My challenge to <#C5T9GPWFL|>: try and come up wit...
# thinking-together
n
My challenge to #C5T9GPWFL: try and come up with a precise definition for the term data or datum that isn't equivalent to one of these two notions: • informationobjects (more precisely: physical objects or abstract concepts, such as "the number 3", "an electrical charge", "a cat", or "capitalism") I contend that this cannot be done 🙃. Most distinctions that people try to make between "data" and "information" either fall down under light scrutiny, or reduce "data" to the role of an arbitrary object. (Consequently, I've come to believe that the term "data" causes more harm than good, since we could instead be using terms with broadly understood meanings.)
👀 3
m
data are raw facts "that which is given", when analyzed they become information
👍 1
k
OK, I'll bite. Data is any symbolic expression using a well-defined notation. Information is data presented in a context that allows its interpretation.
👍 1
👍🏼 1
j
I'd say that one calls something "data" when one isn't interested in its internal semantics. For example, you might say that an integer is "data" in the context of checking its presence in a list; even though it has a rich inductive structure which is important in other contexts. Therefore, I partly agree, partly disagree. Data can encompass "objects" or "information", but the term (IMO) communicates that any internal structure and relations to external objects (both of which define objects and information) is irrelevant.
a
Data is a special case of information, that leans toward physical fact and measurement results. I don't know if that's the "distinction" you were looking for. To a significant extent "data" is about the intended usage of the information. Also, I contend that just because a term is tricky to define doesn't mean it's useless. Humans aren't computers, and can communicate and stay productive with, when you analyze it to this level, a shocking level of ambiguity. In what actual situation does the common understanding of "data" fail?
d
(I hadn’t seen this when I posted elsewhere, so here’s my slightly repackaged version.) First, I personally think both “information” and “data” tend to be more confusing than useful, and so I avoid them when possible. However, insofar as they might be made meaningful, I think of them something like this. “Information” is relative, defined in reference to an agent, its environment, and the agent’s model of its environment. An agent has a body, through which it interfaces with its environment as well as its internal model of its environment via a set of sensors and actuators. “Data” is relative too, defined in reference to a given set of sensors, their environments, and a given model of each sensor/environment. So I would define data as “a sequence of observations from a set of sensors, represented according to a given model of each sample/sensor/environment”. Agent use data to decide how to update their internal models, and use their internal models to decide how to act on themselves and their environments. I would define information operationally as “data that leads an agent to update its internal model of its environment” (which notably includes the common case of data that represents another agent’s internal model of a shared environment, i.e. “communication”).
a
If we're going to abandon Shannon's definition of information, the only sensible place to start is physics, probably quantum information theory, not agent-based AI.
e
what's the difference between data and symbols ? Is there some sort of data that is not and that can't be quantized/numerized
a
Any definition of "data" that assumes a sequence is too narrow to capture the term as used. A single measurement of, say a manufactured part or room's dimensions is sometimes considered a datum. A "Material Safety Data Sheet" consists of a wide spread of single values (measured from the real world; possibly derived as the average of some sequence but nevertheless still called "data" when you see it singularly) Computer science does have a somewhat idiosyncratic usage of "data" compared to the natural sciences and really the rest of the world, so that's probably where some of the confusion is coming from.
e
data is to be interpreted
data can be accompanied by other data that help with interpretation, such as a type or a unit but it will always be open to interpretation
d
@Andrew F I’m very much starting with physics. 😉 Unlike Shannon, most everyone wants the term ‘information’ to say something about semantics, which means you need something like an agent in your definition. Shannon entropy is much more ‘mechanical’ and explicitly excludes semantics. What’s your favorite physical model of semantics?
PS my ‘sequence’ includes a sequence of one.
e
data at its looseliest https://www.reddit.com/r/data_irl
❤️ 1
a
AFAICT there isn't and can't be a physical semantics, unless you stretch the definition to include physics itself, i.e. the semantics of a bit (encoded in a physical system) is exactly the way it causes different physical configurations in systems that interact with it. Shannon was right to dodge the question; this isn't so much a question of a good definition not existing as people ignoring prior work. Anyway, yes, information in general and data in particular is only meaningful in conjunction with an interpreter; again, both considered as physical systems rather than bits or code if you want to go down all the turtles. Even in the cushy digital universe, I think it's important to keep track of what the intended interpreter(s) of a given piece of info is/are. I'm guessing we're agreed on that.
a
Consequently, I've come to believe that the term "data" causes more harm than good, since we could instead be using terms with broadly understood meanings.
I'm curious to hear what you think is an equivalent term that is more broadly understood... I'm not coming up with one
d
Yep, “including physics itself” is definitely what I mean! I like David Wolpert’s line of thinking on this, among others. https://arxiv.org/pdf/0708.1362.pdf
o
I’ll refrain from adding my own definitions here, but I will say that this thread feels more to do with disambiguation than with definitions of information vs data. There are so many ways these terms get used it seems unlikely we can reach consensus. @daltonb your definition actually came up in a chat at work about information comparing Shannon to others. In that conversation these links to Jane Austins (implicit) definition and also Bateson were discussed. The first link has some nice history and references for the differing notion of information.
👍 1
d
Really fun links, thank you! Not being precise here but I think the key shared notion between ordinary “austen information” and “shannon information” is “*what can I infer from an observation based on my current model of the world?*” Shannon’s version is restricted to ‘symbols picked from a known set sent over a noisy channel’, which is a great constraint for telecomms applications, and has implications for compression since if you can do better than random at predicting the next observation, then your sequences can be encoded as diffs with what your model would predict. in the more general ‘austen information’, intuitively ‘high information’ feels like ‘surprise,’ like if it’s a nice sunny day maybe i don’t bother checking the forecast before heading out (low expected ‘information’), whereas if it’s overcast i expect to learn more by doing so.
a
We can model the conventional idea of information, the Austen definition if you will, as Shannon information interpreted by a human brain, or some idealized consensus brain. The hazards in defining that interpreter highlight exactly the difficulties in that general approach to information.
@daltonb We seem to have different ideas of "physics itself". Physics itself doesn't have agents, at least not as anything more than a leaky abstraction (nor does the paper you linked, afaict?). As far as we know it basically just has excitations in fundamental fields (plus some gravity stuff, but you get the idea). "Physics itself" only begins to work as a semantics because it's vacuous, "it is what it is" shoehorned into the role of foundational axiom. I certainly don't think it resembles the semantics people intuitively want information to have: I believe that's the idealized consensus brain semantics.
d
💯 The difficulties are enormous, but the baseline is that we’re already doing this implicitly in a million ad hoc ways, so the bar is pretty low. Hence this group! Personally I think you get a lotta mileage out of “interpreter = input signals + model + output signals”, where “model” is some version of this guy (i.e. predicted + desired world states)
Ah. My ‘agent’ is roughly wolpert’s ‘inference device’. Nothing implied about biology, brains, consciousness, etc. Currently, when doing physics, the inference devices tend to be physicists 🙂
My ‘physics’ is roughly ‘patterns that tend to occur wrt some set of invariants + parameters’, as reported by many observations across time & space
n
Naturally, I'm now going to critique all of these definitions. 🙂 @Mariano Guerra "Data are raw facts. When analyzed they become information." What is the difference between a "fact" and "information"? If something needs to be analyzed to be considered information, does that mean an encyclopedia is not information until a human or machine reads it? What does it mean to "analyze"? If the encyclopedia emits photons, and those photons interact with other matter in the world, then has the encyclopedia been analyzed? If so, then the encyclopedia is in a constant state of being analyzed, and so is every piece of data. To exist is to be analyzed. Thus the distinction evaporates. (See also: Appendix A below). @Konrad Hinsen "Data is any symbolic expression using a well-defined notation. Information is data presented in a context that allows its interpretation." Can data exist outside of the physical universe? If not, then every piece of data is continually surrounded by a context — its physical environment. Moreover, your use of the term "interpretation" is synonymous with Mariano's "analyze", and so I will again assert that data is constantly being interpreted by its environment — even if unintentionally. @Jan Ruzicka "The term 'data' communicates that any internal structure and relations to external objects are irrelevant." I can accept this definition. But is it useful in reality? In what situations is it important to stress that you don't care about the structure and relationships that a piece of information expresses? In what situations would using the word "information" mislead a reader where your definition of "data" would not? @daltonb "I would define information operationally as data that leads an agent to update its internal model of its environment." This seems very close to the earlier definitions that "information is data that has been analyzed or interpreted". Thus, I shall refer you to my responses to those, and also to Appendix A. (Also: what conditions are necessary for something to be considered an "agent"? Can an arbitrary collection of atoms be considered an agent?) @Andrew F "I contend that just because a term is tricky to define doesn't mean it's useless. In what actual situation does the common understanding of "data" fail?" In general, I agree. However, there are hundreds (thousands?) of research papers, books, and blog posts which laboriously try to articulate a difference between "data" and "information", not to mention companies that are trying to sell products and services related to "information systems". The amount of human effort spent on understanding the (imagined) distinction is considerable. That's the "harm" I'm talking about. @abeyer "I'm curious to hear what you think is an equivalent term for 'data' that is more broadly understood... I'm not coming up with one." Information 😇 (or "formation": see Appendix A). Appendix A: Some of the responses in this thread come close to describing information as a process, rather than a thing. One might suggest that "information" is the process of analyzing something, or the process of updating an agent's internal model. In simpler words: information is the process of informing. If we accept this definition, then we still need to describe what things (objects) are involved in an occurrence of informing. An obvious term that could help us is "formation" (or "form"). We could say: to inform is to create a formation (in someone's mind, in a computer, or more generally, at any location in the universe). "In-formation" is then the process of creating such a formation. But what induces a formation to be created? Well... interactions between formations of course! (This matches up with physics: the configuration of the universe is fully determined by interactions between matter and/or energy.) Given these definitions, the term "data" would be a synonym for "formation". I could accept that.
🤔 1
a
I'm hesitant to even accept that people spending effort trying to define data is a problem in itself; lots of times incremental progress on a tricky problem looks like a waste of time until there's a breakthrough that they've been slowly building up to. But I think the real point is that people who actually talk about data for their job aren't significantly confused about what it is. Confusion exists, but mostly in people who don't know how science works or have been lied to. People who try to exploit confusion about data vs information are especially a red herring. Even if you came up with a perfectly granular term, the BSers would just turn their attention to corrupting that instead. Nothing you do will change that, so it's of limited use at best to factor it in to your actions. Honestly the same is kinda true of people who endlessly muse about definitions, which is another reason not to worry too much about that. I'll resubmit my original proposal with a tad more precision: data is information (whatever you think that is) that is [claimed to be] derived more or less directly from measuring real phenomena. "Claimed" because, leaving aside falsification, "directly" is somewhat context sensitive, and someone could be mistaken about whether the degree of directness is adequate for a given usage of the supposed data. I don't think the fuzziness can be entirely eliminated, so maybe I agree with @Nick Smith there. I think it's just one of those cases where "we all know" the concept exists and deserves naming, even if we can't nail down the edges. Reminds me of https://en.wikipedia.org/wiki/Sorites_paradox
n
You make good points. It is true that most people are using the term "data" every day, and are effectively communicating with other humans, even though the term does not have a precise definition. I think the sorities paradox you've linked to pretty much nails the problem with the idea that data is something "directly observed", or something that has "not yet been analysed". Such a definition involves laying down a line/boundary, but in reality it seems such a boundary (between "direct" and "indirect", or between "un-analysed" and "analysed") cannot be made precise. Perhaps you can make "data" precise by defining it as information which is given as input to a particular computation/analysis. But that would make the term synonymous with "argument" or "input". (I'm pretty sure this was the original meaning of "data" in its modern usage.)
k
Nice discussion, with many different and complementary points of view. One comment on semantics: in any sufficiently isolated system, semantics is an emergent phenomenon. Well-defined semantics exist only when you focus on a small subsystem (a book, a piece of software, ...), the semantics are then set by surrounding environment via its interactions with the subsystem. It follows that "physics" can not have semantics, because it concerns, by definition, the lowest levels of organizations of matter at the scale of the universe. Shannon's definition of information is just one layer above physics, so it has doesn't touch on semantics either. Note that this is not the only case of a term from common language being recruited into technical jargon. Energy is another nice example.
m
@Nick Smith the raw dumps from the Large Hadron Collider are data, analyzing them to find the presence of the higgs boson is information
sensor readings from LIGO are data, analyzing them to find the presence of gravitational waves is information
a pile of hard drives with radio telescope measurements is data, analyzing them to plot an "image" of a black hole is information
👍🏼 1
n
@Mariano Guerra I understand the distinction you're trying to make, but those statements aren't really conveying that distinction. In fact, those statements could still be valid if data and information were synonyms. For example: "Coriander is a leafy plant, cilantro is a green herb." "Coriander is good in a salad, cilantro is good in a soup." Those terms are synonyms, yet I can still still make statements that could be construed as trying to draw a distinction between them 😉. To really draw a distinction between the terms you would have to give examples of something that is data but not information, and/or something that is information but not data. I can't tell if you're intending to do this in your statements above. In particular, I can't tell whether you're suggesting data ⊆ information, or whether they are disjoint.
m
what's a platypus? what's a tree? is a virus alive? I'm not part of the "taxonomists", I'm OK with fuzzy boundaries and ambiguity 🙂
a
If we want a term for results of analyzing data, I would suggest "insight" before "information". "Information" already has a really solid, useful definition, and that ain't it.
j
@Nick Smith Regarding my attempt at a distinction, maybe “data” and “information” can be used to describe the structure of a single “thing” - namely: “information” is the structure you care about, “data” is the structure you don’t care about. Then if you have a list of values, and you’re checking the presence of a value, the “information” is the structure that let’s you compare them for equality, and the rest is “data”. Regarding the examples of @Mariano Guerra, some structure of LHC measurements allows physicists to draw the conclusion of the presence of the Higgs, whereas some other structure doesn’t. The former is “information”, the latter is “data”. Where you draw the line, really depends on your goal, as sometimes “data” might become “information” under a new goal structure. So I still believe these terms are useful in communication in this sense, albeit they are irrelevant for any practical considerations.
a
I bet not a single person in the LHC project, probably even in the natural sciences overall, thinks of data vs information like that. No one thinks a series of e.g. temperature measurements is "structure you don't care about". Note that there is a huge variety of possible analyses you could and likely will run on that series to answer different questions, which may take into account different aspects of its structure (e.g. need ordering for seeing trends but not for averaging), so we can't incorporate a specific interpretation in our definitions besides "this is a time series of temperatures". I hate to be the one defending the status quo, but the common sense of "data" was basically hashed out centuries ago, and "information" has as rigorous a definition as it can have. We're only muddying the water here.
j
This definition and the book from which it comes might be helpful: https://en.wikipedia.org/wiki/Exformation
j
@Andrew F Well, terms are just terms (= made up words), and the meaning is for users to decide. For example, the proof assistant community (especially the people around homotopy type theory) uses the distinction I described (information = structure you care about, data = opaque & not interesting) quite often. Since we’re discussing the potential distinctions of data v information, the “muddying of waters” is the goal ;)