Disregarding performance or implementation concern...
# thinking-together
n
Disregarding performance or implementation concerns, is self-describing data the way to go? Is there a conceptually "nice" approach to encoding the meaning of data other than composing it from well-defined attributes? Do schemas have any purpose beyond acting as a "type system" and a data compression technique? Should every piece of data in a programming system be self-describing? We can compress away the redundant descriptions within collections.
g
What does "self-describing" mean? Are you going to label every number what it is? A price? A price in dollars? A price of a TV in dollars? A price of a Sony TV in dollars? A price of a Sony TV model KHL123456 in dollars? etc... How about this other number. It's an area of a house floor plan in square meters. Will it be labeled how accurate it is? Is it rounded to the nearest square meter? Is all area inside the outside edges of the walls (so it includes the walls?) or is the space between walls? Does unusable space like a staircase get included? Just trying to clarify what "self-describing" means Do you need a giant taxonomy of categories?
n
Taxonomies are an orthogonal issue. A self-describing data object is (according to me) just one that contains all the metadata (e.g. references to standardised attribute definitions) needed to determine the information it contains. The actual amount of information stored in the object (level of detail) is also an orthogonal issue.
d
http://ivizlab.sfu.ca/arya/Papers/SW/SOP.pdf Subject-Oriented Programming (A Critique of Pure Objects) This essay critiques the idea that it is possible and desirable to put all of the information needed to interpret an object inside the object itself. Different applications or modules may interpret the same data in different ways, and you can't always plan for all of these interpretations in advance.
If you have too much self description, it's cumbersome: code can get very verbose with all the adding and removing of tags. I've seen class libraries for computer graphics where there are separate Point3 and Vector3 classes (for points and vectors in 3D space). That's too much self-description for my taste. In the graphics languages I use, a point and a vector have the same representation, as a generic 1D array of numbers, viz: [x,y,z]. If I store one of these values in a record, then the field name will informally indicate whether the value is being interpreted as a vector or a point. No need for the value [x,y,z] to be self describing as well. If you have too little self description, code gets cryptic. Like in old-style Lisp programming, where you use lists for everything (no records or maps).
s
@Nick Smith I'm interested in this as well. Mostly from the perspective of resilience and as a way to separate content from structure. You set aside taxonomies and level of detail as orthogonal. Would you mind elaborating what you'd consider essential? And are you by any chance currently experimenting with this?
n
@Stefan The "essential" definition of self-describing data is (to my understanding) the one I gave earlier in the thread. The data must (at the UI level, not the DRAM level) contain references to well-known attributes that have an agreed meaning within a certain community/context. I haven't been able to think of any alternative definition.
This requires that attributes have IDs (potentially UUIDs) that can be looked up in some kind of registry.
@Doug Moen I agree with the premise that a data set may be perceived differently at different places/times within an application/system. Though that doesn't obsolesce this discussion: we can refine it to be about self-describing views (derived/reactive data sets). Though such a perspective won't necessarily aid us here. The problem I'm trying to solve is: when you give a data set to someone (a human or a device), how are they supposed to figure out its meaning? Or rather, what is the best way to do so? Send them an email explaining the meaning? Or something more formal and machine-friendly.
s
references to well-known attributes that have an agreed meaning within a certain community/context.
@Nick Smith Isn’t that what a taxonomy is all about? When I think about self-describing data, I’m most interested in in-band vs. out-of-band transmission. Some parts of a data format need to be agreed on out-of-band. That could be generic assumptions like endianess, that a string is encoded in UTF-8 or that a field in the UI only takes a valid email address. If that’s just assumed, these assumptions need to be transmitted out-of-band, i.e. on a different channel which could be that email you mentioned or is just “obvious” within a community (side note: there’s danger of exclusion here). Self-describing data formats transmit more information in-band, as part of the data. So there is an explicit part of the data that says what follows is UTF-8 or there’s a mime type that indicates here comes an image or JavaScript, or you go all the way and end up in semantic web land with those unique, agreed upon entities described in RDF and OWL. The question to me is: how much of a data format can and should be described explicitly in-band? And the interesting challenges hiding under there are: encoding structure vs. content and what is data vs. what is metadata?
👍 1
n
I'm trying to avoid the baggage associated with the term "taxonomy", since the most common usage implies "hierarchy" and declarations from some authority (and I don't want the discussion to veer off in that direction). We can use the term folksonomy if we want a specific term (thanks to Jack Rusher). What's important here is not the characteristics/structure of a classification scheme (which is what taxonomy/folksonomy is concerned with), but the mere existence of such a scheme.
The question to me is: how much of a data format can and should be described explicitly in-band?
Serialisation formats are not a language or environment concern, they're an implementation concern, and I'm acting as a language/environment designer rather than implementer as of late, so I'm not focusing on that stuff. But I'm interested in how the semantics of a data object are presented to the user! Which is closely related. If you're on that bandwagon, then sure, that's the question 🙂
(Btw, if one was to argue that data formats are part of the user interface, then my retort would be that your programming environment is too small in scope)
s
@Nick Smith If you're interested which "bandwagon I'm on", the second half of this post explains it better: https://stefan-lesser.com/2019/12/06/structure-and-behavior/
While I do see a distinction between data format and UI, I don't think that's just two different categories, but more like several layers, almost like a gradient. If a string is ASCII or UTF-8 is probably just an implementation detail (until we talk about import/export). Requiring an email in a form field, not so much. Although it's still just some bytes being parsed in a certain way. So I wouldn't exactly claim that "data formats are part of the user interface", but I wouldn't claim that these things are completely separable either.
a
Data in Wolfram Language is tagged in various ways that tell you where it came from and what it represents. For example, if you download one of their stock datasets, and select data from it, the data is going to present itself as a table even after operations that manipulate it, unless you do something that changes its format, such as plotting it, or using a *Form function. I'm not an expert on it, unfortunately, but I thought it'd be a useful reference if you're looking for how other language do it.
👍 2