A recurrent topic in this community is "Why do tod...
# thinking-together
k
A recurrent topic in this community is "Why do today's programming system so strongly rely on text files, and can we do better?" This tweet made me think of a possible answer: epistemic transparence (of text) vs. epistemic opacity (of data formats requiring more specialized tools for inspection). We have so many tools for inspecting text files that it's hard to imagine that someone could sneak in a tool that deliberately misrepresents the information in a file. Human-readable data encodings in text files thus provide acces to a shared ground truth. The tools intermediating between bits in memory and UIs (screens etc.) are so simple that they are easy to understand and easy to verify and validate. Even for relatively simple structured binary formats such as tar, this is no longer true. https://twitter.com/slpnix/status/1457642326956855296
❤️ 1
🤔 1
c
I read a similar thought recently when discussing "binary vs text" formats. You never really think it, but text is a binary format, it's just the tools to view it are extremely widespread.
👍 2
s
Interestingly text is much more complex than many binary formats. Unicode has given us the worlds scripts, and fully supporting text us a mammoth undertaking.
👍 1
💯 2
😱 1
d
I think we have to underline that text files are punch card emulators, in the same way that TTY terminals are teletype emulators… the fundamental aspects of how people still haven’t changed much since the JVNeumann days, even though the machines certainly have. You can make arguments about how things are “better” which generally comes to mean faster, more “expressive,” or simply you have color in the terminal so you can do highlighting etc but that’s all incremental improvements on the same setup.
s
@Konrad Hinsen yes transparency seems like a strong motivator for sticking to plain text files. Often this shows up in stated reasons like "I can see it". There is a clear boundary of this artifact (the bytes in the file) and the standard representation is "total" - the entirety of the text is visible, no hidden layers. Another related possible reason is reproducibility - if I can see some plain text, I can reproduce it quite easily. Max calls this the "production script": https://twitter.com/maxkreminski/status/1149825466603200518
👍 1
k
@Chris Knott Everything is binary in the end. What sets text apart is its extreme generality, leading to widespread tool support. If you have the slightest worry that a text-based tool is hiding something from you, you just check using one of the hundreds of alternatives. As for @Stephen De Gabrielle comment about the complexity of text: most real-life data formats use only a tiny subset of Unicode, plus perhaps full Unicode in string data. Decent Unicode support matters (and is widespread), but full support is rarely required.
@Daniel Krasner I think what you describe is an orthogonal issue. You can have a user interface at the level of some convenient data structure, but use a text-based encoding underneath (JSON, XML, all the other W3C standards, ...). You don't have to look at that representation, but you can.
k
To add weight to @Stephen De Gabrielle's point, it's pretty common at this point to see vulnerabilities involving Unicode homoglyphs. https://en.wikipedia.org/wiki/Homoglyph
👀 1
d
@Konrad Hinsen fair point. What I was mostly focusing is the ‘file’ part not the text. Why do we have files and have so much focus on them.
p
❤️ 1
a
Yes, transparency (or at least the illusion thereof) is a huge aspect of why text will be hard or impossible to kill. Key aspect: text has (almost) no room for hidden state. A text editor is pretty much surjective onto the set of binaries that won't crash your text editor, and once it opens you can change any byte... Of course the eldritch complexity of Unicode spoils a lot of those "guarantees". If we're going to replace text, I'd honestly like a format that offers even better surjectivity (onto parsable files if not onto binary strings). The fact is, if your visual editing tool doesn't let me inspect every literal bit of code, I will never fully trust that bugs in your editor aren't hiding important things from me. For serious applications, that's a problem. (Also, text is convenient when dealing with "invalid" under-construction states.)
k
@Daniel Krasner Files matter for transparency as well, because it's the system infrastructure that guarantees access to the basic data layout to a wide variety of tools. We could have something else as the "ground truth" layer, of course. But it matters that this layer is accessible to a large range of tools. Illustrative example: a Smalltalk image. Reflection provides the means to inspect the object graph, and Smalltalk comes with the tools to do so. But all that access goes through the Smalltalk VM. If you have to envisage the possibility of a compromised VM, you can't trust your system any more. The same holds of course for the file system layer of a typical Unix system. But there is diversity-with-compatibility. Don't trust btrfs? Use ext4. Paranoid? Store all your stuff on three different file systems, and run consistency checks from time to time. In the end, it's tool diversity that matters. No single bottleneck for accessing your data.
d
The original smalltalk “image” was full memory dumped to tape/disk and you would boot right into that state. The image files exist b/c in the end we are all running a unix. A more smalltalk smalltalk would be a collection of objects with addresses living in some ecosystem, remote or otherwise. Kay had imagined each object being a VM, fully self-contained. We are just shoving the current VM onto unix, which is what it is… Even the word <<file>> evokes old thinking in a new medium, thinking that has nothing to do with computing but with how we organized paper medium based information. This is one reason why the ‘younger’ generation is trending towards a complete lack of understand, or mental image, of files, folders, directories. In the world for them it’s just “a bunch of stuff” out there and ways to find the stuff. (Of course we don’t give them good metaphors for how the stuff should be found but that’s another story.) But their mentality is modern, it’s a computing mentality, even if the much of it amounts to improved means to an unimproved end.
k
@Daniel Krasner I completely agree. My goal is to figure out what makes the text-in-file storage system so attractive that it has persisted over decades, in spite of its shortcomings. One of them is a sense of agency and data ownership. Not sure that people who grew up with today's cloud silos see the value of that.
BTW, I briefly tried to use Smalltalk in its original form. It was a pirate copy of Smalltalk-80 for the Atari ST. Booted from a floppy disk, and completely independent from the standard Atari OS. And that made it completely useless for me. Try something new, fine, but without access to any of my data, no. Perhaps it made sense in a better networked environment, but for me back then, networking was via dial-up modems: slow and expensive.
Wondering: could we define some measure of the opacity of a data representation? For example, the Kolmogorov complexity of an algorithm that produces a minimal but complete human-intellegible representation? That opacity measure would be small for a text-based format, but also for simple (uncompressed ...) image formats. Avoiding an explicit bias towards text. The obvious difficulty is defining "human-intellegible". Nothing prevents a human from practicing many years to read a hex dump of some binary data format. The second difficult is context-dependence. A text written in English is intellegible to me, a text written in Chinese is not, but that distinction is specific to me.
👍 1
d
Although I have literally pout 15s of “thought” into this, but my gut feeling is that something analogous to Shannon entropy is a place to start. This sounds like an information theory problem.
k
Not sure. A plain text and its rot13 encoding are equivalent in information theory. But the text is human-readable, its rot13 encoding isn't. There's my 15s of thought 😉
🤔 1
a
I want to say, how many parseable encodings correspond to a human-indistinguishable result in the editor, just like physical entropy. Problem: I suspect that will be infinite for most formats (e.g. stick arbitrarily long zero-width space sequences everywhere), which is not super illuminating for current formats at least (but maybe not wrong either?). You need some way to bucket the "microstates". I forget exactly how they do this in physics but I think it relies on real numbers being continuous, which is not going to work for discrete strings.
c
"human intelligible" can be quantified to an extent by experiments like getting people to describe it over a phone call, or recreate it in a separate room where they have to read it, remember a bit of it, go next door, and write it down.
It would be interesting to see if someone could recreate a JSON with fewer trips than the same data as XML
k
That's an interesting experiment! It also illustrates that human-intellegible comes at different levels. There's a difference between "this contains obviously no privacy-relevant data" and the ability to recreate an equivalent data object down to the details of syntax.
Had an interesting video call yesterday, about data management in large-scale distributed computations (think bioinformatics etc.). Someone with a lot of experience in both practice and teaching said that an important lesson is to keep most relevant metadata in filenames. Her experience is that metadata in the files (in the file formats that explicitly provide metadata fields) is often wrong in practice, because it's invisible. Users don't care to look at it (an extra step, not always obvious), and then tools don't update it, because, why update something that nobody looks at? So the "use text because it is so much more accessible" principle holds even for the small text bites that are filenames.
e
a
Worthwhile points, but they don't address the transparency issue, at least not directly.
k
@Konrad Hinsen:
Wondering: could we define some measure of the opacity of a data representation?
A plain text and its rot13 encoding are equivalent in information theory. But the text is human-readable, its rot13 encoding isn't.
@Chris Knott:
"human intelligible" can be quantified to an extent by experiments like getting people to describe it over a phone call, or recreate it in a separate room where they have to read it, remember a bit of it, go next door, and write it down.
Another idea, in the spirit of the Turing test: what fraction of humans can tell if two pieces of data (in the given representation) represent the same 'object'. (Sorry for the necrobump; this feels like a really valuable thread to preserve in Slack history a little longer.)
🤔 2
c
Is there any research on measuring human understanding of data formats? Quite difficult to Google for because it doesn't understand "data" as the subject the research.
k
Yeah, I had some similar questions. Understanding depends on what tools the humans use..