A recurrent topic in this community is Why do today s progra Future of Coding #thinking-together

A recurrent topic in this community is "Why do tod...

Konrad Hinsen

11/12/2021, 10:49 AM

A recurrent topic in this community is "Why do today's programming system so strongly rely on text files, and can we do better?" This tweet made me think of a possible answer: epistemic transparence (of text) vs. epistemic opacity (of data formats requiring more specialized tools for inspection). We have so many tools for inspecting text files that it's hard to imagine that someone could sneak in a tool that deliberately misrepresents the information in a file. Human-readable data encodings in text files thus provide acces to a shared ground truth. The tools intermediating between bits in memory and UIs (screens etc.) are so simple that they are easy to understand and easy to verify and validate. Even for relatively simple structured binary formats such as tar, this is no longer true. https://twitter.com/slpnix/status/1457642326956855296

❤️ 1

🤔 1

Chris Knott

11/12/2021, 11:48 AM

I read a similar thought recently when discussing "binary vs text" formats. You never really think it, but text is a binary format, it's just the tools to view it are extremely widespread.

👍 2

Stephen De Gabrielle

11/12/2021, 12:33 PM

Interestingly text is much more complex than many binary formats. Unicode has given us the worlds scripts, and fully supporting text us a mammoth undertaking.

👍 1

💯 2

😱 1

Daniel Krasner

11/12/2021, 1:14 PM

I think we have to underline that text files are punch card emulators, in the same way that TTY terminals are teletype emulators… the fundamental aspects of how people still haven’t changed much since the JVNeumann days, even though the machines certainly have. You can make arguments about how things are “better” which generally comes to mean faster, more “expressive,” or simply you have color in the terminal so you can do highlighting etc but that’s all incremental improvements on the same setup.

shalabh

11/12/2021, 3:42 PM

@Konrad Hinsen yes transparency seems like a strong motivator for sticking to plain text files. Often this shows up in stated reasons like "I can see it". There is a clear boundary of this artifact (the bytes in the file) and the standard representation is "total" - the entirety of the text is visible, no hidden layers. Another related possible reason is reproducibility - if I can see some plain text, I can reproduce it quite easily. Max calls this the "production script": https://twitter.com/maxkreminski/status/1149825466603200518

👍 1

Konrad Hinsen

11/12/2021, 4:49 PM

@Chris Knott Everything is binary in the end. What sets text apart is its extreme generality, leading to widespread tool support. If you have the slightest worry that a text-based tool is hiding something from you, you just check using one of the hundreds of alternatives. As for @Stephen De Gabrielle comment about the complexity of text: most real-life data formats use only a tiny subset of Unicode, plus perhaps full Unicode in string data. Decent Unicode support matters (and is widespread), but full support is rarely required.

Konrad Hinsen

11/12/2021, 4:51 PM

@Daniel Krasner I think what you describe is an orthogonal issue. You can have a user interface at the level of some convenient data structure, but use a text-based encoding underneath (JSON, XML, all the other W3C standards, ...). You don't have to look at that representation, but you can.

Kartik Agaram

11/12/2021, 9:40 PM

To add weight to @Stephen De Gabrielle's point, it's pretty common at this point to see vulnerabilities involving Unicode homoglyphs. https://en.wikipedia.org/wiki/Homoglyph

👀 1

Daniel Krasner

11/12/2021, 11:49 PM

@Konrad Hinsen fair point. What I was mostly focusing is the ‘file’ part not the text. Why do we have files and have so much focus on them.

Peter Abrahamsen

11/13/2021, 1:57 AM

Unicode is a scripting language. https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

❤️ 1

Andrew F

11/13/2021, 3:31 AM

Yes, transparency (or at least the illusion thereof) is a huge aspect of why text will be hard or impossible to kill. Key aspect: text has (almost) no room for hidden state. A text editor is pretty much surjective onto the set of binaries that won't crash your text editor, and once it opens you can change any byte... Of course the eldritch complexity of Unicode spoils a lot of those "guarantees". If we're going to replace text, I'd honestly like a format that offers even better surjectivity (onto parsable files if not onto binary strings). The fact is, if your visual editing tool doesn't let me inspect every literal bit of code, I will never fully trust that bugs in your editor aren't hiding important things from me. For serious applications, that's a problem. (Also, text is convenient when dealing with "invalid" under-construction states.)

Konrad Hinsen

11/13/2021, 7:49 AM

@Daniel Krasner Files matter for transparency as well, because it's the system infrastructure that guarantees access to the basic data layout to a wide variety of tools. We could have something else as the "ground truth" layer, of course. But it matters that this layer is accessible to a large range of tools. Illustrative example: a Smalltalk image. Reflection provides the means to inspect the object graph, and Smalltalk comes with the tools to do so. But all that access goes through the Smalltalk VM. If you have to envisage the possibility of a compromised VM, you can't trust your system any more. The same holds of course for the file system layer of a typical Unix system. But there is diversity-with-compatibility. Don't trust btrfs? Use ext4. Paranoid? Store all your stuff on three different file systems, and run consistency checks from time to time. In the end, it's tool diversity that matters. No single bottleneck for accessing your data.

Daniel Krasner

11/13/2021, 10:50 AM

The original smalltalk “image” was full memory dumped to tape/disk and you would boot right into that state. The image files exist b/c in the end we are all running a unix. A more smalltalk smalltalk would be a collection of objects with addresses living in some ecosystem, remote or otherwise. Kay had imagined each object being a VM, fully self-contained. We are just shoving the current VM onto unix, which is what it is… Even the word <<file>> evokes old thinking in a new medium, thinking that has nothing to do with computing but with how we organized paper medium based information. This is one reason why the ‘younger’ generation is trending towards a complete lack of understand, or mental image, of files, folders, directories. In the world for them it’s just “a bunch of stuff” out there and ways to find the stuff. (Of course we don’t give them good metaphors for how the stuff should be found but that’s another story.) But their mentality is modern, it’s a computing mentality, even if the much of it amounts to improved means to an unimproved end.

Konrad Hinsen

11/14/2021, 4:32 PM

@Daniel Krasner I completely agree. My goal is to figure out what makes the text-in-file storage system so attractive that it has persisted over decades, in spite of its shortcomings. One of them is a sense of agency and data ownership. Not sure that people who grew up with today's cloud silos see the value of that.

Konrad Hinsen

11/14/2021, 4:38 PM

BTW, I briefly tried to use Smalltalk in its original form. It was a pirate copy of Smalltalk-80 for the Atari ST. Booted from a floppy disk, and completely independent from the standard Atari OS. And that made it completely useless for me. Try something new, fine, but without access to any of my data, no. Perhaps it made sense in a better networked environment, but for me back then, networking was via dial-up modems: slow and expensive.

Konrad Hinsen

11/16/2021, 9:06 AM

Wondering: could we define some measure of the opacity of a data representation? For example, the Kolmogorov complexity of an algorithm that produces a minimal but complete human-intellegible representation? That opacity measure would be small for a text-based format, but also for simple (uncompressed ...) image formats. Avoiding an explicit bias towards text. The obvious difficulty is defining "human-intellegible". Nothing prevents a human from practicing many years to read a hex dump of some binary data format. The second difficult is context-dependence. A text written in English is intellegible to me, a text written in Chinese is not, but that distinction is specific to me.

👍 1

Daniel Krasner

11/16/2021, 11:31 AM

Although I have literally pout 15s of “thought” into this, but my gut feeling is that something analogous to Shannon entropy is a place to start. This sounds like an information theory problem.

Konrad Hinsen

11/16/2021, 4:16 PM

Not sure. A plain text and its rot13 encoding are equivalent in information theory. But the text is human-readable, its rot13 encoding isn't. There's my 15s of thought 😉

🤔 1

Andrew F

11/16/2021, 5:09 PM

I want to say, how many parseable encodings correspond to a human-indistinguishable result in the editor, just like physical entropy. Problem: I suspect that will be infinite for most formats (e.g. stick arbitrarily long zero-width space sequences everywhere), which is not super illuminating for current formats at least (but maybe not wrong either?). You need some way to bucket the "microstates". I forget exactly how they do this in physics but I think it relies on real numbers being continuous, which is not going to work for discrete strings.

Chris Knott

11/16/2021, 6:19 PM

"human intelligible" can be quantified to an extent by experiments like getting people to describe it over a phone call, or recreate it in a separate room where they have to read it, remember a bit of it, go next door, and write it down.

Chris Knott

11/16/2021, 6:20 PM

It would be interesting to see if someone could recreate a JSON with fewer trips than the same data as XML

Konrad Hinsen

11/17/2021, 7:49 AM

That's an interesting experiment! It also illustrates that human-intellegible comes at different levels. There's a difference between "this contains obviously no privacy-relevant data" and the ability to recreate an equivalent data object down to the details of syntax.

Konrad Hinsen

11/18/2021, 7:27 AM

Had an interesting video call yesterday, about data management in large-scale distributed computations (think bioinformatics etc.). Someone with a lot of experience in both practice and teaching said that an important lesson is to keep most relevant metadata in filenames. Her experience is that metadata in the files (in the file formats that explicitly provide metadata fields) is often wrong in practice, because it's invisible. Users don't care to look at it (an extra step, not always obvious), and then tools don't update it, because, why update something that nobody looks at? So the "use text because it is so much more accessible" principle holds even for the small text bites that are filenames.

Erik Stel

11/20/2021, 9:27 AM

Another interesting bit https://twitter.com/gilad_bracha/status/1461867555132346370?s=21

Andrew F

11/20/2021, 6:37 PM

Worthwhile points, but they don't address the transparency issue, at least not directly.

Kartik Agaram

03/20/2022, 4:08 PM

@Konrad Hinsen:

Wondering: could we define some measure of the opacity of a data representation?

A plain text and its rot13 encoding are equivalent in information theory. But the text is human-readable, its rot13 encoding isn't.

@Chris Knott:

"human intelligible" can be quantified to an extent by experiments like getting people to describe it over a phone call, or recreate it in a separate room where they have to read it, remember a bit of it, go next door, and write it down.

Another idea, in the spirit of the Turing test: what fraction of humans can tell if two pieces of data (in the given representation) represent the same 'object'. (Sorry for the necrobump; this feels like a really valuable thread to preserve in Slack history a little longer.)

🤔 2

Chris Knott

03/21/2022, 7:18 PM

Is there any research on measuring human understanding of data formats? Quite difficult to Google for because it doesn't understand "data" as the subject the research.

Kartik Agaram

03/21/2022, 7:35 PM

Yeah, I had some similar questions. Understanding depends on what tools the humans use..

4 Views

Open in Slack

Previous Next