I was reading some text vs. binary file arguments ...
# thinking-together
f
I was reading some text vs. binary file arguments and one thing that kept coming up was that text files are easier to recover when they get corrupted. I've got a couple of questions regarding this and would like to hear your thoughts: 1. When did you have to deal with corrupted files and what was causing the corruption? 2. How did the corruption look like (single swapped bits, part of the file missing, ...)? My current line of thought is that small errors in binary files could simply be corrected by adding redundency to the format (crc etc.) and larger missing parts wouldn't be easy to recover from in text formats as well. Therefore the argument wouldn't make much sense. I'm curious what you might have experienced as I can't remember to have seen corrupted files in the last couple of years.
d
I saw corrupted files frequently, mostly when they are coming from the network and were cut in the middle, or when a stream wasn't working well, but also as part of memory dumps where some parts of the memory was written over etc.
n
In some ways the distinction between text and binary is arbitrary. Text files just have a whole ecosystem of tools that are able to parse (and display) the bits in a certain way. I expect any issues of corruption or error-correction would equally apply to both.
💯 1
g
text files often indicate (to some extent) human-readable files and human-readable files have a lot of redundancy built in—all human languages do
a
Yep. When a human reads a file, they can eyeball it to see if it looks corrupt, make a guess at what it should have looked like, and manually patch it. With binary files, usually your parser just barfs. All the recovery steps a human performs have to be programmed in. As to the actual question, the only time I can remember personally was from downloading ISOs: s couple of them failed their hash check and had to be re-downloaded. Bad disk dismounts are the other classic case, with a broader category of bad disk write/flush handling that can leave a file in weird states (https://danluu.com/file-consistency/). TBH I wouldn't want to hand-restore one of those files either.
k
Corrupted files have always been part of my digital life. The main cause: software bugs. Number two: aborted computation jobs, usually due to resource limitations on batch systems. For programs that write serialized output, a corrupted file is usually just truncated. That is usuallyl recoverable. But when working on large binary files, random-access modification is quite frequent and can result in just about any mistake.
1
f
@Naveen Michaud-Agrawal The "advantage" of text files is that characters are encoded individually and each character only takes a couple of bytes. If there's an incorrect byte, the corruption is limited to at most two characters. In binary encodings, a single error (e.g. in a length field) can corrupt everything that follows. But you're right, text files are basically a subset of binary files and all advantages they have could also be achieved in other (non-text) formats.
t
I think @Andrew F nailed it with "they can eyeball it". It's less of a theoretical math problem and more of a pragmatic "how long does it take you to figure out exactly what the file should be and what it is". I worked at a data storage company so we'd corrupt binary stuff all the time in dev, it was really hard to tell what was broken when it was binary. I really wanted to, but never got around to, a to_json serializer so I could see system state in a structured, textual format.
👍 1
k
Most people don't appreciate how easy it is to undelete a file on Unix: * make the file system read-only (to prevent more writes) *
grep -n10000
the device of the hard disk, go get a coffee * Scan the output in a text editor, find the right section, delete above and below. If it's a binary file you now need more tooling.
🙌 1
j
+1 that the main benefits are legibility and the ubiquity of text-based tooling
m
in my case corrupted files I've had to handle mostly fall on two cases 1) truncated files 2) a file with a block of zeros at the end