Immortal programs vs crash-only programs Immortal...
# thinking-together
k
Immortal programs vs crash-only programs Immortal programs: http://steve-yegge.blogspot.com/2007/01/pinocchio-problem.html Crash-only programs: https://en.wikipedia.org/wiki/Crash-only_software In brief, immortal programs try to never, ever reboot. Crash-only programs are designed to always be able to recover gracefully from a reboot. There's a fundamental tension here, and I'm starting to realize I'm very definitely on one side of it. I like a neat desk, and am compulsively closing things (terminals, browser tabs, browser sessions) when I'm done with them. I prefer text editors to IDEs, vim to emacs, unix as my IDE rather than slime. I'd always thought of these as subjective opinions that were just down to my personality and past experience. But, upon reflection, I want to make a stronger case that "my side" is superior. 1. Focusing on recovering from reboots makes you better at simulating immortality. Restarts can in principle become instantaneous. Focusing on never rebooting makes you worse at recovering from crashes. 2. It's easy for immortal programs to end up in situations that are difficult to reproduce. I spent some time recently programming with @Tudor Girba's Glamorous Toolkit. Modern Smalltalk uncomfortably straddles the image and git repo worlds. The way you work is to make changes to your running image until you have something you like, then go back and package up a slice of your image into a git repository to publish. If you make mistakes, others can have trouble reproducing the behavior you created in your image. Testing if you did it right necessarily requires rebooting the image. Putting these reasons together, immortal systems are more forbidding to newcomers. Crashing becomes a traumatic event, one newcomers are not used to, something beginner tutorials don't cover. When things don't work, it's more challenging to ask for help. Creating and sharing reproducible test cases requires crash-recovery skills. Rereading the Pinocchio post now, I notice that there's actually no concrete benefits stated for long-lived programs. All there is are (compelling) analogies. A counter-analogy: an immortal program is like a spaceship. Once launched you're in a little bubble, stuck with whoever you happened to start out with. A crash-only program is like a little stone rolling down a hillside, gathering other stones until it turns into an avalanche. As I said above, I'm biased because of my experiences. I'm curious to hear from others with more experience of immortal programs. Am I understating the benefits, overstating the drawbacks?
😂 1
🤔 2
c
These two write ups influenced me to take a similar point of view as you seem to be voicing, here https://ferd.ca/the-zen-of-erlang.html https://medium.com/@mattklein123/crash-early-and-crash-often-for-more-reliable-software-597738dd21c5
❤️ 1
k
That's a good point. I think Erlang might be the ultimate no-compromises crash-only system, the way Smalltalk is the ultimate no-compromises immortal system.
j
I think this is the key to the success of the database / application server separation. You put all of your long-lived state into some immortal process with a very controlled data model and put all the scary weird stuff into a crash-only process.
💯 2
☝️ 1
Maybe the real benefit from the immortal systems that yegge described is not that you can avoid restarting them, but that they are forced to have lots of tools for inspection / modification. If you have something like https://ourmachinery.com/post/the-story-behind-the-truth-designing-a-data-model/ instead you can get the same benefits while also being able to do clean restarts.
❤️ 2
Similarly, the problem I had with smalltalk is not that state lives in the image, but that the state is smeared all over the place and built out of pointers and mutable variables. Having all your state and code in eg sqlite or couchdb is essentially the same idea but is much easier to inspect and understand.
2
k
There's definitely value in reducing the blast radius of crashes, but that feels like an orthogonal axis. I love how the DB is separate when I restart my web app, but it doesn't help me when I need to restart my DB. Is there any stateful system today that has a decent story for restarting without downtime? Now I wonder what a storage system would look like that was designed from the ground up to be crash-only..
a
I definitely lean toward crash-only. The only truly immortal program is one that runs on a machine that never has power failures, angry users with hammers, etc. I also agree that in principle, requiring reboots is a sign of flawed software. I froth at the mouth a little bit every time I have to "turn it off and on again". I think the only real tension between those two ideals is the one related to users being unfamiliar with the process of rebooting, and I'm pretty sure that can be surmounted. I can't think of any reason you wouldn't try for both. Maybe think of it as "immortal unless you pull the plug, which must be allowed". I don't have the patience I once did for Stevey Blog Rants, so I skimmed his post, especially the middle. However, I think the parts that he thinks should be immortal (what he calls software as opposed to the rigidly defined software-as-hardware he says static types create), that soft stuff should be thought of as user-created content that is interpreted by relatively stateless infrastructure. From that perspective, it's clear that content should be immortal (and portable, inspectable, etc), while the infrastructure can be started up or killed whenever. Obviously there's some caching involved, maybe even including JITed user code, but I think this architecture could basically get the nice properties of both immortality and crash-only.
g
There are tons of distributed databases that handle partial failure all the time. I work with Cassandra. One of the interesting things we’ve done is to use AWS’s virtual block storage (EBS) to swap versions by reattaching storage to the new instances.
b
I'm 99.9% biased toward crash-reboot from my empirical dataset. I'd also say that in biology cell-division seems like a crash reboot.
c
This is why I like git - it can't really "crash" as such because it isn't running 99.9% of the time. Highly stateful system, but all the state is either on disk (or very short lived). The equivalent of "crashing" in systems like this is getting a bad config, so it won't run at all. This is incredibly annoying to deal with. So I'm not sure there's actually a philosophical difference here, other than "be mindful of unanticipated states". Having a crashy system probably just makes you think about state more explicitly. The main difference I guess is that a "crashed" (i.e. corrupted) git repo is an automatic "memory dump" to which your file browser and text editor act like an already attached debugger.
💡 1
w
There really are important differences between long-lived and short-lived parts of a system. To a first approximation it's alls just data and transformations thereof, but how manage the tradeoffs of robustness, performance, mutability, and recovery differ.
k
For me it's crash-only for small software units, and immortal as the vision for large software assemblies. In practice, moderate-size systems (Emacs, Smalltalk, Unix, ...) are immortal assemblies of crash-only processes. Just like biological systems by the way: individual organisms are crash-only, but ecosystems are (aiming at) immortal. Also for both universes, immortal systems must adapt to changing environmental conditions, which they do by replacing crashed units by modified crash-only units.
2
That leaves the question of personal preferences such as @Kartik Agaram describes. BTW, mine are largely the opposite: I am happy with Emacs, Smalltalk, Slime, and my desk is a mess. All I can offer is an educated guess based on comparing my way of thinking/working with other people I know very well (family, friends, long-time collaborators). I am the big-picture guy who will happily ignore details as much as possible. That seems to be correlated with the messy desk, vs. the tidy desk for people who pay attention to detail, but then easily lose the big picture.
I don't think the "blast radius" of a crash is orthogonal. On the contrary, it's why larger assemblies cannot afford the crash-only approach. Complex state takes too much time to rebuild. And it takes an even larger embedding system to handle the reboot. Once you reach the scale where a system is one of a very few, crashing is no longer an option.
j
Somewhat tangentially: how does our thinking change when we switch to non-volatile RAM so that program state is sort of inherently immortal? The Twizzler OS talk that I watched hurt my brain a bit in this respect
k
Does Twizzler OS provide a means to reinitialize state to a known good value? That's what I consider essential for rebooting. Non-volatility just removes one cause for a crash, that looks like a minor detail.
g
[moved into thread from #C5T9GPWFL] I agree with your perspective, but I think that the Happy Path culture is more insidious than imagined.  It deeply affects our tools and thought processes.  Here are some of my thoughts... [Failure Drive Design](https://guitarvydas.github.io/2021/04/23/Failure-Driven-Design.html). (edited)
❤️ 2
t
If you run state machines on a paxos group, you get immortality despite distributed systems failures. (How a component of Spanner works)
j
The thing that keeps popping into my head when this thread comes up in my window in slack (it's what autoloads every time I restart my machine) is "maybe code really isn't data"
❤️ 1
🤔 2