Ancient tech, but also the only one ever designed ...
# thinking-together
Ancient tech, but also the only one ever designed for this kind of use case without compromises. All of today's Open Source clones of Mathematica notebooks started with the constraint of building on pre-existing technology that was designed for different use cases. Perhaps even worse, they suffered from a rush to widespread adoption before battle-testing the design. I have often said in conferences that Jupyter will be tomorrow's legacy format that everyone will wish would go away. Fortunately nobody carries rotten tomatoes to such events, so I only get verbal abuse.
I was reflecting on why it became convention on my team to save notebooks in source control with cell output stripped, since, coming from Mathematica, it’s ludicrous to open a notebook and not see the half of the content that contains the results. I realized that a big part of it was that legacy tech problem. Outputs can’t be used as inputs because the frontend is a totally different stack, so they’re not nearly as useful. And they’re bloated because there’s no concise way to specify the graphics commands to render a plot—they’re giant blobs of HTML and CSS and JavaScript, which probably will be different every time you rerun the notebook.
I use Jupyter notebooks a decent amount and they definitely aren’t perfect. But as a data person, I sadly don’t have much of a choice. IDE’s are even worse. There are some good SaaS products that offer richer, live-collaborate notebooks which I’ve enjoyed using. They aren’t open source so that’s the tradeoff. Gone are the days of open source software being built for specific crafts-people (like scientists or data scientists) that have good design. Sadly nowadays, open source to me usually means “not designed tightly with users’ workflows in mind”
All that is why I advocate for convivial software in computational science. If we want tools that fit our needs, we need to be able to tweak them to our needs. Today's "Open Source with insufficient funding" approach to scientific software means that we have to live with what we can hack together based on ingredients made by others for other purposes. Jupyter is a good example. Unfortunately, most of my colleagues don't even believe this to be possible.
If we want tools that fit our needs, we need to be able to tweak them to our needs.
Definitely agree. One challenge I see is that software engineering culture is driven by / dominated by people shipping software into production for users in commercial environments ‘at scale’. Their job historically has been made ‘harder’ by non-engineers being empowered by tools like Excel, Tableau, SQL, etc because they aren’t version controlled, they’re somewhat brittle ‘at scale’, and engineers don’t want to support shipping their ‘content’ into production. So we end up with this half-baked solution. End-users get a high level programming language that has expressiveness and freedom (unlike Excel or Tableau) and engineers are happy to support that nay embrace it. But most data scientists don’t want to learn computer science and Python, they want to do their work.
I sense we’re missing some strong examples of convivial software that “have scaled” or examples that didn’t scale but that was acceptable for the org
But computational scientists want to feel empowered! They want to use R or Excel to analyze data but then ship an interactive data experience on the web. We end up with lots of weird half baked solutions in those cases 😕
I'm not a data scientist, but I thought Philip Guo's work on IncPy was an interesting alternative. I don't know if it's still being developed, though. My main gripe with notebooks is that there's lots of intermediate state that depends on which cells have been executed. And that's not really visible, so it forces the user to remember all of the implicit context of what they're doing. As I understand it, IncPy solves that by effectively running your entire script from scratch every time, but memoizing the parts that haven't changed. So you can still tweak theb script and get instant feedback for minor changes. I haven't actually tried it, but it sounds much friendlier to the user than a traditional notebook. And the approach also ensures the results are 100% reproducible.
IncPy is Python 2 only, so it's safe to declare it dead. Similar ideas have been proposed, and sometimes implemented, but I have to see anything for a mutable-data language like Python that's actually safe and efficient to use. The problem with Jupyter very fundamental: the data model of the notebook. It combines (1) code and documentation with (2) a partial trace of excecution and results. It would have been just as straightforward, and a lot more useful, to keep (1) and a full trace of execution, as separate data structures. It's (1) you want to keep under version control, and (2) that you want to make computationally reproducible. But now it's too late for such changes (see here for technical details).