I’m a data scientist by background, and a lot of t...
# thinking-together
s
I’m a data scientist by background, and a lot of this PL stuff is new to me. However, I think data science is an interesting use case for innovation in PL. The most common use cases are a bit more bounded and well defined, the persona base ideal (people who just wanna do data stuff, not program), and there’s a non-PL success here already (Excel!). • Are there others that are motivated by this data science use case / working on it? I know Instadeq is here, I’m sure there are many others. I’ve chatted with Erik Blakke about his Ultorg system. • Why haven’t we seen a good live programming language for data science? It’s so ideal for it! Everything from sampling / streaming in data results to keep things live to the fact that data analysts / scientists want to move / iterate at the speed of thought, and most of data science is this type of curiosity driven stumbling around the data-dark
🆒 1
❤️ 3
👍 1
m
I guess because data science as a field is fairly new and most of the people here are programmers and want to improve the tools they work with, that is, programming languages.
for example R which is used a lot in data science was created mostly by statisticians
e
Have you looked at tools like Roassal?
s
eh not sure I agree. Data science was the first application of programming to begin with! In the broadest sense, initially it was more simple computations
👍 2
R is definitely interesting. Super wonky language but very friendly for statisticians
at my last job we had to add an R learning track b/c Python was too much for most people (too many CS-y concepts to learn)
with R, you just install RStudio
a
I'm a data analyst working primarily in RStudio, and I'd argue that the R/RStudio/Tidyverse stack is already a very good live programming environment for data work. R's wonkiness as a language is only an issue if you're coming from a more traditional PL. I'm also a huge fan of Observable, and with the recent release of Arquero, Observable and JS is a viable platform for lots of data work.
👍 2
s
I love the premise of Observable but I personally really struggle to use JS (even their dialect of JS). I need to spend more time here
Arquero looks neat…
a
Yeah, Arquero is heavily influenced by R's dplyr library, which is like 70% of why I love using R, and it seems like "tidy" data is catching on in JS. But both Observable and RStudio can be used without learning any command line (library installation is in-language, Observable punts on git altogether). I think it's a really interesting question which CS concepts you need to know to be an effective data scientist (broadly defined). My career might have gone way different if I had been able to install a Python library like 10 years ago when I starting teaching myself how to code.
s
interesting, what do you think would have been different about your career (hypothesizing) if you had picked up Python / grinded through the CS stuff?
from my perspective, the “CS stuff” you need to know should mirror the conceptual structures in your head
tables / spreadsheets / csvs / databases are all a format, a CS concept, and a mental representation
a
In R for instance, data frames are lists of columns, but when I'm wrangling data, I don't actually think of them as such, they're just data frames. Likewise for the non-standard evaluation that makes the syntax more user friendly that I still only vaguely understand. I don't think loops are a particularly complicated idea, but I also don't know that loops necessary to do data work. Excel doesn't have named functions and didn't have the concept of a table until a few years ago. Even the different between in-memory and on-disk data feels like accidental complexity from the perspective of a data work. But I mostly mean things like git, docker, and command line bullshittery.
👍 1
🤔 1
s
right that makes sense
a
Mildly related, I actually have a little blog where I solve data science problems with "esoteric" languages. It has mostly turned into just lesser known languages. But I am inspired by the idea of a top to bottom language built with data science at it's heart.
👍 2
s
woah woah woah, I NEED to read this 🙂
I’ve been thinking of starting a data tools / languages focused blog
g
I'm mainly interested in the distributed computing wing of data science. Having dealt with spark and Hadoop, I really think they're very ripe for a better programming experience
j
Have you had a look at Julia? It's basically an infix re-implemention of Common Lisp that, while general purpose, is targeted toward exactly this niche: https://docs.julialang.org/en/v1/stdlib/REPL/
k
Julia is definitely worth a look for anything data science, if you don't depend on libraries in other languages. I wouldn't quite call it a variant of Common Lisp, since its live programming support is not at the same level, but it's certainly better than any of the more established data science languages. One potential issue to watch out for is the enormous dependency stack of Julia, being based on LLVM. If you ever find yourself having to install it from source, that's a major undertaking, and if you need your code to work for ten years, the fragility of that stack could also become a source of trouble.
s
yeah I’ve played with Julia a bit
I think it’s fine for their designed purpose, which is “high performance-ish step child of Python / R but modern”
but its not a way of computing 🤔
l
@Andrew Carr Would you share a link to your blog?
a
Happy to! https://andrewnc.github.io/blog/blog.html I haven't been allowed to write this summer because of my internship. But I have 2 or 3 posts on the pipeline for the end of the year. I hope you enjoy reading, they're quite simple, mostly me recording my experience and experimentation. I'll gradually add more "deep" PL and data science topics as time goes on.
I probably have 15 or so languages I want to try out still too. I'll probably compile a list eventually of the cool esolangs I find
l
tables / spreadsheets / csvs / databases are all a format, a CS concept, and a mental representation.
They're all tables 😉 Spreadsheets are tables with reactive function execution. (relational) Databases are collections of related tables, with the relationships themselves maintained in other tables.
@Andrew Carr thanks!
👍 1
@Eric Gade The first time I saw a Roassal demo video, I was so hyped, I tweeted it with a comment paraphrasing Arthur C. Clarke: "Any sufficiently advanced technology is indistinguishable from Smalltalk 80." Unfortunately, I could rarely get it to work reliably. I really wish someone would address Smalltalk's module system issues.
In R for instance, data frames are lists of columns, but when I'm wrangling data, I don't actually think of them as such, they're just data frames.
@Alex Wein That's funny. I almost always think of them as lists of columns, unless I'm reading or writing them from/to disk. OTOH, I almost never extract a column from a table to use a column operation directly. It's always t.col(whatever).
p
Regarding live coding in Julia, I've heard good things about Pluto.jl https://github.com/fonsp/Pluto.jl
e
@larry I don’t really use Roassal myself, but I do know they’ve been hard at work on version 3 which hopefully provides more stability. I think it follows the new Iceberg/Baseline git-based installation. It will also come built-in to Pharo 9, which is current itself in dev phase: https://github.com/ObjectProfile/Roassal3
l
FWIW, I think Wolfram/Mathematica is a language/toolkit that is worth considering when talking about data science languages. Its lispy style and abilities in the code-as-data area are interesting/impressive.
@Eric Gade I don't think the stability issues are Roassal-specific. I think it's a function of how Pharo combines code from various sources into one super-environment, especially without the relative safety provided by static type and version checking. Direct updates to the system classes are another issue. Base Pharo simply has 'too much junk in its trunk' to ever be reliable. The Date class, for example, has an easter function that, IIRC, returns the day Easter falls on in any given year.
e
I’ve not had too many problems with this kind of stability in Squeak or Pharo, but the environments to expect users to be a little more proactive in managing what’s going on
Both are still full of older classes/methods, like you’ve pointed out. But they’ve slowly been purging this stuff. The Pharo team actually builds the image from the ground up now, so anyone can bootstrap their own minimal images as needed
And yes, static type checking is not something you will get in that environment by definition and specific intent
s
@larry for sure, I need to play with Mathematica more. It gets a bad rep cuz its not open source but
they pioneered in interactive / notebook driven exploratoin
l
@Eric Gade I understand. I programmed in Smalltalk for years and love the language, but it hasn't evolved. Anyway, wrong thread for this 😉
j
@Konrad Hinsen I usually use the Julia support in ESS mode, which -- while not as good as SLIME -- gets me completion, "evaluation in place", and many of the other interactive programming goodies without which I'm always grumpy. My understanding is that colleagues mainly use Juno to get this sort of setup in what the kids like to call a "modern" editor: https://junolab.org
o
Does anyone know or has used good visual programming for data science (I mean à la Scratch or à la PureData)? I guess I will explore this space at some point. Maybe I will try to create a data extension for Scratch (maybe using Arquero, didn't know it and it looks like a good candidate).
k
@Jack Rusher I wasn't thinking so much of tooling but of packages. In Common Lisp, packages are just namespaces. Live coders can modify code everywhere at any time. In Julia, as in Python or in most Schemes, live coding is limited to the "main" namespace (whatever it is called) and code in modules or packages can be modified predictably only by restarting the session. Caveat: I haven't looked seriously at Julia for more than a year. Maybe there is tooling to work around this by now.
j
@Konrad Hinsen Schemers usually call that namespace "the top level", thus the phrase "the top-level is hopeless" among Racketeers. (I fall on the opposite side of this from Matthias Felleisen, as live-coding is my preferred way of interacting with a computer.) Anyway, yeah, one can develop Julia packages in the interactive style, but modifying someone else's package at runtime is not part of the culture.
k
@Jack Rusher The aversion to the top level is also my main gripe with Racket. As for Julia, I'd be happy to be able to live-code my own packages, which of course includes forks of someone else's packages. Is that possible now? It's a pain not to be able to restructure code as it grows because of this "live coding only in one namespace" restriction.
1
w
One thing to keep in mind about Scheme is that it is carefully specced so as to not be a dynamic language, not really.
k
Indeed, and for good reasons, but there is still some room for tooling to provide better live programming support. Emacs with Geiser, for example.
d
@Srini K every field is working with the same raw resources. So there is going to be a huge overlap in tools, but they're all going to have different names. Can you give us more specific details about the problem your trying to solve?
j
@wtaysom I'm a little confused as I've been livecoding scheme for 35 years. Would you say a bit more to help me understand what you mean?
w
Sure. Liveness isn't required of Scheme though most Schemes support it. In particular I remember R5RS (that's when I noticed and I could be wrong here since that was a while ago to say the least) does not mandate any sort of rich reflection or runtime interrogation of what's going on.
s
@Srini K wrote:
Why haven’t we seen a good live programming language for data science? It’s so ideal for it! Everything from sampling / streaming in data results to keep things live to the fact that data analysts / scientists want to move / iterate at the speed of thought, and most of data science is this type of curiosity driven stumbling around the data-dark
To a large extent, S and R aim to be that language, and always have. I've taken the following above from a slide in [History and Ecology of R](https://calcul.math.cnrs.fr/attachments/spip/IMG/pdf/r-history-ecology.pdf) by Martyn Plummer, 2015:
> “We wanted users to be able to begin in an interactive environment [emph. Sietse], where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming.”
– John Chambers, Stages in the Evolution of S
This philosophy was later articulated explicitly in Programming With Data (Chambers, 1998) as a kind of mission statement for S:
> "To turn ideas into software, quickly and faithfully"
The 'iterate at the speed of thought' bit is especially interesting, because that implies interactive use, and that informs many of R and S's design choices. For example,
library(my_package)
loads that package's components (functions, objects, etc.) directly into the global/top-level namespace. Python, contrariwise, makes
import my_package
create a module object
my_package
that contains
my_package.glm
etc. Python's tradeoff favours codebase construction: you know the provenance of every name in your namespace, and you won't have namespace collisions. There's even a taboo on writing
from my_package import *
R's tradeoff, contrariwise, favours interactive development. You load a package because you want to use its functions; and you want those functions ready at hand, which means the global namespace and no prefixes.
j
@Sietse Brouwer Totally. I use R the same way I use Julia or any Lisp, hanging off of emacs allowing me to send forms interactively from the buffer, pull up a plot window, and so on. And, for all the warts of the language, if one stays in the TidyVerse it's a fairly agreeable experience.
s
Yeah! And the Tidyverse, too, heartily favours interactive usage over programming. Take dplyr, for example: that you can type
data.frame(apples=1:5, pears=6:10) %>% mutate(fruit = apples + pears)
to compute
z
is a big win for ergonomics. Alternatives like
df.fruit = df.apples + df.pears
get real old real fast, in my unfortunately-extensive experience with Pandas. And
mutate("fruit = apples + pears")
limits you to whatever operators are allowed in that stringly mini-dsl. But allowing unquoted names has a tradeoff: it disfavours programming in the tidyverse. What if I'd like to write a function that does the same 'analysis', but is generic over column names?
Copy code
add_fruit <- function(df, col1, col2, colresult) {
    # usage: addfruit(mydf, "apples", "pears", "fruit")
    df %>% mutate(colresult = col1 + col2
}
would get me an error saying 'no
col1
in `df`', and I'd need to do some contortions to make clear that I want to use the contents of
col1
as a name, rather than the name
col1
itself. Other languages that make the same tradeoff: • the various shell languages (sh, fish, PowerShell), where bare words are literal, variables are $something, and the meaning of a literal is determined by the invoked command's argument parser. Like R, a language focused on interactive use. • Tcl, too: first word is a command, later words are command arguments, bare words are taken literally, $something denotes the contents of the 'something' variable. And Tcl, too, is focused on providing an interactive shell over lower libraries -- and even on making such shells easy to create.
k
@wtaysom @Jack Rusher One feature of Scheme that is a bit at odds with live coding (and thus the eternally running "system") is continuations. They make sense only in the context of a computation that has an end. A language designed for live-coding would use delimited continuations instead. Scheme implementations thus must treat the top level as something special.
🤔 1