Srini K
09/25/2020, 3:07 PMMariano Guerra
Mariano Guerra
Eric Gade
09/25/2020, 3:40 PMSrini K
09/25/2020, 3:40 PMSrini K
09/25/2020, 3:40 PMSrini K
09/25/2020, 3:40 PMSrini K
09/25/2020, 3:40 PMAlex Wein
09/25/2020, 4:06 PMSrini K
09/25/2020, 4:08 PMSrini K
09/25/2020, 4:08 PMAlex Wein
09/25/2020, 4:17 PMSrini K
09/25/2020, 4:30 PMSrini K
09/25/2020, 4:30 PMSrini K
09/25/2020, 4:30 PMAlex Wein
09/25/2020, 4:43 PMSrini K
09/25/2020, 4:46 PMAndrew Carr
09/25/2020, 9:56 PMSrini K
09/25/2020, 9:59 PMSrini K
09/25/2020, 9:59 PMGabriel Pickard
09/26/2020, 1:27 AMJack Rusher
09/26/2020, 7:37 AMKonrad Hinsen
09/26/2020, 10:10 AMSrini K
09/26/2020, 3:36 PMSrini K
09/26/2020, 3:36 PMSrini K
09/26/2020, 3:36 PMlarry
09/26/2020, 4:41 PMAndrew Carr
09/26/2020, 4:46 PMAndrew Carr
09/26/2020, 4:48 PMlarry
09/26/2020, 4:48 PMtables / spreadsheets / csvs / databases are all a format, a CS concept, and a mental representation.They're all tables 😉 Spreadsheets are tables with reactive function execution. (relational) Databases are collections of related tables, with the relationships themselves maintained in other tables.
larry
09/26/2020, 4:49 PMlarry
09/26/2020, 4:59 PMlarry
09/26/2020, 5:04 PMIn R for instance, data frames are lists of columns, but when I'm wrangling data, I don't actually think of them as such, they're just data frames.@Alex Wein That's funny. I almost always think of them as lists of columns, unless I'm reading or writing them from/to disk. OTOH, I almost never extract a column from a table to use a column operation directly. It's always t.col(whatever).
Paul Butler
09/26/2020, 5:20 PMEric Gade
09/26/2020, 5:20 PMlarry
09/26/2020, 5:21 PMlarry
09/26/2020, 5:26 PMEric Gade
09/26/2020, 5:31 PMEric Gade
09/26/2020, 5:32 PMEric Gade
09/26/2020, 5:33 PMSrini K
09/26/2020, 6:36 PMSrini K
09/26/2020, 6:36 PMlarry
09/26/2020, 7:23 PMJack Rusher
09/27/2020, 7:13 AMogadaki
09/27/2020, 4:26 PMKonrad Hinsen
09/28/2020, 6:25 AMJack Rusher
09/28/2020, 6:48 AMKonrad Hinsen
09/28/2020, 8:47 AMwtaysom
09/29/2020, 1:42 AMKonrad Hinsen
09/29/2020, 5:56 AMDrewverlee
09/29/2020, 9:25 AMJack Rusher
09/29/2020, 9:39 AMwtaysom
09/29/2020, 9:51 AMSietse Brouwer
09/29/2020, 10:01 AM@Srini K wrote:
Why haven’t we seen a good live programming language for data science? It’s so ideal for it! Everything from sampling / streaming in data results to keep things live to the fact that data analysts / scientists want to move / iterate at the speed of thought, and most of data science is this type of curiosity driven stumbling around the data-darkTo a large extent, S and R aim to be that language, and always have. I've taken the following above from a slide in [History and Ecology of R](https://calcul.math.cnrs.fr/attachments/spip/IMG/pdf/r-history-ecology.pdf) by Martyn Plummer, 2015:
> “We wanted users to be able to begin in an interactive environment [emph. Sietse], where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming.”
– John Chambers, Stages in the Evolution of S
This philosophy was later articulated explicitly in Programming With Data (Chambers, 1998) as a kind of mission statement for S:
> "To turn ideas into software, quickly and faithfully"
Sietse Brouwer
09/29/2020, 10:17 AMlibrary(my_package)
loads that package's components (functions, objects, etc.) directly into the global/top-level namespace. Python, contrariwise, makes import my_package
create a module object my_package
that contains my_package.glm
etc.
Python's tradeoff favours codebase construction: you know the provenance of every name in your namespace, and you won't have namespace collisions. There's even a taboo on writing from my_package import *
R's tradeoff, contrariwise, favours interactive development. You load a package because you want to use its functions; and you want those functions ready at hand, which means the global namespace and no prefixes.Jack Rusher
09/29/2020, 10:28 AMSietse Brouwer
09/29/2020, 10:43 AMdata.frame(apples=1:5, pears=6:10) %>% mutate(fruit = apples + pears)
to compute z
is a big win for ergonomics. Alternatives like df.fruit = df.apples + df.pears
get real old real fast, in my unfortunately-extensive experience with Pandas. And mutate("fruit = apples + pears")
limits you to whatever operators are allowed in that stringly mini-dsl.
But allowing unquoted names has a tradeoff: it disfavours programming in the tidyverse. What if I'd like to write a function that does the same 'analysis', but is generic over column names?
add_fruit <- function(df, col1, col2, colresult) {
# usage: addfruit(mydf, "apples", "pears", "fruit")
df %>% mutate(colresult = col1 + col2
}
would get me an error saying 'no col1
in `df`', and I'd need to do some contortions to make clear that I want to use the contents of col1
as a name, rather than the name col1
itself.
Other languages that make the same tradeoff:
• the various shell languages (sh, fish, PowerShell), where bare words are literal, variables are $something, and the meaning of a literal is determined by the invoked command's argument parser. Like R, a language focused on interactive use.
• Tcl, too: first word is a command, later words are command arguments, bare words are taken literally, $something denotes the contents of the 'something' variable. And Tcl, too, is focused on providing an interactive shell over lower libraries -- and even on making such shells easy to create.Konrad Hinsen
09/29/2020, 1:23 PM