Hey folks! Bit of a random question, but can any o...
# thinking-together
j
Hey folks! Bit of a random question, but can any one think of any language interpreters which parse text straight to bytecode, omitting the AST step completely? Quick.js is one of the most well-known (https://github.com/bellard/quickjs/tree/master) but I wondered if any one knew of any other interesting examples for other languages?
m
I think the original pascal did it doing a one pass parsing to p-code https://en.wikipedia.org/wiki/P-code_machine
a
The bytecode version of the language defined in Bob Nystrom’s Crafting Interpreters does this.
b
[EDIT: oops I didn't notice I'm necroposting on old thread] [bad example] Tcl formally had no AST, it's all just strings. Informally though, since 1996 there are multiple internal representations—all satisfying same stringy API but each optimized for particular uses—and those include "list" (a string adopts this representation after operations like indexing) and "bytecode" (a string adopts this representation after it gets evaluated). (there is a command to peek under the hood; e.g.

here

I'm showing a string becoming bytecode after
eval
. it's part of a longer video on one-API=multi-representation dualities across several languages.) So while the code for control structures like
if
or
proc
may never say so explicitly, in practice it ends up slicing strings into something quite akin to AST tree, with parts converted to bytecode on-demand. 🪄
[better example] Forth is more interesting case to study, it really doesn't have an AST. Nor exactly bytecode, but builds up a sequence of "word" pointers to execute (or in direct-threaded implementations, a sequence of JMP instructions); I'll let you decide whether that's closer to bytecode or tree-walking. But am I cheating because Forth has no syntax either, just flat sequence of words, right? Well, it has some syntax here and there, it's just implemented in diffuse fashion by various words. (Which means it has no grammar.) For example
( ... )
is comments, implemented by
(
word eating up and discarding stdin until it meets
)
character. And
: NAME word1 word2... ;
is word definition, eating up single NAME from stdin, allocating its entry-to-be-built in dictionary, then toggling main loop from "interepreter" to "compile" mode where following words are appended to the dictionary entry not executed — unless they're marked "immediate" for special handling. -
;
is immediate word that wraps up the dictionary entry and toggles main loop back to "interpreter" mode. -
IF
,
THEN
etc. are immediate words that take notice of addresses and later fixup some pointers to build up conditional, loops, etc. in a manner similar to assembly language. --- Assemblers could answer your question in similar spirit (mostly flat 1:1 mapping to output, with some fixups), except most assemblers are not interpreted.
I suspect neither are what you meant. IMHO they're interesting not for the behind-the-scenes implementation choice of skipping a parser->AST step, but for the syntax malleability they were able to expose to user-space: in both Tcl and FORTH it's possible to define custom control structures / DSLs in the language itself.
As for Pascal, conceptually it does have a grammar, and parts of the implementation can be marked a "parser"—but it's not a separate "pass" in the sense it doesn't allocate the whole AST as data structure. You could probably write something similar with YACC actions that build p-code instead of allocating AST nodes? The grammar was carefully designed to enable this, e.g. requiring all variables declared before code starts. I think Wirth experimented with single-pass approaches in his later languages too? https://people.inf.ethz.ch/wirth/Articles/Modula-Oberon-June.pdf sheds interesting light on historic constraints:
... PDP-11 with a 64K-byte store. The single-pass strategy of our Pascal compilers could not be adopted; a multipass approach was unavoidable in view of the small memory. ...
The first Modula-2 compiler, written by K. van Le in 1977 consisted of 7 passes, each generating sequential output written onto the 2M-byte disk.
IIUC what he means is multi-pass allows you to reduce RAM occupied by compiler code by loading one pass' code at a time, leaving more RAM for data (plus data can be streamed from/to disk). Funnily, similar reason also motivated Unix "do one thing" + pipes design 😄