I've put together some thoughts about syntax, they...
# thinking-together
j
I've put together some thoughts about syntax, they're pretty roughly defined at the moment, but I'd love to get your feedback! https://gist.github.com/jaredly/593d66a955b09572f3810b43b75a22a1
m
The structured editor that I'm building
for which language(s)?
j
The idea is to be multi-modal, but I've started it with a clojure-esque language that I'm creating. Ultimately it's an editor for developing new languages, and I want to have various syntax options
m
Can you expand on what is
structured
for you? Syntax as in "commas, parens, semicolons" (collections and atoms) is just a part of the story. Other part is semantic meaning of that syntax. For example the difference in meaning of tokens in
(+ 1 2 3)
vs
(if 1 2 3)
, or meaning of vector(
[]
) elements in
(defn foo [a b] ...)
vs
(let [a b] ...)
. This is why both "lisp has almost no syntax" and "lisp code is AST" - BS. > Ultimately it's an editor for developing new languages "editor for new clojure/lisp macros" would be nice test/milestone/challenge for it. I tried to approach same/similar problem recently as "DSL for custom macros support for Clojure IDE", because
condo
configs are nightmare: https://clojurians.slack.com/archives/C06AH8PGS/p1713614069031559 (https://clojurians-log.clojureverse.org/instaparse/2024-04-20)
j
Yeah so my structured editor works at a level between raw text and the AST. I've tried doing structured editors at the AST level, but it ended up being misaligned with the way I wanted to be inputting & manipulating code. So I think that the level that treats
(+ 1 2 3)
and
(if 1 2 3)
the same is the right spot for editor manipulation. And then a language's parser converts this "concrete syntax tree" into an AST
Yeah I had a great chat Peter Vilter a couple years ago about his datalog stuff, it's very cool!
m
"paredit for js"? slurp, barf, wrap in parens/brackets/curlies, swap tokens (left-right, top-bottom)? Feels like not enough to make new editor. What else you have in mind? (if you don't mind ofc)
j
hah so, there's a variety of things going on in the project • structured editor is a part of it (I guess I haven't really gotten around to writing out my "why structured editors" thoughts, but one game-changer is persistent addressability in the midst of changes. for an editor to be able to have a durable location for e.g. "the
name
of the function `flatMap`" that's not a line/col pair that will break at the slightest touch, unlocks a lot of nice things) • jupyter/observable/etc. style super-REPL/literate programming environment for pure functional languages • unison-style "terms are referenced by the hash of their contents, stored and synced in a database" • a Development Environment for programming languages themselves, making it easy to iterate and play with various aspects of a programming language (compilation targets, execution semantics, type inference algorithms) in relative isolation, as well as enabling the bootstrapping of self-hosted languages
It used to be "I want to make a programming language that has All The Best Features" and while I was at it I figured I'd make a structured editor for it at the same time, because I've tried making a structured editor for existing languages and concluded that it would work much better if the language (& compiler) were designed with structured editing in mind.... and then I got a little distracted by wanting to make "a minimally-featured language that is capable of self-hosting its own type inference while being nice to use", and so it has morphed into being an Editor Environment that can be used to make a variety of programming languages 🙃
Thus far the editor has only allowed clojure-style syntax, but the past few days I've been wondering what it would take to open it up to c-style languages, and if I'm going to do that might as well come up with a General Unified Theory of Syntax 😄
m
It seems to me #1 is at odds with `level that treats
(+ 1 2 3)
and
(if 1 2 3)
the same is the right spot for editor manipulation` because all you get is
some-hash[0][1][0][6]
(nested array address) or something w/o knowledge what
+
or
defn
means. re unison: YES. designing new lang w/o even giving it a try to be "content-addressable" – ... it solves/simplifies/amplifies so much later on in the toolchain: deps, version control, diffs/reviews, (structural)editing. (in my like 4th spare time I try to retrofit content-addressability onto at least a subset of clojure, which too started elsewhere: from custom macros, to kondo-config for it, to "screw it - I'm writing myself a clojure IDE with blackjack", to "might as well make bake in addressability and distribution for source control, because git is both overkill and underwhelming (like any text-files-diffing SCM)")
relevant too:

https://www.youtube.com/watch?v=GB_oTjVVgDc

j
Ah so
(+ a b c)
is actually a map of node-id to node,
0=list(1 2 3 4), 1=id(+, ref=hash of the + function), 2=id(a, ref=hash of the a term)
etc
So the
+
in
(defn + [a b] ...)
is addressable as
some-toplevel-id : the-loc-of-that-id-node
which ends up being nicely durable
Yeah Dion is super cool! I wished they'd produced more about it 😭
m
so essentially what I wrote?
hash[1st][0th][7th]
or am I not seeing something? also how is "hash of a func" different from "hash of a term"? so you have some "rule" that "1st item in a () list - function call"? that's a semantical knowledge I mentioned
j
nope nope
(sorry talking afk, 1 minute)
m
at the very least you need to differentiate "new name `N`" from "reference `R`", as in my example
defn
vs
let
:
Copy code
(defn foo [x y] ...)
 R    N    N N

(let [x y] ...)
 R    N R
and that is semantic knowledge, not just "collections and atoms"
j
yeah so in the editor identifiers are by default "unlinked", and there are editor affordances for "linking" an id to a definition
and the parser provides hints back to the editor about when to provide those affordances
m
+ constants/literals
C
+ scope (which
R
are known, and which are error))
Copy code
(defn foo [x y] ...)
 R    N    N N
(quote (defn foo [x y] ...)
 R      C    C    C C
j
C is the same as N, no need to distinguish
Importantly: linking an ID (turning an N into a R) is done by the user, not by some out-of-band algorithm
So an important difference from unison: I'm not trying to Normalize All The Things
m
> linking is done by the user are you describing "when user writes grammar for new lang"? or "when user programs in new lang"?
j
Writes programs using the new lang
it's autocomplete that actually means something
So more realistically, the toplevel
(defn a [b] c)
probably looks like
id=x45r, root: 37, nodes: 37=list(11 3 7 1), 11=id(defn, ref=builtin), 3=id(a, ref=null), 7=array(20), ...
. So "the name of that defn" is
(x45r, 3)
m
then you need "scoping rules", or rather "autocomplete needs to know scoping rules". that's again part of semantics of particular list of atoms (I might have a tunnel vision, because I spend lots of time in with in from the clojure pov)
j
hahaha
Yeah, that's the part where the parser gives autocomplete hints back to the editor
m
"the name of that defn" -
defn
or
(defn ...)
?
j
sorry
a
is what I meant to be referncing
the ID the defines the 'name' of the function that is produced by that toplevel
This is more relevant interesting when a toplevel can have multiple exports, for example with
(deftype (option a) (some a) (none))
References to the type constructor
some
have a durable reference to the id that defines the name of it, so renames are trivial
m
how scope "spreads" is a semantic too. in clojure (again, sorry :D) - there are (at least) parallel scope, forward sequential scope, backward sequential scope: here, numbers is sequence of scope propagation (higher number gets its scope from prev number):
Copy code
forward + backward example:
0
 1   2
        3
(let [a x b a] [a b])
      4     
            5
          6    7
                8 8 ;; a and b have parallel scope at this point
 
parallel example:
0
 1       2
          4 3 4 3  5
                    6 6 
(binding [a 1 b 2] [a b])  
;;The new bindings are made in parallel (unlike let);
also notice, that in
let
,
b
ejects/exports its cope from vector to body
[a b]
, but body does not export scope outside let (propagation stops). So spread direction is based on the meaning of first symbol, and the fact that 1st symbol meaning is important - is a higher level semantic too
One instance of scope
export
is created global definition: in
(defn foo [a b] body)
defn
exports
foo
to the global scope, but not
a
b
or
body
. which is solely semantics of
defn
j
So locals aren't locked in
Also the parser tells the editor what is exported
m
so parser knows what's up (which list el is local which is not), because you backed in some semantics in it (for c-like langs - defined a set of keywords and what they mean: if def for while)
but if you allow user-defined macros in your lang - you need to provide a way for user to let parser know what's up
j
So macros work on the cst, and are expanded before the parser operates
m
basically you describe(bake in) N of special forms and their semantics during lang-design-phase, and then rely on macroexpand for autocomplete?
j
Oh yeah so macros also can report autocomplete hints
m
> C is the same as N, no need to distinguish literal symbol
defn
is not the same as
defn
which is meant to be looked up and resolved as e.g.
clojure.core/defn
so either you need to prompt user on every word "is it a ref or is it static/literal?", or forbid literals, or rely on semantics of something in the text before the word, again, in clojure:
'
or
quote
, which is a semantic not just "colls and atoms"
this is all long winded way to say that "syntax families" seem to be incomplete w/o mentioning scope propagation rules and semantics
j
So when typing defn, if it autocompletes to link then it is R otherwise it's C
Macros operate on Rs mostly tbh
But yeah only global scope references are linked
m
mostly≠only )
j
Also macros don't have access to any environment to resolve things
The only time they'd use a C is for a numeric literal or as the export-name for a new definition
Either a macro consumes core/defn as an R, or has it referenced in its definition, or it doesn't have access to it
m
how macro knows it's an R when there is no env access to look it up?
j
an attribute on the node
Whether node.ref is null
m
so macro receives already resolved things?
j
Yup
Well it can also access "terms associated with resolved things it has received"
Where term associations are explicitly defined as a first class thing
This is all critical for term hashes to be useful. Can't depend on "the whole environment"
m
ok,
defn
is a macro, it receives
foo
and
a b
all 3
foo a b
exist in global scope (= can be resolved, and are resolved before being passed into macro) now
defn
throws away resolution and assigns new roles of
global
to `foo`and
local
to
a
and
b
?
j
So def isn't a macro
m
or
foo a b
are not in the globals, resolution resolves to 'unknown' and passes that to
defn
?
j
Gotta be built in
Macros bottom out to def/def type/etc
m
by
built in
you mean
semantics (scope propagation rules, locals/globals export) baked in into "parser"
?
ok, but what about
defn
being macro in clojure (bottoms out to
def
), and all those
prismatic.schema/defschema
etc. basically any macro exporting new global?
j
Mean it (def) can't be a macro
defn produces a def, which the parser determines produces an export
This allows different parsers (e.g. languages) to have different forms for defining things
m
> defn produces a def, which the parser determines produces an export yes, but
defn
knows what is new global, and what is args (new locals). and you resolve them before
defn
gets them
j
"new global" is just an id with ref=null
m
ok
j
'fn' also can't be a macro
m
does editor show it as an error? how does editor know it's not an error, and
foo
is ok to be unresolved at this particular place: second token inside
(defn foo ...)
list
j
Parser knows what ids need to be resolved
m
macroexpand + "source map" from exported new global back to
foo
in the (defn foo)?
j
No need to source map :) durable ids
m
I mean conceptually
"expand, and see that id=7 goes from unresolved to export-new-global, and its all good"?
j
I mean it's a parser error to use an id with ref=null as an expression if it's not resolvable with local scope
I can imagine a parser using an unresolved I'd in other ways that it would determine are valid
In fact it would be a parser error to use a resolved id as the "name of what I'm exporting"
So it's not a "how do I ensure unresolved ids eventually have a home" problem, it's more generally "do all these nodes make sense"
m
In fact it would be a parser error to use a resolved id as the "name of what I'm exporting"
can't redefine things this way.
Copy code
(def foo 1)
(def foo 2) ;; exports already resolved(able) global foo
j
Yeah so no name conflicts allowed in the same module
m
Copy code
(def elsewhere/foo 1) ;)
j
And it wouldn't autocomplete to resolve that id
m
because it knows from hardcoded knowledge "no qualified symbols here"?
j
Parser decides what autocompletes
m
I understand, I just try to zero in on "based on what"
j
When the parser is parsing, and sees the sibling to a 'def' in this case
At the top level
Btw ids with refs are underlined
Visually distinct
m
circling back to "IDE for defining new languages": that initial grammar/lang-description needs to provide that info for parser/autocomplete.
j
Yeah so the base lang is just raw js lol
In a big ol string literal
m
😄
j
And then you use that to make other languages
m
does that base-lang restricts what semantics are un/available to new-langs? or is it to academic or hard to tell atm?
j
So the base lang doesn't produce any restrictions
The nature of the editor and such does produce limits though. For example, macros don't have access to the environment. The parser and compiler don't even have global access. Dependency graphs are calculated by the editor.
Also impurity is a no go for the repl to make sense
a
Still reading this thread. But in response to the original post. I think it would probably be useful to talk about syntax in terms of formal grammars where you have terminal/non-terminal symbols. I think terminal symbols are your atoms and non-terminal symbols are your collections. Expressing these grammars in terms of a meta-syntax may also be useful for grouping languages together with similar structural editing affordances.
j
That's a good tip, thanks!
m
btw, which structure-edit-operations you have in mind?
j
so I'm starting out building a structured editor that is "usable as a normal editor", such that most of the keystrokes would be the same as in a text editor for general "code input" operations. And then I'll work on layering structure editing on top (probably with a heavy emphasis on end-user-coded transforms)
a
That sounds similar to the work we're doing on Hazel using tylr https://hazel.org/papers/tiny-tylr-tyde2022.pdf
j
yup 🙂 I'm a huge fan
haven't read that paper though, thanks for the link
m
re syntax flavors: https://futureofcoding.slack.com/archives/C03RR0W5DGC/p1731679753126049 reminded my of • commenting out or ignoring blocks of code/text:
/* ... */
//
;
#_
(comment ...)
• markers/indentation as denotation of nestedness (python):
>
>>
\t