does anyone work on making a programming language ...
# thinking-together
i
does anyone work on making a programming language ? and if so, can you share a repo or some notes ? I'm curious what features are you implementing and what programming language are you using do develop it and how you do the codegen part. I'm currently struggling quite a bit with LLVM
i
great resources. thank you
m
what are you looking for? I wrote a weird one https://marianoguerra.org/posts/bootstrapping-oop-part-3-who-parses-the-parser/ (follow the links back to the first post)
I'm also writing about a less weird one https://wasmgroundup.com/
i
Thank you for the reading material. I'm writing my own programming language. For fun and profit - fun mostly. And I went through all the steps successfully (lexing, parsing, all the static analysis checking) and now I'm at the step of actually building the executable file, and from multiple articles I've seen, I chose to go with generating IR for LLVM (multiple targets and platforms was a plus) and then compile that to an executable. But it seems overly complicated and I'm not sure is the right path to go. Or maybe I just don't get it yet. I would like to see another completed language that uses LLVM and see how stuff are organized and defined and called. There are a bunch of tutorials on the site but it seems are just about one or two toy instructions. I've also seen a lot of other languages, and Paul Tarvydas also has some articles - suggesting a transpilation to another high level language (like C, or even better, something with GC like GO) and then compile that. But that also brings in a lot of issues. So...I'm just fishing around here to see if something clicks. 🤓
m
this may be useful, but I would suggest that if llvm is not a requirement for you target something easier or higher level https://lambdaclass.github.io/mlir-workshop/
you could target some existing bytecode, webassembly, transpile to another language or use something simpler than llvm like https://cranelift.dev/
or emit your own bytecode and write an interpreter
i
that is really helpful. thank you!
g
Yes, LLVM is complicated. It probably encodes everything we've learned about building compilers in a traditional way. At the code emission point, things get complicated because there are so many disparate possible targets. I suggest that it helps to first understand how to emit code - manually - for only one target, then the reasons for all of LLVM's complication will become more clear. [Aside: my favourite approach is to cut the emission problem into 2 halves - a dumb, general pass, then a rewriting pass that targets a pile of real targets. This is what GCC does. GCC uses "RTL". I think that "OCG" is even better. Hmm, is GCC's RTL more approachable than LLVM?]
Are you aware of Bob Nystrom's "Crafting Interpreters"? There is a Thursday night (EST) reading group - CS Cabal on Slack (cscabal.slack.com). We're incrementally reading through the book and asking questions as we go. Apparently, the book contains the hoary details for interpreting an AST and for converting the AST into bytecodes (we haven't got very far. It's definitely not too late to join in).
If you insist on writing your own code generator, maybe ask ChatGPT or Claude to write some LLVM for you. Spend a few hours chatting with it to see if it goes anywhere. Start by writing a simple program in your programming language. Get Claude to generate LLVM for it. I think that LLVM is old enough to be included in LLM ChatGPT's/Claude's training. Typically, I don't trust the output of LLMs, but, they are very helpful, for me, in generating example code and helping me down the learning curve.
i
I was mostly using https://godbolt.org/ with mingw clang and -S -emit-llvm to see what ir it spits out from some C i write there. but it seems is way different than any documentation i find about llvm
j
I'm building my own language. It's a dynamically typed functional language that compiles straight to machine code. https://github.com/jimmyhmiller/beagle I will just say personally that I found all language stuff to be much easier once I just went and learned how I can go straight to machine code. It is way less confusing than I thought it would be. And made so many things click in place for me. LLVM is an impressive feat of engineering, but it is made for industrial strength things, not for helping people learn the first time. There are definitely a ton of concepts there that assume some background that might be hard. But of course, don't let that discourage you from going in that direction! It's all about your goals.
g
At this very moment, I'm thinking that targeting WASM is a good idea. Mariano's given you some starter links...
"Syntax" is not just for the front end human-facing parts of a compiler. You'll note that even LLVM has a "syntax" for its IR, but, it ain't very human-friendly. I consider compilers to be pipelines of little DSLs, each with a specialized, machine-readable syntax. At this point in your compiler project, can you output some sort of text-file with its own specialized, machine-readable syntax? Can you use OhmJS to bolt this specialized "syntax" to LLMV-IR?
i
i never used ohmjs. is ohmjs capable of doing llvm-ir ? that would mean that my source code would go to some custom ir, then use nodejs with ohm to convert that to ir, then something else to convert the llvm-ir to machine code
i would ideally generate machine code myself but it will only work for a targeted platform. i was thinking that if i manage to generate llvm-ir, that would just compile to whatever platform it is capable of.
i just need to see real world examples of llvm. how structures are defined, control statements, all the usual things, to understand it better. and then see how can i mold it to what i want/need. i wouldn't want to run it in an interpreter or a vm. i have big goals 🙂
j
Not sure if you read haskell, but here's this old idris llvm backend https://github.com/idris-hackers/idris-llvm/blob/master/src/IRTS/CodegenLLVM.hs Other than that, I'm not sure of small examples. But zig source might be worth a read for a bigger real world project. There's also roc https://github.com/roc-lang/roc/tree/main/crates/compiler/gen_llvm/src/llvm
g
OhmJS is capable of doing llvm-ir, but, doesn't come with it. You would have to write an llvm-ir outputter yourself. It sounds like you're doing something like that anyway, but keeping it all in your head. When I feel confused about something, I draw a diagram or just write about it. OhmJS doesn't build a compiler back end for you. You still have to do the work. I believe that chopping up the work - divide and conquer - into smaller pieces makes it easier to do the work. Basing the pieces on little-DSLs makes it easier to write down what you're thinking. The hoary part of building a back-end emitter is that you want to target a bunch of very disparate target architectures - you either have to build a custom solution for each target, or, you have to find a way to generalize and cull the common stuff out of the task. LLVM shows where this kind of strive-for-generalization is ultimately gonna go.
j
And I'll stop pitching my alternative. But I spent years being confused because I kept trying to use system like ohmjs without understanding them. Here is a (very messy) project I did of making an x86-64 assembler, and then building up to a simple language. Helped me way more to not have any tools doing things for me. https://github.com/jimmyhmiller/PlayGround/blob/4069532cc2366706a9f9ff88a2c41f448f8c908f/rust/assembler/src/main.rs
I'll throw out the offer though (to anyone) if you are doing a programming language project and want to pair, let me know 🙂
g
I agree with Jimmy's comment. Understanding how something works is harder - and everything looks magical - if you start by looking at decades of incremental evolution of the toolchain. FWIW, here's a 10-minute intro to one of the ultra-simple compilers I learned from

Ron Cain's SmallCâ–¾

. My references to OhmJS are moot unless you already believe that a compiler is just a pipeline with little, custom syntactic APIs in between the passes.
I did something completely unheard of: I followed my own advice. I wrote an extremely simple function in C (a language that I already know) and asked an LLM to compile it to LLVM. Then to explain the result. I include a link to the voluminous explanation. My conclusion: unless you've already built several compilers and enjoy dealing with painful niggly issues like alignment, don't try to understand LLVM just yet. It's been under development for 24 years and, today, addresses a lot of niggly portability issues that can't be grokked just by staring at LLVM-IR (nor, probably the LLVM documentation). Punt. Use some other existing language as your assembler and emit that instead. You can get all of the production-level efficiency you need from Odin or Zig or C or ...