Currently, the lapyst language works like the following:

  1. Lex and parse the sourcecode into an AST
  2. Enhance the AST via scope information and interfered types for (nearly) each node
  3. Walk the AST to generate LLVM IR
  4. Let LLVM do it's magic, with some LLD sprinkled in so we get a fully linked result
  5. Profit!

The problem with this is the second and third steps; not only store we some non-sourcecode related informations inside the AST, but we face problems when we want to transform the AST, being it inefficenties in the AST creation or be it simple loosing the ability to reason about the AST.

I finally took this problem head on in an attempt to finally solve it, and thats what I want to write today about.


So first thing first, what is even the full problem? Like I mentioned above, the lapyst language uses the AST to store informations from later steps. This ofc in itself has multiple problems:

  • It violates the seperation of concerns: By stuffing more and more into the AST, it does more than it should be doing. An AST is an abstract representation of syntax (hence the name). By adding more things that aren't syntax (such as scope information or inferred type), it suddenly is involved in much more aspects of your entire software as visible at a first glance.

  • Breaking of modularity: By storing more infos than needed, it breaks modularity by requiring any consumer to also atleast load the code for the added informations. For example; if I now want to write an other tool that consumes the AST as an input (say an documentation generator or even a simple formatter), this program now not only would need the whole parsing code, it also would need to load the whole code responsible for the scope information to be stored etc. While technically this **could** be mitigated by using conditional compilation through something like #ifdef X, if you want to provide your compiler through reusable libraries, you still need to compile the library with all features so it can be linked in all usecases.

While these things are already somewhat of a problem, you can "easily" get away with it by simply deciding that your code isn't ment to be modular or reusable. It surely get's you further, but I personally dont like it myself when I'm falling back to such "hacks" only to get the software to be developed further. I really like to write everything as reuseable as possible, so it's easier for me personally to reuse code I already written than rolling, say, the 100st implementation of some process-spawing abstraction.

But there's a second problem, far bigger than the one we already discussed: transformations. Many languages (lapyst included) will want to add some syntactic sugar, meta-programming or simple optimzations at some point or another. These all require to transform the code you read in from disk to something other; some examples include:

  • A syntactic sugar so you can use iterators by compiling them to for loops
  • Rust's trailing ? operator for simplyfing value-based errorhandling
  • Properties on classes / structs.
  • Dlang's mixin("code") feature which is used to "inject" compile-time creates sourcecode.
  • Macros!
  • Optimazations like constant folding, ctfe...
  • And so on...

Lets look at one problem I faced myself, which even promted me to thing about all this in the first place:

I wanted to have an .length property on strings like so: "hello world".length, but this wasn't as easy as compiling an "normal" property, because we can compile this to 3 different things! For starters, if we indeed have an literal on the right side like in the example, we can just replace the whole expression with an constant number, since the literal will not change. The same goes for an named constant: we can retrieve it's value at compiletime and replace the whole expression with an constant number again. Only if we're using an variable, we have to check the string at runtime; the complexity of that varies too: if it's an length-prefixed string, we simple can use the length component. But if we're using a c-style string (in lapyst the cstr type), we're out of luck and need to count the characters manually. For this case, I wanted to insert an call to an builtin function for the string length.

And heres now the problem: If we where to insert an call into the AST, what should it be? We could use the same node that the function call uses, but this then needs to be typechecked as well, needs an sourcecode location (it doesnt really have), and some additional informations that would be present when reading a call from the sourcecode, but being neccesary for the codegeneration backend. An own BuiltinCall node ofc would eliminate such problems, but it creates a brand new one; or rather it shows the problem that was there before with an painfull clarity: we loose the ability to reason about the AST.

If we modify the AST in to broad way's, we cannot anymore know if any given AST is already modified or not. We dont know if all transformations where applied. This is even better visible when looking at a feature like macros: if we modify the AST when we execute macros, we never would know if the resulting AST is already expanded or not. This would mean that any code needed to ran after the transfomations even gets the transformed AST it needs to function in the first place!

This was the point where I got stuck. On the one hand, an enhanced AST ment I didn't need an whole other structure (i.e. an IR) so the codesize of my project stayed low and manageable. Additionally, it would "save" memory allocations, time to translate things that didn't needed translations and so on. But if I continued, I would need to "abuse" the AST to create "virtual" nodes that be not naturally occuring, being half-broken bc some informations wouldn't be completly correct in them, or simply being the same overhead on memory allocations I was trying to avoid. If you're like me, these problems are the worst: you want to write something decently performant, so it doesnt burn anyones computer by simply compiling an hello world, but you also want it to be modular, reuseable and expertly crafted, with an clean codeflow and all that.

Conclusion

Starting an language is easy; implementing it to such an extend it can be used for simple (or even some not-so-simple) programs is a bit harder, but only in time taken to reach that stage. But to actually craft an good implementation of an language is harder than writing any piece of software; besides maybe an own kernel.

In the end, I decided it would be best to start implementing an IR; I mean Rust does that too while using LLVM under the hoot, and they even have (atleast) two own IR's they use! So it can't be that inefficent to use one in my project too.

An IR has also one additional advantage I did not speak of until now: it makes you a bit more independent from your codegen if you're using LLVM like me. With an own IR you can add more codegen backends any time with the benefit that all can make use of your optimzation passes etc.