11/17/2022

typelalr: Nerd sniping myself into building lexer/parser generators for Typescript's type system

Aka the dumbest idea ever


Background

You know those wacky type-level Typescript projects like Hypescript (Typescript implemented in types) or ts-sql (an SQL database written in Typescript's type system)?

They work by taking an input Typescript string type, and lexing/parsing it into a higher representation that’s easier to manipulate (an AST).

If you want to build anything similar, you’re going to have to do the same. But writing a lexer/parser by hand isn’t particularly fun, and having to work under the constraints of Typescript’s type-level programming makes it much worse.

Decades ago, people fixed this problem by inventing lexer and parser generators. You give them a “grammar” (usually defined in some syntax of its own), and they spit out code that can lex and parse it.

This brings us to my terrible idea: build a tool to generate lexers and parsers to be used in Typescript's type system.

Now I'll talk a little about the process.

Lexer/parser generators

How do they work? The key insight is modeling the grammar as states and transitions, then generating a finite-state machine that can move between these states as it moves through your input.

This is my favorite application of state machines, because instead of having to write all the complicated logic and control flow associated with lexing/parsing, that information is encoded within the states and transitions.

So the actual implementation of the state machine is very simple, it’s really a glorified while loop over the input. Most of the work goes into actually generating the states and transitions, but you write that once and it will work for any grammar you give it, rather than writing a new lexer/parser for every language.

[show image of an example regex lexer]

One weird trick

Equipped with this knowledge of state machines, I had a realization that would make this whole process far simpler for myself.

The hard part (generating the states and transitions) doesn't even have to be in type-level Typescript. It doesn't matter how they're generated, as long as they end up as Typescript types that the type-level state machine I write (the easy part) can interpret.

This means I can generate the states and transitions in whatever language I choose! So I chose the language I'm most comfortable in: Rust.

Plan

After this revelation, a plan began to take form in my mind:

  1. Lexer: Generate lexer DFA states in Rust
  2. Parser: Generate the LALR(1) parsing tables in Rust
  3. Codegen: Turn the lexer DFA states and the LALR(1) parsing tables into Typescript types
  4. Machines: Write Typescript type-levels state machines that use the lexer DFA states and LALR(1) tables respectively
  5. Grammar language: People need to write a way to specify the grammar to be parsed

That has some terminology I didn't cover, this is the boring part so I'll make it quick:

Deterministic finite automaton (DFA) is a specific finite-state machine variant that has deterministic transitions between states. For example in a regex state machine, this means each transition on each state is only associated with one char.

[show dfa vs nfa]

What is LALR(1)? A LALR parser belongs to a whole of class of parsers called the LR parsers. They parse grammers from the bottom-up and this is challenging to do by hand so they're most commonly paired with a parser generator. LALR is a variant of LR which allows most syntactic constructs of programming languages to be expressed.

With my plan formed, I pulled out the Dragon Book and read up on its chapters about lexer and parser generators, then began hacking on the project.

The grammar language

It looks like this:

This grammar language itself is parsed using a LR(1) parser generated by this library for Rust.

Once I have the AST for the grammar, I turn all the rules into DFA states and a lexer is generated for all the different types of tokens.

Notice that tokens can use regex, well actually every token is a regular expression. That's important for lexing.

Lexer

Since each token is actually a regular expression, I need to parse the regex for each token and turn that into DFA states/transitions. Then I need to combine all the states and transitions for each token into one DFA.

Merging all the token states is actually not something the Dragon Book covers, the way I did it was first compiling all the regexes to an NFA, merging them, then turning that big NFA into a DFA.

I also realized I needed to test if the lexer DFA states worked properly, so I ended up having to write the state machine that interprets the states in Rust too. But that ended up being useful because I could transliterate the Rust code to Typescript code when it came to writing the type-level state machine.

There are also some tokens that need to refer to the input source, for example a string token should also carry with it the actual string. The Dragon Book also doesn't cover this, but the solution is simple. When you begin lexing a token you need to store the index in the input string where it begins, then when lexing that token is done you can slice the string using that begin index.

Parser

I like to think of the LR parsing state machine as a lexing state machine with extra steps.

In the lexing state machine, characters drive the progress of lexing a token forward, causing a state transition. This relationship of characters -> tokens extends to LR parsing with tokens -> AST nodes (known as productions in the literature). However, there are some additional requirements of parsing that introduce some nuance.

Parsing often requires us to parse nested AST structures, unlike lexing where we simply produce a flat list of tokens. If we complete an AST node, we need to know what to do with it if it has a parent, and perhaps its parent might have a parent and so forth. This faciliates the need to somehow track our progrees through nested structures, and in classic computer science style it is solved with a stack.

But once you introduce a stack, you also need to know how to manipulate it: when do you push/pop it? This is where the action and goto tables comes into play.

Machines

Like I mentioned before, all the complicated logic/control flow is encoded into states and so the state machine implementation is quite elegant. To demonstrate this, a hypothetical implementation of the parser state machine in JS/TS is about ~30 lines long:

let a = firstChar() while (true) { const state = stackTop() const action = actionTable[state][a] if (action.type === 'shift') { const newState = action.newState pushToStack(newState) a = nextChar() continue } if (action.type === 'reduce') { const production = action.production popN(production.to.length) const newTop = stackTop() pushToStack(gotoTable[newTop][production.name]) outputProduction(production) continue } if (action.type === 'accept') { // Done parsing! break } throw new Error('Invalid action type') }

In spite of this, the type-level code still ended up being a monstrosity. Here's the type-level code that is the equivalent of the ~10 line action.type === 'reduce' if condition from the above code sample:

The most annoying problem is that Typescript doesn't know the return type of generics, so you constantly have to add gratuitous conditional statements everywhere. Other than that, I find type-level Typescript like an extremely handicapped version of Lisp and not that bad.

Another annoying problem was debugging. I could approximate "printf" style debugging by returning template literal strings, but it was really painful.

The lexer ended up being ~200 lines of type-level typescript and the same for the parser.

Final result

Here it is parsing a simplified version of Lisp:

You can try it out here on the Typescript playground (warning this link is huge).

I was happy I got this working, but I ran into a Typescript recursion limit problem that pretty much means you can't use this for longer inputs.

I'm still trying to figure out why it's happening, since all my type-level functions should be tail-call optimized, but perhaps there's something I've missed.

But I feel like I've spent too much time on a really stupid idea so I'm laying this one to rest for now.


Got any questions? Did I say something completely bogus and wrong? Hit me up on Twitter.