11/19/2021

Implementing syntax highlighting

Using Tree-sitter to add syntax highlighting to my personal code editor, Glyph.

In order to make Glyph usable, I need syntax highlighting to accentuate and attenuate different parts of the code I’m editing. This is a must have feature for any code editor.

I was able to get syntax highlighting working and still maintain a 100 FPS with Glyph, here’s how it looks higlighting some Go code.

Syntax highlighting

Most syntax highlighting implementations use some form of regex or pattern matching to identify the areas of the text to highlight, and this is a fine way of doing things for most use cases.

However, a more interesting approach is made possible through Tree-sitter.

What is Tree-sitter?λ

Tree-sitter is a parser generator. Given a set of grammar rules, it is able to generate an efficient parser that parses code into a concrete syntax tree (CST or simply syntax tree).

You can then walk this syntax tree to find the types of nodes you want to highlight (e.g. function, variable) and their positions in the source text.

If you’re a user of nvim 0.5, it is usually paired with Tree-sitter to provide syntax highlighting. Tree-sitter also powers the syntax highlighting of GitHub and the Atom code editor.

Why Tree-sitter?λ

What makes tree-sitter compelling for text editors is that it is an efficient incremental parser, meaning it can efficiently update the syntax tree when the source text is updated. In a text editor, these updates and edits happen quite often.

Additionally, the CST provides you a representation of the code’s structure, which you can use to provide more intelligent syntax highlighting that would be difficult to achieve via regex.

Also, you can also make smarter text editing operations using the structure provided by the CST. For example, if I wanted to provide a command to delete a function this would be very easy with Tree-sitter.

Or, if I wanted to highlight multiple nested brackets, this too would be much easier to achieve with Tree-sitter’s syntax tree.

In essence, Tree-sitter gives you an efficient way to get the structure of the code currently being edited in a text editor, which you can then use to provide more intelligent editing operations or smarter syntax highlighting.

Syntax highlighting with Tree-sitterλ

Tree-sitter has a library (tree-sitter-highlight) with Rust bindings that provides some basic syntax highlighting functionality.

Most supported languages come with a set of “queries”, which basically map a syntax tree node to a highlight name.

You provide a list of highlight names you want to match (e.g. function, variable, etc.), and the library will parse the text and walk the CST returning the positions of nodes in the syntax tree that match the highlights.

In Glyph, I then use these positions to generate the appropriate RGBA color values for the vertices of the text that I send to the GPU through OpenGL.

Here’s a rough example of what I do for Glyph:

// Text represented as a byte array
let source_code: &[u8] = get_text_from_wherever();

let highlighter = Highlighter::new();
let highlights = highlighter.highlight(
    // Pass in a language specific highlight configuration,
    // and also selecting which nodes types to highlight.
    &rust_highlight_config,
    source_code,
    None,
    |_| None
).unwrap();

// Initialize our colors for each character, default to the foreground color
let mut text_colors: Vec<&Color> = vec![self.theme.fg(); source_code.len()];

let mut color_stack: Vec<&Color> = Vec::new();

for event in highlights {
    match event.unwrap() {
        // Processed a chunk of text spanning from start..end
        HighlightEvent::Source {start, end} => {
            // Sometimes you will get a source event that has no highlight,
            // so make sure to check if there is a color on the stack
            if let Some(color) = color_stack.last() {
                (start..end).for_each(|i| {
                    text_colors[i] = color;
                });
            }
        },
        HighlightEvent::HighlightStart(highlight) => {
            // `highlight` is a tuple struct containing the node type's ID
            let node_type_id = highlight.0;
            if let Some(highlight) = Theme::color_from_highlight_id(node_type_id) {
                color_stack.push(
                    self.theme
                        .highlight(highlight)
                        .unwrap_or_else(|| self.theme.fg()),
                );
            } else {
                // Just use the normal text color if we can't find a highlight
                color_stack.push(self.theme.fg())
            }
        },
        HighlightEvent::HighlightEnd => {
            color_stack.pop();
        },
    }
}

The library API makes you process highlights through events dispatched by an iterator. Processed text comes through as the HighlightEvent::Source event. If the text matches a highlight, this event will be preceeded by a HighlightEvent::HighlightStart which gives you the type of node the text is (e.g. function), which you can then use to fetch a corresponding color for it.

Optimizationsλ

This is a really simple and dumb way of achieving syntax highlighting (I’m parsing the code every keystroke). But “premature optimization” so I profiled Glyph to make sure I wasn’t solving a performance problem that didn’t actually exist.

Sure enough, I was still able to maintain a consistent ~100 FPS on a medium-large file, and the flamegraph showed the time Tree-sitter took to build the syntax tree was miniscule in comparison to other operations like the OpenGL GPU calls. I am really impressed by how performant Tree-sitter is.

I’m pretty happy with ~100 FPS, so no optimizations are absolutely necessary at this time. However, it’s still important to make note of them in case I want to go back and speed things up.

Tree-sitter provides an API to efficiently edit the syntax tree without re-building it every time. However, after some poking around it appears it’s not available in the highlighting library. So I’ll have to implement the highlighting myself, adding this incremental parsing API call, and then manually walking the syntax tree finding the nodes to highlight.

Another simpler optimization to make is put the highlighting task on another thread. Tree-sitter trees are very inexpensive to copy and send to different threads.