zackoverflow

Why does tsgo use so much memory?


If you run tsgo on decently sized Typescript project, it’s not uncommon to see it using gigabytes of memory.

Why is that?

The short answer is:

It’s not uncommon for Typescript projects to have:

When running tsgo on a large Typescript project, these type creation patterns compound and result in a lot of duplicated or unused memory.

Let’s dig deeper.

Heap analysisλ

Let’s first get a breakdown of the heap so we can see what’s taking up so much memory.

I’ll run tsgo on a large nextjs project with Zod, tRPC, Drizzle, all the good stuff that makes the typechecker do work. Including node_modules, it’s about 7k .ts files.

We can use Go’s runtime/pprof package to capture peak heap snapshots and the pprof tool to tell us which functions allocated the most memory with the -inuse_space flag.

If we categorize them by AST, typechecker etc. we see this:

Total live heap:                    1471.9 MB
  pprof writer self-overhead:         75.2 MB
  real live data:                   1321.5 MB

      MB     pct   Family
──────────────────────────────────────────────────────────────────────────────
  594.72   45.0%   AST arenas (parser-allocated)
  399.12   30.2%   Checker (type/signature computation)
  121.79    9.2%   LinkStore (per-node/per-symbol caches)
   63.38    4.8%   OS / syscall / file I/O
   62.58    4.7%   Binder (symbol/flow declarations)
   22.33    1.7%   Parser (intern maps, etc.)
   20.24    1.5%   pkg: collections
   15.54    1.2%   Checker arenas
   13.46    1.0%   AST utilities
    6.58    0.5%   Compiler / module resolution
    1.10    0.1%   pkg: core
    0.70    0.1%   pkg: packagejson

What sticks out at first glance is 45% of memory (600MB) is allocated for AST nodes. It sounds like a lot, but it’s actually expected for the bulk of the memory allocated by a compiler to be taken up by AST nodes.

AST nodes also typically need to live for the duration of the compiler’s execution, so there’s nothing we can really do here. A lot of files means a lot of AST nodes!

I’m more interested in the memory allocated by the typechecker (the Checker struct in the source).

What happens if we run tsgo with --singleThreaded?

Total live heap:                     797.4 MB
  pprof writer self-overhead:          3.6 MB
  real live data:                    790.2 MB

      MB     pct   Family
──────────────────────────────────────────────────────────────────────────────
  522.95   66.2%   AST arenas (parser-allocated)
   63.37    8.0%   OS / syscall / file I/O
   62.63    7.9%   Binder (symbol/flow declarations)
   51.93    6.6%   Checker (type/signature computation)
   23.01    2.9%   LinkStore (per-node/per-symbol caches)
   22.51    2.8%   Parser (intern maps, etc.)
   16.78    2.1%   AST utilities
   16.15    2.0%   pkg: collections
   10.21    1.3%   Compiler / module resolution
    0.58    0.1%   pkg: packagejson
    0.10    0.0%   pkg: core
    0.01    0.0%   ** unclassified **

The typechecker takes up only ~50MB instead of ~400MB! This strongly suggests to me that there is some overhead with multi-threading here.

Let’s look at the typechecker deeper.

The type checkerλ

The way tsgo multi-threads typechecking is by creating a pool of Checker for each thread:

// internal/compiler/checkerpool.go

func newCheckerPoolWithTracing(program *Program, tr *tracing.Tracing) *checkerPool {
	checkerCount := 4
	if program.SingleThreaded() {
		checkerCount = 1
	} else if c := program.Options().Checkers; c != nil {
		checkerCount = *c
	}

	checkerCount = max(min(checkerCount, len(program.files), 256), 1)

	pool := &checkerPool{
		program:  program,
		checkers: make([]*checker.Checker, checkerCount),
		locks:    make([]*sync.Mutex, checkerCount),
		tracing:  tr,
	}

	return pool
}

When a Checker is created, it is given the entire Typescript program AST and all its files:

// internal/checker/checker.go

func NewChecker(program Program, tracer *Tracer) (*Checker, *sync.Mutex) {
	program.BindSourceFiles()

	c := &Checker{}
	c.id = nextCheckerID.Add(1)
	c.tracer = tracer
	c.program = program
	c.compilerOptions = program.Options()
	c.files = program.SourceFiles()
	c.fileIndexMap = createFileIndexMap(c.files)

	// ... more code
}

During typechecking and emitting diagnostics for a file, each file gets assigned to the next available Checker.

Each Checker has it’s own state for type-checking (which we’ll see in more detail later). Here’s an example of duplicated work:

From the pprof run I noticed the top allocating Checker functions were:

Let’s look at precisely what the data being allocated in each Checker.

Duplicated typesλ

Each Checker has a lot of stores for the many types that could be constructed:

type Checker struct {
  stringLiteralTypes            map[string]*Type
  numberLiteralTypes            map[jsnum.Number]*Type
  bigintLiteralTypes            map[jsnum.PseudoBigInt]*Type
  enumLiteralTypes              map[EnumLiteralKey]*Type
  indexedAccessTypes            map[CacheHashKey]*Type
  templateLiteralTypes          map[CacheHashKey]*Type
  stringMappingTypes            map[StringMappingKey]*Type
  cachedTypes                   map[CachedTypeKey]*Type        
  cachedSignatures              map[CachedSignatureKey]*Signature
  narrowedTypes                 map[NarrowedTypeKey]*Type
  assignmentReducedTypes        map[AssignmentReducedKey]*Type
  discriminatedContextualTypes  map[DiscriminatedContextualTypeKey]*Type
  instantiationExpressionTypes  map[InstantiationExpressionKey]*Type
  substitutionTypes             map[SubstitutionTypeKey]*Type
  reverseMappedCache            map[ReverseMappedTypeKey]*Type
  reverseHomomorphicMappedCache map[ReverseMappedTypeKey]*Type
  iterationTypesCache           map[IterationTypesKey]IterationTypes
  tupleTypes                    map[CacheHashKey]*Type
  unionTypes                    map[CacheHashKey]*Type
  unionOfUnionTypes             map[UnionOfUnionKey]*Type
  intersectionTypes             map[CacheHashKey]*Type
  propertiesTypes               map[PropertiesTypesKey]*Type
  flowLoopCache                 map[FlowLoopKey]*Type
  flowTypeCache                 map[*ast.Node]*Type
  errorTypes                    map[CacheHashKey]*Type
  // and many more!
}

Remember that:

This means there can be a lot of duplicated memory that sits around.

To verify this, let’s start by creating a file with some code that builds tuples:

type BuildTuple<L extends number, T extends any[] = []> =
  T['length'] extends L ? T : BuildTuple<L, [...T, any]>;

type TC = BuildTuple<100>;
declare const x: TC;
export const c0 = x[0];
export const cLen: 100 = x.length;

The BuildTuple<L, T> type will recursively build a tuple type from the empty tuple type [] all the way to a tuple with 100 any’s in it ([any, any, ... any]).

Each iteration of the recursion creates a new tuple and caches it forever1.

If we create 4 files with the content as above and ran it through tsgo, we should see 100 tuple types created and duplicated across 4 typecheckers (and also 100 number literal types).

Let’s see:

                            single checker     4 checkers
                            ─────────────────  ─────────────────────────────
  tupleTypes                102                [102 102 102 102]  →  408
  numberLiteralTypes        101                [101 101 101 101]  →  404

This illustrates two things:

This is just a trivial example. Imagine the level of duplication that could happen when typechecking many thousands of files.

Duplicated symbolsλ

In compilers, named things (identifiers for functions, variables, etc.) often get recorded in a layer of indirection called a “symbol”.

Usually, this lets names be scoped (“foo” in the global scope and “foo” in function scope mean two different things) and also lets you give a stable handle to them in case you want to rename (e.g. minification).

Each Checker stores a bunch of symbols:

type Checker struct {
	// ... more code
    symbolArena core.Arena[ast.Symbol]
	// ... more code
}

Are symbols being duplicated a lot?

I modified tsgo to dump the top symbol names (the string part of the symbol) when running 4 threads:

tsgo --checkers 4

| Symbol         | Kind     |  Count |
| -------------- | -------- | -----: |
| `at`           | Method   | 34,500 |
| `_`            | Property | 25,600 |
| `name`         | Property | 24,700 |
| `value`        | FuncVar  | 22,800 |
| `@@iterator`   | Method   | 22,300 |
| `data`         | Property | 22,100 |
| `enumValues`   | Property | 21,900 |
| `columnType`   | Property | 21,000 |
| `dataType`     | Property | 21,000 |
| `generated`    | Property | 19,500 |

Let’s look at the at symbol count. If it decreases with single-threaded tsgo then that probably means other threads are duplicating it:

tsgo --checkers 1

| Symbol         | Kind     |  Count |
| -------------- | -------- | -----: |
| `props`        | FuncVar  | 16,800 |
| `at`           | Method   | 14,600 |
| `children`     | Property | 10,400 |
| `value`        | FuncVar  | 10,200 |
| `@@iterator`   | Method   |  9,500 |
| `className`    | Property |  9,200 |
| `data`         | Property |  8,500 |
| `forEach`      | Method   |  8,100 |
| `map`          | Method   |  8,000 |
| `find`         | Method   |  7,900 |

So there’s about 20k more at symbols created when running tsgo with 4 threads!

Let’s verify it by creating a little test file.

The at symbol is from Array<T>.prototype.at. We can force Typescript to create this symbol by creating an Array<T> and doing any property lookup on it, this causes Typescript to resolve all members (and create their symbols)2 on the Array object:

declare const arr: Array<string>;
export const len = arr.length;

Now we can create 4 files with this exact same contents. If we run tsgo with --checkers 4 each file should go to a Checker and we’ll see if it duplicates the at symbol:

               --checkers 1         --checkers 4
               ─────────────        ──────────────────────────────
               total                total    c0    c1    c2    c3
at             1                    4        1     1     1     1

So each checker duplicated the symbol for Array<string>.prototype.at.

Also note that new symbols are created for every new instantiation of a type parameter. So Array<string>, Array<number>, etc. will all get their own symbols for at and any other members. This is pretty standard and normal.

But you can start to see how it could be easy for tsgo to duplicate a lot of symbols on other threads.

Imagine your code creates some generic type with a lot of fields and methods, maybe for a data structure:

type MyDataStructure<T> = {
  field1: T;
  field2: string;
  // ...
  field100: string;
}

Each instantiation will create 100 symbols. And then perhaps if you import this type in many files, it’s highly likely that it will be seen and duplicated across more than one Checker.

A real-world example is Zod objects. Zod’s method chaining API returns a ZodObject instantiated with different type parameters:

const emailSchema = z.string().email().min(5).max(120).toLowerCase();

Each .string(), .email() etc instantiates some new ZodObject<Shape, Config> type and the property chaining causes Typescript to resolve and create symbols (as well as allocating the individual types!).

There are similar APIs like Drizzle, tRPC that all do a similar thing, and when multiplied by multiple threads this leads to a lot of memory usage.

Conclusionλ

This was a fun dive into the tsgo source.

How could memory usage be made better in the future?

Garbage collecting types sounds promising, especially since Typescript types behave like regular values in a programming language. Transient types wouldn’t be bound to an AST node or anything and should get GC’ed.

Another option for reducing memory usage is using data structures which share data. This is used in FP languages where data structures like lists, maps, etc. are immutable and a naive implementation would mean duplicating the data on every append.

Incremental compilation should fix memory usage on subsequent tsgo runs.

Here is my fork which contains scripts to process pprof’s data as well as some modifications to tsgo code to emit profiling data.

Footnotesλ

1↩︎

The getTupleTargetType() function creates a tuple and stores it in the tupleType of Checker.

2↩︎

In Typescript’s typechecker, getting a property on an object causes it to resolve members on it which then calls instantiateSymbolTable().