Why does tsgo use so much memory?
If you run tsgo on decently sized Typescript project, it’s not uncommon to see
it using gigabytes of memory.
Why is that?
The short answer is:
-
when multi-threading,
tsgomakes a type checker per thread -
each type checker has its own state (types, symbols, etc.)
-
this state is not shared as synchronizing it between threads is costly
-
so each type checker often allocates duplicate, redundant memory
-
in addition, allocated types are never freed
It’s not uncommon for Typescript projects to have:
-
several thousand Typescript files
-
libraries like Zod, tRPC, Drizzle which result in many, many type instantiations
-
recursive generic types which product a lot of transient types which are never freed
When running tsgo on a large Typescript project, these type creation patterns
compound and result in a lot of duplicated or unused memory.
Let’s dig deeper.
Heap analysisλ
Let’s first get a breakdown of the heap so we can see what’s taking up so much memory.I’ll run tsgo on a large nextjs project with Zod, tRPC, Drizzle, all the good
stuff that makes the typechecker do work. Including node_modules, it’s about 7k
.ts files.
We can use Go’s runtime/pprof package to capture peak heap snapshots and the
pprof tool to tell us which functions allocated the most memory with the
-inuse_space flag.
If we categorize them by AST, typechecker etc. we see this:
Total live heap: 1471.9 MB
pprof writer self-overhead: 75.2 MB
real live data: 1321.5 MB
MB pct Family
──────────────────────────────────────────────────────────────────────────────
594.72 45.0% AST arenas (parser-allocated)
399.12 30.2% Checker (type/signature computation)
121.79 9.2% LinkStore (per-node/per-symbol caches)
63.38 4.8% OS / syscall / file I/O
62.58 4.7% Binder (symbol/flow declarations)
22.33 1.7% Parser (intern maps, etc.)
20.24 1.5% pkg: collections
15.54 1.2% Checker arenas
13.46 1.0% AST utilities
6.58 0.5% Compiler / module resolution
1.10 0.1% pkg: core
0.70 0.1% pkg: packagejson
What sticks out at first glance is 45% of memory (600MB) is allocated for AST nodes. It sounds like a lot, but it’s actually expected for the bulk of the memory allocated by a compiler to be taken up by AST nodes.
AST nodes also typically need to live for the duration of the compiler’s execution, so there’s nothing we can really do here. A lot of files means a lot of AST nodes!
I’m more interested in the memory allocated by the typechecker (the Checker
struct in the source).
What happens if we run tsgo with --singleThreaded?
Total live heap: 797.4 MB
pprof writer self-overhead: 3.6 MB
real live data: 790.2 MB
MB pct Family
──────────────────────────────────────────────────────────────────────────────
522.95 66.2% AST arenas (parser-allocated)
63.37 8.0% OS / syscall / file I/O
62.63 7.9% Binder (symbol/flow declarations)
51.93 6.6% Checker (type/signature computation)
23.01 2.9% LinkStore (per-node/per-symbol caches)
22.51 2.8% Parser (intern maps, etc.)
16.78 2.1% AST utilities
16.15 2.0% pkg: collections
10.21 1.3% Compiler / module resolution
0.58 0.1% pkg: packagejson
0.10 0.0% pkg: core
0.01 0.0% ** unclassified **
The typechecker takes up only ~50MB instead of ~400MB! This strongly suggests to me that there is some overhead with multi-threading here.
Let’s look at the typechecker deeper.
The type checkerλ
The way tsgo multi-threads typechecking is by creating a pool ofChecker
for each thread:
// internal/compiler/checkerpool.go
func newCheckerPoolWithTracing(program *Program, tr *tracing.Tracing) *checkerPool {
checkerCount := 4
if program.SingleThreaded() {
checkerCount = 1
} else if c := program.Options().Checkers; c != nil {
checkerCount = *c
}
checkerCount = max(min(checkerCount, len(program.files), 256), 1)
pool := &checkerPool{
program: program,
checkers: make([]*checker.Checker, checkerCount),
locks: make([]*sync.Mutex, checkerCount),
tracing: tr,
}
return pool
}
When a Checker is created, it is given the entire Typescript program AST
and all its files:
// internal/checker/checker.go
func NewChecker(program Program, tracer *Tracer) (*Checker, *sync.Mutex) {
program.BindSourceFiles()
c := &Checker{}
c.id = nextCheckerID.Add(1)
c.tracer = tracer
c.program = program
c.compilerOptions = program.Options()
c.files = program.SourceFiles()
c.fileIndexMap = createFileIndexMap(c.files)
// ... more code
}
During typechecking and emitting diagnostics for a file, each file gets assigned
to the next available Checker.
Each Checker has it’s own state for type-checking (which we’ll see in more
detail later). Here’s an example of duplicated work:
-
File
a.tsgoes toChecker 1, it creates a bunch of types. -
File
b.tsimports some type froma.tsand goes toChecker 2. -
Checker 2has its own separate state, so it needs to recompute and re-allocate data fora.ts.
From the pprof run I noticed the top allocating Checker functions were:
-
Checker.newSymbol()(symbols) -
Checker.newObjectType()(types) -
Checker.instantiateType()(types)
Let’s look at precisely what the data being allocated in each Checker.
Duplicated typesλ
EachChecker has a lot of stores for the many types that could be constructed:
type Checker struct {
stringLiteralTypes map[string]*Type
numberLiteralTypes map[jsnum.Number]*Type
bigintLiteralTypes map[jsnum.PseudoBigInt]*Type
enumLiteralTypes map[EnumLiteralKey]*Type
indexedAccessTypes map[CacheHashKey]*Type
templateLiteralTypes map[CacheHashKey]*Type
stringMappingTypes map[StringMappingKey]*Type
cachedTypes map[CachedTypeKey]*Type
cachedSignatures map[CachedSignatureKey]*Signature
narrowedTypes map[NarrowedTypeKey]*Type
assignmentReducedTypes map[AssignmentReducedKey]*Type
discriminatedContextualTypes map[DiscriminatedContextualTypeKey]*Type
instantiationExpressionTypes map[InstantiationExpressionKey]*Type
substitutionTypes map[SubstitutionTypeKey]*Type
reverseMappedCache map[ReverseMappedTypeKey]*Type
reverseHomomorphicMappedCache map[ReverseMappedTypeKey]*Type
iterationTypesCache map[IterationTypesKey]IterationTypes
tupleTypes map[CacheHashKey]*Type
unionTypes map[CacheHashKey]*Type
unionOfUnionTypes map[UnionOfUnionKey]*Type
intersectionTypes map[CacheHashKey]*Type
propertiesTypes map[PropertiesTypesKey]*Type
flowLoopCache map[FlowLoopKey]*Type
flowTypeCache map[*ast.Node]*Type
errorTypes map[CacheHashKey]*Type
// and many more!
}
Remember that:
-
this memory belongs to a single
Checkerand there’s no sharing of the data. -
allocated types never get freed
This means there can be a lot of duplicated memory that sits around.
To verify this, let’s start by creating a file with some code that builds tuples:
type BuildTuple<L extends number, T extends any[] = []> =
T['length'] extends L ? T : BuildTuple<L, [...T, any]>;
type TC = BuildTuple<100>;
declare const x: TC;
export const c0 = x[0];
export const cLen: 100 = x.length;
The BuildTuple<L, T> type will recursively build a tuple type from the empty
tuple type [] all the way to a tuple with 100 any’s in it ([any, any, ... any]).
Each iteration of the recursion creates a new tuple and caches it forever1.
If we create 4 files with the content as above and ran it through tsgo, we should see 100 tuple types created and
duplicated across 4 typecheckers (and also 100 number literal types).
Let’s see:
single checker 4 checkers
───────────────── ─────────────────────────────
tupleTypes 102 [102 102 102 102] → 408
numberLiteralTypes 101 [101 101 101 101] → 404
This illustrates two things:
-
types will be redundantly created on different threads
-
a recursive generic type can create a lot of transient types which take up memory
This is just a trivial example. Imagine the level of duplication that could happen when typechecking many thousands of files.
Duplicated symbolsλ
In compilers, named things (identifiers for functions, variables, etc.) often get recorded in a layer of indirection called a “symbol”.Usually, this lets names be scoped (“foo” in the global scope and “foo” in function scope mean two different things) and also lets you give a stable handle to them in case you want to rename (e.g. minification).
Each Checker stores a bunch of symbols:
type Checker struct {
// ... more code
symbolArena core.Arena[ast.Symbol]
// ... more code
}
Are symbols being duplicated a lot?
I modified tsgo to dump the top symbol names (the string part of the symbol) when running 4 threads:
tsgo --checkers 4
| Symbol | Kind | Count |
| -------------- | -------- | -----: |
| `at` | Method | 34,500 |
| `_` | Property | 25,600 |
| `name` | Property | 24,700 |
| `value` | FuncVar | 22,800 |
| `@@iterator` | Method | 22,300 |
| `data` | Property | 22,100 |
| `enumValues` | Property | 21,900 |
| `columnType` | Property | 21,000 |
| `dataType` | Property | 21,000 |
| `generated` | Property | 19,500 |
Let’s look at the at symbol count. If it decreases with single-threaded tsgo
then that probably means other threads are duplicating it:
tsgo --checkers 1
| Symbol | Kind | Count |
| -------------- | -------- | -----: |
| `props` | FuncVar | 16,800 |
| `at` | Method | 14,600 |
| `children` | Property | 10,400 |
| `value` | FuncVar | 10,200 |
| `@@iterator` | Method | 9,500 |
| `className` | Property | 9,200 |
| `data` | Property | 8,500 |
| `forEach` | Method | 8,100 |
| `map` | Method | 8,000 |
| `find` | Method | 7,900 |
So there’s about 20k more at symbols created when running tsgo with 4 threads!
Let’s verify it by creating a little test file.
The at symbol is from Array<T>.prototype.at. We can force Typescript to create
this symbol by creating an Array<T> and doing any property lookup on it, this
causes Typescript to resolve all members (and create their symbols)2 on the Array
object:
declare const arr: Array<string>;
export const len = arr.length;
Now we can create 4 files with this exact same contents. If we run tsgo with
--checkers 4 each file should go to a Checker and we’ll see if it duplicates the
at symbol:
--checkers 1 --checkers 4
───────────── ──────────────────────────────
total total c0 c1 c2 c3
at 1 4 1 1 1 1
So each checker duplicated the symbol for Array<string>.prototype.at.
Also note that new symbols are created for every new instantiation of a type parameter. So Array<string>, Array<number>, etc.
will all get their own symbols for at and any other members. This is pretty standard and normal.
But you can start to see how it could be easy for tsgo to duplicate a lot of symbols on other threads.
Imagine your code creates some generic type with a lot of fields and methods, maybe for a data structure:
type MyDataStructure<T> = {
field1: T;
field2: string;
// ...
field100: string;
}
Each instantiation will create 100 symbols. And then perhaps if you import this
type in many files, it’s highly likely that it will be seen and duplicated
across more than one Checker.
A real-world example is Zod objects. Zod’s method chaining API returns a ZodObject instantiated with different type parameters:
const emailSchema = z.string().email().min(5).max(120).toLowerCase();
Each .string(), .email() etc instantiates some new ZodObject<Shape, Config> type
and the property chaining causes Typescript to resolve and create symbols (as
well as allocating the individual types!).
There are similar APIs like Drizzle, tRPC that all do a similar thing, and when multiplied by multiple threads this leads to a lot of memory usage.
Conclusionλ
This was a fun dive into thetsgo source.
How could memory usage be made better in the future?
Garbage collecting types sounds promising, especially since Typescript types behave like regular values in a programming language. Transient types wouldn’t be bound to an AST node or anything and should get GC’ed.
Another option for reducing memory usage is using data structures which share data. This is used in FP languages where data structures like lists, maps, etc. are immutable and a naive implementation would mean duplicating the data on every append.
Incremental compilation should fix memory usage on subsequent tsgo runs.
Here is my fork which contains scripts to process pprof’s data as well as some
modifications to tsgo code to emit profiling data.
Footnotesλ
The getTupleTargetType() function creates a tuple and stores it in the tupleType of Checker.
In Typescript’s typechecker, getting a property on an object causes it to resolve members on it which then calls instantiateSymbolTable().