10/30/2025

How often does Python allocate?

The answer is "very often"

I recently saw this tweet which I thought was funny:

Python allocating tweet

How true is it?

I was surprised to learn that CPython represents each integer in a heap allocated PyLongObject*.

Does that mean Python heap allocates every integer you create? Integers are likely the most used data type of any program, that means a lot of heap allocations.

Beyond heap allocations, this could just make basic arithmetic operations like adding two numbers 200-500x slower¹.

Or maybe it uses the well-known, industry standard pointer tagging technique to avoid allocating small integers?

To figure out I modified CPython to print when it allocates a PyLongObject*:

static PyLongObject *
long_alloc(Py_ssize_t size)
{
    assert(size >= 0);
    PyLongObject *result = NULL;

    /* ... */

    Py_ssize_t ndigits = size ? size : 1;

    if (ndigits == 1) {
        result = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
    }
    
    if (result == NULL) {
        printf("ALLOCATING a new number object\n");
        
        result = PyObject_Malloc(offsetof(PyLongObject, long_value.ob_digit) +
                                ndigits*sizeof(digit));
        if (!result) {
            PyErr_NoMemory();
            return NULL;
        }
        _PyObject_Init((PyObject*)result, &PyLong_Type);
    }

    /* ...omitted code... */
    return result;
}

And wrote a script to add some numbers 100k times:

# lol.py
for i in range(0, 100_000):
  print(i + 1)

And then counted the number of times it allocated:

echo "Allocating number object $(
    ./python.exe lol.py \
    | rg -o ALLOCATING \
    | wc -l
) times"

Allocating number object 100904 times

Huh, that seems like a lot!

It looks like it could be allocating once per each iteration in the loop.

Let’s take out the print statement and see if it’s just the addition:

# lol.py
for i in range(0, 100_000):
  a = i + 1

echo "Allocating number object $(
    ./python.exe lol.py \
    | rg -o ALLOCATING \
    | wc -l
) times"

Allocating number object 905 times

That’s almost 100k less, so it does look like it’s the printing and CPython has some optimization in place to avoid allocations for addition.

Let’s look at the internal function to add two integers, long_add(a, b).

Just by looking at its signature, the return type is PyLongObject*, which could mean that it’s allocating a new object everytime you add two integers:

static PyLongObject *
long_add(PyLongObject *a, PyLongObject *b)
{
    // An integer is "compact" if it is <= 2^30
    if (_PyLong_BothAreCompact(a, b)) {
        // Unwrap the values of `a` and `b` and add them
        stwodigits z = medium_value(a) + medium_value(b);
        return _PyLong_FromSTwoDigits(z);
    }
    
    /* ... more code ... */
}

The _PyLong_BothAreCompact(...) function checks a bitfield on PyLongObject to check if it’s “compact” (the integer value can fit in 63 bits) which will always be the case for our script.

The code appears to be unwrapping the integer values for a and b, which are represented inside of PyLongObject structs, and adding them together.

The reason the result of adding them together has the type stwodigits is because PyLongObject stores the number in base 2³⁰, essentially each digit is a 30bit integer.

Let’s look at _PyLong_FromSTwoDigits:

/* Create a new int object from a C word-sized int */
static inline PyLongObject *
_PyLong_FromSTwoDigits(stwodigits x)
{
    if (IS_SMALL_INT(x)) {
        return (PyLongObject*)get_small_int((sdigit)x);
    }
    assert(x != 0);
    /* check that it is <= 2^(32-1) */
    if (is_medium_int(x)) {
        return (PyLongObject*)_PyLong_FromMedium((sdigit)x);
    }
    return (PyLongObject*)_PyLong_FromLarge(x);
}

There’s three separate code paths:

when x is a “small int”: call get_small_int(x).
when x is a “medium int”: call _PyLong_FromMedium(x).
when x is a “large int”: call _PyLong_FromLarge(x).

The call to get_small_int() is catching my eye. Is it like the tagged pointer trick I mentioned earlier?

static PyObject *
get_small_int(sdigit ival)
{
    assert(IS_SMALL_INT(ival));
    return (PyObject *)&_PyLong_SMALL_INTS[_PY_NSMALLNEGINTS + ival];
}

Nope, it seems there is a pre-allocated list of objects for integers in the range of -5 -> 1025². This would account for 1025 iterations of our loop but not for the rest.

Let’s look at the medium case (meaning x is less than 2^32-1) and the _PyLong_FromMedium() function:

static PyObject *
_PyLong_FromMedium(sdigit x)
{
    assert(!IS_SMALL_INT(x));
    assert(is_medium_int(x));

    PyLongObject *v = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
    if (v == NULL) {
        v = PyObject_Malloc(sizeof(PyLongObject));
        if (v == NULL) {
            PyErr_NoMemory();
            return NULL;
        }
        _PyObject_Init((PyObject*)v, &PyLong_Type);
    }
    digit abs_x = x < 0 ? -x : x;
    _PyLong_SetSignAndDigitCount(v, x<0?-1:1, 1);
    v->long_value.ob_digit[0] = abs_x;
    return (PyObject*)v;
}

Okay, it seems to not be using the long_alloc() function from earlier and instead trying to get a PyLongObject* from a freelist (using _Py_FREELIST_POP()), or otherwise allocating the memory for it if that fails.

A “freelist” is a common technique used in memory allocation strategies to keep track of freed allocations so they can be reused later, avoiding allocating more memory if possible.

If we look at the long_dealloc() function, there’s a codepath specifically for pushing the allocation to the freelist and not calling free on it so it can be reused:

static void
long_dealloc(PyObject *self)
{
    /* ... */
       
    if (PyLong_CheckExact(self) && _PyLong_IsCompact((PyLongObject *)self)) {
        _Py_FREELIST_FREE(ints, self, PyObject_Free);
        return;
    }
    Py_TYPE(self)->tp_free(self);
}

I deleted the printf() I added to long_alloc() earlier, and added two print statements to _PyLong_FromMedium() to print if it allocated or reused memory:

> ./python.exe lol.py | rg 'ALLOCATING|REUSING' | sort | uniq -c

  102 ALLOCATING
99193 REUSING

Our script appears to actually be reusing most of the PyLongObject objects!

So where did the allocation come from when our script was print()’ing the integers? It turns out the implemention of print() allocates a scratch PyLongObject* for conversion purposes, even though this can be avoided in most cases³.

The cost of allocatingλ

It’s nice to see that simply adding two numbers in a loop does not allocate memory everytime.

We just saw an optimization to reduce the frequency of allocations (by reusing them), but I also want to poke around and see if Python is doing anything to amortize the actual cost of an allocation.

Here’s the internal allocation function used for objects:

static inline void*
pymalloc_alloc(OMState *state, void *Py_UNUSED(ctx), size_t nbytes)
{
    if (UNLIKELY(nbytes == 0)) {
        return NULL;
    }
    if (UNLIKELY(nbytes > SMALL_REQUEST_THRESHOLD)) {
        return NULL;
    }

    uint size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;
    poolp pool = usedpools[size + size];
    pymem_block *bp;

    if (LIKELY(pool != pool->nextpool)) {
        /*
         * There is a used pool for this size class.
         * Pick up the head block of its free list.
         */
        ++pool->ref.count;
        bp = pool->freeblock;
        assert(bp != NULL);

        if (UNLIKELY((pool->freeblock = *(pymem_block **)bp) == NULL)) {
            // Reached the end of the free list, try to extend it.
            pymalloc_pool_extend(pool, size);
        }
    }
    else {
        /* There isn't a pool of the right size class immediately
         * available:  use a free pool.
         */
        bp = allocate_from_new_pool(state, size);
    }

    return (void *)bp;
}

It looks like the memory allocator for objects is a pool allocator with multiple different pools based on the size of the object.

The pool allocator splits its backing buffer into fixed size chunks and typically is implemented with a freelist. This comes with some pretty nice advantages.

First, because everything is fixed in size, allocating and freeing becomes very simple (and thus faster!) especially since it doesn’t need to do any of the complicated bookkeeping malloc(...) needs to do to work for any allocation size.

There’s also less fragmentation (because that arises from variable allocation sizes, which means you can optimally reuse freed memory as the pool is not getting fragmented.

Similarly to an arena, you pay most of the cost upfront by pre-allocating a backing buffer at initialization. In fact in CPython’s case, the pool’s backing buffer itself is allocated by an arena whose capacity is 1mb or 256kb:

#ifdef USE_LARGE_ARENAS
    #define ARENA_BITS              20                    /* 1 MiB */
#else
    #define ARENA_BITS              18                    /* 256 KiB */
#endif

#define ARENA_SIZE              (1 << ARENA_BITS)
#define ARENA_SIZE_MASK         (ARENA_SIZE - 1)

Also, CPython will use mmap() with MAP_ANONYMOUS so it reserves 1mb of virtual memory upfront and the physical memory is allocated on demand on page faults.

With all these tricks, it’s likely that our script (the one that prints and does seem to allocate memory in the loop) is probably reusing memory from the pool allocator more than it’s actually allocating new memory (at least for PyLongObject integers)

Conclusionλ

So to finally answer the title question “How often does Python allocate?”, the answer is what everyone expected: very often. There are some good optimizations, but it could definitely do better :)

Despite some of the optimizations, there’s obviously still a big of overhead of executing all this allocation code, when theoretically all it should take the CPU to add two integers is a single ADD instruction (obviously with a bytecode interpreter like CPython there is more overhead than this).

In addition, the representation of PyLongObject is designed to handle both small and really big integers alike, meaning the code to handle the slow and unlikely path (using a really big integer in your Python script) pessimizes the fast path (using regularly sized integers).

To be honest, I’m a little surprised Python is missing some kind of tagged pointer optimization here, which would allow it to avoid heap allocations and directly the encode the value of some integers in the pointer. I consider this to be a very well-known optimization that many interpreters of dynamically typed languages use (e.g. V8, JSC, LuaJIT) and has been around since Smalltalk in the 80s⁴!

Update: Python does use tagged integers for iteration and pointers are tagged to skip refcounting.

But we do see a good example of how using specialized memory allocators lets us greatly reduce some of the costs of memory allocation.

This is why Zig is one my favorite languages. The answer to the question of “I wonder if this is allocating?” is “does this function take an allocator?“⁵. And the std.mem.Allocator interface lets you easily swap out allocators, and all of the standard library is designed to accept a std.mem.Allocator when it wants to allocate memory.

Footnotesλ

^1↩︎

According to this chart of the cost of CPU operations in cycles, allocating + deallocating an object could take 200-500 cycles. Meanwhile, an ADD op is listed as taking less than 1 cycle (since modern CPUs have multiple ALU units meaning more parallelism). So the 200-500x slower estimate is quite modest.

Also, this is does not include the other overheads of Python’s integer representation beyond allocation.

For example, this is the codepath for converting a PythonLongObject* into a usize:

static inline Py_ssize_t
_PyLong_CompactValue(const PyLongObject *op)
{
    Py_ssize_t sign;
    sign = 1 - (op->long_value.lv_tag & _PyLong_SIGN_MASK);
    return sign * (Py_ssize_t)op->long_value.ob_digit[0];
}

There’s pointer indirection, a bit mask and a substraction and a multiplication and then another pointer indirection from reading ob_digit[0].

^2↩︎

We can see declaration of these small ints here:

#define _PY_NSMALLPOSINTS           1025
#define _PY_NSMALLNEGINTS           5

struct _Py_static_objects {
    struct {
        /* Small integers are preallocated in this array so that they
         * can be shared.
         * The integers that are preallocated are those in the range
         * -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (exclusive).
         */
        PyLongObject small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
        
        /* ... */
    } singletons;
};

This is actually a recent change made 2 months ago and will be in Python 3.15 (the old range was -5 to 257).

Now the nice thing about these is that they require no heap allocation (notice the type of the items in the list are PyLongObject with no pointer)

And the runtime avoids accidentally calling free on these static small ints by setting this IMMORTALITY_BIT_MASK which marks the small int object as “immortal”.

^3↩︎

From briefly glancing at the function which is allocating in the internals print(), most of the logic is made complicated by the fact that the representation of integers is in base 2³⁰ (to allow for extra long integers) and it must convert it back to base 10. It allocates a new PyLongObject with enough space to fit the base 10 digits.

But realistically the majority of integers in a program are going to be less than 2³⁰ so why not introduce a fast path which skips this complicated code entirely?

In addition, there’s no need to heap allocate the list of digits in every case. If you know the list of digits is small (which is often the case), you can stack allocate the digit list.

This is what the Zig standard library does for example.

^4↩︎

You can read about V8’s use of pointer tagging here.

JSC and LuaJIT both use NaN-boxing which I consider to be a combination of pointer tagging and exploiting neat little nuances of IEEE-754 to stuff everything into a double-precision float.

As for Smalltalk, which appears to be the first instance of tagged pointers used in an interpeter, I found mention of it in the section “Representation of Small Integers” in page 565 of “Smalltalk-80: The Language and its Implementation”, also known as the “Blue book”.

From what I understand, pointers were 16-bit sized at a time and it uses the bottom bit to tag the pointer (the lowest bit will always be set to 0 for pointers aligned to 2 bytes) and the remaining 15 bits are the two’s complement representation of an integer.

^5↩︎

A good quote by paperclover!