zackoverflow

How often does Python allocate?


I recently saw this tweet which I thought was funny:

Python allocating tweet

How true is it?

I was surprised to learn that CPython represents each integer in a heap allocated PyLongObject*.

Does that mean Python heap allocates every integer you create? Integers are likely the most used data type of any program, that means a lot of heap allocations.

Beyond heap allocations, this could just make basic arithmetic operations like adding two numbers 200-500x slower1.

Or maybe it uses the industry standard pointer tagging technique to avoid allocating small integers?

To figure out I modified CPython to print when it allocates a PyLongObject*:

static PyLongObject *
long_alloc(Py_ssize_t size)
{
    assert(size >= 0);
    PyLongObject *result = NULL;

    /* ... */

    Py_ssize_t ndigits = size ? size : 1;

    if (ndigits == 1) {
        result = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
    }
    
    if (result == NULL) {
        printf("ALLOCATING a new number object\n");
        
        result = PyObject_Malloc(offsetof(PyLongObject, long_value.ob_digit) +
                                ndigits*sizeof(digit));
        if (!result) {
            PyErr_NoMemory();
            return NULL;
        }
        _PyObject_Init((PyObject*)result, &PyLong_Type);
    }

    /* ...omitted code... */
    return result;
}

And wrote a script to add some numbers 100k times:

# lol.py
for i in range(0, 100_000):
  print(i + 1)

And then counted the number of times it allocated:

echo "Allocating number object $(
    ./python.exe lol.py \
    | rg -o ALLOCATING \
    | wc -l
) times"
Allocating number object 100904 times

Huh, that seems like a lot!

It looks like it could be allocating once per each iteration in the loop.

Let’s take out the print statement and see if it’s just the addition:

# lol.py
for i in range(0, 100_000):
  a = i + 1
echo "Allocating number object $(
    ./python.exe lol.py \
    | rg -o ALLOCATING \
    | wc -l
) times"

Allocating number object 905 times

That’s almost 100k less, so it does look like it’s the printing and CPython has some optimization in place to avoid allocations for addition.

Let’s look at the internal function to add two integers, long_add(a, b).

Just by looking at its signature, the return type is PyLongObject*, which could mean that it’s allocating a new object everytime you add two integers:

static PyLongObject *
long_add(PyLongObject *a, PyLongObject *b)
{
    // An integer is "compact" if it is <= 2^30
    if (_PyLong_BothAreCompact(a, b)) {
        // Unwrap the values of `a` and `b` and add them
        stwodigits z = medium_value(a) + medium_value(b);
        return _PyLong_FromSTwoDigits(z);
    }
    
    /* ... more code ... */
}

The _PyLong_BothAreCompact(...) function checks a bitfield on PyLongObject to check if it’s “compact” (the integer value can fit in 63 bits) which will always be the case for our script.

The code appears to be unwrapping the integer values for a and b, which are represented inside of PyLongObject structs, and adding them together.

The reason the result of adding them together has the type stwodigits is because PyLongObject stores the number in base 230, essentially each digit is a 30bit integer.

Let’s look at _PyLong_FromSTwoDigits:

/* Create a new int object from a C word-sized int */
static inline PyLongObject *
_PyLong_FromSTwoDigits(stwodigits x)
{
    if (IS_SMALL_INT(x)) {
        return (PyLongObject*)get_small_int((sdigit)x);
    }
    assert(x != 0);
    /* check that it is <= 2^(32-1) */
    if (is_medium_int(x)) {
        return (PyLongObject*)_PyLong_FromMedium((sdigit)x);
    }
    return (PyLongObject*)_PyLong_FromLarge(x);
}

There’s three separate code paths:

  1. when x is a “small int”: call get_small_int(x).

  2. when x is a “medium int”: call _PyLong_FromMedium(x).

  3. when x is a “large int”: call _PyLong_FromLarge(x).

The call to get_small_int() is catching my eye. Is it like the tagged pointer trick I mentioned earlier?

static PyObject *
get_small_int(sdigit ival)
{
    assert(IS_SMALL_INT(ival));
    return (PyObject *)&_PyLong_SMALL_INTS[_PY_NSMALLNEGINTS + ival];
}

Nope, it seems there is a pre-allocated list of objects for integers in the range of -5 -> 10252. This would account for 1025 iterations of our loop but not for the rest.

Let’s look at the medium case (meaning x is less than 232-1) and the _PyLong_FromMedium() function:

static PyObject *
_PyLong_FromMedium(sdigit x)
{
    assert(!IS_SMALL_INT(x));
    assert(is_medium_int(x));

    PyLongObject *v = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
    if (v == NULL) {
        v = PyObject_Malloc(sizeof(PyLongObject));
        if (v == NULL) {
            PyErr_NoMemory();
            return NULL;
        }
        _PyObject_Init((PyObject*)v, &PyLong_Type);
    }
    digit abs_x = x < 0 ? -x : x;
    _PyLong_SetSignAndDigitCount(v, x<0?-1:1, 1);
    v->long_value.ob_digit[0] = abs_x;
    return (PyObject*)v;
}

Okay, it seems to not be using the long_alloc() function from earlier and instead trying to get a PyLongObject* from a freelist (using _Py_FREELIST_POP()), or otherwise allocating the memory for it if that fails.

A “freelist” is a common technique used in memory allocation strategies to keep track of freed allocations so they can be reused later, avoiding allocating more memory if possible.

If we look at the long_dealloc() function, there’s a codepath specifically for pushing the allocation to the freelist and not calling free on it so it can be reused:

static void
long_dealloc(PyObject *self)
{
    /* ... */
       
    if (PyLong_CheckExact(self) && _PyLong_IsCompact((PyLongObject *)self)) {
        _Py_FREELIST_FREE(ints, self, PyObject_Free);
        return;
    }
    Py_TYPE(self)->tp_free(self);
}

I deleted the printf() I added to long_alloc() earlier, and added two print statements to _PyLong_FromMedium() to print if it allocated or reused memory:

> ./python.exe lol.py | rg 'ALLOCATING|REUSING' | sort | uniq -c
  102 ALLOCATING
99193 REUSING

Our script appears to actually be reusing most of the PyLongObject objects!

So where did the allocation come from when our script was print()’ing the integers? It turns out the implemention of print() allocates a scratch PyLongObject* for conversion purposes, even though this can be avoided in most cases3.

The cost of allocatingλ

It’s nice to see that simply adding two numbers in a loop does not allocate memory everytime.

We just saw an optimization to reduce the frequency of allocations (by reusing them), but I also want to poke around and see if Python is doing anything to amortize the actual cost of an allocation.

Here’s the internal allocation function used for objects:

static inline void*
pymalloc_alloc(OMState *state, void *Py_UNUSED(ctx), size_t nbytes)
{
    if (UNLIKELY(nbytes == 0)) {
        return NULL;
    }
    if (UNLIKELY(nbytes > SMALL_REQUEST_THRESHOLD)) {
        return NULL;
    }

    uint size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;
    poolp pool = usedpools[size + size];
    pymem_block *bp;

    if (LIKELY(pool != pool->nextpool)) {
        /*
         * There is a used pool for this size class.
         * Pick up the head block of its free list.
         */
        ++pool->ref.count;
        bp = pool->freeblock;
        assert(bp != NULL);

        if (UNLIKELY((pool->freeblock = *(pymem_block **)bp) == NULL)) {
            // Reached the end of the free list, try to extend it.
            pymalloc_pool_extend(pool, size);
        }
    }
    else {
        /* There isn't a pool of the right size class immediately
         * available:  use a free pool.
         */
        bp = allocate_from_new_pool(state, size);
    }

    return (void *)bp;
}

It looks like the memory allocator for objects is a pool allocator with multiple different pools based on the size of the object.

The pool allocator splits its backing buffer into fixed size chunks and typically is implemented with a freelist. This comes with some pretty nice advantages.

First, because everything is fixed in size, allocating and freeing becomes very simple (and thus faster!) especially since it doesn’t need to do any of the complicated bookkeeping malloc(...) needs to do to work for any allocation size.

There’s also less fragmentation (because that arises from variable allocation sizes, which means you can optimally reuse freed memory as the pool is not getting fragmented.

Similarly to an arena, you pay most of the cost upfront by pre-allocating a backing buffer at initialization. In fact in CPython’s case, the pool’s backing buffer itself is allocated by an arena whose capacity is 1mb or 256kb:

#ifdef USE_LARGE_ARENAS
    #define ARENA_BITS              20                    /* 1 MiB */
#else
    #define ARENA_BITS              18                    /* 256 KiB */
#endif

#define ARENA_SIZE              (1 << ARENA_BITS)
#define ARENA_SIZE_MASK         (ARENA_SIZE - 1)

Also, CPython will use mmap() with MAP_ANONYMOUS so it reserves 1mb of virtual memory upfront and the physical memory is allocated on demand on page faults.

With all these tricks, it’s likely that our script (the one that prints and does seem to allocate memory in the loop) is probably reusing memory from the pool allocator more than it’s actually allocating new memory (at least for PyLongObject integers)

Conclusionλ

So to finally answer the title question “How often does Python allocate?”, the answer is what everyone expected: alot. There are some good optimizations, but it could definitely do better :)

Despite some of the optimizations, there’s obviously still a big of overhead of executing all this allocation code, when theoretically all it should take the CPU to add two integers is a single ADD instruction.

In addition, the representation of PyLongObject is designed to handle both small and really big integers alike, meaning the code to handle the slow and unlikely path (using a really big integer in your Python script) pessimizes the fast path (using regularly sized integers).

To be honest, I’m a little surprised there isn’t a fast path optimization here. Even Smalltalk in the 80s used pointer tagging to avoid heap allocating integers4!

But we do see a good example of how using specialized memory allocators lets us greatly reduce some of the costs of memory allocation.

This is why Zig is one my favorite languages. The answer to the question of “I wonder if this is allocating?” is “does this function take an allocator?“5. And the std.mem.Allocator interface lets you easily swap out allocators, and all of the standard library is designed to accept a std.mem.Allocator when it wants to allocate memory.

Footnotesλ

1

According to this chart of the cost of CPU operations in cycles, allocating + deallocating an object could take 200-500 cycles. Meanwhile, an ADD op is listed as taking less than 1 cycle (since modern CPUs have multiple ALU units meaning more parallelism). So the 200-500x slower estimate is quite modest.

Also, this is does not include the other overheads of Python’s integer representation beyond allocation.

For example, this is the codepath for converting a PythonLongObject* into a usize:

static inline Py_ssize_t
_PyLong_CompactValue(const PyLongObject *op)
{
    Py_ssize_t sign;
    sign = 1 - (op->long_value.lv_tag & _PyLong_SIGN_MASK);
    return sign * (Py_ssize_t)op->long_value.ob_digit[0];
}

There’s pointer indirection, a bit mask and a substraction and a multiplication and then another pointer indirection from reading ob_digit[0].

2

We can see declaration of these small ints here:

#define _PY_NSMALLPOSINTS           1025
#define _PY_NSMALLNEGINTS           5

struct _Py_static_objects {
    struct {
        /* Small integers are preallocated in this array so that they
         * can be shared.
         * The integers that are preallocated are those in the range
         * -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (exclusive).
         */
        PyLongObject small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
        
        /* ... */
    } singletons;
};

Now the nice thing about these is that they require no heap allocation (notice the type of the items in the list are PyLongObject with no pointer)

And the runtime avoids accidentally calling free on these static small ints by setting this IMMORTALITY_BIT_MASK which marks the small int object as “immortal”.

3

From briefly glancing at the function which is allocating in the internals print(), most of the logic is made complicated by the fact that the representation of integers is in base 230 (to allow for extra long integers) and it must convert it back to base 10. It allocates a new PyLongObject with enough space to fit the base 10 digits.

But realistically the majority of integers in a program are going to be less than 230 so why not introduce a fast path which skips this complicated code entirely?

In addition, there’s no need to heap allocate the list of digits in every case. If you know the list of digits is small (which is often the case), you can stack allocate the digit list.

This is what the Zig standard library does for example.

4

See “Representation of Small Integers” in page 565 of “Smalltalk-80: The Language and its Implementation”, also known as the “Blue book”

5

A good quote by paperclover!