How often does Python allocate?
I recently saw this tweet which I thought was funny:

How true is it?
I was surprised to learn that CPython represents each integer in a heap
allocated PyLongObject*.
Does that mean Python heap allocates every integer you create? Integers are likely the most used data type of any program, that means a lot of heap allocations.
Beyond heap allocations, this could just make basic arithmetic operations like adding two numbers 200-500x slower1.
Or maybe it uses the industry standard pointer tagging technique to avoid allocating small integers?
To figure out I modified CPython to print when it allocates a PyLongObject*:
static PyLongObject *
long_alloc(Py_ssize_t size)
{
assert(size >= 0);
PyLongObject *result = NULL;
/* ... */
Py_ssize_t ndigits = size ? size : 1;
if (ndigits == 1) {
result = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
}
if (result == NULL) {
printf("ALLOCATING a new number object\n");
result = PyObject_Malloc(offsetof(PyLongObject, long_value.ob_digit) +
ndigits*sizeof(digit));
if (!result) {
PyErr_NoMemory();
return NULL;
}
_PyObject_Init((PyObject*)result, &PyLong_Type);
}
/* ...omitted code... */
return result;
}
And wrote a script to add some numbers 100k times:
# lol.py
for i in range(0, 100_000):
print(i + 1)
And then counted the number of times it allocated:
echo "Allocating number object $(
./python.exe lol.py \
| rg -o ALLOCATING \
| wc -l
) times"
Allocating number object 100904 times
Huh, that seems like a lot!
It looks like it could be allocating once per each iteration in the loop.
Let’s take out the print statement and see if it’s just the addition:
# lol.py
for i in range(0, 100_000):
a = i + 1
echo "Allocating number object $(
./python.exe lol.py \
| rg -o ALLOCATING \
| wc -l
) times"
Allocating number object 905 times
That’s almost 100k less, so it does look like it’s the printing and CPython has some optimization in place to avoid allocations for addition.
Let’s look at the internal function to add two integers, long_add(a, b).
Just by looking at its signature, the return type is PyLongObject*, which could
mean that it’s allocating a new object everytime you add two integers:
static PyLongObject *
long_add(PyLongObject *a, PyLongObject *b)
{
// An integer is "compact" if it is <= 2^30
if (_PyLong_BothAreCompact(a, b)) {
// Unwrap the values of `a` and `b` and add them
stwodigits z = medium_value(a) + medium_value(b);
return _PyLong_FromSTwoDigits(z);
}
/* ... more code ... */
}
The _PyLong_BothAreCompact(...) function checks a bitfield on PyLongObject to
check if it’s “compact” (the integer value can fit in 63 bits) which will always
be the case for our script.
The code appears to be unwrapping the integer values for a and b, which are
represented inside of PyLongObject structs, and adding them together.
The reason the result of adding them together has the type stwodigits is because
PyLongObject stores the number in base 230, essentially each digit is a 30bit
integer.
Let’s look at _PyLong_FromSTwoDigits:
/* Create a new int object from a C word-sized int */
static inline PyLongObject *
_PyLong_FromSTwoDigits(stwodigits x)
{
if (IS_SMALL_INT(x)) {
return (PyLongObject*)get_small_int((sdigit)x);
}
assert(x != 0);
/* check that it is <= 2^(32-1) */
if (is_medium_int(x)) {
return (PyLongObject*)_PyLong_FromMedium((sdigit)x);
}
return (PyLongObject*)_PyLong_FromLarge(x);
}
There’s three separate code paths:
-
when
xis a “small int”: callget_small_int(x). -
when
xis a “medium int”: call_PyLong_FromMedium(x). -
when
xis a “large int”: call_PyLong_FromLarge(x).
The call to get_small_int() is catching my eye. Is it like the tagged pointer
trick I mentioned earlier?
static PyObject *
get_small_int(sdigit ival)
{
assert(IS_SMALL_INT(ival));
return (PyObject *)&_PyLong_SMALL_INTS[_PY_NSMALLNEGINTS + ival];
}
Nope, it seems there is a pre-allocated list of objects for integers in the
range of -5 -> 10252. This would account for 1025 iterations of our loop but
not for the rest.
Let’s look at the medium case (meaning x is less than 232-1) and the
_PyLong_FromMedium() function:
static PyObject *
_PyLong_FromMedium(sdigit x)
{
assert(!IS_SMALL_INT(x));
assert(is_medium_int(x));
PyLongObject *v = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
if (v == NULL) {
v = PyObject_Malloc(sizeof(PyLongObject));
if (v == NULL) {
PyErr_NoMemory();
return NULL;
}
_PyObject_Init((PyObject*)v, &PyLong_Type);
}
digit abs_x = x < 0 ? -x : x;
_PyLong_SetSignAndDigitCount(v, x<0?-1:1, 1);
v->long_value.ob_digit[0] = abs_x;
return (PyObject*)v;
}
Okay, it seems to not be using the long_alloc() function from earlier and
instead trying to get a PyLongObject* from a freelist (using
_Py_FREELIST_POP()), or otherwise allocating the memory for it if that fails.
A “freelist” is a common technique used in memory allocation strategies to keep track of freed allocations so they can be reused later, avoiding allocating more memory if possible.
If we look at the long_dealloc() function, there’s a codepath specifically for
pushing the allocation to the freelist and not calling free on it so it can be
reused:
static void
long_dealloc(PyObject *self)
{
/* ... */
if (PyLong_CheckExact(self) && _PyLong_IsCompact((PyLongObject *)self)) {
_Py_FREELIST_FREE(ints, self, PyObject_Free);
return;
}
Py_TYPE(self)->tp_free(self);
}
I deleted the printf() I added to long_alloc() earlier, and added two print
statements to _PyLong_FromMedium() to print if it allocated or reused memory:
> ./python.exe lol.py | rg 'ALLOCATING|REUSING' | sort | uniq -c
102 ALLOCATING
99193 REUSING
Our script appears to actually be reusing most of the PyLongObject objects!
So where did the allocation come from when our script was print()’ing the
integers? It turns out the implemention of print() allocates a scratch
PyLongObject* for conversion purposes, even though this can be avoided in most
cases3.
The cost of allocatingλ
It’s nice to see that simply adding two numbers in a loop does not allocate memory everytime.We just saw an optimization to reduce the frequency of allocations (by reusing them), but I also want to poke around and see if Python is doing anything to amortize the actual cost of an allocation.
Here’s the internal allocation function used for objects:
static inline void*
pymalloc_alloc(OMState *state, void *Py_UNUSED(ctx), size_t nbytes)
{
if (UNLIKELY(nbytes == 0)) {
return NULL;
}
if (UNLIKELY(nbytes > SMALL_REQUEST_THRESHOLD)) {
return NULL;
}
uint size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;
poolp pool = usedpools[size + size];
pymem_block *bp;
if (LIKELY(pool != pool->nextpool)) {
/*
* There is a used pool for this size class.
* Pick up the head block of its free list.
*/
++pool->ref.count;
bp = pool->freeblock;
assert(bp != NULL);
if (UNLIKELY((pool->freeblock = *(pymem_block **)bp) == NULL)) {
// Reached the end of the free list, try to extend it.
pymalloc_pool_extend(pool, size);
}
}
else {
/* There isn't a pool of the right size class immediately
* available: use a free pool.
*/
bp = allocate_from_new_pool(state, size);
}
return (void *)bp;
}
It looks like the memory allocator for objects is a pool allocator with multiple different pools based on the size of the object.
The pool allocator splits its backing buffer into fixed size chunks and typically is implemented with a freelist. This comes with some pretty nice advantages.
First, because everything is fixed in size, allocating and freeing becomes very
simple (and thus faster!) especially since it doesn’t need to do any of the
complicated bookkeeping malloc(...) needs to do to work for any allocation size.
There’s also less fragmentation (because that arises from variable allocation sizes, which means you can optimally reuse freed memory as the pool is not getting fragmented.
Similarly to an arena, you pay most of the cost upfront by pre-allocating a backing buffer at initialization. In fact in CPython’s case, the pool’s backing buffer itself is allocated by an arena whose capacity is 1mb or 256kb:
#ifdef USE_LARGE_ARENAS
#define ARENA_BITS 20 /* 1 MiB */
#else
#define ARENA_BITS 18 /* 256 KiB */
#endif
#define ARENA_SIZE (1 << ARENA_BITS)
#define ARENA_SIZE_MASK (ARENA_SIZE - 1)
Also, CPython will use mmap() with MAP_ANONYMOUS so it reserves 1mb of virtual memory
upfront and the physical memory is allocated on demand on page faults.
With all these tricks, it’s likely that our script (the one that prints and does
seem to allocate memory in the loop) is probably reusing memory from the pool
allocator more than it’s actually allocating new memory (at least for
PyLongObject integers)
Conclusionλ
So to finally answer the title question “How often does Python allocate?”, the answer is what everyone expected: alot. There are some good optimizations, but it could definitely do better :)Despite some of the optimizations, there’s obviously still a big of overhead of
executing all this allocation code, when theoretically all it should take the
CPU to add two integers is a single ADD instruction.
In addition, the representation of PyLongObject is designed to handle both small
and really big integers alike, meaning the code to handle the
slow and unlikely path (using a really big integer in your Python script)
pessimizes the fast path (using regularly sized integers).
To be honest, I’m a little surprised there isn’t a fast path optimization here. Even Smalltalk in the 80s used pointer tagging to avoid heap allocating integers4!
But we do see a good example of how using specialized memory allocators lets us greatly reduce some of the costs of memory allocation.
This is why Zig is one my favorite languages. The answer to the question of “I
wonder if this is allocating?” is “does this function take an allocator?“5. And the
std.mem.Allocator interface lets you easily swap out allocators, and all of the
standard library is designed to accept a std.mem.Allocator when it wants to
allocate memory.
Footnotesλ
According to this chart of the cost of CPU operations in cycles,
allocating + deallocating an object could take 200-500 cycles. Meanwhile, an ADD
op is listed as taking less than 1 cycle (since modern CPUs have multiple ALU units
meaning more parallelism). So the 200-500x slower estimate is quite modest.
Also, this is does not include the other overheads of Python’s integer representation beyond allocation.
For example, this is the codepath for converting a PythonLongObject* into a usize:
static inline Py_ssize_t
_PyLong_CompactValue(const PyLongObject *op)
{
Py_ssize_t sign;
sign = 1 - (op->long_value.lv_tag & _PyLong_SIGN_MASK);
return sign * (Py_ssize_t)op->long_value.ob_digit[0];
}
There’s pointer indirection, a bit mask and a substraction and a multiplication
and then another pointer indirection from reading ob_digit[0].
We can see declaration of these small ints here:
#define _PY_NSMALLPOSINTS 1025
#define _PY_NSMALLNEGINTS 5
struct _Py_static_objects {
struct {
/* Small integers are preallocated in this array so that they
* can be shared.
* The integers that are preallocated are those in the range
* -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (exclusive).
*/
PyLongObject small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
/* ... */
} singletons;
};
Now the nice thing about these is that they require no heap allocation (notice
the type of the items in the list are PyLongObject with no pointer)
And the runtime avoids accidentally calling free on these static small ints by
setting this IMMORTALITY_BIT_MASK which marks the small int object as
“immortal”.
From briefly glancing at the function which is allocating in the internals
print(), most of the logic is made complicated by the fact that the
representation of integers is in base 230 (to allow for extra long integers) and it
must convert it back to base 10. It allocates a new PyLongObject with enough
space to fit the base 10 digits.
But realistically the majority of integers in a program are going to be less than 230 so why not introduce a fast path which skips this complicated code entirely?
In addition, there’s no need to heap allocate the list of digits in every case. If you know the list of digits is small (which is often the case), you can stack allocate the digit list.
This is what the Zig standard library does for example.
See “Representation of Small Integers” in page 565 of “Smalltalk-80: The Language and its Implementation”, also known as the “Blue book”
A good quote by paperclover!