Examples of palloc overhead


2014/09/30 by Tomas Vondra

This is the post I promised last week, explaining a few common issues with memory contexts. The issues mostly lead to excessive overhead, i.e. excessive usage of resources - usually memory. And that's exactly what this post is about, so whenever I say "overhead" you can read "excessive memory usage." I will also try to give advices on how to avoid those issues or minimize the impact.

As I briefly mentioned when explaining allocation set internals, there are three main sources of overhead:

  • chunk headers - for large chunks, this gets negligible
  • unused space (because of 2^N chunks) - expected ~25% for randomly sized chunks, but can get much worse
  • reuse not working efficiently - we'll see some examples how this can happen

If you haven't read that post, it's probably the right time to do that. Also, if you want to learn more about memory management and allocator implementations, there's a great post at IBM developerWorks explaining it quite well and also listing many interesting additional resources (e.g. various malloc implementations).

So let's see some usual (but somehow unexpected) examples of palloc overhead.


Allocating many tiny pieces of memory

Say we need to store a lot of small elements, a few bytes each - e.g. 64-bit integers, words in an ispell dictionary or something like that. You could do a separate palloc call for each element (and keep the pointer), but in that case you'll pay the 'header' price for each element. For example by doing this

int64 *items[1024];
for (i = 0; i < 1024; i++)
    items[i] = (int*)palloc(sizeof(int64));

you might think you allocated ~8kB of memory (1024 x 8B), but in fact this prepends each value with 16B header. So you end up with about 24kB (not counting the items array, which needs additional 8kB), which is ~3x the requested amount. If we asked for smaller values (e.g. 4B integers), the overhead would be even larger because 8B is the smallest chunk allocated by palloc (as mentioned in the previous post).

Let's assume all the elements have about the same life span / scope (i.e. can be freed at the same time), and don't need to be passed somewhere else (i.e. are used only locally). In such cases the best approach is often preallocating a large chunk of memory, and then slicing it into small pieces - essentially doing what the allocation set allocator does when slicing the block into chunks, but without the chunk headers.

Of course, that means you won't be able to call pfree on the elements, but that's OK because we assumed only local usage (if we pass one of the elements somewhere else, that code would be unaware of this). But if the elements have the same life span (as for example words in an ispell dictionary), we don't really need to call the pfree on the individual words as the dictionary gets freed at once.

This is especially trivial to do when the elements are of fixed size, and you know how many of them to expect - in that case you can just do

char * tmp = palloc(elementSize * elementCount)

tmp -> element #0
(tmp + elementSize) -> element #1
...
(tmp + k*elementSize) -> element #k

and you're done. But what to do when the elements are variable-sized, or when you don't know how many of them to expect? Well, who says you have to allocate all the elements at once? You can allocate a large chunk of memory, use it for elements until you run our of space, then allocate another one and start over.

A very simple example of this approach is compact_palloc0 method in spell.c, which is used for efficient allocation of words and other dictionary-related data (for the purpose of fulltext-search) (the following code is somewhat simplified, to make it easier to understand):

# define COMPACT_ALLOC_CHUNK 8192

/* current chunk and available space */
char * current_ptr = NULL;
Size   available_space = 0;

static void *
compact_palloc0(IspellDict *Conf, size_t size)
{
    void    *result;

    /* Need more space? */
    if (size > available_space)
    {
        current_ptr = palloc0(COMPACT_ALLOC_CHUNK);
        available_space = COMPACT_ALLOC_CHUNK;
    }

    result = (void *) current_ptr;

    current_ptr += size;
    available_space -= size;

    return result;
}

This effectively adds another (very simple) allocator on top of the AllocSet. The "large" chunks (allocated using palloc) are registered in the parent memory context and will be freed when that memory allocator gets destroyed.

Another example such "dense" allocation was recently committed into our hashjoin implementation - it's however slightly more complicated (to support some hashjoin-specific requirements). See commit 45f6240a for more details.

Allocating pieces that are not 2^N bytes

The other trap you may fall into is the 2^N sizing of chunks. There are plenty of ways how to "achieve" that - I'll show you a real-world example from my quantile extension. I already fixed it, so you have to look at commit c1e7bba9 to see it.

Let's say you're writing an aggregate function MEDIAN(float) - to achieve that, you need to collect the float values into the aggregate state (so that you can later sort them and choose the "middle" value, which is the median). The aggregate state might be represented by this structure

typedef struct median_state {
    int nelements;
    int next;
    float * elements;
} median_state;

The float elements are accumulated into 'elements' array, which is resized on the fly as needed. The size of the array (including unusued part) is tracked by 'nelements' and 'next' is index of the next available element. So when next == nelements happends, we need to enlarge the array, which is done like this:

#define SLICE_SIZE 5
...

if (state->nelement == 0) {
    state->nelements = SLICE_SIZE;
    state->elements = (float*)palloc(sizeof(float)*data->nelements);
} else if (state->next > state->nelements-1) {
    state->nelements = state->nelements + SLICE_SIZE;
    state->elements = (float*)repalloc(state->elements,
                                       sizeof(float)*data->nelements);
}

Can you spot the problem? It's pretty obvious :-/

Well, every time we call repalloc, the 'elements' array grows by 20 bytes (SLICE_SIZE * 4B). That's rather pointless, because it either fits into the current chunk (and then it's almost free), or the array has to be moved to a larger chunk. But it does not save any memory at all.

Moreover there's a small inefficiency, because 20 is not a divisor of chunk sizes (following the 2^N rule). This wastes a bit of memory, but for larger chunks this is negligible.

It however clearly shows that constant growth does not work, because it does not follow the 2^N chunk size pattern. A saner version of the resize would be about this:

#define SLICE_SIZE 8
...

if (state->nelement == 0) {
    state->nelements = SLICE_SIZE;
    state->elements = (float*)palloc(sizeof(float)*data->nelements);    
} else if (state->next > state->nelements-1) {
    state->nelements = state->nelements * 2;
    state->elements = (float*)repalloc(state->elements,
                                       sizeof(float)*data->nelements);
}

So, this is better - it starts with 8 elements (because 8 * 4B = 32B, which is 2^5 and thus follows the 2^N rule). I could have used smaller value (up to 2, because the smallest chunk is 8B). This would have about the same memory consumption as before, but it's more correct and the repalloc will be called much less frequently (exponentially less).

Now, let's see a more serious example, that I almost commited into count_distinct but luckily realized the error before doing that:

Let's say the state for the MEDIAN() aggregate was defined like this:

typedef struct median_state {
    int nelements;
    int next;
    float elements[1];
} median_state;

Placing a single-element array at the end of struct is a well known trick to define variable length structures. The resize then might look like this:

#define SLICE_SIZE 8
...

if (state == NULL) {
    state = (median_state *) palloc(offsetof(median_state,elements)
                                    + sizeof(float) * SLICE_SIZE);
    state->nelements = SLICE_SIZE;
} else if (state->next > state->nelements-1) {
    state->nelements = state->nelements * 2;
    state = (median_state*)repalloc(state,
                                    offsetof(median_state,elements)
                                    + sizeof(float)*data->nelements);
}

Update 30/09/2014: Fixed the broken example, reported by Patrick TJ McPhee.

We're still keeping the array nicely sized (still 2^N bytes), so perfect, right? Well, no. What gets allocated is the whole structure, and sadly that's always (2^N + 8B), because the two integers are part of the chunk. So we'll always get the perfect size + 2B. Which pretty much says we'll get 2x the necessary chunk size (because of the overflowing 8B). But this time we're guaranteed to waste the upper half of the chunk (except the first 8B). That kinda sucks.

There are two ways to fix this - either by allocating the array separately (which is what I did in count_distinct) or keeping the total size 2^N (and tweaking the nelements appropriately).

Creating many small contexts

Sometimes "less is more" and it certainly holds for memory contexts. A nice example is array_agg - an aggregate function that accumulates all the values into an array. The heavylifting is done by accumArrayResult in arrayfuncs.c. This piece of code performs initialization when the first value is passed to the function:

if (astate == NULL)
{
    /* First time through --- initialize */
    /* Make a temporary context to hold all the junk */
    arr_context = AllocSetContextCreate(rcontext,
                            "accumArrayResult",
                            ALLOCSET_DEFAULT_MINSIZE,
                            ALLOCSET_DEFAULT_INITSIZE,
                            ALLOCSET_DEFAULT_MAXSIZE);
    oldcontext = MemoryContextSwitchTo(arr_context);
    astate = (ArrayBuildState *) palloc(sizeof(ArrayBuildState));
    astate->mcontext = arr_context;
    astate->alen = 64;  /* arbitrary starting array size */
    astate->dvalues = (Datum *) palloc(astate->alen * sizeof(Datum));
    astate->dnulls = (bool *) palloc(astate->alen * sizeof(bool));
    astate->nelems = 0;
    astate->element_type = element_type;
    get_typlenbyvalalign(element_type,
                            &astate->typlen,
                            &astate->typbyval,
                            &astate->typalign);
}

It's slightly complicated because it sets a lot of values in the aggregate state, but apparently it preallocates space for 64 elements (astate->alen), which is 512B on 64-bit architectures. It then properly doubles the size (not shown here bor brevity). Perfect, right? What could go wrong?

Well, the first thing the code actually does is creating a dedicated memory context (for this group), and it uses ALLOCSET_DEFAULT_INITSIZE as the initial block size. And ALLOCSET_DEFAULT_INITSIZE is 8kB, so on the first palloc, this memory context allocates 8kB block. The fact that we wanted to preallocate space for only 64 elements is irrelevant - we'll get 16x that.

Now, imagine aggregation with many distinct groups. Each group will get 8kB, even though there may be a single item in the array. And we actually get bug reports related to this.

What makes this even worse is that keeping per-group memory contexts makes it impossible to reuse chunks across groups.

Gradually growing request sizes

The last issue is quite different from the previous ones, because it's about (in)efficient chunk reuse.

As I mentioned, chunks are not really freed - instead they're moved to a freelist for later reuse. So when you do palloc(50) you'll get 64B chunk, and when you do pfree(chunk) it'll be moved to a freelist and eventually used for other requests needing 64B chunks.

But what happens if the requested sizes only grow? Consider for example this example:

int i = 0, j = 0;
size_t current_size = 8;
int nchunks = 100;
int nloops = 10;
char * tmp[nchunks];

for (i = 0; i < nchunks; i++)
    tmp[i] = palloc(current_size);

for (i = 0; i < nloops; i++)
{
    current_size *= 2;
    for (j = 0; j < nchunks; j++)
        tmp[j] = repalloc(tmp[j], current_size);
}

So, how much memory is allocated at the end? There are 100 chunks, and the final chunk size is 8kB. So it has to be 800kB, right?

Actually, it's about double that, because none of the smaller chunks will ever be reused. The fact that the request sizes only grow prevents chunk reuse - the chunks will get stuck in the freelists until the memory context is destroyed.

It's however true that this overhead is bounded (it can't be higher than 100%, because it's bounded by a sum of infinite series 1/(2^k)), and most workloads actually mix requests of various sizes (making the reuse possible).

Summary

  • Don't try to be overly smart - follow the 2^N rule by allocating properly sized pieces and doubling the size when needed.
  • Where applicable, consider using dense allocation (as for example compact_palloc in spell.c).




comments powered by Disqus