PATCH: decreasing memory needlessly consumed by array_agg - Mailing list pgsql-hackers

From Tomas Vondra
Subject PATCH: decreasing memory needlessly consumed by array_agg
Date
Msg-id 5334D7A5.2000907@fuzzy.cz
Whole thread Raw
Responses Re: PATCH: decreasing memory needlessly consumed by array_agg  (Robert Haas <robertmhaas@gmail.com>)
Re: PATCH: decreasing memory needlessly consumed by array_agg  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
Hi,

this is a patch for issue reported in October 2013 in pgsql-bugs:

  http://www.postgresql.org/message-id/3839201.Nfa2RvcheX@techfox.foxi

Frank van Vugt reported that a simple query with array_agg() and large
number of groups (1e7) fails because of OOM even on machine with 32GB of
RAM.

So for example doing this:

   CREATE TABLE test (a INT, b INT);
   INSERT INTO test SELECT i, i FROM generate_series(1,10000000) s(i);

   SELECT a, array_agg(b) FROM test GROUP BY a;

allocates huge amounts of RAM and easily forces the machine into
swapping and eventually gets killed by OOM (on my workstation with 8GB
of RAM that happens almost immediately).

Upon investigation, it seems caused by a combination of issues:

(1) per-group memory contexts - each group state uses a dedicated
    memory context, which is defined like this (in accumArrayResult):

    arr_context = AllocSetContextCreate(rcontext,
                "accumArrayResult",
                ALLOCSET_DEFAULT_MINSIZE,
                ALLOCSET_DEFAULT_INITSIZE,
                ALLOCSET_DEFAULT_MAXSIZE);

    which actually means this

    arr_context = AllocSetContextCreate(rcontext,
                "accumArrayResult",
                0,
                (8*1024),
                (8*1024*1024));

    so each group will allocate at least 8kB of memory (of the first
    palloc call). With 1e7 groups, that's ~80GB of RAM, even if each
    group contains just 1 item.

(2) minimum block size in aset.c - The first idea I got was to decrease
    the block size in the allocator. So I decreased it to 256B but I
    was still getting OOM. Then I found that aset.c contains this:

        if (initBlockSize < 1024)
        initBlockSize = 1024;

    so effectively the lowest allowed block size is 1kB. Which means
    ~10GB of memory for the state data (i.e. not considering overhead
    of the hash table etc., which is not negligible).

    Considering we're talking about 1e7 32-bit integers, i.e. 40MB
    of raw data, that's still pretty excessive (250x more).

While I question whether the 1kB minimum block size makes sense, I
really think per-group memory contexts don't make much sense here. What
is the point of per-group memory contexts?

The memory will be allocated when the first row of the group is
received, and won't be allocated until the whole result set is
processed. At least that's how it works for Hash Aggregate.

However that's exactly how it would work with a single memory context,
which has the significant benefit that all the groups share the same
memory (so the minimum block size is not an issue).

That is exactly what the patch aims to do - it removes the per-group
memory contexts and reuses the main memory context of the aggregate
itself.

The patch also does one more thing - it changes how the arrays (in the
aggregate state) grow. Originally it worked like this

    /* initial size */
    astate->alen = 64;

    /* when full, grow exponentially */
    if (astate->nelems >= astate->alen)
        astate->alen *= 2;

so the array length would grow like this 64 -> 128 -> 256 -> 512 ...
(note we're talking about elements, not bytes, so with with 32-bit
integers it's actually 256B -> 512B -> 1024B -> ...).

While I do understand the point of this (minimizing palloc overhead), I
find this pretty dangerous, especially in case of array_agg(). I've
modified the growth like this:

    /* initial size */
    astate->alen = 4;

    /* when full, grow exponentially */
    if (astate->nelems >= astate->alen)
        astate->alen += 4;

I admit that might be a bit too aggressive, and maybe there's a better
way to do this - with better balance between safety and speed. I was
thinking about something like this:


    /* initial size */
    astate->alen = 4;

    /* when full, grow exponentially */
    if (astate->nelems >= astate->alen)
        if (astate->alen < 128)
            astate->alen *= 2;
        else
            astate->alen += 128;

i.e. initial size with exponential growth, but capped at 128B.


regards
Tomas

Attachment

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: History of WAL_LEVEL (archive vs hot_standby)
Next
From: David Johnston
Date:
Subject: Re: History of WAL_LEVEL (archive vs hot_standby)