Thread: array_agg() on a set larger than some arbitrary(?) limit causes runaway memory usage and eventually memory exhaustion

L.S.

Something seems wrong here.... when applying arrag_agg() on a large recordset,
above a certain size 'things fall over' and memory usage races off until the
system runs out of it:


# select version();
                                      version
-----------------------------------------------------------------------------------
 PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.7.1,
64-bit


# \! free -g
             total       used       free     shared    buffers     cached
Mem:            31          1         30          0          0          0
-/+ buffers/cache:          0         30
Swap:           31          0         31



====================
Create concatenate function and aggregate:

CREATE OR REPLACE FUNCTION comma_cat(text, text)
        RETURNS text
        LANGUAGE 'plpgsql'
        IMMUTABLE
        STRICT
        SECURITY INVOKER
        AS '
                BEGIN
                        IF $1 = '''' THEN
                                RETURN $2;
                        ELSIF $2 = '''' THEN
                                RETURN $1;
                        ELSE
                                RETURN $1 || '', '' || $2;
                        END IF;
                END;';
CREATE AGGREGATE comcat(text) ( SFUNC = comma_cat, STYPE = text );


Activate timing:

\timing
Timing is on.


Create demo data:

create temp table f as
    select id, random() as value
    from generate_series(1, 1e7::int) as f(id);

Time: 7036,917 ms



====================
Don't mind the 'usefulness' of the exact query below, I ran into this issue
when experimenting a bit using random().




On my system, using the comcat() aggregate is no problem regardless of the
size of the recordset:

with g as (select * from f limit 1e5)
    select comcat(id::text), min(value)
    from g
    group by g.value
    having count(1) > 1;
Time: 189,835 ms


with g as (select * from f limit 1e6)
    select comcat(id::text), min(value)
    from g
    group by g.value
    having count(1) > 1;
Time: 1815,756 ms


with g as (select * from f)
    select comcat(id::text), min(value)
    from g
    group by g.value
    having count(1) > 1;
Time: 18660,326 ms

====================
However, using the array_agg() this breaks (on my system ) on the largest set:

with g as (select * from f limit 1e5)
    select array_agg(id), min(value)
    from g
    group by g.value
    having count(1) > 1;
Time: 361,242 ms


with g as (select * from f limit 1e6)
    select array_agg(id), min(value)
    from g
    group by g.value
    having count(1) > 1;
Time: 3310,347 ms


with g as (select * from f)
    select array_agg(id), min(value)
    from g
    group by g.value
    having count(1) > 1;
Time: <none, fails>


=> the last query eats up all 32GB main memory in seconds, then starts on the
32GB swap (which obviously takes a bit longer to digest) until eventually the
child process gets killed by the oom-killer and postgresql restarts....






--

Best,




Frank.
Hi Frank,

this is a little bit not relevant to the question itself. But to prevent
OOM killer from currupting your database please consider this for your
production environments:
http://www.postgresql.org/docs/9.1/static/kernel-resources.html#LINUX-MEMOR=
Y-OVERCOMMIT

=E1=83=95=E1=83=90=E1=83=9A=E1=83=94=E1=83=9C=E1=83=A2=E1=83=98=E1=83=9C =
=E1=83=92=E1=83=9D=E1=83=92=E1=83=98=E1=83=A9=E1=83=90=E1=83=A8=E1=83=95=E1=
=83=98=E1=83=9A=E1=83=98
Valentine Gogichashvili


On Sat, Oct 19, 2013 at 3:38 PM, Frank van Vugt <ftm.van.vugt@foxi.nl>wrote=
:

> L.S.
>
> Something seems wrong here.... when applying arrag_agg() on a large
> recordset,
> above a certain size 'things fall over' and memory usage races off until
> the
> system runs out of it:
>
>
> # select version();
>                                       version
>
> -------------------------------------------------------------------------=
----------
>  PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.7.=
1,
> 64-bit
>
>
> # \! free -g
>              total       used       free     shared    buffers     cached
> Mem:            31          1         30          0          0          0
> -/+ buffers/cache:          0         30
> Swap:           31          0         31
>
>
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Create concatenate function and aggregate:
>
> CREATE OR REPLACE FUNCTION comma_cat(text, text)
>         RETURNS text
>         LANGUAGE 'plpgsql'
>         IMMUTABLE
>         STRICT
>         SECURITY INVOKER
>         AS '
>                 BEGIN
>                         IF $1 =3D '''' THEN
>                                 RETURN $2;
>                         ELSIF $2 =3D '''' THEN
>                                 RETURN $1;
>                         ELSE
>                                 RETURN $1 || '', '' || $2;
>                         END IF;
>                 END;';
> CREATE AGGREGATE comcat(text) ( SFUNC =3D comma_cat, STYPE =3D text );
>
>
> Activate timing:
>
> \timing
> Timing is on.
>
>
> Create demo data:
>
> create temp table f as
>         select id, random() as value
>         from generate_series(1, 1e7::int) as f(id);
>
> Time: 7036,917 ms
>
>
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Don't mind the 'usefulness' of the exact query below, I ran into this iss=
ue
> when experimenting a bit using random().
>
>
>
>
> On my system, using the comcat() aggregate is no problem regardless of th=
e
> size of the recordset:
>
> with g as (select * from f limit 1e5)
>         select comcat(id::text), min(value)
>         from g
>         group by g.value
>         having count(1) > 1;
> Time: 189,835 ms
>
>
> with g as (select * from f limit 1e6)
>         select comcat(id::text), min(value)
>         from g
>         group by g.value
>         having count(1) > 1;
> Time: 1815,756 ms
>
>
> with g as (select * from f)
>         select comcat(id::text), min(value)
>         from g
>         group by g.value
>         having count(1) > 1;
> Time: 18660,326 ms
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> However, using the array_agg() this breaks (on my system ) on the largest
> set:
>
> with g as (select * from f limit 1e5)
>         select array_agg(id), min(value)
>         from g
>         group by g.value
>         having count(1) > 1;
> Time: 361,242 ms
>
>
> with g as (select * from f limit 1e6)
>         select array_agg(id), min(value)
>         from g
>         group by g.value
>         having count(1) > 1;
> Time: 3310,347 ms
>
>
> with g as (select * from f)
>         select array_agg(id), min(value)
>         from g
>         group by g.value
>         having count(1) > 1;
> Time: <none, fails>
>
>
> =3D> the last query eats up all 32GB main memory in seconds, then starts =
on
> the
> 32GB swap (which obviously takes a bit longer to digest) until eventually
> the
> child process gets killed by the oom-killer and postgresql restarts....
>
>
>
>
>
>
> --
>
> Best,
>
>
>
>
> Frank.
>
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>
Hi,

On 19.10.2013 21:02, Valentine Gogichashvili wrote:
> Hi Frank,
>
> this is a little bit not relevant to the question itself. But to prevent
> OOM killer from currupting your database please consider this for your
> production environments:
> http://www.postgresql.org/docs/9.1/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

Not sure what's your suggestion? The problem is that the database is
actually using the memory, so overcommit is not directly applicable here.

He might easily add more swap to "fix" the issue, but it will take ages
to actually execute the query because of the overhead.

Tomas
Hi,

On 19.10.2013 15:38, Frank van Vugt wrote:
> L.S.
>
> Something seems wrong here.... when applying arrag_agg() on a large recordset,
> above a certain size 'things fall over' and memory usage races off until the
> system runs out of it:

...

> with g as (select * from f)
>     select comcat(id::text), min(value)
>     from g
>     group by g.value
>     having count(1) > 1;
> Time: 18660,326 ms

...

> with g as (select * from f)
>     select array_agg(id), min(value)
>     from g
>     group by g.value
>     having count(1) > 1;
> Time: <none, fails>

Hmmmm. I initially thought that this was because of 2x resizing the
array in accumArrayResult, but that clearly is not the case (e.g.
because the example is constructed so that groups with multiple elements
are very unlikely).

Then I though that it's because the array is pre-allocated with 64
elements, but that's not an issue either (at least not the main one)
because even with 1e7 elements this amounts for about

   10M * 8 * 64 = ~5GB

and the report says that it consumed ~32GB RAM and swap (not sure how
much, but probably not a small amount). On my machine it easily ate 8GB
of RAM and 4GB of swap (and then got shot by OOM).

Anyway, I disabled the preallocation (i.e. 1 element initially, +1 for
each iteration) which should be ~80MB of data, but even then I was
unable to execute that query.

The effect on smaller queries (say, 1e6 rows) was negligible too - it
consumed more or less the same amount of memory, irrespectedly of the
preallocation.

Then I started to wonder about the memory local context, which is
defined like this:

    arr_context = AllocSetContextCreate(rcontext,
                "accumArrayResult",
                ALLOCSET_DEFAULT_MINSIZE,
                ALLOCSET_DEFAULT_INITSIZE,
                ALLOCSET_DEFAULT_MAXSIZE);

which means

    arr_context = AllocSetContextCreate(rcontext,
                "accumArrayResult",
                0,
                (8*1024),
                (8*1024*1024));

I may be wrong, but it seems to me that each group has it's own
allocation context and once something gets allocated (which is pretty
much granted for each group), it's allocates at least 8kB of memory.

With 1e6 groups that's ~8GB (not really far from what I see here), and
with 1e7 groups it's ~80GB. Not the most efficient approach for 80MB of
values.

I tried this:

#define ALLOCSET_TINY_MINSIZE     0
#define ALLOCSET_TINY_INITSIZE  (1 * 256)
#define ALLOCSET_TINY_MAXSIZE    (8 * 1024)

    arr_context = AllocSetContextCreate(rcontext,
                "accumArrayResult",
                ALLOCSET_DEFAULT_MINSIZE,
                ALLOCSET_DEFAULT_INITSIZE,
                ALLOCSET_DEFAULT_MAXSIZE);

and after a minor tweak in aset.c, which by default enforces that init
size >= 1024:

    if (initBlockSize < 1024)
        initBlockSize = 1024;

I'm able to process even the 1e7 values, although it still consumes a
significant amount of memory.

The failing query is slightly artificial, but pretty much any array_agg
query with large number of groups is going to fail exactly the same. So
if we could improve that somehow, that'd be nice.

However the more I think about the local memory context, the more I
think the idea to use a dedicated memory context for each group is
flawed. I think a single context for the whole aggregate would work much
better, and indeed - after replacing the local memory context by
rcontext (which is just aggcontext passed from array_agg_transfn) the
memory consumption was cut in half.

The other thing I think we should reevaluate is the preallocation, i.e.
the initial size (currently 64) and if growing to 2x the size is a good
idea.

Based on experience and measurements with multiple custom aggregates I
did in the past few weeks, I think a significantly lower initial size
(say 16 or maybe even 8) would work equally well. We're talking about
difference in percents vs. higher probability of running out of memory.

But let's say it's a minor issue, and by switching to a shared memory
context we've already saved ~50% memory for such these cases (single
element in a group, 1kB chunk - 64*8B = 512B).

Regarding the growth rate - I think once array reachch certain size
(e.g. 1024 elements), we should probably decrease the growth rate. For
example +128 instead of x2 or something like that. With the current
approach, we're pretty much bound to have ~50% overhead because of the
doubling.

regards
Tomas
Hi Valentine,

Op zaterdag 19 oktober 2013 21:02:25 schreef Valentine Gogichashvili:
> this is a little bit not relevant to the question itself. But to prevent OOM
> killer from currupting your database please consider this for your
> production environments:

Sure, I understand why you'd want to mention this 'on the side', but I'm aware
of OOM-tuning. In production, I use it wherever it's _really_ needed, but mind
that the oom-killer in newer kernels is already selecting processes a bit
smarter than it used to. In the example I gave, the correct child process was
killed.





--

Best,




Frank.
Hi,

Interesting read so far!

Op zondag 20 oktober 2013 04:48:01 schreef Tomas Vondra:
> because the example is constructed so that groups with multiple elements
> are very unlikely).

yep, the original intention of the 'experiment' was to investigate how often
random() would 'clash' on various machines with/without hardware number
generators, etc

> the report says that it consumed ~32GB RAM and swap (not sure how
> much, but probably not a small amount). On my machine it easily ate 8GB
> of RAM and 4GB of swap (and then got shot by OOM).

amount of swap is 32GB as well (see output of 'free' on top of report ;) )

> Anyway, I disabled the preallocation (i.e. 1 element initially, +1 for
> each iteration) which should be ~80MB of data, but even then I was
> unable to execute that query.
>
> The effect on smaller queries (say, 1e6 rows) was negligible too - it
> consumed more or less the same amount of memory, irrespectedly of the
> preallocation.

on my setup, running the array_agg() over 1e6 records uses up ~5GB

> With 1e6 groups that's ~8GB (not really far from what I see here), and
> with 1e7 groups it's ~80GB. Not the most efficient approach for 80MB of
> values.

exactly, the 'scaling' of memory hunger was not what I'd expect from such a
recordset, which made me write up the report

> The failing query is slightly artificial, but pretty much any array_agg
> query with large number of groups is going to fail exactly the same. So
> if we could improve that somehow, that'd be nice.

exactly

> But let's say it's a minor issue, and by switching to a shared memory
> context we've already saved ~50% memory for such these cases (single
> element in a group, 1kB chunk - 64*8B = 512B).

I never ran into this earlier (and I've been using PostgreSQL for quite some
time now) and even now, this is not about a 'production query', so I'm happy
with any and all benefits for future versions that come from this ;)

Having said that, should you want me to check the effects of some patch, just
let me know, I'd be happy to do that.




--

Best,




Frank.
On 20 Říjen 2013, 10:26, Frank van Vugt wrote:
>
> I never ran into this earlier (and I've been using PostgreSQL for quite
> some
> time now) and even now, this is not about a 'production query', so I'm
> happy
> with any and all benefits for future versions that come from this ;)
>
> Having said that, should you want me to check the effects of some patch,
> just
> let me know, I'd be happy to do that.

Attached is a quick patch removing the local memory context and using
aggcontext instead. I've also tuned down the preallocation. It's against
9.2 stable, so it should apply fine against your 9.2.4. But be careful,
it's not really tested.

Tomas
Attachment
Hi Frank,

On Sun, Oct 20, 2013 at 10:26 AM, Frank van Vugt <ftm.van.vugt@foxi.nl>wrote:

> Sure, I understand why you'd want to mention this 'on the side', but I'm
> aware
> of OOM-tuning. In production, I use it wherever it's _really_ needed, but
> mind
> that the oom-killer in newer kernels is already selecting processes a bit
> smarter than it used to. In the example I gave, the correct child process
> was
> killed.
>
> I personally prefer to see ERROR: out of memory, coming from Postgres,
then OOM killing processes, that it thinks are to be killed :)

Apropos, how does the string_agg aggregate behave in your example? Will be
nice to know without experimenting myself and looking into the sources :)

Regards,

-- Valentine
* Frank van Vugt (ftm.van.vugt@foxi.nl) wrote:
> Op zaterdag 19 oktober 2013 21:02:25 schreef Valentine Gogichashvili:
> > this is a little bit not relevant to the question itself. But to preven=
t OOM
> > killer from currupting your database please consider this for your
> > production environments:
>=20
> Sure, I understand why you'd want to mention this 'on the side', but I'm =
aware=20
> of OOM-tuning. In production, I use it wherever it's _really_ needed, but=
 mind=20
> that the oom-killer in newer kernels is already selecting processes a bit=
=20
> smarter than it used to. In the example I gave, the correct child process=
 was=20
> killed.

The correct child being killed doesn't actually mean that it's a *good
idea* to kill off PG child processes, in general, particularly in the
way that the OOM killer goes about it (kill -9).  If it was possible to
tune the OOM killer to use a different signal, which would allow PG to
actually clean things up, it *might* be reasonable to allow it, but I
still wouldn't recommend it.

In production, for my part, proper memory accounting (and disabling of
OOM) should *always* be used.

    Thanks,

        Stephen
Hi Valentine,

Op zondag 20 oktober 2013 14:06:19 schreef Valentine Gogichashvili:
Apropos, how does the string_agg aggregate behave in your example?
Will be nice to know without experimenting myself and looking into the
sources :)



Interesting....

On the 1e7 dataset, our custom comcat() function runs for ~18.5
seconds and eats up to ~6,5 GB of memory, while the string_agg() on
the same dataset runs for ~23.5 seconds and eats up to ~16 GB of
memory.




--

Best,




Frank.
Hi Stephen,

Op maandag 21 oktober 2013 10:47:26 schreef Stephen Frost:
> > In production, I use it wherever it's _really_ needed, but mind
> > that the oom-killer in newer kernels is already selecting processes
> > a bit smarter than it used to. In the example I gave,
> > the correct child process was killed.

> The correct child being killed doesn't actually mean that it's a *good
> idea* to kill off PG child processes, in general, particularly in the
> way that the OOM killer goes about it (kill -9).  If it was possible to
> tune the OOM killer to use a different signal, which would allow PG to
> actually clean things up, it *might* be reasonable to allow it, but I
> still wouldn't recommend it.
>
> In production, for my part, proper memory accounting (and disabling of
> OOM) should *always* be used.

I did not describe this the way I should have, I meant to say that I do
exactly that: disable the OOM-killer _always_ in a production situation since
there it is _really_ needed.

I noticed on the development machine where I did the testing that the correct
child process was being killed, which used to not be the case.


Thanks for emphasizing this, though.




--

Best,




Frank.
Hi,

Op zondag 20 oktober 2013 12:57:43 schreef Tomas Vondra:
> Attached is a quick patch removing the local memory context and using
> aggcontext instead. I've also tuned down the preallocation. It's against
> 9.2 stable, so it should apply fine against your 9.2.4. But be careful,
> it's not really tested.

I applied the patch and ran my queries again with the following results:



> with g as (select * from f limit 1e5)
>         select array_agg(id), min(value)

used to be:
    Time: 361,242 ms
now:
    Time: 363,767 ms



> with g as (select * from f limit 1e6)
>         select array_agg(id), min(value)

used to be:
    Time: 3310,347 ms
now:
    Time: 2134,688 ms



> with g as (select * from f limit 1e7)
>         select array_agg(id), min(value)

used to be:
    Time: <none, fails>
now:
    Time: 23234,045 ms

The last query now uses up ~3.5GB of memory.





and as for the comcat() / string_agg() comparison:

> with g as (select * from f limit 1e7)
>         select comcat(id::text), min(value)

used to be / still is:
    Time: ~18.5 seconds
    Mem: ~6,5 GB


> with g as (select * from f limit 1e7)
>         select string_agg(id::text, ', '), min(value)

used to be:
    Time: ~28.5 seconds
    Mem: ~16 GB
now:
    Time: ~28.5 seconds
    Mem: ~12 GB



So, the patch seems to have the desired effect ;)

It looks like this didn't go into git yet. Will it be in v9.2.6/v9.3.1? Since
it's still a 'rough' patch, I reversed it on our development server for now.





--

Best,




Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Op zondag 20 oktober 2013 12:57:43 schreef Tomas Vondra:
>> Attached is a quick patch removing the local memory context and using
>> aggcontext instead.

> It looks like this didn't go into git yet. Will it be in v9.2.6/v9.3.1?

Certainly not.  We can consider it for 9.4, if Tomas submits it to the
commitfest process, but it needs review and testing.  In particular
I'm skeptical that it probably outright breaks some cases (look at
makeMdArrayResult) and results in substantial added bloat, not savings,
in others.

            regards, tom lane
Op donderdag 31 oktober 2013 10:05:16 schreef Tom Lane:
> > It looks like this didn't go into git yet. Will it be in v9.2.6/v9.3.1?
>
> Certainly not.  We can consider it for 9.4, if Tomas submits it to the
> commitfest process, but it needs review and testing.  In particular
> I'm skeptical that it probably outright breaks some cases (look at
> makeMdArrayResult) and results in substantial added bloat, not savings,

Clear, please mind the fact that Tomas did point out to me that it was an
untested patch ;)

However, although the example with which I ran into this issue was a bit
contrived, the current behaviour of the array functions in terms of memory
usage does seem to be a bit greedy. So at least it _looks_ like a good idea to
want to do something about it, since triggering this issue in the real world
doesn't seem that far-fetched...

Tomas...., are you planning to (submit a patch)?



--

Best,




Frank.
On 31 Říjen 2013, 15:05, Tom Lane wrote:
> Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>> Op zondag 20 oktober 2013 12:57:43 schreef Tomas Vondra:
>>> Attached is a quick patch removing the local memory context and using
>>> aggcontext instead.
>
>> It looks like this didn't go into git yet. Will it be in v9.2.6/v9.3.1?
>
> Certainly not.  We can consider it for 9.4, if Tomas submits it to the
> commitfest process, but it needs review and testing.  In particular
> I'm skeptical that it probably outright breaks some cases (look at
> makeMdArrayResult) and results in substantial added bloat, not savings,
> in others.

Yes, I'll submit it to the next commitfest. And you're right that the
makeMdArrayResult is faulty - it shouldn't really do the
MemoryContextDelete as I reused the context of the aggregation function.
Actually, I'm quite surprised it did not fail, even though my tests were
only really sketchy.

As for the 9.2 / 9.3 branches, I think we should discuss whether the
current state is actually correct. I'm inclined to say that a code that
manages to crash a server with 64GB of virtual memory, although with very
light tweaking can happily complete with ~3.5GB of RAM, is not exactly
right.

It seems to me that the code somehow assumes the accumulated arrays are
going to be quite large - in that case preallocating > 1kB per group is
probably fine. I'm fine with "reasonable" preallocation, but 1kB seems way
too aggressive to me. But it's not possible to use smaller preallocation
because this is determined by the block size check in aset.c.

So I see three options for <9.4 versions:

(a) keeping the current implementation, that crashes with cases like this
(and I don't really like this option, because the code seems broken to me)

(b) keep the local memory context but tune it somehow - this however means
changes in aset.c, which however impacts significant part of the codebase
(not sure how much internal / external code relies on this behavior)

(c) remove the local memory context (I'm really wondering what other
benefit it might have in this particular case) and maybe tune the other
parameters (e.g. the initial size / growth rate of the array etc.).

I'd certainly vote for (c) in this case.

Tomas