Thread: excessive amounts of consumed memory (RSS), triggering OOM killer

excessive amounts of consumed memory (RSS), triggering OOM killer

From

Tomas Vondra

Date:

01 December 2014, 22:40:27

Hi all,

while working on the patch decreasing amount of memory consumed by
array_agg [1], I've ran into some strange OOM issues. Reproducing them
using the attached SQL script is rather simple.

  [1] https://commitfest.postgresql.org/action/patch_view?id=1652

At first I thought there's some rare hidden memory leak, but I'm pretty
sure that's not the case. I've even put some explicit counters into
aset.c counting allocated/freed blocks, and it seems to be working just
fine (and matching the context tree perfectly). So no memory leak.

The only thing I can think of is some interaction (fragmentation?) with
the system allocator (I'm using glibc-2.19 and kernel 3.16.1, btw),
making the RSS values even less useful than I thought. Sadly, it seems
to trigger the OOM killer :-(

It's entirely possible that this is a known behavior of the allocator,
and I've been unaware of it. It's also true that the work_mem values are
really excessive, and the actual values would be much lower, which would
make the issues much less serious.

To demonstrate the problem I'll use the attached SQL script - it's a
rather simple script generating sample tables and then executing trivial
array_agg() queries.

The script sets work_mem to two different values:

  work_mem = '1GB'    ==>  this limits the INSERT (generating data)

  work_mem = '1024GB' ==>  bogus value, forcing a hash aggegate
                           (assuming  the hash table fits into memory)

The size of the sample tables (and the amount of memory needed for the
hash aggregate) is determined by the fist parameter set in the script.
With this value

  \set size 10000

I get a OOM crash on the first execution of the SQL script (on a machine
with 8GB of RAM and 512MB shared buffers), but YMMV.

The problem is, that even with a much smaller dataset (say, using size
7500) you'll get an OOM error after several executions of the script.
How many executions are needed seems to be inversely proportional to the
size of the data set.

The "problem" is that the RSS amount is increasing over time for some
reason. For example with the "size = 5000", the memory stats for the
process look like this over the first few minutes:

     VIRT    RES    SHR  %CPU %MEM     TIME+ COMMAND
  5045508 2,818g 187220  51,9 36,6   0:15.39 postgres: INSERT
  5045508 3,600g 214868  62,8 46,8   3:11.58 postgres: INSERT
  5045508 3,771g 214868  50,9 49,0   3:40.03 postgres: INSERT
  5045508 3,840g 214868  48,5 49,9   4:00.71 postgres: INSERT
  5045508 3,978g 214880  51,5 51,7   4:40.73 postgres: INSERT
  5045508 4,107g 214892  53,2 53,4   5:22.04 postgres: INSERT
  5045508 4,297g 215276  53,9 55,9   6:22.63 postgres: INSERT
  ...

Those are rows for the backend process, captured from "top" over time.
How long the backend was running is in the TIME column. Each iteration
takes ~30 seconds, so those lines represent approximately iterations 1,
6, 7, 8, 11, etc.

Notice how the RSS value grows over time, and also notice that this is
the INSERT, restricted by work_mem=1GB. So the memory consumption should
be ~1.5GB, and MemoryContextStats(TopMemoryContext) collected at this
point is consistent with that (see the mem-ctx.log).

And then the value stabilizes at ~4,430g and stops growing. With size
7500 it however takes only ~20 iterations to reach the OOM issue, with a
crash log like this:

[10560.843547] Killed process 15862 (postgres) total-vm:7198260kB,
anon-rss:6494136kB, file-rss:300436kB

So, any ideas what might be the culprit here?

As I said, this is clearly made worse by inappropriately high work_mem
values, but I'm not sure it's completely harmless. Imagine for example
long-running backends, executing complex queries with inaccurate
estimates. That may easily result in using much more memory than the
work_mem limit, and increasing the RSS value over time.

regards
Tomas

Attachment

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

Andrew Dunstan

Date:

01 December 2014, 23:31:32

On 12/01/2014 05:39 PM, Tomas Vondra wrote:
> Hi all,
>
> while working on the patch decreasing amount of memory consumed by
> array_agg [1], I've ran into some strange OOM issues. Reproducing them
> using the attached SQL script is rather simple.
>
>    [1] https://commitfest.postgresql.org/action/patch_view?id=1652
>
> At first I thought there's some rare hidden memory leak, but I'm pretty
> sure that's not the case. I've even put some explicit counters into
> aset.c counting allocated/freed blocks, and it seems to be working just
> fine (and matching the context tree perfectly). So no memory leak.


Doesn't this line:
   TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 chunks); 608256 used


look pretty suspicious?


cheers

andrew

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

Tomas Vondra

Date:

02 December 2014, 00:26:13

On 2.12.2014 00:31, Andrew Dunstan wrote:
> 
> On 12/01/2014 05:39 PM, Tomas Vondra wrote:
>> Hi all,
>>
>> while working on the patch decreasing amount of memory consumed by
>> array_agg [1], I've ran into some strange OOM issues. Reproducing them
>> using the attached SQL script is rather simple.
>>
>>    [1] https://commitfest.postgresql.org/action/patch_view?id=1652
>>
>> At first I thought there's some rare hidden memory leak, but I'm pretty
>> sure that's not the case. I've even put some explicit counters into
>> aset.c counting allocated/freed blocks, and it seems to be working just
>> fine (and matching the context tree perfectly). So no memory leak.
> 
> 
> Doesn't this line:
> 
>    TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
> (500017 chunks); 608256 used
> 
> 
> look pretty suspicious?

It certainly looks a bit large, especially considering this is a fresh
cluster with a single DB, containing a single table (created by user),
no doubt about that.

For comparison, this is a new backend:

TopMemoryContext: 78128 total in 11 blocks; 8528 free (16 chunks); 69600
used

OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits.


For the record, I only tweaked these settings in the config file:

test=# select name, setting from pg_settings where source like
'%configuration file%';           name            |      setting
----------------------------+-------------------checkpoint_segments        | 32DateStyle                  | ISO,
DMYdefault_text_search_config| pg_catalog.simpledynamic_shared_memory_type | posixlc_messages                |
cs_CZ.UTF-8lc_monetary               | cs_CZ.UTF-8lc_numeric                 | cs_CZ.UTF-8lc_time                    |
cs_CZ.UTF-8log_timezone              | Europe/Praguemaintenance_work_mem       | 524288max_connections            |
100shared_buffers            | 65536TimeZone                   | Europe/Praguework_mem                   | 1048576
 
(14 rows)

Some of those are set by the initdb, of course.

regards
Tomas

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

Tom Lane

Date:

02 December 2014, 00:33:24

Tomas Vondra <tv@fuzzy.cz> writes:
> On 2.12.2014 00:31, Andrew Dunstan wrote:
>> Doesn't this line:
>> TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
>> (500017 chunks); 608256 used
>> look pretty suspicious?

> It certainly looks a bit large, especially considering this is a fresh
> cluster with a single DB, containing a single table (created by user),
> no doubt about that.
> OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits.

Yeah, but what was that 130MB being used for?  It's not generally
considered good practice to put large or random stuff in TopMemoryContext
--- arguably, any code doing so is broken already, because almost by
definition such allocations won't be reclaimed on error.

What I suspect you're looking at here is the detritus of creation of a
huge number of memory contexts.  mcxt.c keeps its own state about existing
contents in TopMemoryContext.  So, if we posit that those contexts weren't
real small, there's certainly room to believe that there was some major
memory bloat going on recently.
        regards, tom lane

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

Tomas Vondra

Date:

02 December 2014, 01:08:31

On 2.12.2014 01:33, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> On 2.12.2014 00:31, Andrew Dunstan wrote:
>>> Doesn't this line:
>>> TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
>>> (500017 chunks); 608256 used
>>> look pretty suspicious?
> 
>> It certainly looks a bit large, especially considering this is a fresh
>> cluster with a single DB, containing a single table (created by user),
>> no doubt about that.
>> OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits.
> 
> Yeah, but what was that 130MB being used for?  It's not generally
> considered good practice to put large or random stuff in TopMemoryContext
> --- arguably, any code doing so is broken already, because almost by
> definition such allocations won't be reclaimed on error.
> 
> What I suspect you're looking at here is the detritus of creation of
> a huge number of memory contexts. mcxt.c keeps its own state about
> existing contents in TopMemoryContext. So, if we posit that those
> contexts weren't real small, there's certainly room to believe that
> there was some major memory bloat going on recently.

Aha! MemoryContextCreate allocates the memory for the new context from
TopMemoryContext explicitly, so that it survives resets of the parent
context. Is that what you had in mind by keeping state about existing
contexts?

That'd probably explain the TopMemoryContext size, because array_agg()
creates separate context for each group. So if you have 1M groups, you
have 1M contexts. Although I don's see how the size of those contexts
would matter?

Maybe we could move this info (list of child contexts) to the parent
context somehow, so that it's freed when the context is destroyed? Aside
from the regular blocks, of course. But maybe it's not worth the effort,
because you can only have so many memory contexts created concurrently.
For AllocSet it's at most RAM/1kB (because that's the minimum block
size) before the OOM intervenes. And in this case the contexts have 8kB
initial size, so the number of contexts is even lower. So maybe it's not
worth the effort.

Also, this explains the TopMemoryContext size, but not the RSS size (or
am I missing something)?

regards
Tomas

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

Tom Lane

Date:

02 December 2014, 01:52:31

Tomas Vondra <tv@fuzzy.cz> writes:
> On 2.12.2014 01:33, Tom Lane wrote:
>> What I suspect you're looking at here is the detritus of creation of
>> a huge number of memory contexts. mcxt.c keeps its own state about
>> existing contents in TopMemoryContext. So, if we posit that those
>> contexts weren't real small, there's certainly room to believe that
>> there was some major memory bloat going on recently.

> Aha! MemoryContextCreate allocates the memory for the new context from
> TopMemoryContext explicitly, so that it survives resets of the parent
> context. Is that what you had in mind by keeping state about existing
> contexts?

Right.

> That'd probably explain the TopMemoryContext size, because array_agg()
> creates separate context for each group. So if you have 1M groups, you
> have 1M contexts. Although I don's see how the size of those contexts
> would matter?

Well, if they're each 6K, that's your 6GB right there.

> Maybe we could move this info (list of child contexts) to the parent
> context somehow, so that it's freed when the context is destroyed?

We intentionally didn't do that, because in many cases it'd result in
parent contexts becoming nonempty when otherwise they'd never have
anything actually in them.  The idea was that such "shell" parent contexts
should be cheap, requiring only a control block in TopMemoryContext and
not an actual allocation arena.  This idea of a million separate child
contexts was never part of the design of course; we might need to rethink
whether that's a good idea.  Or maybe there need to be two different
policies about where to put child control blocks.

> Also, this explains the TopMemoryContext size, but not the RSS size (or
> am I missing something)?

Very possibly you're left with "islands" that prevent reclaiming very
much of the peak RAM usage.  It'd be hard to be sure without some sort
of memory map, of course.
        regards, tom lane

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

Tomas Vondra

Date:

02 December 2014, 09:59:27

Dne 2014-12-02 02:52, Tom Lane napsal:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> On 2.12.2014 01:33, Tom Lane wrote:
>>> What I suspect you're looking at here is the detritus of creation of
>>> a huge number of memory contexts. mcxt.c keeps its own state about
>>> existing contents in TopMemoryContext. So, if we posit that those
>>> contexts weren't real small, there's certainly room to believe that
>>> there was some major memory bloat going on recently.
> 
>> Aha! MemoryContextCreate allocates the memory for the new context from
>> TopMemoryContext explicitly, so that it survives resets of the parent
>> context. Is that what you had in mind by keeping state about existing
>> contexts?
> 
> Right.
> 
>> That'd probably explain the TopMemoryContext size, because array_agg()
>> creates separate context for each group. So if you have 1M groups, you
>> have 1M contexts. Although I don's see how the size of those contexts
>> would matter?
> 
> Well, if they're each 6K, that's your 6GB right there.

Yeah, but this memory should be freed after the query finishes, no?

>> Maybe we could move this info (list of child contexts) to the parent
>> context somehow, so that it's freed when the context is destroyed?
> 
> We intentionally didn't do that, because in many cases it'd result in
> parent contexts becoming nonempty when otherwise they'd never have
> anything actually in them.  The idea was that such "shell" parent 
> contexts
> should be cheap, requiring only a control block in TopMemoryContext and
> not an actual allocation arena.  This idea of a million separate child
> contexts was never part of the design of course; we might need to 
> rethink
> whether that's a good idea.  Or maybe there need to be two different
> policies about where to put child control blocks.

Maybe. For me, the 130MB is not really a big deal, because for this to
happen there really needs to be many child contexts at the same time,
consuming much more memory. With 6.5GB consumed in total, 130MB amounts
to ~2% which is negligible. Unless we can fix the RSS bloat.

>> Also, this explains the TopMemoryContext size, but not the RSS size 
>> (or
>> am I missing something)?
> 
> Very possibly you're left with "islands" that prevent reclaiming very
> much of the peak RAM usage.  It'd be hard to be sure without some sort
> of memory map, of course.

Yes, that's something I was thinking about too - I believe what happens
is that allocations of info in TopMemoryContext and the actual contexts
are interleaved, and at the end only the memory contexts are deleted. 
The
blocks allocated in TopMemoryContexts are kept, creating the "islands".

If that's the case, allocating the child context info within the parent
context would solve this, because these pieces would be reclaimed with 
the
rest of the parent memory.

But then again, there are probably other ways to create such islands
(e.g. allocating additional block in a long-lived context while the 
child
contexts exist).

regards
Tomas

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

"Tomas Vondra"

Date:

02 December 2014, 11:22:29

Dne 2 Prosinec 2014, 10:59, Tomas Vondra napsal(a):
> Dne 2014-12-02 02:52, Tom Lane napsal:
>> Tomas Vondra <tv@fuzzy.cz> writes:
>>
>>> Also, this explains the TopMemoryContext size, but not the RSS size
>>> (or am I missing something)?
>>
>> Very possibly you're left with "islands" that prevent reclaiming very
>> much of the peak RAM usage.  It'd be hard to be sure without some sort
>> of memory map, of course.
>
> Yes, that's something I was thinking about too - I believe what happens
> is that allocations of info in TopMemoryContext and the actual contexts
> are interleaved, and at the end only the memory contexts are deleted.
> The blocks allocated in TopMemoryContexts are kept, creating the
> "islands".
>
> If that's the case, allocating the child context info within the parent
> context would solve this, because these pieces would be reclaimed with
> the rest of the parent memory.

On second thought, I'm not sure this explains the continuous increase of
consumed memory. When the first iteration consumes 2,818g of memory, why
should the following iterations consume significantly more? The allocation
patterns should be (almost) exactly the same, reusing the already
allocated memory (either at the system or TopMemoryContext level).

regards
Tomas

Re: excessive amounts of consumed memory (RSS), triggering OOM killer

From

Tomas Vondra

Date:

06 December 2014, 21:59:16

On 2.12.2014 02:52, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> On 2.12.2014 01:33, Tom Lane wrote:
>>> What I suspect you're looking at here is the detritus of creation
>>> of a huge number of memory contexts. mcxt.c keeps its own state
>>> about existing contents in TopMemoryContext. So, if we posit that
>>> those contexts weren't real small, there's certainly room to
>>> believe that there was some major memory bloat going on recently.
> 
>> Aha! MemoryContextCreate allocates the memory for the new context
>> from TopMemoryContext explicitly, so that it survives resets of the
>> parent context. Is that what you had in mind by keeping state about
>> existing contexts?
> 
> Right.
> 
>> That'd probably explain the TopMemoryContext size, because
>> array_agg() creates separate context for each group. So if you have
>> 1M groups, you have 1M contexts. Although I don's see how the size
>> of those contexts would matter?
> 
> Well, if they're each 6K, that's your 6GB right there.

FWIW, I tried to repeat the test with the array_agg() patch, submitted
to the next CF:
 http://www.postgresql.org/message-id/5479FAEA.3070501@fuzzy.cz

and the memory context stats after the first iteration look like this:

master (unpatched)
------------------
TopMemoryContext: 204878128 total in 25011 blocks; 204010192 free
(750034 chunks); 867936 used

patched with array_agg
----------------------
TopMemoryContext: 78128 total in 11 blocks; 10144 free (34 chunks);
67984 used

Also, the RSS issue (which was the initial subject of this thread)
pretty much disappeared. So indeed, this seems to be caused by the
'islands' referenced by TopMemoryContext.

I wonder whether it'd be worth looking for places doing something
similar to array_agg(), potentially creating many contexts in parallel,
and maybe modifying them as in the patch.

regards
Tomas