Thread: OOM with many sorts

OOM with many sorts

From
Justin Pryzby
Date:
One of our instances was killed:

| Out of memory: Kill process 27670 (postmaster) score 381 or sacrifice child
| Killed process 27670 (postmaster) total-vm:13207796kB, anon-rss:5736kB, file-rss:0kB, shmem-rss:12421696kB

The VM has: 32GB RAM, shared_buffers=12GB, work_mem=128MB 
Running only pg11.3

Trying to diagnose the issue by running a query on a narrower range of dates,
it looks to me that's due to many sorts nodes per worker.

Part of the query plan looks like this:

     [...]
     ->  Finalize GroupAggregate (actual rows=1788 loops=1)
       Group Key: huawei_umts_ucell_20181217.sect_id, (to_char(huawei_umts_ucell_20181217.start_time,
'YYYY-MM'::text))
       ->  Gather Merge (actual rows=56921 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort (actual rows=18974 loops=3)
               Sort Key: huawei_umts_ucell_20181217.sect_id, (to_char(huawei_umts_ucell_20181217.start_time,
'YYYY-MM'::text))
               Sort Method: quicksort  Memory: 36499kB
               Worker 0:  Sort Method: quicksort  Memory: 40275kB
               Worker 1:  Sort Method: quicksort  Memory: 40263kB
               ->  Parallel Append (actual rows=18974 loops=3)
                 ->  Partial GroupAggregate (actual rows=1783 loops=1)
                   Group Key: huawei_umts_ucell_20181217.sect_id, (to_char(huawei_umts_ucell_20181217.start_time,
'YYYY-MM'::text))
                   ->  Sort (actual rows=42792 loops=1)
                     Sort Key: huawei_umts_ucell_20181217.sect_id, (to_char(huawei_umts_ucell_20181217.start_time,
'YYYY-MM'::text))
                     Worker 0:  Sort Method: quicksort  Memory: 44997kB
                     ->  Parallel Seq Scan on huawei_umts_ucell_20181217 (actual rows=42792 loops=1)
                           Filter: ((start_time >= '2018-12-01 00:00:00'::timestamp without time zone) AND (start_time
<'2019-01-01 00:00:00'::timestamp without time zone))
 
                 ->  Partial GroupAggregate (actual rows=1783 loops=1)
                   Group Key: huawei_umts_ucell_20181216.sect_id, (to_char(huawei_umts_ucell_20181216.start_time,
'YYYY-MM'::text))
                   ->  Sort (actual rows=42792 loops=1)
                     Sort Key: huawei_umts_ucell_20181216.sect_id, (to_char(huawei_umts_ucell_20181216.start_time,
'YYYY-MM'::text))
                     Worker 1:  Sort Method: quicksort  Memory: 44997kB
                     ->  Parallel Seq Scan on huawei_umts_ucell_20181216 (actual rows=42792 loops=1)
                           Filter: ((start_time >= '2018-12-01 00:00:00'::timestamp without time zone) AND (start_time
<'2019-01-01 00:00:00'::timestamp without time zone))
 

                 [...many partial GroupAggregate/Sort repeated here for ~200 child tables...]

This isn't the first time I've had to reduce work_mem on a parallel query to
avoid OOM, but it seems unreasonable to be concerned with 50MB sorts.

It looks like the results of each Sort node stay in RAM, during processing of
additional sort nodes (is that required?)

I SET max_parallel_workers_per_gather=0 and dumped mem context:
...
    PortalContext: 203948152 total in 25756 blocks; 176600 free (319 chunks); 203771552 used:
...
Grand total: 1918118632 bytes in 30716 blocks; 192472208 free (3685 chunks); 1725646424 used

The total shows ~1700MB but the largest context is only 200MB.
"Caller tuples" accounts for 1400MB.

awk '/Caller tuples/{s+=$3}END{print s/1024^3}' /tmp/mem-context
1.44043

I'm attaching full plan and mem context.

It doesn't seem to be changed under pg12b2, which is concerning since PG12
advertizes that it gracefully supports "thousands" of partitions.  It seems to
me that would require tiny work_mem, which would be devastating to some
workloads.  This is not a contrived test case, it's one of our reports run
across ~10 months of history with weekly granularity.

Thanks in advance for any advice.

Justin

Attachment

Re: OOM with many sorts

From
Thomas Munro
Date:
On Tue, Jul 9, 2019 at 4:44 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>                              [...many partial GroupAggregate/Sort repeated here for ~200 child tables...]
>
> This isn't the first time I've had to reduce work_mem on a parallel query to
> avoid OOM, but it seems unreasonable to be concerned with 50MB sorts.

This is a fundamental problem with our model of memory budgets.  To
oversimplify massively, we can use about work_mem * ntables * the
number of parallel workers.  Core counts are going up, and now we have
a convenient way to get large values of ntables.  One of many -hackers
threads to discuss the problem:

https://www.postgresql.org/message-id/flat/CAH2-WzmNwV%3DLfDRXPsmCqgmm91mp%3D2b4FvXNF%3DcCvMrb8YFLfQ%40mail.gmail.com

> It looks like the results of each Sort node stay in RAM, during processing of
> additional sort nodes (is that required?)

That's a very interesting question linked to whole-query memory
budgets.  If you don't have a whole-query memory budget, then you have
nothing to lose by keeping hash and sort results in memory, and you
gain the ability to do cheap rescans (if the node happens to be in a
nest loop).  I discussed this in the context of hash joins over here:

https://www.postgresql.org/message-id/CAEepm%3D0N6DODN7nx6Zb93YOW-y%3DRftNNFZJRaLyG6jbJHJVjsA%40mail.gmail.com

The TL;DR is that some other databases throw out eg hash tables as
soon as possible, and consider changing the shape of their join
nesting in order to minimise the total number of hash tables in memory
at once, in order to come in under a certain budget for memory used at
any one point in time.  That is, they don't just divide query_work_mem
up over all the nodes, they understand things about when nodes will
allocate and free memory.

There be some less clever things we can do, though, before we tackle
the big problems involved.  We could probably still opportunistically
give back memory sooner, when we know there is no chance of rescan,
and other things along those lines.

> Thanks in advance for any advice.

I think it's impossible to choose a single value for work_mem if you
have a mixture of types of queries that hit wildly different numbers
of partitions and workers.  I think this is an ongoing topic for
-hackers.

-- 
Thomas Munro
https://enterprisedb.com