Thread: OOM in hash join

OOM in hash join

From

Konstantin Knizhnik

Date:

14 April 2023, 10:59:27

Hi hackers,

Too small value of work_mem cause memory overflow in parallel hash join 
because of too much number batches.
There is the plan:

explain SELECT * FROM solixschema.MIG_50GB_APR04_G1_H a join 
solixschema.MIG_50GB_APR04_G2_H b on a.seq_pk = b.seq_pk join 
solixschema.MIG_50GB_APR04_G3_H c on b.seq_p
k = c.seq_pk join solixschema.MIG_50GB_APR04_G4_H d on c.seq_pk = 
d.seq_pk join solixschema.MIG_50GB_APR04_G5_H e on d.seq_pk = e.seq_pk;
                                                          QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
  Gather  (cost=205209076.76..598109290.40 rows=121319744 width=63084)
    Workers Planned: 8
    ->  Parallel Hash Join  (cost=205208076.76..585976316.00 
rows=15164968 width=63084)
          Hash Cond: (b.seq_pk = a.seq_pk)
          ->  Parallel Hash Join  (cost=55621683.59..251148173.17 
rows=14936978 width=37851)
                Hash Cond: (b.seq_pk = c.seq_pk)
                ->  Parallel Hash Join (cost=27797595.68..104604780.40 
rows=15346430 width=25234)
                      Hash Cond: (b.seq_pk = d.seq_pk)
                      ->  Parallel Seq Scan on mig_50gb_apr04_g2_h b  
(cost=0.00..4021793.90 rows=15783190 width=12617)
                      ->  Parallel Hash (cost=3911716.30..3911716.30 
rows=15346430 width=12617)
                            ->  Parallel Seq Scan on mig_50gb_apr04_g4_h 
d  (cost=0.00..3911716.30 rows=15346430 width=12617)
                ->  Parallel Hash  (cost=3913841.85..3913841.85 
rows=15362085 width=12617)
                      ->  Parallel Seq Scan on mig_50gb_apr04_g3_h c  
(cost=0.00..3913841.85 rows=15362085 width=12617)
          ->  Parallel Hash  (cost=102628306.07..102628306.07 
rows=15164968 width=25233)
                ->  Parallel Hash Join (cost=27848049.61..102628306.07 
rows=15164968 width=25233)
                      Hash Cond: (a.seq_pk = e.seq_pk)
                      ->  Parallel Seq Scan on mig_50gb_apr04_g1_h a  
(cost=0.00..3877018.68 rows=15164968 width=12617)
                      ->  Parallel Hash (cost=3921510.05..3921510.05 
rows=15382205 width=12616)
                            ->  Parallel Seq Scan on mig_50gb_apr04_g5_h 
e  (cost=0.00..3921510.05 rows=15382205 width=12616)


work_mem is 4MB and leader + two parallel workers consumes about 10Gb each.
There are 262144 batches:

(gdb) p *hjstate->hj_HashTable
$2 = {nbuckets = 1024, log2_nbuckets = 10, nbuckets_original = 1024,
   nbuckets_optimal = 1024, log2_nbuckets_optimal = 10, buckets = {
     unshared = 0x7fa5d5211000, shared = 0x7fa5d5211000}, keepNulls = 
false,
   skewEnabled = false, skewBucket = 0x0, skewBucketLen = 0, 
nSkewBuckets = 0,
   skewBucketNums = 0x0, nbatch = 262144, curbatch = 86506,
   nbatch_original = 262144, nbatch_outstart = 262144, growEnabled = true,
   totalTuples = 122600000, partialTuples = 61136408, skewTuples = 0,
   innerBatchFile = 0x0, outerBatchFile = 0x0,
   outer_hashfunctions = 0x55ce086a3288, inner_hashfunctions = 
0x55ce086a32d8,
   hashStrict = 0x55ce086a3328, collations = 0x55ce086a3340, spaceUsed = 0,
   spaceAllowed = 8388608, spacePeak = 204800, spaceUsedSkew = 0,
   spaceAllowedSkew = 167772, hashCxt = 0x55ce086a3170,
   batchCxt = 0x55ce086a5180, chunks = 0x0, current_chunk = 0x7fa5d5283000,
   area = 0x55ce085b56d8, parallel_state = 0x7fa5ee993520,
   batches = 0x7fa5d3ff8048, current_chunk_shared = 1099512193024}


The biggest memory contexts are:

ExecutorState: 1362623568
    HashTableContext: 102760280
     HashBatchContext: 7968
    HashTableContext: 178257752
     HashBatchContext: 7968
    HashTableContext: 5306745728
     HashBatchContext: 7968


There is still some gap between size reported by memory context sump and 
actual size of backend.
But is seems to be obvious, that trying to fit in work_mem 
sharedtuplestore creates so much batches, that  them consume much more 
memory than work_mem.

Re: OOM in hash join

From

Matthias van de Meent

Date:

14 April 2023, 11:21:05

On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> Hi hackers,
>
> Too small value of work_mem cause memory overflow in parallel hash join
> because of too much number batches.
> There is the plan:

[...]

> There is still some gap between size reported by memory context sump and
> actual size of backend.
> But is seems to be obvious, that trying to fit in work_mem
> sharedtuplestore creates so much batches, that  them consume much more
> memory than work_mem.

The same issue [0] was reported a few weeks ago, with the same
diagnosis here [1]. I think it's being worked on over there.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/flat/20230228190643.1e368315%40karst
[1]
https://www.postgresql.org/message-id/flat/3013398b-316c-638f-2a73-3783e8e2ef02%40enterprisedb.com#ceb9e14383122ade8b949b7479c6f7e2

Re: OOM in hash join

From

Thomas Munro

Date:

14 April 2023, 11:27:55

On Fri, Apr 14, 2023 at 10:59 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> Too small value of work_mem cause memory overflow in parallel hash join
> because of too much number batches.

Yeah.  Not only in parallel hash join, but in any hash join
(admittedly parallel hash join has higher per-batch overheads; that is
perhaps something we could improve).  That's why we tried to invent an
alternative strategy where you loop over batches N times, instead of
making more batches, at some point:

https://www.postgresql.org/message-id/flat/CA+hUKGKWWmf=WELLG=aUGbcugRaSQbtm0tKYiBut-B2rVKX63g@mail.gmail.com

That thread starts out talking about 'extreme skew' etc but the more
general problem is that, at some point, even with perfectly evenly
distributed keys, adding more batches requires more memory than you
can save by doing so.  Sure, it's a problem that we don't account for
that memory properly, as complained about here:

https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development

If you did have perfect prediction of every byte you will need, maybe
you could say, oh, well, we just don't have enough memory for a hash
join, so let's do a sort/merge instead.  But you can't, because (1)
some types aren't merge-joinable, and (2) in reality sometimes you've
already started the hash join due to imperfect stats so it's too late
to change strategies.

Re: OOM in hash join

From

Jehan-Guillaume de Rorthais

Date:

14 April 2023, 11:43:21

On Fri, 14 Apr 2023 13:21:05 +0200
Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

> On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> >
> > Hi hackers,
> >
> > Too small value of work_mem cause memory overflow in parallel hash join
> > because of too much number batches.
> > There is the plan:  
> 
> [...]
> 
> > There is still some gap between size reported by memory context sump and
> > actual size of backend.
> > But is seems to be obvious, that trying to fit in work_mem
> > sharedtuplestore creates so much batches, that  them consume much more
> > memory than work_mem.

Indeed. The memory consumed by batches is not accounted and the consumption
reported in explain analyze is wrong.

Would you be able to test the latest patchset posted [1] ? This does not fix
the work_mem overflow, but it helps to keep the number of batches
balanced and acceptable. Any feedback, comment or review would be useful.

[1] https://www.postgresql.org/message-id/flat/20230408020119.32a0841b%40karst#616c1f41fcc10e8f89d41e8e5693618c

Regards,

Re: OOM in hash join

From

Thomas Munro

Date:

14 April 2023, 12:46:09

On Fri, Apr 14, 2023 at 11:43 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> Would you be able to test the latest patchset posted [1] ? This does not fix
> the work_mem overflow, but it helps to keep the number of batches
> balanced and acceptable. Any feedback, comment or review would be useful.
>
> [1] https://www.postgresql.org/message-id/flat/20230408020119.32a0841b%40karst#616c1f41fcc10e8f89d41e8e5693618c

Hi Jehan-Guillaume.  I hadn't paid attention to that thread before
probably due to timing and the subject and erm ETOOMANYTHREADS.
Thanks for all the work you've done to study this area and also review
and summarise the previous writing/patches/ideas.