Thread: OOM in hash join
Hi hackers, Too small value of work_mem cause memory overflow in parallel hash join because of too much number batches. There is the plan: explain SELECT * FROM solixschema.MIG_50GB_APR04_G1_H a join solixschema.MIG_50GB_APR04_G2_H b on a.seq_pk = b.seq_pk join solixschema.MIG_50GB_APR04_G3_H c on b.seq_p k = c.seq_pk join solixschema.MIG_50GB_APR04_G4_H d on c.seq_pk = d.seq_pk join solixschema.MIG_50GB_APR04_G5_H e on d.seq_pk = e.seq_pk; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Gather (cost=205209076.76..598109290.40 rows=121319744 width=63084) Workers Planned: 8 -> Parallel Hash Join (cost=205208076.76..585976316.00 rows=15164968 width=63084) Hash Cond: (b.seq_pk = a.seq_pk) -> Parallel Hash Join (cost=55621683.59..251148173.17 rows=14936978 width=37851) Hash Cond: (b.seq_pk = c.seq_pk) -> Parallel Hash Join (cost=27797595.68..104604780.40 rows=15346430 width=25234) Hash Cond: (b.seq_pk = d.seq_pk) -> Parallel Seq Scan on mig_50gb_apr04_g2_h b (cost=0.00..4021793.90 rows=15783190 width=12617) -> Parallel Hash (cost=3911716.30..3911716.30 rows=15346430 width=12617) -> Parallel Seq Scan on mig_50gb_apr04_g4_h d (cost=0.00..3911716.30 rows=15346430 width=12617) -> Parallel Hash (cost=3913841.85..3913841.85 rows=15362085 width=12617) -> Parallel Seq Scan on mig_50gb_apr04_g3_h c (cost=0.00..3913841.85 rows=15362085 width=12617) -> Parallel Hash (cost=102628306.07..102628306.07 rows=15164968 width=25233) -> Parallel Hash Join (cost=27848049.61..102628306.07 rows=15164968 width=25233) Hash Cond: (a.seq_pk = e.seq_pk) -> Parallel Seq Scan on mig_50gb_apr04_g1_h a (cost=0.00..3877018.68 rows=15164968 width=12617) -> Parallel Hash (cost=3921510.05..3921510.05 rows=15382205 width=12616) -> Parallel Seq Scan on mig_50gb_apr04_g5_h e (cost=0.00..3921510.05 rows=15382205 width=12616) work_mem is 4MB and leader + two parallel workers consumes about 10Gb each. There are 262144 batches: (gdb) p *hjstate->hj_HashTable $2 = {nbuckets = 1024, log2_nbuckets = 10, nbuckets_original = 1024, nbuckets_optimal = 1024, log2_nbuckets_optimal = 10, buckets = { unshared = 0x7fa5d5211000, shared = 0x7fa5d5211000}, keepNulls = false, skewEnabled = false, skewBucket = 0x0, skewBucketLen = 0, nSkewBuckets = 0, skewBucketNums = 0x0, nbatch = 262144, curbatch = 86506, nbatch_original = 262144, nbatch_outstart = 262144, growEnabled = true, totalTuples = 122600000, partialTuples = 61136408, skewTuples = 0, innerBatchFile = 0x0, outerBatchFile = 0x0, outer_hashfunctions = 0x55ce086a3288, inner_hashfunctions = 0x55ce086a32d8, hashStrict = 0x55ce086a3328, collations = 0x55ce086a3340, spaceUsed = 0, spaceAllowed = 8388608, spacePeak = 204800, spaceUsedSkew = 0, spaceAllowedSkew = 167772, hashCxt = 0x55ce086a3170, batchCxt = 0x55ce086a5180, chunks = 0x0, current_chunk = 0x7fa5d5283000, area = 0x55ce085b56d8, parallel_state = 0x7fa5ee993520, batches = 0x7fa5d3ff8048, current_chunk_shared = 1099512193024} The biggest memory contexts are: ExecutorState: 1362623568 HashTableContext: 102760280 HashBatchContext: 7968 HashTableContext: 178257752 HashBatchContext: 7968 HashTableContext: 5306745728 HashBatchContext: 7968 There is still some gap between size reported by memory context sump and actual size of backend. But is seems to be obvious, that trying to fit in work_mem sharedtuplestore creates so much batches, that them consume much more memory than work_mem.
On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > Hi hackers, > > Too small value of work_mem cause memory overflow in parallel hash join > because of too much number batches. > There is the plan: [...] > There is still some gap between size reported by memory context sump and > actual size of backend. > But is seems to be obvious, that trying to fit in work_mem > sharedtuplestore creates so much batches, that them consume much more > memory than work_mem. The same issue [0] was reported a few weeks ago, with the same diagnosis here [1]. I think it's being worked on over there. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/flat/20230228190643.1e368315%40karst [1] https://www.postgresql.org/message-id/flat/3013398b-316c-638f-2a73-3783e8e2ef02%40enterprisedb.com#ceb9e14383122ade8b949b7479c6f7e2
On Fri, Apr 14, 2023 at 10:59 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > Too small value of work_mem cause memory overflow in parallel hash join > because of too much number batches. Yeah. Not only in parallel hash join, but in any hash join (admittedly parallel hash join has higher per-batch overheads; that is perhaps something we could improve). That's why we tried to invent an alternative strategy where you loop over batches N times, instead of making more batches, at some point: https://www.postgresql.org/message-id/flat/CA+hUKGKWWmf=WELLG=aUGbcugRaSQbtm0tKYiBut-B2rVKX63g@mail.gmail.com That thread starts out talking about 'extreme skew' etc but the more general problem is that, at some point, even with perfectly evenly distributed keys, adding more batches requires more memory than you can save by doing so. Sure, it's a problem that we don't account for that memory properly, as complained about here: https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development If you did have perfect prediction of every byte you will need, maybe you could say, oh, well, we just don't have enough memory for a hash join, so let's do a sort/merge instead. But you can't, because (1) some types aren't merge-joinable, and (2) in reality sometimes you've already started the hash join due to imperfect stats so it's too late to change strategies.
On Fri, 14 Apr 2023 13:21:05 +0200 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > > > Hi hackers, > > > > Too small value of work_mem cause memory overflow in parallel hash join > > because of too much number batches. > > There is the plan: > > [...] > > > There is still some gap between size reported by memory context sump and > > actual size of backend. > > But is seems to be obvious, that trying to fit in work_mem > > sharedtuplestore creates so much batches, that them consume much more > > memory than work_mem. Indeed. The memory consumed by batches is not accounted and the consumption reported in explain analyze is wrong. Would you be able to test the latest patchset posted [1] ? This does not fix the work_mem overflow, but it helps to keep the number of batches balanced and acceptable. Any feedback, comment or review would be useful. [1] https://www.postgresql.org/message-id/flat/20230408020119.32a0841b%40karst#616c1f41fcc10e8f89d41e8e5693618c Regards,
On Fri, Apr 14, 2023 at 11:43 PM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Would you be able to test the latest patchset posted [1] ? This does not fix > the work_mem overflow, but it helps to keep the number of batches > balanced and acceptable. Any feedback, comment or review would be useful. > > [1] https://www.postgresql.org/message-id/flat/20230408020119.32a0841b%40karst#616c1f41fcc10e8f89d41e8e5693618c Hi Jehan-Guillaume. I hadn't paid attention to that thread before probably due to timing and the subject and erm ETOOMANYTHREADS. Thanks for all the work you've done to study this area and also review and summarise the previous writing/patches/ideas.