Re: Fix overflow of nbatch - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Fix overflow of nbatch
Date
Msg-id 244dc6c1-3b3d-4de2-b3de-b1511e6a6d10@vondra.me
Whole thread Raw
In response to Re: Fix overflow of nbatch  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
Hi,

Here's a couple draft patches fixing the bug:

- 0001 adds the missing size_t cast, to fix the overflow

- 0002 fixes the balancing, by adjusting the hash table size limit

- 0003 adds the missing overflow protection for nbuckets and the hash
table limit

- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.


0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the
code is simply how we clamp nbuckets elsewhere (more or less).

At some point I started wondering if this would be simpler if it'd have
been better to use the algerbraic solution posted by James Hunter back
in February [1]. It'd not need the loop, but it'd still need all this
new overflow protection etc.


I wanted to make sure the patches actually make it work correctly, so I
created a table with 4B rows:

  create table t (a bigint, b text);
  insert into t select i, md5(i::text)
                  from generate_series(1,4000000000) s(i);

and I added this log message at the end of ExecChooseHashTableSize:

  elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld",
       work_mem, nbatch, nbuckets, (*space_allowed)/1024,
       (*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024);

and I ran an explain on a self-join

  set enable_mergejoin = off;
  set max_parallel_workers_per_gather = 0;
  set work_mem = '...';

  explain select * from t t1 join t t2 on (t1.a = t2.a);

with work_mem set to values between 64kB and 1GB.

On 18.0 I got this:

  wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200
  wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400
  wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800
  wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600
  wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200
  wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
  wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

I wanted to know how serious the issue is, compared to what would happen
without the balancing. I disabled the balancing (by skipping the loop),
and then I get this:

  wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200
  wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400
  wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800
  wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600
  wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200
  wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248
  wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768
  wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672
  wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912
  wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

The interesting bit is that the expected total memory usage (the last
number in the log line) is exactly the same as for 18.0 with and without
balancing. IIUC this is due to the "stop" condition using the initial
hash table size. It makes me a bit less worried about this triggering
OOM crashes - it does not improve the behavior, but it doesn't use more
memory than before. Still an embarrassing bug, though.

With the attached patches, this looks like this:

  wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192
  wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384
  wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768
  wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536
  wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072
  wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

So, this time it actually seems to work correctly and significantly
reduces the memory usage ...

There's one weird thing remaining - if you look at nbatch, it actually
increases for the first couple work_mem steps. That's weird, because
after increasing work_mem we should need *fewer* batches. But this has
nothing to do with the balancing, it happens even with it disabled.

The reason is that when calculating nbatch we do this:

    dbatch = Min(dbatch, max_pointers);

and max_pointers is calculated from work_mem (among other things). It's
a bit funny the logica worries about how many batch pointers we have,
and refuses to allow more. But at the same time it ignores the BufFiles.

AFAICS it's harmless - we may pick low number of batches initially, but
then later we'll ramp it up (and the balancing will work too). And if
you choose to run huge hash joins with tiny work_mem, I guess you're in
for the suffering anyway. In any case, it's unrelated to balancing.


regards

[1]
https://www.postgresql.org/message-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_%2BzecoZYMU%2Bdn3BVVaZg%40mail.gmail.com

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: plan shape work
Next
From: Tomas Vondra
Date:
Subject: Re: DSA overflow in hash join