Re: pgsql: Add parallel-aware hash joins. - Mailing list pgsql-committers
From | Thomas Munro |
---|---|
Subject | Re: pgsql: Add parallel-aware hash joins. |
Date | |
Msg-id | CAEepm=34PDuR69kfYVhmZPgMdy8pSA-MYbpesEN1SR+2oj3Y+w@mail.gmail.com Whole thread Raw |
In response to | Re: pgsql: Add parallel-aware hash joins. (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: pgsql: Add parallel-aware hash joins.
|
List | pgsql-committers |
On Wed, Jan 3, 2018 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> I mean that ExecChooseHashTableSize() estimates the hash table size like this: >> inner_rel_bytes = ntuples * tupsize; > >> ... but then at execution time, in the Parallel Hash case, we do >> memory accounting not in tuples but in chunks. The various >> participants pack tuples into 32KB chunks, and they trigger an >> increase in the number of batches when the total size of all chunks >> happens to exceeds the memory budget. In this case they do so >> unexpectedly due to that extra overhead at execution time that the >> planner didn't account for. We happened to be close to the threshold, >> in this case between choosing 8 batches and 16 batches, we can get it >> wrong and have to increase nbatch at execution time. > > If that's the issue, why doesn't the test fail every time on affected > platforms? There shouldn't be anything nondeterministic about the > number or size of tuples going into the hash table? > >> ... You get a >> larger size if more workers manage to load at least one tuple, due to >> their final partially filled chunk. > > Hm. That could do it, except it doesn't really account for the observed > result that slower single-processor machines seem more prone to the > bug. Surely they should be less likely to get multiple workers activated. I can reproduce the instability here with an -m32 build and this: create table simple as select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; alter table simple set (parallel_workers = 2); analyze simple; set parallel_setup_cost = 0; set work_mem = '96kB'; explain analyze select count(*) from simple r join simple s using (id); It plans for 8 batches, and then usually but not always goes to 16 at execution time depending on timing. It doesn't happen for me with 128kB (the setting used in the regression test), but I think the affected BF machines are all 32 bit systems that have MAXIMUM_ALIGNOF 8 and therefore use a bit more space, whereas my machines have MAXIMUM_ALIGNOF 4 in a 32 bit build, so that would explain the different location of this unstable border. We could probably fix the failures by simply increasing work_mem out of that zone, but I'm hoping to fix the problem in a more principled way. More soon. > BTW, I'm seeing a few things that look bug-like about > ExecParallelHashTuplePrealloc, for instance why does it use just > "size" to decide if space_allowed is exceeded but if not then add the > typically-much-larger value "want + HASH_CHUNK_HEADER_SIZE" to > estimated_size. That clearly can allow estimated_size to get > significantly past space_allowed --- if it's not a bug, it at least > deserves a comment explaining why not. Right. Fixed in the attached. > Another angle, which does not > apply to this test case but seems like a bug for real usage, is that > ExecParallelHashTuplePrealloc doesn't account correctly for tuples wider > than HASH_CHUNK_THRESHOLD. Right. I'll address that separately. > I'm also wondering why the non-parallel path seems to prefer to allocate > in units of HASH_CHUNK_SIZE + HASH_CHUNK_HEADER_SIZE while the parallel > path targets allocations of exactly HASH_CHUNK_SIZE, That is intentional: dsa.c sucks at allocating 32KB + a tiny bit because it works in 4KB pages for large allocations, so I wanted to make HASH_CHUNK_SIZE the total size that arrives into dsa_allocate(). The non-parallel path has similar problems on some libc implementations, as we discussed over here: https://www.postgresql.org/message-id/flat/29770.1511495642%40sss.pgh.pa.us > and why there's such > inconsistency in whether tuples of exactly HASH_CHUNK_THRESHOLD bytes > are treated as "big" or not. Right, that's inconsistent. Fixed in the attached. -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-committers by date: