Thread: pgsql: Add some regression tests that exercise hash join code.
Add some regression tests that exercise hash join code. Although hash joins are already tested by many queries, these tests systematically cover the four different states we can reach as part of the strategy for respecting work_mem. Author: Thomas Munro Reviewed-By: Andres Freund Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/fa330f9adf4e83c0707b0b1164e7bf09c9204b3d Modified Files -------------- src/test/regress/expected/join.out | 457 +++++++++++++++++++++++++++++++++++++ src/test/regress/sql/join.sql | 253 ++++++++++++++++++++ 2 files changed, 710 insertions(+)
Andres Freund <andres@anarazel.de> writes: > Add some regression tests that exercise hash join code. At least one buildfarm member doesn't like this ... regards, tom lane
On Thu, Nov 30, 2017 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> Add some regression tests that exercise hash join code. > > At least one buildfarm member doesn't like this ... $$); initially_multibatch | increased_batches ----------------------+------------------- ! t | f (1 row) rollback to settings; --- 6002,6008 ---- $$); initially_multibatch | increased_batches ----------------------+------------------- ! | (1 row) Hmm. aholehole didn't give me the hash join EXPLAIN ANALYZE output. All other animals are fine so far. Gah. My guess is that this is the following unlikely sequence: 1. Worker launched. 2. Leader descheduled. 3. Worker runs whole join. 4. Leader awakes from slumber, begins running plan and tries to scan outer relation, sees EOF, takes empty-outer optimisation and skips building hash table. 5. Explain has no data. It's arguably a bug that EXPLAIN ANALYZE can fail to give you what you asked for because of details of timing and I have a patch in development to fix that but it's not yet baked. Hmm. Wondering if there is a quick way to avoid that case in the meantime... -- Thomas Munro http://www.enterprisedb.com
On Thu, Nov 30, 2017 at 4:47 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Nov 30, 2017 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andres Freund <andres@anarazel.de> writes: >>> Add some regression tests that exercise hash join code. >> >> At least one buildfarm member doesn't like this ... > > $$); > initially_multibatch | increased_batches > ----------------------+------------------- > ! t | f > (1 row) > > rollback to settings; > --- 6002,6008 ---- > $$); > initially_multibatch | increased_batches > ----------------------+------------------- > ! | > (1 row) > > Hmm. aholehole didn't give me the hash join EXPLAIN ANALYZE output. > All other animals are fine so far. Gah. My guess is that this is the > following unlikely sequence: > > 1. Worker launched. > 2. Leader descheduled. > 3. Worker runs whole join. > 4. Leader awakes from slumber, begins running plan and tries to scan > outer relation, sees EOF, takes empty-outer optimisation and skips > building hash table. > 5. Explain has no data. > > It's arguably a bug that EXPLAIN ANALYZE can fail to give you what you > asked for because of details of timing and I have a patch in > development to fix that but it's not yet baked. Hmm. Wondering if > there is a quick way to avoid that case in the meantime... A couple more have failed in the same way. If my theory above is correct, the attached should fix the problem by skipping the affected bits of the test for now. Once we fix EXPLAIN so that it reliably prints hash table info no matter what (for which I should have a patch fairly soon), we can uncomment them again. Is that an acceptable solution for now? BTW this commit tipped nodeHash.c from yellow to green on coverage.postgresql.org :-) -- Thomas Munro http://www.enterprisedb.com
Attachment
Hi, On 2017-11-30 23:53:55 +1300, Thomas Munro wrote: > On Thu, Nov 30, 2017 at 4:47 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > On Thu, Nov 30, 2017 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Andres Freund <andres@anarazel.de> writes: > >>> Add some regression tests that exercise hash join code. > >> > >> At least one buildfarm member doesn't like this ... > > > > $$); > > initially_multibatch | increased_batches > > ----------------------+------------------- > > ! t | f > > (1 row) > > > > rollback to settings; > > --- 6002,6008 ---- > > $$); > > initially_multibatch | increased_batches > > ----------------------+------------------- > > ! | > > (1 row) > > > > Hmm. aholehole didn't give me the hash join EXPLAIN ANALYZE output. > > All other animals are fine so far. Gah. My guess is that this is the > > following unlikely sequence: > > > > 1. Worker launched. > > 2. Leader descheduled. > > 3. Worker runs whole join. > > 4. Leader awakes from slumber, begins running plan and tries to scan > > outer relation, sees EOF, takes empty-outer optimisation and skips > > building hash table. > > 5. Explain has no data. > > > > It's arguably a bug that EXPLAIN ANALYZE can fail to give you what you > > asked for because of details of timing and I have a patch in > > development to fix that but it's not yet baked. Hmm. Wondering if > > there is a quick way to avoid that case in the meantime... > > A couple more have failed in the same way. If my theory above is > correct, the attached should fix the problem by skipping the affected > bits of the test for now. Once we fix EXPLAIN so that it reliably > prints hash table info no matter what (for which I should have a patch > fairly soon), we can uncomment them again. Is that an acceptable > solution for now? BTW this commit tipped nodeHash.c from yellow to > green on coverage.postgresql.org :-) At the current rate of failure I'm slightly inclined to just leave the occasionally failing test, which sounds like an actual bug imo, in place till we have the fix. But if there's any pushback I'll go with your disablign patch till then. - Andres
On 2017-11-30 08:06:44 -0800, Andres Freund wrote: > Hi, > > On 2017-11-30 23:53:55 +1300, Thomas Munro wrote: > > On Thu, Nov 30, 2017 at 4:47 PM, Thomas Munro > > <thomas.munro@enterprisedb.com> wrote: > > > On Thu, Nov 30, 2017 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> Andres Freund <andres@anarazel.de> writes: > > >>> Add some regression tests that exercise hash join code. > > >> > > >> At least one buildfarm member doesn't like this ... > > > > > > $$); > > > initially_multibatch | increased_batches > > > ----------------------+------------------- > > > ! t | f > > > (1 row) > > > > > > rollback to settings; > > > --- 6002,6008 ---- > > > $$); > > > initially_multibatch | increased_batches > > > ----------------------+------------------- > > > ! | > > > (1 row) > > > > > > Hmm. aholehole didn't give me the hash join EXPLAIN ANALYZE output. > > > All other animals are fine so far. Gah. My guess is that this is the > > > following unlikely sequence: > > > > > > 1. Worker launched. > > > 2. Leader descheduled. > > > 3. Worker runs whole join. > > > 4. Leader awakes from slumber, begins running plan and tries to scan > > > outer relation, sees EOF, takes empty-outer optimisation and skips > > > building hash table. > > > 5. Explain has no data. > > > > > > It's arguably a bug that EXPLAIN ANALYZE can fail to give you what you > > > asked for because of details of timing and I have a patch in > > > development to fix that but it's not yet baked. Hmm. Wondering if > > > there is a quick way to avoid that case in the meantime... > > > > A couple more have failed in the same way. If my theory above is > > correct, the attached should fix the problem by skipping the affected > > bits of the test for now. Once we fix EXPLAIN so that it reliably > > prints hash table info no matter what (for which I should have a patch > > fairly soon), we can uncomment them again. Is that an acceptable > > solution for now? BTW this commit tipped nodeHash.c from yellow to > > green on coverage.postgresql.org :-) > > At the current rate of failure I'm slightly inclined to just leave the > occasionally failing test, which sounds like an actual bug imo, in place > till we have the fix. But if there's any pushback I'll go with your > disablign patch till then. Thomas, any updates on the status of that explain fix? Greetings, Andres Freund
On Mon, Dec 4, 2017 at 7:39 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-11-30 08:06:44 -0800, Andres Freund wrote: >> At the current rate of failure I'm slightly inclined to just leave the >> occasionally failing test, which sounds like an actual bug imo, in place >> till we have the fix. But if there's any pushback I'll go with your >> disablign patch till then. > > Thomas, any updates on the status of that explain fix? Will post today. -- Thomas Munro http://www.enterprisedb.com
On Mon, Dec 4, 2017 at 9:12 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Dec 4, 2017 at 7:39 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2017-11-30 08:06:44 -0800, Andres Freund wrote: >>> At the current rate of failure I'm slightly inclined to just leave the >>> occasionally failing test, which sounds like an actual bug imo, in place >>> till we have the fix. But if there's any pushback I'll go with your >>> disablign patch till then. >> >> Thomas, any updates on the status of that explain fix? > > Will post today. Here it is. Thoughts? -- Thomas Munro http://www.enterprisedb.com
Attachment
On Mon, Dec 4, 2017 at 12:18 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Dec 4, 2017 at 9:12 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Mon, Dec 4, 2017 at 7:39 AM, Andres Freund <andres@anarazel.de> wrote: >>> On 2017-11-30 08:06:44 -0800, Andres Freund wrote: >>>> At the current rate of failure I'm slightly inclined to just leave the >>>> occasionally failing test, which sounds like an actual bug imo, in place >>>> till we have the fix. But if there's any pushback I'll go with your >>>> disablign patch till then. >>> >>> Thomas, any updates on the status of that explain fix? >> >> Will post today. > > Here it is. Thoughts? Here's a version with a better commit message. -- Thomas Munro http://www.enterprisedb.com
Attachment
On 2017-12-04 12:28:56 +1300, Thomas Munro wrote: > From 43eeea0bc35204440d262224b56efca958b33428 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@enterprisedb.com> > Date: Mon, 4 Dec 2017 11:52:11 +1300 > Subject: [PATCH] Fix EXPLAIN ANALYZE of hash join when the leader doesn't > execute. > > If a hash join appears in a parallel query, there may be no hash table > available for explain.c to inspect even though a hash table may have been > built in other processes. This could happen either because > parallel_leader_participation was set to off or because the leader happened to > hit the end of the outer relation immediately (even though the complete > relation is not empty) and decided not to build the hash table. > > Commit bf11e7ee introduced a way for workers to exchange instrumentation via > the DSM segment for Sort nodes even though they are not parallel-aware. This > commit does the same for Hash nodes, so that explain.c has a way to find > instrumentation data from an arbitrary participant that actually built the > hash table. Makes sense. Seems ok to only fix this inv11, even if present earlier. > diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c > index 447f69d044e..1ffe635d66c 100644 > --- a/src/backend/commands/explain.c > +++ b/src/backend/commands/explain.c > @@ -2379,34 +2379,62 @@ show_sort_info(SortState *sortstate, ExplainState *es) > static void > show_hash_info(HashState *hashstate, ExplainState *es) > { > - HashJoinTable hashtable; > + HashInstrumentation *hinstrument = NULL; > > - hashtable = hashstate->hashtable; > + /* > + * In a parallel query, the leader process may or may not have run the > + * hash join, and even if it did it may not have built a hash table due to > + * timing (if it started late it might have seen no tuples in the outer > + * relation and skipped building the hash table). Therefore we have to be > + * prepared to get instrumentation data from a worker if there is no hash > + * table. > + */ > + if (hashstate->hashtable) > + { > + hinstrument = (HashInstrumentation *) > + palloc(sizeof(HashInstrumentation)); > + ExecHashGetInstrumentation(hinstrument, hashstate->hashtable); > + } > + else if (hashstate->shared_info) > + { > + SharedHashInfo *shared_info = hashstate->shared_info; > + int i; > + > + /* Find the first worker that built a hash table. */ > + for (i = 0; i < shared_info->num_workers; ++i) > + { > + if (shared_info->hinstrument[i].nbatch > 0) > + { > + hinstrument = &shared_info->hinstrument[i]; > + break; > + } > + } > + } I can't think of a way that, before the parallel HJ patch, results in a difference between the size of the hashtable between workers. Short of some participant not having a hashtable at all, that is. Wwe also don't currently display lookup information etc. So just more or less randomly choosing one worker's seems ok. > + case T_HashState: > + /* even when not parallel-aware */ "..., for stats" or such maybe? > + ExecHashEstimate((HashState *) planstate, e->pcxt); > + break; > case T_SortState: > /* even when not parallel-aware */ I'd just update that comment as well. > #endif /* NODEHASH_H */ > diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h > index e05bc04f525..6c322e57c00 100644 > --- a/src/include/nodes/execnodes.h > +++ b/src/include/nodes/execnodes.h > @@ -1980,6 +1980,29 @@ typedef struct GatherMergeState > struct binaryheap *gm_heap; /* binary heap of slot indices */ > } GatherMergeState; > > +/* ---------------- > + * Values displayed by EXPLAIN ANALYZE > + * ---------------- > + */ > +typedef struct HashInstrumentation > +{ > + int nbuckets; > + int nbuckets_original; > + int nbatch; > + int nbatch_original; > + size_t space_peak; > +} HashInstrumentation; Maybe I'm being pedantic here, but I think it'd not hurt to explain what these mean. It's obviously pretty clear what nbuckets/batch mean, but the difference between original and not is far less clea.r Looks reasonable these nitpicks aside. Greetings, Andres Freund
On Tue, Dec 5, 2017 at 9:21 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-12-04 12:28:56 +1300, Thomas Munro wrote: >> From 43eeea0bc35204440d262224b56efca958b33428 Mon Sep 17 00:00:00 2001 >> From: Thomas Munro <thomas.munro@enterprisedb.com> >> Date: Mon, 4 Dec 2017 11:52:11 +1300 >> Subject: [PATCH] Fix EXPLAIN ANALYZE of hash join when the leader doesn't >> execute. >> >> If a hash join appears in a parallel query, there may be no hash table >> available for explain.c to inspect even though a hash table may have been >> built in other processes. This could happen either because >> parallel_leader_participation was set to off or because the leader happened to >> hit the end of the outer relation immediately (even though the complete >> relation is not empty) and decided not to build the hash table. >> >> Commit bf11e7ee introduced a way for workers to exchange instrumentation via >> the DSM segment for Sort nodes even though they are not parallel-aware. This >> commit does the same for Hash nodes, so that explain.c has a way to find >> instrumentation data from an arbitrary participant that actually built the >> hash table. > > Makes sense. Seems ok to only fix this inv11, even if present earlier. +1 I doubt any v10 users will notice. v11 makes it more noticeable because it has the new parallel_leader_participation GUC and these new regression tests which hit the racy case surprisingly often when running on somebody's toaster in the buildfarm. > I can't think of a way that, before the parallel HJ patch, results in a > difference between the size of the hashtable between workers. Short of > some participant not having a hashtable at all, that is. Wwe also don't > currently display lookup information etc. So just more or less randomly > choosing one worker's seems ok. Right. Finding the first arbitrary backend that actually built the hash table and using its instrumentation data works OK for now because they each do either exactly the same hash build or nothing. Even with the later PHJ patch the batch/buckets sizing info will always be the same across all workers that built the hash table, so it just needs some special handling to choose what peak size to report (or some more elaborate information on space usage, to be bikeshedded later). >> + case T_HashState: >> + /* even when not parallel-aware */ > > "..., for stats" or such maybe? How about ", for EXPLAIN ANALYZE"? Done that way. >> + ExecHashEstimate((HashState *) planstate, e->pcxt); >> + break; >> case T_SortState: >> /* even when not parallel-aware */ > > I'd just update that comment as well. Done. Looking for all the places where Robert has that comment, I realised that I probably also needed to supply ExecHashReInitializeDSM() to zap the instrumentation array for each loop. Done. >> +/* ---------------- >> + * Values displayed by EXPLAIN ANALYZE >> + * ---------------- >> + */ >> +typedef struct HashInstrumentation >> +{ >> + int nbuckets; >> + int nbuckets_original; >> + int nbatch; >> + int nbatch_original; >> + size_t space_peak; >> +} HashInstrumentation; > > Maybe I'm being pedantic here, but I think it'd not hurt to explain what > these mean. It's obviously pretty clear what nbuckets/batch mean, but > the difference between original and not is far less clea.r Fixed. -- Thomas Munro http://www.enterprisedb.com
Attachment
Hi, On 2017-12-05 12:59:21 +1300, Thomas Munro wrote: > src/backend/commands/explain.c | 58 ++++++++++++++------ This didn't actually include nodeHash.h, therefore ExecHashGetInstrumentation() didn't have a prototype. I added the relevant include and remove the hashjoin.h include which isn't required anymore. Pushed, thanks for the patch. Let's see what the buildfarm says... - Andres
On Wed, Dec 6, 2017 at 8:03 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-12-05 12:59:21 +1300, Thomas Munro wrote: >> src/backend/commands/explain.c | 58 ++++++++++++++------ > > This didn't actually include nodeHash.h, ... Ugh, I'd lost -Werror from my COPT. Back in there it goes. > Pushed, thanks for the patch. Let's see what the buildfarm says... Thank you. I'm keeping an eye on it. By the way, the current coding of ExecSortRetrieveInstrumentation() and now also ExecHashRetrieveInstrumentation() is wrong in the case of rescans, since they get called multiple times but are written in a way that assumes they'll be called only once: https://www.postgresql.org/message-id/CAA4eK1JBj4YCEQKeTua5%3DBMXy7zW7zNOvoXomzBP%3Dkb_aqMF7w%40mail.gmail.com Amit's patch would move ExecParallelRetrieveInstrumentation() so that they are called only once, so that would fix that problem without any change to this code. More over on that thread. -- Thomas Munro http://www.enterprisedb.com