Thread: Wrong results from Parallel Hash Full Join
I came across $subject and reduced the repro query as below.
create table a (i int);
create table b (i int);
insert into a values (1);
insert into b values (2);
update b set i = 2;
set min_parallel_table_scan_size to 0;
set parallel_tuple_cost to 0;
set parallel_setup_cost to 0;
# explain (costs off) select * from a full join b on a.i = b.i;
QUERY PLAN
------------------------------------------
Gather
Workers Planned: 2
-> Parallel Hash Full Join
Hash Cond: (a.i = b.i)
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b
(7 rows)
# select * from a full join b on a.i = b.i;
i | i
---+---
1 |
(1 row)
Tuple (NULL, 2) is missing from the results.
Thanks
Richard
create table a (i int);
create table b (i int);
insert into a values (1);
insert into b values (2);
update b set i = 2;
set min_parallel_table_scan_size to 0;
set parallel_tuple_cost to 0;
set parallel_setup_cost to 0;
# explain (costs off) select * from a full join b on a.i = b.i;
QUERY PLAN
------------------------------------------
Gather
Workers Planned: 2
-> Parallel Hash Full Join
Hash Cond: (a.i = b.i)
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b
(7 rows)
# select * from a full join b on a.i = b.i;
i | i
---+---
1 |
(1 row)
Tuple (NULL, 2) is missing from the results.
Thanks
Richard
On Wed, Apr 12, 2023 at 7:36 AM Richard Guo <guofenglinux@gmail.com> wrote: > > I came across $subject and reduced the repro query as below. > > create table a (i int); > create table b (i int); > insert into a values (1); > insert into b values (2); > update b set i = 2; > > set min_parallel_table_scan_size to 0; > set parallel_tuple_cost to 0; > set parallel_setup_cost to 0; > > # explain (costs off) select * from a full join b on a.i = b.i; > QUERY PLAN > ------------------------------------------ > Gather > Workers Planned: 2 > -> Parallel Hash Full Join > Hash Cond: (a.i = b.i) > -> Parallel Seq Scan on a > -> Parallel Hash > -> Parallel Seq Scan on b > (7 rows) > > # select * from a full join b on a.i = b.i; > i | i > ---+--- > 1 | > (1 row) > > Tuple (NULL, 2) is missing from the results. Thanks so much for reporting this, Richard. This is a fantastic minimal repro! So, I looked into this, and it seems that, as you can imagine, the tuple in b is hot updated, resulting in a heap only tuple. t_ctid | raw_flags --------+---------------------------------------------------------------------- (0,2) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED} (0,2) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} In ExecParallelScanHashTableForUnmatched() we don't emit the NULL-extended tuple because HeapTupleHeaderHasMatch() is true for our desired tuple. while (hashTuple != NULL) { if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(hashTuple))) { HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set. In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as HEAP_ONLY_TUPLE /* * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is * only used in tuples that are in the hash table, and those don't need * any visibility information, so we can overlay it on a visibility flag * instead of using up a dedicated bit. */ #define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */ If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already used, say 0x1800, the query returns correct results. QUERY PLAN ------------------------------------------ Gather Workers Planned: 2 -> Parallel Hash Full Join Hash Cond: (a.i = b.i) -> Parallel Seq Scan on a -> Parallel Hash -> Parallel Seq Scan on b (7 rows) i | i ---+--- 1 | | 2 (2 rows) The question is, why does this only happen for a parallel full hash join? unpa postgres=# explain (costs off) select * from a full join b on a.i = b.i; QUERY PLAN --------------------------- Hash Full Join Hash Cond: (a.i = b.i) -> Seq Scan on a -> Hash -> Seq Scan on b (5 rows) postgres=# select * from a full join b on a.i = b.i; i | i ---+--- 1 | | 2 (2 rows) I imagine it has something to do with what tuples are put in the parallel hashtable. I am about to investigate that but just wanted to share what I had so far. - Melanie
Hi, On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote: > HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set. > > In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as > HEAP_ONLY_TUPLE > /* > * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is > * only used in tuples that are in the hash table, and those don't need > * any visibility information, so we can overlay it on a visibility flag > * instead of using up a dedicated bit. > */ > #define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */ > > If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already > used, say 0x1800, the query returns correct results. > [...] > The question is, why does this only happen for a parallel full hash join? I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere, but the non-parallel case isn't. Greetings, Andres Freund
On Wed, Apr 12, 2023 at 2:14 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote: > > HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set. > > > > In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as > > HEAP_ONLY_TUPLE > > /* > > * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is > > * only used in tuples that are in the hash table, and those don't need > > * any visibility information, so we can overlay it on a visibility flag > > * instead of using up a dedicated bit. > > */ > > #define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */ > > > > If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already > > used, say 0x1800, the query returns correct results. > > [...] > > The question is, why does this only happen for a parallel full hash join? > > I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere, > but the non-parallel case isn't. Indeed. Thanks! This diff fixes the case Richard provided. diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index a45bd3a315..54c06c5eb3 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -1724,6 +1724,7 @@ retry: /* Store the hash value in the HashJoinTuple header. */ hashTuple->hashvalue = hashvalue; memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len); + HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple)); /* Push it onto the front of the bucket's list */ ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno], I will propose a patch that includes this change and a test. I just want to convince myself that ExecParallelHashTableInsertCurrentBatch() covers the non-batch 0 cases and we don't need to add something to sts_puttuple(). - Melanie
On Wed, Apr 12, 2023 at 2:59 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Apr 12, 2023 at 2:14 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote: > > > HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set. > > > > > > In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as > > > HEAP_ONLY_TUPLE > > > /* > > > * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is > > > * only used in tuples that are in the hash table, and those don't need > > > * any visibility information, so we can overlay it on a visibility flag > > > * instead of using up a dedicated bit. > > > */ > > > #define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */ > > > > > > If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already > > > used, say 0x1800, the query returns correct results. > > > [...] > > > The question is, why does this only happen for a parallel full hash join? > > > > I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere, > > but the non-parallel case isn't. > > Indeed. Thanks! This diff fixes the case Richard provided. > > diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c > index a45bd3a315..54c06c5eb3 100644 > --- a/src/backend/executor/nodeHash.c > +++ b/src/backend/executor/nodeHash.c > @@ -1724,6 +1724,7 @@ retry: > /* Store the hash value in the HashJoinTuple header. */ > hashTuple->hashvalue = hashvalue; > memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len); > + HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple)); > > /* Push it onto the front of the bucket's list */ > ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno], > > I will propose a patch that includes this change and a test. > > I just want to convince myself that ExecParallelHashTableInsertCurrentBatch() > covers the non-batch 0 cases and we don't need to add something to > sts_puttuple(). So, indeed, tuples in batches after batch 0 already had their match bit cleared by ExecParallelHashTableInsertCurrentBatch(). Attached patch includes the fix for ExecParallelHashTableInsert() as well as a test. I toyed with adapting one of the existing parallel full hash join tests to cover this case, however, I think Richard's repro is much more clear. Maybe it is worth throwing in a few updates to the tables in the existing queries to provide coverage for the other HeapTupleHeaderClearMatch() calls in the code, though. - Melanie
Attachment
On Thu, Apr 13, 2023 at 9:48 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > Attached patch includes the fix for ExecParallelHashTableInsert() as > well as a test. I toyed with adapting one of the existing parallel full > hash join tests to cover this case, however, I think Richard's repro is > much more clear. Maybe it is worth throwing in a few updates to the > tables in the existing queries to provide coverage for the other > HeapTupleHeaderClearMatch() calls in the code, though. Oof. Analysis and code LGTM. I thought about the way non-parallel HJ also clears the match bits when re-using the hash table for rescans. PHJ doesn't keep hash tables across rescans. (There's no fundamental reason why it couldn't, but there was some complication and it seemed absurd to have NestLoop over Gather over PHJ, forking a new set of workers for every tuple, so I didn't implement that in the original PHJ.) But... there is something a little odd about the code in ExecHashTableResetMatchFlags(), or the fact that we appear to be calling it: it's using the unshared union member unconditionally, which wouldn't actually work for PHJ (there should be a variant of that function with Parallel in its name if we ever want that to work). That's not a bug AFAICT, as in fact we don't actually call it--it should be unreachable because the hash table should be gone when we rescan--but it's confusing. I'm wondering if we should put in something explicit about that, maybe a comment and an assertion in ExecReScanHashJoin(). +-- Ensure that hash join tuple match bits have been cleared before putting them +-- into the hashtable. Could you mention that the match flags steals a bit from the HOT flag, ie *why* we're testing a join after an update? And if we're going to exercise/test that case, should we do the non-parallel version too? For the commit message, I think it's a good idea to use something like "Fix ..." for the headline of bug fix commits to make that clearer, and to add something like "oversight in commit XYZ" in the body, just to help people connect the dots. (Yeah, I know I failed to reference the delinquent commit in the recent assertion-removal commit, my bad.) I think "Discussion:" footers are supposed to use https://postgr.es/m/XXX shortened URLs.
On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Apr 13, 2023 at 9:48 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Attached patch includes the fix for ExecParallelHashTableInsert() as > > well as a test. I toyed with adapting one of the existing parallel full > > hash join tests to cover this case, however, I think Richard's repro is > > much more clear. Maybe it is worth throwing in a few updates to the > > tables in the existing queries to provide coverage for the other > > HeapTupleHeaderClearMatch() calls in the code, though. > > Oof. Analysis and code LGTM. > > I thought about the way non-parallel HJ also clears the match bits > when re-using the hash table for rescans. PHJ doesn't keep hash > tables across rescans. (There's no fundamental reason why it > couldn't, but there was some complication and it seemed absurd to have > NestLoop over Gather over PHJ, forking a new set of workers for every > tuple, so I didn't implement that in the original PHJ.) But... there > is something a little odd about the code in > ExecHashTableResetMatchFlags(), or the fact that we appear to be > calling it: it's using the unshared union member unconditionally, > which wouldn't actually work for PHJ (there should be a variant of > that function with Parallel in its name if we ever want that to work). > That's not a bug AFAICT, as in fact we don't actually call it--it > should be unreachable because the hash table should be gone when we > rescan--but it's confusing. I'm wondering if we should put in > something explicit about that, maybe a comment and an assertion in > ExecReScanHashJoin(). An assert about it not being a parallel hash join? I support this. > +-- Ensure that hash join tuple match bits have been cleared before putting them > +-- into the hashtable. > > Could you mention that the match flags steals a bit from the HOT flag, > ie *why* we're testing a join after an update? v2 attached has some wordsmithing along these lines. > And if we're going to > exercise/test that case, should we do the non-parallel version too? I've added this. I thought if we were adding the serial case, we might as well add the multi-batch case as well. However, that proved a bit more challenging. We can get a HOT tuple in one of the existing tables with no issues. Doing this and then deleting the reset match bit code doesn't cause any of the tests to fail, however, because we use this expression as the join condition when we want to emit NULL-extended unmatched tuples. select count(*) from simple r full outer join simple s on (r.id = 0 - s.id); I don't think we want to add yet another time-consuming test to this test file. So, I was trying to decide if it was worth changing these existing tests so that they would fail when the match bit wasn't reset. I'm not sure. > For the commit message, I think it's a good idea to use something like > "Fix ..." for the headline of bug fix commits to make that clearer, > and to add something like "oversight in commit XYZ" in the body, just > to help people connect the dots. (Yeah, I know I failed to reference > the delinquent commit in the recent assertion-removal commit, my bad.) I've made these edits and tried to improve the commit message clarity in general. > I think "Discussion:" footers are supposed to use > https://postgr.es/m/XXX shortened URLs. Hmm. Is the problem with mine that I included "flat"? Because I did use postgr.es/m format. The message id is unfortunately long, but I believe that is on google and not me. - Melanie
Attachment
On Thu, Apr 13, 2023 at 12:31 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I think "Discussion:" footers are supposed to use > > https://postgr.es/m/XXX shortened URLs. > > Hmm. Is the problem with mine that I included "flat"? Because I did use > postgr.es/m format. The message id is unfortunately long, but I believe > that is on google and not me. For some reason I thought we weren't supposed to use the flat thing, but it looks like I'm just wrong and people do that all the time so I take that back. Pushed. Thanks Richard and Melanie.
On Thu, Apr 13, 2023 at 7:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > For some reason I thought we weren't supposed to use the flat thing, > but it looks like I'm just wrong and people do that all the time so I > take that back. > > Pushed. Thanks Richard and Melanie. I tend to use http://postgr.es/m/ or https://postgr.es/m/ just to keep the URL a bit shorter, and also because I like to point anyone reading the commit log to the particular message that I think is most relevant rather than to the thread as a whole. But I don't think there's any hard-and-fast rule that committers have to do it one way rather than another. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Apr 12, 2023 at 08:31:26PM -0400, Melanie Plageman wrote: > On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > And if we're going to > > exercise/test that case, should we do the non-parallel version too? > > I've added this. I thought if we were adding the serial case, we might > as well add the multi-batch case as well. However, that proved a bit > more challenging. We can get a HOT tuple in one of the existing tables > with no issues. Doing this and then deleting the reset match bit code > doesn't cause any of the tests to fail, however, because we use this > expression as the join condition when we want to emit NULL-extended > unmatched tuples. > > select count(*) from simple r full outer join simple s on (r.id = 0 - s.id); > > I don't think we want to add yet another time-consuming test to this > test file. So, I was trying to decide if it was worth changing these > existing tests so that they would fail when the match bit wasn't reset. > I'm not sure. I couldn't stop thinking about how my explanation for why this test didn't fail sounded wrong. After some further investigation, I found that the real reason that the HOT bit is already cleared in the tuples inserted into the hashtable for this query is that the tuple descriptor for the relation "simple" and the target list for the scan node are not identical (because we only need to retain a single column from simple in order to eventually do count(*)), so we make a new virtual tuple and build projection info for the scan node. The virtual tuple doesn't have the HOT bit set anymore (the buffer heap tuple would have). So we couldn't fail a test of the code clearing the match bit. Ultimately this is probably fine. If we wanted to modify one of the existing tests to cover the multi-batch case, changing the select count(*) to a select * would do the trick. I imagine we wouldn't want to do this because of the excessive output this would produce. I wondered if there was a pattern in the tests for getting around this. But, perhaps we don't care enough to cover this code. - Melanie
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote: > Ultimately this is probably fine. If we wanted to modify one of the > existing tests to cover the multi-batch case, changing the select > count(*) to a select * would do the trick. I imagine we wouldn't want to > do this because of the excessive output this would produce. I wondered > if there was a pattern in the tests for getting around this. You could use explain (ANALYZE). But the output is machine-dependant in various ways (which is why the tests use "explain analyze so rarely). So you'd have to filter its output with a function (like the functions that exist in a few places for similar purpose). -- Justin
Hi, On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote: > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote: > > Ultimately this is probably fine. If we wanted to modify one of the > > existing tests to cover the multi-batch case, changing the select > > count(*) to a select * would do the trick. I imagine we wouldn't want to > > do this because of the excessive output this would produce. I wondered > > if there was a pattern in the tests for getting around this. > > You could use explain (ANALYZE). But the output is machine-dependant in > various ways (which is why the tests use "explain analyze so rarely). I think with sufficient options it's not machine specific. We have a bunch of EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) .. in our tests. Greetings, Andres Freund
On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote: > On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote: > > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote: > > > Ultimately this is probably fine. If we wanted to modify one of the > > > existing tests to cover the multi-batch case, changing the select > > > count(*) to a select * would do the trick. I imagine we wouldn't want to > > > do this because of the excessive output this would produce. I wondered > > > if there was a pattern in the tests for getting around this. > > > > You could use explain (ANALYZE). But the output is machine-dependant in > > various ways (which is why the tests use "explain analyze so rarely). > > I think with sufficient options it's not machine specific. It *can* be machine specific depending on the node type.. In particular, for parallel workers, it shows "Workers Launched: ..", which can vary even across executions on the same machine. And don't forget about "loops=". Plus: src/backend/commands/explain.c: "Buckets: %d Batches: %d Memory Usage: %ldkB\n", > We have a bunch of > EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) .. > in our tests. There's 81 uses of "timing off", out of a total of ~1600 explains. Most of them are in partition_prune.sql. explain analyze is barely used. I sent a patch to elide the machine-specific parts, which would make it easier to use. But there was no interest. -- Justin
On Wed, Apr 19, 2023 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
> On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> > Ultimately this is probably fine. If we wanted to modify one of the
> > existing tests to cover the multi-batch case, changing the select
> > count(*) to a select * would do the trick. I imagine we wouldn't want to
> > do this because of the excessive output this would produce. I wondered
> > if there was a pattern in the tests for getting around this.
>
> You could use explain (ANALYZE). But the output is machine-dependant in
> various ways (which is why the tests use "explain analyze so rarely).
I think with sufficient options it's not machine specific. We have a bunch of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.
Cool. Yea, so ultimately these options are almost enough but memory
usage changes from execution to execution. There are some tests which do
regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output
to allow us to still compare the plans. However, I figured if I was
already going to go to the trouble of using regexp_replace(), I might as
well write a function that returns the "Actual Rows" field from the
EXPLAIN ANALYZE output.
The attached patch does that. I admittedly mostly copy-pasted the
plpgsql function from similar examples in other tests, and I suspect it
may be overkill and also poorly written.
The nice thing about this approach is that we can modify some of the
existing tests in join_hash.sql to use this function and cover the code
to reset the matchbit for serial hashjoin, single batch parallel
hashjoin, and all batches of parallel multi-batch hashjoin without any
additional queries. (I'll leave testing match bit resetting with the
skew hashtable and match bit resetting in case of a rescan for another
day.)
I was able to delete the tests added in 558c9d75fe, as they became
redundant.
I wonder if any other tests are in need of an EXPLAIN (ANALYZE,
MEMORY_USAGE OFF) option? Perhaps it is quite unusual to only require a
deterministic field like 'Actual Rows'. If we had that option we could
also remove the extra EXPLAIN invocations before the actual query
executions.
- Melanie
usage changes from execution to execution. There are some tests which do
regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output
to allow us to still compare the plans. However, I figured if I was
already going to go to the trouble of using regexp_replace(), I might as
well write a function that returns the "Actual Rows" field from the
EXPLAIN ANALYZE output.
The attached patch does that. I admittedly mostly copy-pasted the
plpgsql function from similar examples in other tests, and I suspect it
may be overkill and also poorly written.
The nice thing about this approach is that we can modify some of the
existing tests in join_hash.sql to use this function and cover the code
to reset the matchbit for serial hashjoin, single batch parallel
hashjoin, and all batches of parallel multi-batch hashjoin without any
additional queries. (I'll leave testing match bit resetting with the
skew hashtable and match bit resetting in case of a rescan for another
day.)
I was able to delete the tests added in 558c9d75fe, as they became
redundant.
I wonder if any other tests are in need of an EXPLAIN (ANALYZE,
MEMORY_USAGE OFF) option? Perhaps it is quite unusual to only require a
deterministic field like 'Actual Rows'. If we had that option we could
also remove the extra EXPLAIN invocations before the actual query
executions.
- Melanie
Attachment
On Wed, Apr 19, 2023 at 8:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote: > > On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote: > > > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote: > > > > Ultimately this is probably fine. If we wanted to modify one of the > > > > existing tests to cover the multi-batch case, changing the select > > > > count(*) to a select * would do the trick. I imagine we wouldn't want to > > > > do this because of the excessive output this would produce. I wondered > > > > if there was a pattern in the tests for getting around this. > > > > > > You could use explain (ANALYZE). But the output is machine-dependant in > > > various ways (which is why the tests use "explain analyze so rarely). > > > > I think with sufficient options it's not machine specific. > > It *can* be machine specific depending on the node type.. > > In particular, for parallel workers, it shows "Workers Launched: ..", > which can vary even across executions on the same machine. And don't > forget about "loops=". > > Plus: > src/backend/commands/explain.c: "Buckets: %d Batches: %d Memory Usage: %ldkB\n", > > > We have a bunch of > > EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) .. > > in our tests. > > There's 81 uses of "timing off", out of a total of ~1600 explains. Most > of them are in partition_prune.sql. explain analyze is barely used. > > I sent a patch to elide the machine-specific parts, which would make it > easier to use. But there was no interest. While I don't know about other use cases, I would have used that here. Do you still have that patch laying around? I'd be interested to at least review it. - Melanie
On Wed, Apr 19, 2023 at 8:43 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Wed, Apr 19, 2023 at 3:20 PM Andres Freund <andres@anarazel.de> wrote: >> On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote: >> > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote: >> > > Ultimately this is probably fine. If we wanted to modify one of the >> > > existing tests to cover the multi-batch case, changing the select >> > > count(*) to a select * would do the trick. I imagine we wouldn't want to >> > > do this because of the excessive output this would produce. I wondered >> > > if there was a pattern in the tests for getting around this. >> > >> > You could use explain (ANALYZE). But the output is machine-dependant in >> > various ways (which is why the tests use "explain analyze so rarely). >> >> I think with sufficient options it's not machine specific. We have a bunch of >> EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) .. >> in our tests. > > > Cool. Yea, so ultimately these options are almost enough but memory > usage changes from execution to execution. There are some tests which do > regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output > to allow us to still compare the plans. However, I figured if I was > already going to go to the trouble of using regexp_replace(), I might as > well write a function that returns the "Actual Rows" field from the > EXPLAIN ANALYZE output. > > The attached patch does that. I admittedly mostly copy-pasted the > plpgsql function from similar examples in other tests, and I suspect it > may be overkill and also poorly written. I renamed the function to join_hash_actual_rows to avoid potentially affecting other tests. Nothing about the function is specific to a hash join plan, so I think it is more clear to prefix the function with the test file name. v2 attached. - Melanie
Attachment
On Wed, Apr 19, 2023 at 08:47:07PM -0400, Melanie Plageman wrote: > On Wed, Apr 19, 2023 at 8:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote: > > > On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote: > > > > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote: > > > > > Ultimately this is probably fine. If we wanted to modify one of the > > > > > existing tests to cover the multi-batch case, changing the select > > > > > count(*) to a select * would do the trick. I imagine we wouldn't want to > > > > > do this because of the excessive output this would produce. I wondered > > > > > if there was a pattern in the tests for getting around this. > > > > > > > > You could use explain (ANALYZE). But the output is machine-dependant in > > > > various ways (which is why the tests use "explain analyze so rarely). > > > > > > I think with sufficient options it's not machine specific. > > > > It *can* be machine specific depending on the node type.. > > > > In particular, for parallel workers, it shows "Workers Launched: ..", > > which can vary even across executions on the same machine. And don't > > forget about "loops=". > > > > Plus: > > src/backend/commands/explain.c: "Buckets: %d Batches: %d Memory Usage: %ldkB\n", > > > > > We have a bunch of > > > EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) .. > > > in our tests. > > > > There's 81 uses of "timing off", out of a total of ~1600 explains. Most > > of them are in partition_prune.sql. explain analyze is barely used. > > > > I sent a patch to elide the machine-specific parts, which would make it > > easier to use. But there was no interest. > > While I don't know about other use cases, I would have used that here. > Do you still have that patch laying around? I'd be interested to at > least review it. https://commitfest.postgresql.org/41/3409/
I noticed that BF animal conchuela has several times fallen over on the test case added by 558c9d75f: diff -U3 /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out --- /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out 2023-04-19 10:20:26.159840000 +0200 +++ /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out 2023-04-19 10:21:47.971900000 +0200 @@ -974,8 +974,8 @@ SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id; id | id ----+---- - 1 | | 2 + 1 | (2 rows) -- Test serial full hash join. Considering that this is a parallel plan, I don't think there's any mystery about why an ORDER-BY-less query might have unstable output order; the only mystery is why more of the buildfarm hasn't failed. Can we just add "ORDER BY t1.id" to this query? It looks like you get the same PHJ plan, although now underneath Sort/Gather Merge. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-04-19%2008%3A20%3A56 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-05-03%2006%3A21%3A03 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-05-19%2022%3A21%3A04
On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I noticed that BF animal conchuela has several times fallen over on the > test case added by 558c9d75f: > > diff -U3 /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out > --- /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out 2023-04-19 10:20:26.159840000 +0200 > +++ /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out 2023-04-19 10:21:47.971900000 +0200 > @@ -974,8 +974,8 @@ > SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id; > id | id > ----+---- > - 1 | > | 2 > + 1 | > (2 rows) > > -- Test serial full hash join. > > Considering that this is a parallel plan, I don't think there's any > mystery about why an ORDER-BY-less query might have unstable output > order; the only mystery is why more of the buildfarm hasn't failed. > Can we just add "ORDER BY t1.id" to this query? It looks like you > get the same PHJ plan, although now underneath Sort/Gather Merge. Yes, this was an oversight on my part. Attached is the patch that does just what you suggested. I can't help but take this opportunity to bump my un-reviewed patch further upthread which adds additional test coverage for match bit clearing for multi-batch hash joins [1]. It happens to also remove the test that failed on the buildfarm, which is why I thought to bring it up. -- Melanie [1] https://www.postgresql.org/message-id/CAAKRu_bdwDN_aHVctHcc9VoDP9av7LUMeuLbch1fHD2ESouw1g%40mail.gmail.com
Attachment
On Wed, Jun 07, 2023 at 05:16:12PM -0400, Melanie Plageman wrote: > On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Considering that this is a parallel plan, I don't think there's any >> mystery about why an ORDER-BY-less query might have unstable output >> order; the only mystery is why more of the buildfarm hasn't failed. >> Can we just add "ORDER BY t1.id" to this query? It looks like you >> get the same PHJ plan, although now underneath Sort/Gather Merge. > > Yes, this was an oversight on my part. Attached is the patch that does > just what you suggested. Confirmed that adding an ORDER BY adds a Sort node between a Gather Merge and a Parallel Hash Full Join, not removing coverage. This has fallen through the cracks and conchuela has failed again today, so I went ahead and applied the fix on HEAD. Thanks! -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > This has fallen through the cracks and conchuela has failed again > today, so I went ahead and applied the fix on HEAD. Thanks! Thanks! I'd intended to push that but it didn't get to the top of the to-do queue yet. (I'm still kind of wondering why only conchuela has failed to date.) regards, tom lane
On Sun, Jun 11, 2023 at 11:24 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Jun 07, 2023 at 05:16:12PM -0400, Melanie Plageman wrote: > > On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Considering that this is a parallel plan, I don't think there's any > >> mystery about why an ORDER-BY-less query might have unstable output > >> order; the only mystery is why more of the buildfarm hasn't failed. > >> Can we just add "ORDER BY t1.id" to this query? It looks like you > >> get the same PHJ plan, although now underneath Sort/Gather Merge. > > > > Yes, this was an oversight on my part. Attached is the patch that does > > just what you suggested. > > Confirmed that adding an ORDER BY adds a Sort node between a Gather > Merge and a Parallel Hash Full Join, not removing coverage. > > This has fallen through the cracks and conchuela has failed again > today, so I went ahead and applied the fix on HEAD. Thanks! Thanks!