Thread: Wrong results from Parallel Hash Full Join

Wrong results from Parallel Hash Full Join

From

Richard Guo

Date:

12 April 2023, 11:35:53

I came across $subject and reduced the repro query as below.

create table a (i int);
create table b (i int);
insert into a values (1);
insert into b values (2);
update b set i = 2;

set min_parallel_table_scan_size to 0;
set parallel_tuple_cost to 0;
set parallel_setup_cost to 0;

# explain (costs off) select * from a full join b on a.i = b.i;
QUERY PLAN
------------------------------------------
Gather
Workers Planned: 2
-> Parallel Hash Full Join
Hash Cond: (a.i = b.i)
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b
(7 rows)

# select * from a full join b on a.i = b.i;
i | i
---+---
1 |
(1 row)

Tuple (NULL, 2) is missing from the results.

Thanks
Richard

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

12 April 2023, 14:57:17

On Wed, Apr 12, 2023 at 7:36 AM Richard Guo <guofenglinux@gmail.com> wrote:
>
> I came across $subject and reduced the repro query as below.
>
> create table a (i int);
> create table b (i int);
> insert into a values (1);
> insert into b values (2);
> update b set i = 2;
>
> set min_parallel_table_scan_size to 0;
> set parallel_tuple_cost to 0;
> set parallel_setup_cost to 0;
>
> # explain (costs off) select * from a full join b on a.i = b.i;
>                 QUERY PLAN
> ------------------------------------------
>  Gather
>    Workers Planned: 2
>    ->  Parallel Hash Full Join
>          Hash Cond: (a.i = b.i)
>          ->  Parallel Seq Scan on a
>          ->  Parallel Hash
>                ->  Parallel Seq Scan on b
> (7 rows)
>
> # select * from a full join b on a.i = b.i;
>  i | i
> ---+---
>  1 |
> (1 row)
>
> Tuple (NULL, 2) is missing from the results.

Thanks so much for reporting this, Richard. This is a fantastic minimal
repro!

So, I looked into this, and it seems that, as you can imagine, the tuple
in b is hot updated, resulting in a heap only tuple.

 t_ctid |                              raw_flags
--------+----------------------------------------------------------------------
 (0,2)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED}
 (0,2)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}

In ExecParallelScanHashTableForUnmatched() we don't emit the
NULL-extended tuple because HeapTupleHeaderHasMatch() is true for our
desired tuple.

        while (hashTuple != NULL)
        {
            if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(hashTuple)))
            {

HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.

In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
HEAP_ONLY_TUPLE
/*
 * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins.  It is
 * only used in tuples that are in the hash table, and those don't need
 * any visibility information, so we can overlay it on a visibility flag
 * instead of using up a dedicated bit.
 */
#define HEAP_TUPLE_HAS_MATCH    HEAP_ONLY_TUPLE /* tuple has a join match */

If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
used, say 0x1800, the query returns correct results.

                QUERY PLAN
------------------------------------------
 Gather
   Workers Planned: 2
   ->  Parallel Hash Full Join
         Hash Cond: (a.i = b.i)
         ->  Parallel Seq Scan on a
         ->  Parallel Hash
               ->  Parallel Seq Scan on b
(7 rows)

 i | i
---+---
 1 |
   | 2
(2 rows)

The question is, why does this only happen for a parallel full hash join?

unpa
postgres=# explain (costs off) select * from a full join b on a.i = b.i;
        QUERY PLAN
---------------------------
 Hash Full Join
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a
   ->  Hash
         ->  Seq Scan on b
(5 rows)

postgres=#  select * from a full join b on a.i = b.i;
 i | i
---+---
 1 |
   | 2
(2 rows)

I imagine it has something to do with what tuples are put in the
parallel hashtable. I am about to investigate that but just wanted to
share what I had so far.

- Melanie

Re: Wrong results from Parallel Hash Full Join

From

Andres Freund

Date:

12 April 2023, 18:14:52

Hi,

On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote:
> HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.
> 
> In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
> HEAP_ONLY_TUPLE
> /*
>  * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins.  It is
>  * only used in tuples that are in the hash table, and those don't need
>  * any visibility information, so we can overlay it on a visibility flag
>  * instead of using up a dedicated bit.
>  */
> #define HEAP_TUPLE_HAS_MATCH    HEAP_ONLY_TUPLE /* tuple has a join match */
> 
> If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
> used, say 0x1800, the query returns correct results.
> [...]
> The question is, why does this only happen for a parallel full hash join?

I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere,
but the non-parallel case isn't.

Greetings,

Andres Freund

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

12 April 2023, 18:59:11

On Wed, Apr 12, 2023 at 2:14 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote:
> > HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.
> >
> > In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
> > HEAP_ONLY_TUPLE
> > /*
> >  * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins.  It is
> >  * only used in tuples that are in the hash table, and those don't need
> >  * any visibility information, so we can overlay it on a visibility flag
> >  * instead of using up a dedicated bit.
> >  */
> > #define HEAP_TUPLE_HAS_MATCH    HEAP_ONLY_TUPLE /* tuple has a join match */
> >
> > If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
> > used, say 0x1800, the query returns correct results.
> > [...]
> > The question is, why does this only happen for a parallel full hash join?
>
> I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere,
> but the non-parallel case isn't.

Indeed. Thanks! This diff fixes the case Richard provided.

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a45bd3a315..54c06c5eb3 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1724,6 +1724,7 @@ retry:
                /* Store the hash value in the HashJoinTuple header. */
                hashTuple->hashvalue = hashvalue;
                memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
+               HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));

                /* Push it onto the front of the bucket's list */
                ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],

I will propose a patch that includes this change and a test.

I just want to convince myself that ExecParallelHashTableInsertCurrentBatch()
covers the non-batch 0 cases and we don't need to add something to
sts_puttuple().

- Melanie

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

12 April 2023, 21:48:27

On Wed, Apr 12, 2023 at 2:59 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Apr 12, 2023 at 2:14 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote:
> > > HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.
> > >
> > > In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
> > > HEAP_ONLY_TUPLE
> > > /*
> > >  * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins.  It is
> > >  * only used in tuples that are in the hash table, and those don't need
> > >  * any visibility information, so we can overlay it on a visibility flag
> > >  * instead of using up a dedicated bit.
> > >  */
> > > #define HEAP_TUPLE_HAS_MATCH    HEAP_ONLY_TUPLE /* tuple has a join match */
> > >
> > > If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
> > > used, say 0x1800, the query returns correct results.
> > > [...]
> > > The question is, why does this only happen for a parallel full hash join?
> >
> > I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere,
> > but the non-parallel case isn't.
>
> Indeed. Thanks! This diff fixes the case Richard provided.
>
> diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
> index a45bd3a315..54c06c5eb3 100644
> --- a/src/backend/executor/nodeHash.c
> +++ b/src/backend/executor/nodeHash.c
> @@ -1724,6 +1724,7 @@ retry:
>                 /* Store the hash value in the HashJoinTuple header. */
>                 hashTuple->hashvalue = hashvalue;
>                 memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
> +               HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
>
>                 /* Push it onto the front of the bucket's list */
>                 ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
>
> I will propose a patch that includes this change and a test.
>
> I just want to convince myself that ExecParallelHashTableInsertCurrentBatch()
> covers the non-batch 0 cases and we don't need to add something to
> sts_puttuple().

So, indeed, tuples in batches after batch 0 already had their match bit
cleared by ExecParallelHashTableInsertCurrentBatch().

Attached patch includes the fix for ExecParallelHashTableInsert() as
well as a test. I toyed with adapting one of the existing parallel full
hash join tests to cover this case, however, I think Richard's repro is
much more clear. Maybe it is worth throwing in a few updates to the
tables in the existing queries to provide coverage for the other
HeapTupleHeaderClearMatch() calls in the code, though.

- Melanie

Attachment

v1-0001-Reset-PHJ-tuple-match-bit-upon-hashtable-insert.patch

Re: Wrong results from Parallel Hash Full Join

From

Thomas Munro

Date:

12 April 2023, 22:49:21

On Thu, Apr 13, 2023 at 9:48 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Attached patch includes the fix for ExecParallelHashTableInsert() as
> well as a test. I toyed with adapting one of the existing parallel full
> hash join tests to cover this case, however, I think Richard's repro is
> much more clear. Maybe it is worth throwing in a few updates to the
> tables in the existing queries to provide coverage for the other
> HeapTupleHeaderClearMatch() calls in the code, though.

Oof.  Analysis and code LGTM.

I thought about the way non-parallel HJ also clears the match bits
when re-using the hash table for rescans.  PHJ doesn't keep hash
tables across rescans.  (There's no fundamental reason why it
couldn't, but there was some complication and it seemed absurd to have
NestLoop over Gather over PHJ, forking a new set of workers for every
tuple, so I didn't implement that in the original PHJ.)  But... there
is something a little odd about the code in
ExecHashTableResetMatchFlags(), or the fact that we appear to be
calling it: it's using the unshared union member unconditionally,
which wouldn't actually work for PHJ (there should be a variant of
that function with Parallel in its name if we ever want that to work).
That's not a bug AFAICT, as in fact we don't actually call it--it
should be unreachable because the hash table should be gone when we
rescan--but it's confusing.  I'm wondering if we should put in
something explicit about that, maybe a comment and an assertion in
ExecReScanHashJoin().

+-- Ensure that hash join tuple match bits have been cleared before putting them
+-- into the hashtable.

Could you mention that the match flags steals a bit from the HOT flag,
ie *why* we're testing a join after an update?  And if we're going to
exercise/test that case, should we do the non-parallel version too?

For the commit message, I think it's a good idea to use something like
"Fix ..." for the headline of bug fix commits to make that clearer,
and to add something like "oversight in commit XYZ" in the body, just
to help people connect the dots.  (Yeah, I know I failed to reference
the delinquent commit in the recent assertion-removal commit, my bad.)
 I think "Discussion:" footers are supposed to use
https://postgr.es/m/XXX shortened URLs.

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

13 April 2023, 00:31:26

On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Apr 13, 2023 at 9:48 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Attached patch includes the fix for ExecParallelHashTableInsert() as
> > well as a test. I toyed with adapting one of the existing parallel full
> > hash join tests to cover this case, however, I think Richard's repro is
> > much more clear. Maybe it is worth throwing in a few updates to the
> > tables in the existing queries to provide coverage for the other
> > HeapTupleHeaderClearMatch() calls in the code, though.
>
> Oof.  Analysis and code LGTM.
>
> I thought about the way non-parallel HJ also clears the match bits
> when re-using the hash table for rescans.  PHJ doesn't keep hash
> tables across rescans.  (There's no fundamental reason why it
> couldn't, but there was some complication and it seemed absurd to have
> NestLoop over Gather over PHJ, forking a new set of workers for every
> tuple, so I didn't implement that in the original PHJ.)  But... there
> is something a little odd about the code in
> ExecHashTableResetMatchFlags(), or the fact that we appear to be
> calling it: it's using the unshared union member unconditionally,
> which wouldn't actually work for PHJ (there should be a variant of
> that function with Parallel in its name if we ever want that to work).
> That's not a bug AFAICT, as in fact we don't actually call it--it
> should be unreachable because the hash table should be gone when we
> rescan--but it's confusing.  I'm wondering if we should put in
> something explicit about that, maybe a comment and an assertion in
> ExecReScanHashJoin().

An assert about it not being a parallel hash join? I support this.

> +-- Ensure that hash join tuple match bits have been cleared before putting them
> +-- into the hashtable.
>
> Could you mention that the match flags steals a bit from the HOT flag,
> ie *why* we're testing a join after an update?

v2 attached has some wordsmithing along these lines.

> And if we're going to
> exercise/test that case, should we do the non-parallel version too?

I've added this. I thought if we were adding the serial case, we might
as well add the multi-batch case as well. However, that proved a bit
more challenging. We can get a HOT tuple in one of the existing tables
with no issues. Doing this and then deleting the reset match bit code
doesn't cause any of the tests to fail, however, because we use this
expression as the join condition when we want to emit NULL-extended
unmatched tuples.

select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);

I don't think we want to add yet another time-consuming test to this
test file. So, I was trying to decide if it was worth changing these
existing tests so that they would fail when the match bit wasn't reset.
I'm not sure.

> For the commit message, I think it's a good idea to use something like
> "Fix ..." for the headline of bug fix commits to make that clearer,
> and to add something like "oversight in commit XYZ" in the body, just
> to help people connect the dots.  (Yeah, I know I failed to reference
> the delinquent commit in the recent assertion-removal commit, my bad.)

I've made these edits and tried to improve the commit message clarity in
general.

>  I think "Discussion:" footers are supposed to use
> https://postgr.es/m/XXX shortened URLs.

Hmm. Is the problem with mine that I included "flat"? Because I did use
postgr.es/m format. The message id is unfortunately long, but I believe
that is on google and not me.

- Melanie

Attachment

v2-0001-Fix-PHJ-tuple-match-bit-management.patch

Re: Wrong results from Parallel Hash Full Join

From

Thomas Munro

Date:

13 April 2023, 23:05:41

On Thu, Apr 13, 2023 at 12:31 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >  I think "Discussion:" footers are supposed to use
> > https://postgr.es/m/XXX shortened URLs.
>
> Hmm. Is the problem with mine that I included "flat"? Because I did use
> postgr.es/m format. The message id is unfortunately long, but I believe
> that is on google and not me.

For some reason I thought we weren't supposed to use the flat thing,
but it looks like I'm just wrong and people do that all the time so I
take that back.

Pushed.  Thanks Richard and Melanie.

Re: Wrong results from Parallel Hash Full Join

From

Robert Haas

Date:

14 April 2023, 12:37:30

On Thu, Apr 13, 2023 at 7:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> For some reason I thought we weren't supposed to use the flat thing,
> but it looks like I'm just wrong and people do that all the time so I
> take that back.
>
> Pushed.  Thanks Richard and Melanie.

I tend to use http://postgr.es/m/ or https://postgr.es/m/ just to keep
the URL a bit shorter, and also because I like to point anyone reading
the commit log to the particular message that I think is most relevant
rather than to the thread as a whole. But I don't think there's any
hard-and-fast rule that committers have to do it one way rather than
another.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

19 April 2023, 15:17:04

On Wed, Apr 12, 2023 at 08:31:26PM -0400, Melanie Plageman wrote:
> On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > And if we're going to
> > exercise/test that case, should we do the non-parallel version too?
> 
> I've added this. I thought if we were adding the serial case, we might
> as well add the multi-batch case as well. However, that proved a bit
> more challenging. We can get a HOT tuple in one of the existing tables
> with no issues. Doing this and then deleting the reset match bit code
> doesn't cause any of the tests to fail, however, because we use this
> expression as the join condition when we want to emit NULL-extended
> unmatched tuples.
> 
> select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
> 
> I don't think we want to add yet another time-consuming test to this
> test file. So, I was trying to decide if it was worth changing these
> existing tests so that they would fail when the match bit wasn't reset.
> I'm not sure.

I couldn't stop thinking about how my explanation for why this test
didn't fail sounded wrong.

After some further investigation, I found that the real reason that the
HOT bit is already cleared in the tuples inserted into the hashtable for
this query is that the tuple descriptor for the relation "simple" and
the target list for the scan node are not identical (because we only
need to retain a single column from simple in order to eventually do
count(*)), so we make a new virtual tuple and build projection info for
the scan node. The virtual tuple doesn't have the HOT bit set anymore
(the buffer heap tuple would have). So we couldn't fail a test of the
code clearing the match bit.

Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this. But,
perhaps we don't care enough to cover this code.

- Melanie

Re: Wrong results from Parallel Hash Full Join

From

Justin Pryzby

Date:

19 April 2023, 17:16:24

On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> Ultimately this is probably fine. If we wanted to modify one of the
> existing tests to cover the multi-batch case, changing the select
> count(*) to a select * would do the trick. I imagine we wouldn't want to
> do this because of the excessive output this would produce. I wondered
> if there was a pattern in the tests for getting around this.

You could use explain (ANALYZE).  But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).

So you'd have to filter its output with a function (like the functions
that exist in a few places for similar purpose).

-- 
Justin

Re: Wrong results from Parallel Hash Full Join

From

Andres Freund

Date:

19 April 2023, 19:20:51

Hi,

On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
> On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> > Ultimately this is probably fine. If we wanted to modify one of the
> > existing tests to cover the multi-batch case, changing the select
> > count(*) to a select * would do the trick. I imagine we wouldn't want to
> > do this because of the excessive output this would produce. I wondered
> > if there was a pattern in the tests for getting around this.
> 
> You could use explain (ANALYZE).  But the output is machine-dependant in
> various ways (which is why the tests use "explain analyze so rarely).

I think with sufficient options it's not machine specific. We have a bunch of
 EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.

Greetings,

Andres Freund

Re: Wrong results from Parallel Hash Full Join

From

Justin Pryzby

Date:

20 April 2023, 00:41:35

On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote:
> On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
> > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> > > Ultimately this is probably fine. If we wanted to modify one of the
> > > existing tests to cover the multi-batch case, changing the select
> > > count(*) to a select * would do the trick. I imagine we wouldn't want to
> > > do this because of the excessive output this would produce. I wondered
> > > if there was a pattern in the tests for getting around this.
> > 
> > You could use explain (ANALYZE).  But the output is machine-dependant in
> > various ways (which is why the tests use "explain analyze so rarely).
> 
> I think with sufficient options it's not machine specific.

It *can* be machine specific depending on the node type..

In particular, for parallel workers, it shows "Workers Launched: ..",
which can vary even across executions on the same machine.  And don't
forget about "loops=".

Plus:
src/backend/commands/explain.c: "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",

> We have a bunch of
>  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
> in our tests.

There's 81 uses of "timing off", out of a total of ~1600 explains.  Most
of them are in partition_prune.sql.  explain analyze is barely used.

I sent a patch to elide the machine-specific parts, which would make it
easier to use.  But there was no interest.

-- 
Justin

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

20 April 2023, 00:43:15

On Wed, Apr 19, 2023 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
> On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> > Ultimately this is probably fine. If we wanted to modify one of the
> > existing tests to cover the multi-batch case, changing the select
> > count(*) to a select * would do the trick. I imagine we wouldn't want to
> > do this because of the excessive output this would produce. I wondered
> > if there was a pattern in the tests for getting around this.
>
> You could use explain (ANALYZE). But the output is machine-dependant in
> various ways (which is why the tests use "explain analyze so rarely).

I think with sufficient options it's not machine specific. We have a bunch of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.

Cool. Yea, so ultimately these options are almost enough but memory
usage changes from execution to execution. There are some tests which do
regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output
to allow us to still compare the plans. However, I figured if I was
already going to go to the trouble of using regexp_replace(), I might as
well write a function that returns the "Actual Rows" field from the
EXPLAIN ANALYZE output.

The attached patch does that. I admittedly mostly copy-pasted the
plpgsql function from similar examples in other tests, and I suspect it
may be overkill and also poorly written.

The nice thing about this approach is that we can modify some of the
existing tests in join_hash.sql to use this function and cover the code
to reset the matchbit for serial hashjoin, single batch parallel
hashjoin, and all batches of parallel multi-batch hashjoin without any
additional queries. (I'll leave testing match bit resetting with the
skew hashtable and match bit resetting in case of a rescan for another
day.)

I was able to delete the tests added in 558c9d75fe, as they became
redundant.

I wonder if any other tests are in need of an EXPLAIN (ANALYZE,
MEMORY_USAGE OFF) option? Perhaps it is quite unusual to only require a
deterministic field like 'Actual Rows'. If we had that option we could
also remove the extra EXPLAIN invocations before the actual query
executions.

- Melanie

Attachment

v1-0001-Test-multi-batch-PHJ-match-bit-initialization.patch

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

20 April 2023, 00:47:07

On Wed, Apr 19, 2023 at 8:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote:
> > On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
> > > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> > > > Ultimately this is probably fine. If we wanted to modify one of the
> > > > existing tests to cover the multi-batch case, changing the select
> > > > count(*) to a select * would do the trick. I imagine we wouldn't want to
> > > > do this because of the excessive output this would produce. I wondered
> > > > if there was a pattern in the tests for getting around this.
> > >
> > > You could use explain (ANALYZE).  But the output is machine-dependant in
> > > various ways (which is why the tests use "explain analyze so rarely).
> >
> > I think with sufficient options it's not machine specific.
>
> It *can* be machine specific depending on the node type..
>
> In particular, for parallel workers, it shows "Workers Launched: ..",
> which can vary even across executions on the same machine.  And don't
> forget about "loops=".
>
> Plus:
> src/backend/commands/explain.c: "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
>
> > We have a bunch of
> >  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
> > in our tests.
>
> There's 81 uses of "timing off", out of a total of ~1600 explains.  Most
> of them are in partition_prune.sql.  explain analyze is barely used.
>
> I sent a patch to elide the machine-specific parts, which would make it
> easier to use.  But there was no interest.

While I don't know about other use cases, I would have used that here.
Do you still have that patch laying around? I'd be interested to at
least review it.

- Melanie

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

20 April 2023, 15:49:49

On Wed, Apr 19, 2023 at 8:43 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Wed, Apr 19, 2023 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
>> On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
>> > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
>> > > Ultimately this is probably fine. If we wanted to modify one of the
>> > > existing tests to cover the multi-batch case, changing the select
>> > > count(*) to a select * would do the trick. I imagine we wouldn't want to
>> > > do this because of the excessive output this would produce. I wondered
>> > > if there was a pattern in the tests for getting around this.
>> >
>> > You could use explain (ANALYZE).  But the output is machine-dependant in
>> > various ways (which is why the tests use "explain analyze so rarely).
>>
>> I think with sufficient options it's not machine specific. We have a bunch of
>>  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
>> in our tests.
>
>
> Cool. Yea, so ultimately these options are almost enough but memory
> usage changes from execution to execution. There are some tests which do
> regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output
> to allow us to still compare the plans. However, I figured if I was
> already going to go to the trouble of using regexp_replace(), I might as
> well write a function that returns the "Actual Rows" field from the
> EXPLAIN ANALYZE output.
>
> The attached patch does that. I admittedly mostly copy-pasted the
> plpgsql function from similar examples in other tests, and I suspect it
> may be overkill and also poorly written.

I renamed the function to join_hash_actual_rows to avoid potentially
affecting other tests. Nothing about the function is specific to a hash
join plan, so I think it is more clear to prefix the function with the
test file name. v2 attached.

- Melanie

Attachment

v2-0001-Test-multi-batch-PHJ-match-bit-initialization.patch

Re: Wrong results from Parallel Hash Full Join

From

Justin Pryzby

Date:

20 April 2023, 15:50:45

On Wed, Apr 19, 2023 at 08:47:07PM -0400, Melanie Plageman wrote:
> On Wed, Apr 19, 2023 at 8:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote:
> > > On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
> > > > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> > > > > Ultimately this is probably fine. If we wanted to modify one of the
> > > > > existing tests to cover the multi-batch case, changing the select
> > > > > count(*) to a select * would do the trick. I imagine we wouldn't want to
> > > > > do this because of the excessive output this would produce. I wondered
> > > > > if there was a pattern in the tests for getting around this.
> > > >
> > > > You could use explain (ANALYZE).  But the output is machine-dependant in
> > > > various ways (which is why the tests use "explain analyze so rarely).
> > >
> > > I think with sufficient options it's not machine specific.
> >
> > It *can* be machine specific depending on the node type..
> >
> > In particular, for parallel workers, it shows "Workers Launched: ..",
> > which can vary even across executions on the same machine.  And don't
> > forget about "loops=".
> >
> > Plus:
> > src/backend/commands/explain.c: "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
> >
> > > We have a bunch of
> > >  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
> > > in our tests.
> >
> > There's 81 uses of "timing off", out of a total of ~1600 explains.  Most
> > of them are in partition_prune.sql.  explain analyze is barely used.
> >
> > I sent a patch to elide the machine-specific parts, which would make it
> > easier to use.  But there was no interest.
> 
> While I don't know about other use cases, I would have used that here.
> Do you still have that patch laying around? I'd be interested to at
> least review it.

https://commitfest.postgresql.org/41/3409/

Re: Wrong results from Parallel Hash Full Join

From

Tom Lane

Date:

20 May 2023, 00:04:58

I noticed that BF animal conchuela has several times fallen over on the
test case added by 558c9d75f:

diff -U3 /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out
--- /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out    2023-04-19 10:20:26.159840000
+0200
+++ /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out    2023-04-19 10:21:47.971900000 +0200
@@ -974,8 +974,8 @@
 SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
  id | id
 ----+----
-  1 |
     |  2
+  1 |
 (2 rows)

 -- Test serial full hash join.

Considering that this is a parallel plan, I don't think there's any
mystery about why an ORDER-BY-less query might have unstable output
order; the only mystery is why more of the buildfarm hasn't failed.
Can we just add "ORDER BY t1.id" to this query?  It looks like you
get the same PHJ plan, although now underneath Sort/Gather Merge.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-04-19%2008%3A20%3A56
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-05-03%2006%3A21%3A03
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-05-19%2022%3A21%3A04

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

07 June 2023, 21:16:12

On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I noticed that BF animal conchuela has several times fallen over on the
> test case added by 558c9d75f:
>
> diff -U3 /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out
> --- /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out       2023-04-19 10:20:26.159840000
+0200
> +++ /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out        2023-04-19 10:21:47.971900000
+0200
> @@ -974,8 +974,8 @@
>  SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
>   id | id
>  ----+----
> -  1 |
>      |  2
> +  1 |
>  (2 rows)
>
>  -- Test serial full hash join.
>
> Considering that this is a parallel plan, I don't think there's any
> mystery about why an ORDER-BY-less query might have unstable output
> order; the only mystery is why more of the buildfarm hasn't failed.
> Can we just add "ORDER BY t1.id" to this query?  It looks like you
> get the same PHJ plan, although now underneath Sort/Gather Merge.

Yes, this was an oversight on my part. Attached is the patch that does
just what you suggested.

I can't help but take this opportunity to bump my un-reviewed patch
further upthread which adds additional test coverage for match bit
clearing for multi-batch hash joins [1]. It happens to also remove the
test that failed on the buildfarm, which is why I thought to bring it
up.

-- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_bdwDN_aHVctHcc9VoDP9av7LUMeuLbch1fHD2ESouw1g%40mail.gmail.com

Attachment

fix-matchbit-test-case.patch

Re: Wrong results from Parallel Hash Full Join

From

Michael Paquier

Date:

12 June 2023, 03:24:24

On Wed, Jun 07, 2023 at 05:16:12PM -0400, Melanie Plageman wrote:
> On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Considering that this is a parallel plan, I don't think there's any
>> mystery about why an ORDER-BY-less query might have unstable output
>> order; the only mystery is why more of the buildfarm hasn't failed.
>> Can we just add "ORDER BY t1.id" to this query?  It looks like you
>> get the same PHJ plan, although now underneath Sort/Gather Merge.
>
> Yes, this was an oversight on my part. Attached is the patch that does
> just what you suggested.

Confirmed that adding an ORDER BY adds a Sort node between a Gather
Merge and a Parallel Hash Full Join, not removing coverage.

This has fallen through the cracks and conchuela has failed again
today, so I went ahead and applied the fix on HEAD.  Thanks!
--
Michael

Attachment

signature.asc

Re: Wrong results from Parallel Hash Full Join

From

Tom Lane

Date:

12 June 2023, 03:30:52

Michael Paquier <michael@paquier.xyz> writes:
> This has fallen through the cracks and conchuela has failed again
> today, so I went ahead and applied the fix on HEAD.  Thanks!

Thanks!  I'd intended to push that but it didn't get to the
top of the to-do queue yet.  (I'm still kind of wondering why
only conchuela has failed to date.)

            regards, tom lane

Re: Wrong results from Parallel Hash Full Join

From

Melanie Plageman

Date:

12 June 2023, 14:09:20

On Sun, Jun 11, 2023 at 11:24 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jun 07, 2023 at 05:16:12PM -0400, Melanie Plageman wrote:
> > On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Considering that this is a parallel plan, I don't think there's any
> >> mystery about why an ORDER-BY-less query might have unstable output
> >> order; the only mystery is why more of the buildfarm hasn't failed.
> >> Can we just add "ORDER BY t1.id" to this query?  It looks like you
> >> get the same PHJ plan, although now underneath Sort/Gather Merge.
> >
> > Yes, this was an oversight on my part. Attached is the patch that does
> > just what you suggested.
>
> Confirmed that adding an ORDER BY adds a Sort node between a Gather
> Merge and a Parallel Hash Full Join, not removing coverage.
>
> This has fallen through the cracks and conchuela has failed again
> today, so I went ahead and applied the fix on HEAD.  Thanks!

Thanks!