Thread: Getting different number of results when using hashjoin on/off

Getting different number of results when using hashjoin on/off

From

Mario Weilguni

Date:

28 November 2005, 09:00:31

I've a problem that might be a bug in the core system (hashjoins) or with ltree using gist-index, but I fail miserable
toproduce a useful testcase (using 8.1, worked in 8.0):
 

A query produces wrong (=0) results, when a different plan is enforced, I get a merge-join plan that looks similar, but
producesthe correct result (=16 rows).
 

I can post a queryplan, but cannot post the data itself since it's confidental (though I might be able to randomize
somedata and construct a self contained case, but this would take quite some time).
 


The working case is:
set enable_hashjoin to off;Seq Scan on foo1 cost=0.00..423583.57 rows=10810 width=4) (actual time=675.422..706.815
rows=16loops=1)  Filter: (subplan)  SubPlan    ->  Merge Join  (cost=19.49..19.55 rows=1 width=0) (actual
time=0.028..0.028rows=0 loops=21619)          Merge Cond: ("outer".str_id = "inner".id)          ->  Sort
(cost=6.49..6.50rows=5 width=4) (actual time=0.023..0.023 rows=0 loops=21619)                Sort Key: bz.str_id
       ->  Bitmap Heap Scan on foo2 bz  (cost=2.02..6.43 rows=5 width=4) (actual time=0.012..0.012 rows=0 loops=21619)
                   Recheck Cond: (bid = $0)                      ->  Bitmap Index Scan on foo2_bid_key1
(cost=0.00..2.02rows=5 width=0) (actual time=0.009..0.009 rows=0 loops=21619)                            Index Cond:
(bid= $0)          ->  Sort  (cost=13.00..13.01 rows=6 width=4) (actual time=0.002..0.003 rows=1 loops=136)
  Sort Key: str.id                ->  Bitmap Heap Scan on structure str  (cost=2.02..12.92 rows=6 width=4) (actual
time=0.095..0.097rows=1 loops=1)                      Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery)
                 ->  Bitmap Index Scan on str_uk4  (cost=0.00..2.02 rows=6 width=0) (actual time=0.086..0.086 rows=1
loops=1)                           Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery)Total runtime: 707.019
ms

16 rows...


The failing case is:
set enable_hashjoin to on;Seq Scan on foo1 cost=0.00..421679.00 rows=10810 width=4) (actual time=154.663..154.663
rows=0loops=1)  Filter: (subplan)  SubPlan    ->  Hash Join  (cost=8.47..19.46 rows=1 width=0) (actual
time=0.004..0.004rows=0 loops=21619)          Hash Cond: ("outer".id = "inner".str_id)          ->  Bitmap Heap Scan on
structurestr  (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1)                Recheck Cond:
(path~ '142.2330445.2330598.2330676.*'::lquery)                ->  Bitmap Index Scan on str_uk4  (cost=0.00..2.02
rows=6width=0) (actual time=0.090..0.090 rows=1 loops=1)                      Index Cond: (path ~
'142.2330445.2330598.2330676.*'::lquery)         ->  Hash  (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032
rows=0loops=1)                ->  Bitmap Heap Scan on foo2 bz  (cost=2.02..6.43 rows=5 width=4) (actual
time=0.025..0.025rows=0 loops=1)                      Recheck Cond: (bid = $0)                      ->  Bitmap Index
Scanon foo2_bid_key1  (cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021 rows=0 loops=1)
   Index Cond: (bid = $0)Total runtime: 154.862 ms
 
No rows....

The query itself is quite simple:
select foo1.id
from foo1
where  foo1.datloesch is null and exists (select 1                from foo2 bz,                    structure str
     where bz.bid=foo1.id                and str.id = bz.str_id                and str.path ~ '*.2330676.*'
);

The path field is an "ltree" column, with an GIST index on it.


Any ideas what I could try to track this down?

Best regards,Mario Weilguni

Re: Getting different number of results when using hashjoin

From

Christopher Kings-Lynne

Date:

28 November 2005, 09:12:29

> The path field is an "ltree" column, with an GIST index on it.

Something to do with bitmap indexscans on lossy indexes?

Chris

Re: Getting different number of results when using hashjoin on/off

From

Mario Weilguni

Date:

28 November 2005, 09:21:57

Am Montag, 28. November 2005 14:12 schrieb Christopher Kings-Lynne:
> > The path field is an "ltree" column, with an GIST index on it.
>
> Something to do with bitmap indexscans on lossy indexes?
>
> Chris

I doubt that, "set enable_bitmapscan to off" produces the wrong result as 
well.

Best regards
Mario

Re: Getting different number of results when using hashjoin on/off

From

Tom Lane

Date:

28 November 2005, 11:11:31

Mario Weilguni <mweilguni@sime.com> writes:
> The failing case is:
> ...
>    SubPlan
>      ->  Hash Join  (cost=8.47..19.46 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=21619)
>            Hash Cond: ("outer".id = "inner".str_id)
>            ->  Bitmap Heap Scan on structure str  (cost=2.02..12.92 rows=6 width=4) (actual time=0.100..30.095 rows=1
loops=1)
>                  Recheck Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery)
>                  ->  Bitmap Index Scan on str_uk4  (cost=0.00..2.02 rows=6 width=0) (actual time=0.090..0.090 rows=1
loops=1)
>                        Index Cond: (path ~ '142.2330445.2330598.2330676.*'::lquery)
>            ->  Hash  (cost=6.43..6.43 rows=5 width=4) (actual time=0.032..0.032 rows=0 loops=1)
>                  ->  Bitmap Heap Scan on foo2 bz  (cost=2.02..6.43 rows=5 width=4) (actual time=0.025..0.025 rows=0
loops=1)
>                        Recheck Cond: (bid = $0)
>                        ->  Bitmap Index Scan on foo2_bid_key1  (cost=0.00..2.02 rows=5 width=0) (actual
time=0.021..0.021rows=0 loops=1)
 
>                              Index Cond: (bid = $0)

Hmm, I wonder why the hash join's input nodes are showing "loops=1" ...
the hash depends on the subplan parameter $0 so it needs to be
re-evaluated each time through.  It looks like that's not happening.
Do you have the corresponding results from 8.0 --- if so, what do
the loop counts look like?
        regards, tom lane

Re: Getting different number of results when using hashjoin on/off

From

Tom Lane

Date:

28 November 2005, 12:20:27

"Mario Weilguni" <mario.weilguni@icomedias.com> writes:
> Yes. This is from a 8.0.3 (with slightly older and different data,
> resulting in only 9 rows, but the rest is the same):

Yeah, that looks more reasonable.

I tried to reproduce this, without any luck:

regression=# explain analyze select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where
b.unique1=c.unique2and b.hundred in (4,5) and c.hundred=a.hundred);
                QUERY PLAN
 

--------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=3879742.37..3879742.38 rows=1 width=0) (actual time=46579.077..46579.082 rows=1 loops=1)  ->  Seq Scan on tenk1 a
(cost=0.00..3879729.87 rows=5000 width=0) (actual time=5.129..46528.208 rows=8500 loops=1)        Filter: (subplan)
  SubPlan          ->  Hash Join  (cost=229.20..546.66 rows=2 width=0) (actual time=4.569..4.569 rows=1 loops=10000)
           Hash Cond: ("outer".unique1 = "inner".unique2)                ->  Bitmap Heap Scan on tenk1 b
(cost=4.69..321.15rows=196 width=4) (actual time=0.947..1.698 rows=90 loops=10000)                      Recheck Cond:
((hundred= 4) OR (hundred = 5))                      ->  BitmapOr  (cost=4.69..4.69 rows=197 width=0) (actual
time=0.544..0.544rows=0 loops=10000)                            ->  Bitmap Index Scan on tenk1_hundred
(cost=0.00..2.34rows=98 width=0) (actual time=0.271..0.271 rows=100 loops=10000)                                  Index
Cond:(hundred = 4)                            ->  Bitmap Index Scan on tenk1_hundred  (cost=0.00..2.34 rows=98 width=0)
(actualtime=0.262..0.262 rows=100 loops=10000)                                  Index Cond: (hundred = 5)
->  Hash  (cost=224.26..224.26 rows=100 width=4) (actual time=2.370..2.370 rows=100 loops=10000)
-> Bitmap Heap Scan on tenk1 c  (cost=2.35..224.26 rows=100 width=4) (actual time=0.492..1.616 rows=100 loops=10000)
                       Recheck Cond: (hundred = $0)                            ->  Bitmap Index Scan on tenk1_hundred
(cost=0.00..2.35rows=100 width=0) (actual time=0.278..0.278 rows=100 loops=10000)
IndexCond: (hundred = $0)Total runtime: 46584.654 ms
 
(19 rows)

(I'm not bothering with setting up an ltree index, since the question
of what index is being used shouldn't affect hashjoin's decision to
rescan or not.)

That's using 8.1 branch CVS tip, but there aren't any related bug fixes
since 8.1 release.  We did have several bug fixes in the hash join code
during the 8.1 beta cycle though ... is it possible you are really
running an 8.1 beta and not 8.1.0?
        regards, tom lane

Re: Getting different number of results when using hashjoin on/off

From

"Mario Weilguni"

Date:

28 November 2005, 12:31:16

Thanks Tom for you quick answer!

No, I'm using 8.1.0, and tried it on different machines, always the same results.

SELECT version();
PostgreSQL 8.1.0 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.4 20040623 (Gentoo Hardened Linux 3.3.4-r1,
ssp-3.3.2-2,pie-8.7.6) 

Best regards,Mario Weilguni


icomedias - Digitale Kommunikation
------------------------------------------------------------------------
Mario Weilguni, Forschung und Entwicklung
mario.weilguni@icomedias.com, http://www.icomedias.com/

icomedias Österreich Systemhaus GmbH: 8020 Graz, Entenplatz 1  Tel: +43 (316) 721.671-272, Fax: -103

icomedias Deutschland Systemhaus GmbH: 10969 Berlin, Alexandrinenstraße 2-3 Tel: +49 (30) 695.399-272, Fax: -103

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, November 28, 2005 5:20 PM
To: Mario Weilguni
Cc: Mario Weilguni; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Getting different number of results when using hashjoin on/off

"Mario Weilguni" <mario.weilguni@icomedias.com> writes:
> Yes. This is from a 8.0.3 (with slightly older and different data,
> resulting in only 9 rows, but the rest is the same):

Yeah, that looks more reasonable.

I tried to reproduce this, without any luck:

regression=# explain analyze select count(*) from tenk1 a where exists (select 1 from tenk1 b, tenk1 c where
b.unique1=c.unique2and b.hundred in (4,5) and c.hundred=a.hundred);
                QUERY PLAN 

--------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=3879742.37..3879742.38 rows=1 width=0) (actual time=46579.077..46579.082 rows=1 loops=1)  ->  Seq Scan on tenk1 a
(cost=0.00..3879729.87 rows=5000 width=0) (actual time=5.129..46528.208 rows=8500 loops=1)        Filter: (subplan)
  SubPlan          ->  Hash Join  (cost=229.20..546.66 rows=2 width=0) (actual time=4.569..4.569 rows=1 loops=10000)
           Hash Cond: ("outer".unique1 = "inner".unique2)                ->  Bitmap Heap Scan on tenk1 b
(cost=4.69..321.15rows=196 width=4) (actual time=0.947..1.698 rows=90 loops=10000)                      Recheck Cond:
((hundred= 4) OR (hundred = 5))                      ->  BitmapOr  (cost=4.69..4.69 rows=197 width=0) (actual
time=0.544..0.544rows=0 loops=10000)                            ->  Bitmap Index Scan on tenk1_hundred
(cost=0.00..2.34rows=98 width=0) (actual time=0.271..0.271 rows=100 loops=10000)                                  Index
Cond:(hundred = 4)                            ->  Bitmap Index Scan on tenk1_hundred  (cost=0.00..2.34 rows=98 width=0)
(actualtime=0.262..0.262 rows=100 loops=10000)                                  Index Cond: (hundred = 5)
->  Hash  (cost=224.26..224.26 rows=100 width=4) (actual time=2.370..2.370 rows=100 loops=10000)
-> Bitmap Heap Scan on tenk1 c  (cost=2.35..224.26 rows=100 width=4) (actual time=0.492..1.616 rows=100 loops=10000)
                       Recheck Cond: (hundred = $0)                            ->  Bitmap Index Scan on tenk1_hundred
(cost=0.00..2.35rows=100 width=0) (actual time=0.278..0.278 rows=100 loops=10000)
IndexCond: (hundred = $0)Total runtime: 46584.654 ms 
(19 rows)

(I'm not bothering with setting up an ltree index, since the question
of what index is being used shouldn't affect hashjoin's decision to
rescan or not.)

That's using 8.1 branch CVS tip, but there aren't any related bug fixes
since 8.1 release.  We did have several bug fixes in the hash join code
during the 8.1 beta cycle though ... is it possible you are really
running an 8.1 beta and not 8.1.0?
        regards, tom lane

Re: Getting different number of results when using hashjoin on/off

From

Tom Lane

Date:

28 November 2005, 13:08:59

"Mario Weilguni" <mario.weilguni@icomedias.com> writes:
> No, I'm using 8.1.0, and tried it on different machines, always the same results.

I see it, I think: the recent changes to avoid work when one or the
other side of the hash join is empty would exit the hash join leaving
a state that confused ExecReScanHashJoin() into thinking it didn't
have to do anything.  Try the attached patch.
        regards, tom lane


Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.75.2.1
diff -c -r1.75.2.1 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c    22 Nov 2005 18:23:09 -0000    1.75.2.1
--- src/backend/executor/nodeHashjoin.c    28 Nov 2005 17:04:43 -0000
***************
*** 152,163 ****          * outer join, we can quit without scanning the outer relation.          */         if
(hashtable->totalTuples== 0 && node->js.jointype != JOIN_LEFT)
 
-         {
-             ExecHashTableDestroy(hashtable);
-             node->hj_HashTable = NULL;
-             node->hj_FirstOuterTupleSlot = NULL;             return NULL;
-         }          /*          * need to remember whether nbatch has increased since we began
--- 152,158 ----
***************
*** 487,493 ****     {         ExecHashTableDestroy(node->hj_HashTable);         node->hj_HashTable = NULL;
-         node->hj_FirstOuterTupleSlot = NULL;     }      /*
--- 482,487 ----
***************
*** 805,841 **** ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) {     /*
-      * If we haven't yet built the hash table then we can just return; nothing
-      * done yet, so nothing to undo.
-      */
-     if (node->hj_HashTable == NULL)
-         return;
- 
-     /*      * In a multi-batch join, we currently have to do rescans the hard way,      * primarily because batch
tempfiles may have already been released. But      * if it's a single-batch join, and there is no parameter change for
the     * inner subnode, then we can just re-use the existing hash table without      * rebuilding it.      */
 
!     if (node->hj_HashTable->nbatch == 1 &&
!         ((PlanState *) node)->righttree->chgParam == NULL)
!     {
!         /* okay to reuse the hash table; needn't rescan inner, either */
!     }
!     else     {
!         /* must destroy and rebuild hash table */
!         ExecHashTableDestroy(node->hj_HashTable);
!         node->hj_HashTable = NULL;
!         node->hj_FirstOuterTupleSlot = NULL; 
!         /*
!          * if chgParam of subnode is not null then plan will be re-scanned by
!          * first ExecProcNode.
!          */
!         if (((PlanState *) node)->righttree->chgParam == NULL)
!             ExecReScan(((PlanState *) node)->righttree, exprCtxt);     }      /* Always reset intra-tuple state */
--- 799,830 ---- ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) {     /*      * In a multi-batch join,
wecurrently have to do rescans the hard way,      * primarily because batch temp files may have already been released.
But     * if it's a single-batch join, and there is no parameter change for the      * inner subnode, then we can just
re-usethe existing hash table without      * rebuilding it.      */
 
!     if (node->hj_HashTable != NULL)     {
!         if (node->hj_HashTable->nbatch == 1 &&
!             ((PlanState *) node)->righttree->chgParam == NULL)
!         {
!             /* okay to reuse the hash table; needn't rescan inner, either */
!         }
!         else
!         {
!             /* must destroy and rebuild hash table */
!             ExecHashTableDestroy(node->hj_HashTable);
!             node->hj_HashTable = NULL; 
!             /*
!              * if chgParam of subnode is not null then plan will be re-scanned
!              * by first ExecProcNode.
!              */
!             if (((PlanState *) node)->righttree->chgParam == NULL)
!                 ExecReScan(((PlanState *) node)->righttree, exprCtxt);
!         }     }      /* Always reset intra-tuple state */
***************
*** 847,852 ****
--- 836,842 ----     node->js.ps.ps_TupFromTlist = false;     node->hj_NeedNewOuter = true;     node->hj_MatchedOuter =
false;
+     node->hj_FirstOuterTupleSlot = NULL;      /*      * if chgParam of subnode is not null then plan will be
re-scannedby

Re: Getting different number of results when using hashjoin on/off

From

"Mario Weilguni"

Date:

28 November 2005, 13:41:15

Hello Tom,

Thanks for the quick response, I've tried the patch, but it did not work
as expected. When I set enable_hashjoin to off, everything works as
expected, but with hashjoin on I do not even get results anymore, CPU is
going up to 100% and after 3 minutes I cancelled the query (it normale
would take ~100-500 milliseconds).

I will check the patch on a different machine again and inform you of
the results.

Best regards,Mario Weilguni


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, November 28, 2005 6:09 PM
To: Mario Weilguni
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Getting different number of results when using
hashjoin on/off

"Mario Weilguni" <mario.weilguni@icomedias.com> writes:
> No, I'm using 8.1.0, and tried it on different machines, always the
same results.

I see it, I think: the recent changes to avoid work when one or the
other side of the hash join is empty would exit the hash join leaving
a state that confused ExecReScanHashJoin() into thinking it didn't
have to do anything.  Try the attached patch.
        regards, tom lane


Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.75.2.1
diff -c -r1.75.2.1 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c    22 Nov 2005 18:23:09 -0000
1.75.2.1
--- src/backend/executor/nodeHashjoin.c    28 Nov 2005 17:04:43 -0000
***************
*** 152,163 ****          * outer join, we can quit without scanning the outer
relation.          */         if (hashtable->totalTuples == 0 && node->js.jointype !=
JOIN_LEFT)
-         {
-             ExecHashTableDestroy(hashtable);
-             node->hj_HashTable = NULL;
-             node->hj_FirstOuterTupleSlot = NULL;             return NULL;
-         }          /*          * need to remember whether nbatch has increased since
we began
--- 152,158 ----
***************
*** 487,493 ****     {         ExecHashTableDestroy(node->hj_HashTable);         node->hj_HashTable = NULL;
-         node->hj_FirstOuterTupleSlot = NULL;     }      /*
--- 482,487 ----
***************
*** 805,841 **** ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) {     /*
-      * If we haven't yet built the hash table then we can just
return; nothing
-      * done yet, so nothing to undo.
-      */
-     if (node->hj_HashTable == NULL)
-         return;
-
-     /*      * In a multi-batch join, we currently have to do rescans the
hard way,      * primarily because batch temp files may have already been
released. But      * if it's a single-batch join, and there is no parameter change
for the      * inner subnode, then we can just re-use the existing hash
table without      * rebuilding it.      */
!     if (node->hj_HashTable->nbatch == 1 &&
!         ((PlanState *) node)->righttree->chgParam == NULL)
!     {
!         /* okay to reuse the hash table; needn't rescan inner,
either */
!     }
!     else     {
!         /* must destroy and rebuild hash table */
!         ExecHashTableDestroy(node->hj_HashTable);
!         node->hj_HashTable = NULL;
!         node->hj_FirstOuterTupleSlot = NULL;
!         /*
!          * if chgParam of subnode is not null then plan will be
re-scanned by
!          * first ExecProcNode.
!          */
!         if (((PlanState *) node)->righttree->chgParam == NULL)
!             ExecReScan(((PlanState *) node)->righttree,
exprCtxt);     }      /* Always reset intra-tuple state */
--- 799,830 ---- ExecReScanHashJoin(HashJoinState *node, ExprContext *exprCtxt) {     /*      * In a multi-batch join,
wecurrently have to do rescans the 
hard way,      * primarily because batch temp files may have already been
released. But      * if it's a single-batch join, and there is no parameter change
for the      * inner subnode, then we can just re-use the existing hash
table without      * rebuilding it.      */
!     if (node->hj_HashTable != NULL)     {
!         if (node->hj_HashTable->nbatch == 1 &&
!             ((PlanState *) node)->righttree->chgParam ==
NULL)
!         {
!             /* okay to reuse the hash table; needn't rescan
inner, either */
!         }
!         else
!         {
!             /* must destroy and rebuild hash table */
!             ExecHashTableDestroy(node->hj_HashTable);
!             node->hj_HashTable = NULL;
!             /*
!              * if chgParam of subnode is not null then plan
will be re-scanned
!              * by first ExecProcNode.
!              */
!             if (((PlanState *) node)->righttree->chgParam ==
NULL)
!                 ExecReScan(((PlanState *)
node)->righttree, exprCtxt);
!         }     }      /* Always reset intra-tuple state */
***************
*** 847,852 ****
--- 836,842 ----     node->js.ps.ps_TupFromTlist = false;     node->hj_NeedNewOuter = true;     node->hj_MatchedOuter =
false;
+     node->hj_FirstOuterTupleSlot = NULL;      /*      * if chgParam of subnode is not null then plan will be
re-scanned by

Re: Getting different number of results when using hashjoin on/off

From

"Mario Weilguni"

Date:

28 November 2005, 13:55:05

Yes. This is from a 8.0.3 (with slightly older and different data,
resulting in only 9 rows, but the rest is the same):
Index Scan using ben_uk3 on foo1 ben  (cost=0.00..73867.23 rows=863
width=27) (actual time=38.591..501.839 rows=9 loops=1)  Filter: (subplan)  SubPlan    ->  Hash Join  (cost=14.25..42.53
rows=1width=0) (actual 
time=0.284..0.284 rows=0 loops=1725)          Hash Cond: ("outer".id = "inner".str_id)          ->  Index Scan using
str_uk4on structure str 
(cost=0.00..27.91 rows=13 width=4) (actual time=0.765..4.043 rows=1
loops=112)                Index Cond: (path ~ '*.2330676.*'::lquery)          ->  Hash  (cost=14.23..14.23 rows=10
width=4)(actual 
time=0.012..0.012 rows=0 loops=1725)                ->  Index Scan using foo2_ben_id_key1 on foo2 bz
(cost=0.00..14.23 rows=10 width=4) (actual time=0.008..0.009 rows=1
loops=1725)                      Index Cond: (ben_id = $0)Total runtime: 501.980 ms

Best regards

P.s. sorry for the stupid quoting, I've to use Outlook....

Mario Weilguni <mweilguni@sime.com> writes:
> The failing case is:
> ...
>    SubPlan
>      ->  Hash Join  (cost=8.47..19.46 rows=1 width=0) (actual
time=0.004..0.004 rows=0 loops=21619)
>            Hash Cond: ("outer".id = "inner".str_id)
>            ->  Bitmap Heap Scan on structure str  (cost=2.02..12.92
rows=6 width=4) (actual time=0.100..30.095 rows=1 loops=1)
>                  Recheck Cond: (path ~
'142.2330445.2330598.2330676.*'::lquery)
>                  ->  Bitmap Index Scan on str_uk4  (cost=0.00..2.02
rows=6 width=0) (actual time=0.090..0.090 rows=1 loops=1)
>                        Index Cond: (path ~
'142.2330445.2330598.2330676.*'::lquery)
>            ->  Hash  (cost=6.43..6.43 rows=5 width=4) (actual
time=0.032..0.032 rows=0 loops=1)
>                  ->  Bitmap Heap Scan on foo2 bz  (cost=2.02..6.43
rows=5 width=4) (actual time=0.025..0.025 rows=0 loops=1)
>                        Recheck Cond: (bid = $0)
>                        ->  Bitmap Index Scan on foo2_bid_key1
(cost=0.00..2.02 rows=5 width=0) (actual time=0.021..0.021 rows=0
loops=1)
>                              Index Cond: (bid = $0)

Hmm, I wonder why the hash join's input nodes are showing "loops=1" ...
the hash depends on the subplan parameter $0 so it needs to be
re-evaluated each time through.  It looks like that's not happening.
>Do you have the corresponding results from 8.0 --- if so, what do
>the loop counts look like?

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate      subscribe-nomail command to
majordomo@postgresql.orgso that your      message can get through to the mailing list cleanly

Hashjoin startup strategy (was Re: Getting different number of results when using hashjoin on/off)

From

Tom Lane

Date:

28 November 2005, 17:27:16

"Mario Weilguni" <mario.weilguni@icomedias.com> writes:
> Thanks for the quick response, I've tried the patch, but it did not work
> as expected. When I set enable_hashjoin to off, everything works as
> expected, but with hashjoin on I do not even get results anymore, CPU is
> going up to 100% and after 3 minutes I cancelled the query (it normale
> would take ~100-500 milliseconds).

Try letting it run longer.  I think your expectation is tuned for the
broken implementation (which runs the subqueries only once instead of
26k times...)

The test case I developed for this failure in the regression database is

select count(*) from tenk1 a
where exists (select 1 from tenk1 b, tenk1 c             where b.unique1=c.unique2 and             b.hundred in (4,5)
andc.hundred=a.hundred+99);

8.0 prefers a nestloop for the subquery, and that plan runs in about
600 ms on my machine.  If forced to a hash join, it takes about 2450 ms.
8.1 prefers the hash join to start with, but takes 11300 ms to run it :-(
(after the patch that is).

The reason for the differential is that 8.1 guesses wrong about which
subplan to cycle first: most of the time, the inner plan is empty and
so there's no need to pull any rows from the outer plan, but 8.1 pulls
the first row from the outer plan anyway, and doing that 10000 times is
what's eating the extra runtime.  It looks from your previous message
that similar things are happening with your data distribution, allowing
8.0 to run faster for you than 8.1 does.

Not sure if there's much we can do about this.  The presence of the
upper-query parameter in the subplan makes it difficult to derive any
stats at all, let alone guess how often the subplan will be completely
empty, so I'm not sure the planner can help.

For a query like this, where the hash join is being done repeatedly,
it might be useful for the executor itself to track how often each
subplan has been seen to be empty.  In particular, the executor knows
that the outer subplan is parameterless and therefore should deliver
the same results each time (modulo volatile functions of course), so
after the first cycle it could know that there's no point in trying
the early fetch on that side.  Dunno if this will be of wide enough
use to be worth implementing though --- in simple cases the join
won't be rescanned and so the executor can't help.

Anyone have any other ideas?
        regards, tom lane

Re: Hashjoin startup strategy (was Re: Getting different number of results when using hashjoin on/off)

From

Tom Lane

Date:

28 November 2005, 19:50:22

I wrote:
> For a query like this, where the hash join is being done repeatedly,
> it might be useful for the executor itself to track how often each
> subplan has been seen to be empty.

I implemented a simple form of this, and it made 8.1 faster than 8.0
on the test case I was using.  Give it a try ...
        regards, tom lane


Index: src/backend/executor/nodeHashjoin.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/executor/nodeHashjoin.c,v
retrieving revision 1.75.2.2
diff -c -r1.75.2.2 nodeHashjoin.c
*** src/backend/executor/nodeHashjoin.c    28 Nov 2005 17:14:47 -0000    1.75.2.2
--- src/backend/executor/nodeHashjoin.c    28 Nov 2005 23:41:28 -0000
***************
*** 120,135 ****          * since we aren't going to be able to skip the join on the strength          * of an empty
innerrelation anyway.)          *          * The only way to make the check is to try to fetch a tuple from the
* outer plan node.  If we succeed, we have to stash it away for later          * consumption by
ExecHashJoinOuterGetTuple.         */
 
!         if (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost ||
!             node->js.jointype == JOIN_LEFT)         {             node->hj_FirstOuterTupleSlot =
ExecProcNode(outerNode);            if (TupIsNull(node->hj_FirstOuterTupleSlot))                 return NULL;         }
       else             node->hj_FirstOuterTupleSlot = NULL;
 
--- 120,147 ----          * since we aren't going to be able to skip the join on the strength          * of an empty
innerrelation anyway.)          *
 
+          * If we are rescanning the join, we make use of information gained
+          * on the previous scan: don't bother to try the prefetch if the
+          * previous scan found the outer relation nonempty.  This is not
+          * 100% reliable since with new parameters the outer relation might
+          * yield different results, but it's a good heuristic.
+          *          * The only way to make the check is to try to fetch a tuple from the          * outer plan node.
Ifwe succeed, we have to stash it away for later          * consumption by ExecHashJoinOuterGetTuple.          */
 
!         if (node->js.jointype == JOIN_LEFT ||
!             (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
!              !node->hj_OuterNotEmpty))         {             node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode);
          if (TupIsNull(node->hj_FirstOuterTupleSlot))
 
+             {
+                 node->hj_OuterNotEmpty = false;                 return NULL;
+             }
+             else
+                 node->hj_OuterNotEmpty = true;         }         else             node->hj_FirstOuterTupleSlot =
NULL;
***************
*** 159,164 ****
--- 171,183 ----          * scanning the outer relation          */         hashtable->nbatch_outstart =
hashtable->nbatch;
+ 
+         /*
+          * Reset OuterNotEmpty for scan.  (It's OK if we fetched a tuple
+          * above, because ExecHashJoinOuterGetTuple will immediately
+          * set it again.)
+          */
+         node->hj_OuterNotEmpty = false;     }      /*
***************
*** 454,459 ****
--- 473,479 ----     hjstate->js.ps.ps_TupFromTlist = false;     hjstate->hj_NeedNewOuter = true;
hjstate->hj_MatchedOuter= false;
 
+     hjstate->hj_OuterNotEmpty = false;      return hjstate; }
***************
*** 546,551 ****
--- 566,574 ----             *hashvalue = ExecHashGetHashValue(hashtable, econtext,
         hjstate->hj_OuterHashKeys); 
 
+             /* remember outer relation is not empty for possible rescan */
+             hjstate->hj_OuterNotEmpty = true;
+              return slot;         } 
***************
*** 810,816 ****         if (node->hj_HashTable->nbatch == 1 &&             ((PlanState *) node)->righttree->chgParam
==NULL)         {
 
!             /* okay to reuse the hash table; needn't rescan inner, either */         }         else         {
--- 833,851 ----         if (node->hj_HashTable->nbatch == 1 &&             ((PlanState *) node)->righttree->chgParam
==NULL)         {
 
!             /*
!              * okay to reuse the hash table; needn't rescan inner, either.
!              *
!              * What we do need to do is reset our state about the emptiness
!              * of the outer relation, so that the new scan of the outer will
!              * update it correctly if it turns out to be empty this time.
!              * (There's no harm in clearing it now because ExecHashJoin won't
!              * need the info.  In the other cases, where the hash table
!              * doesn't exist or we are destroying it, we leave this state
!              * alone because ExecHashJoin will need it the first time
!              * through.)
!              */
!             node->hj_OuterNotEmpty = false;         }         else         {
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.139.2.2
diff -c -r1.139.2.2 execnodes.h
*** src/include/nodes/execnodes.h    22 Nov 2005 18:23:28 -0000    1.139.2.2
--- src/include/nodes/execnodes.h    28 Nov 2005 23:41:28 -0000
***************
*** 1101,1106 ****
--- 1101,1107 ----  *        hj_FirstOuterTupleSlot    first tuple retrieved from outer plan  *        hj_NeedNewOuter
         true if need new outer tuple on next call  *        hj_MatchedOuter            true if found a join match for
currentouter
 
+  *        hj_OuterNotEmpty        true if outer relation known not empty  * ----------------  */ 
***************
*** 1125,1130 ****
--- 1126,1132 ----     TupleTableSlot *hj_FirstOuterTupleSlot;     bool        hj_NeedNewOuter;     bool
hj_MatchedOuter;
+     bool        hj_OuterNotEmpty; } HashJoinState;

Re: Hashjoin startup strategy (was Re: Getting different number of results when using hashjoin on/off)

From

Tom Lane

Date:

28 November 2005, 20:35:10

Greg Stark <gsstark@mit.edu> writes:
> I suspect this is obvious but since you asked, there isn't any way to keep
> around the hash table and just reuse it repeatedly instead of having to rescan
> the data over and over is there?

We already do that when possible --- which it's not in the particular
case at hand, because there's an outer-query parameter used in the
hashed subplan.

It occurs to me that the planner ought to favor putting parameterized
subplans on the outside of a hash join instead of the inside, so as to
make reuse more likely.  Not sure how to factor that into the cost
model though.
        regards, tom lane

Re: Hashjoin startup strategy (was Re: Getting different number of results when using hashjoin on/off)

From

Greg Stark

Date:

28 November 2005, 20:48:24

Tom Lane <tgl@sss.pgh.pa.us> writes:

> In particular, the executor knows that the outer subplan is parameterless
> and therefore should deliver the same results each time (modulo volatile
> functions of course), so after the first cycle it could know that there's no
> point in trying the early fetch on that side.

> Anyone have any other ideas?

I suspect this is obvious but since you asked, there isn't any way to keep
around the hash table and just reuse it repeatedly instead of having to rescan
the data over and over is there?

-- 
greg