Re: Improve hash join's handling of tuples with null join keys - Mailing list pgsql-hackers
| From | Joel Jacobson |
|---|---|
| Subject | Re: Improve hash join's handling of tuples with null join keys |
| Date | |
| Msg-id | 3ccec807-d2ff-4150-9585-74cc52926646@app.fastmail.com Whole thread Raw |
| In response to | Re: Improve hash join's handling of tuples with null join keys (Tom Lane <tgl@sss.pgh.pa.us>) |
| List | pgsql-hackers |
On Tue, Mar 3, 2026, at 21:58, Tom Lane wrote:
> I wrote:
>> Bug #19030 [1] seems to be a fresh report of the problem this patch
>> aims to solve. While answering that, I realized that the v2 patch
>> causes null-keyed inner rows to not be included in EXPLAIN ANALYZE's
>> report of the number of rows output by the Hash node. Now on the
>> one hand, what it's reporting is an accurate reflection of the
>> number of rows in the hash table, which perhaps is useful. On the
>> other hand, it's almost surely going to confuse users, and it's
>> different from the number we produced before. Should we try to
>> preserve the old behavior here? (I've not looked at what code
>> changes would be needed for that.)
>
> I got around to looking at that finally. It's not terribly difficult
> to fix, but while figuring out which counters were used for what,
> I noticed a pre-existing bug: when ExecHashRemoveNextSkewBucket moves
> tuples into the main hash table from the skew hash table, it fails to
> adjust hashtable->skewTuples, meaning that subsequent executions of
> ExecHashTableInsert will have the wrong idea of how many tuples are in
> the main table. The error is probably not very large because the
> skew table is not supposed to be big relative to the main table,
> but still, it's wrong. So I tried to clean that up here.
>
> 0001 attached is the same patch as before (brought up to HEAD, but
> only line numbers change). 0002 is the new code to fix these
> tuple-counting issues.
>
> regards, tom lane
I've tested v3-0001 and v3-0002 and can confirm the bug introduced
in v3-0001 is fixed in v3-0002:
% cat explain-analyze-problem.sql
CREATE TABLE ea_hash (id int);
INSERT INTO ea_hash SELECT g FROM generate_series(1, 10) g;
INSERT INTO ea_hash SELECT NULL FROM generate_series(1, 90);
ANALYZE ea_hash;
CREATE TABLE ea_probe (id int);
INSERT INTO ea_probe SELECT (g % 10) + 1 FROM generate_series(1, 10000) g;
ANALYZE ea_probe;
SET enable_nestloop = off;
SET enable_mergejoin = off;
EXPLAIN (COSTS OFF, ANALYZE, TIMING OFF, BUFFERS OFF, SUMMARY OFF)
SELECT count(*) FROM ea_probe FULL OUTER JOIN ea_hash ON ea_probe.id = ea_hash.id;
EXPLAIN (COSTS OFF, ANALYZE, TIMING OFF, BUFFERS OFF, SUMMARY OFF)
SELECT count(*) FROM ea_probe RIGHT OUTER JOIN ea_hash ON ea_probe.id = ea_hash.id;
% git diff --no-index master.out v3-0001.out
diff --git a/master.out b/v3-0001.out
index 1e05e7e39a6..54210c49757 100644
--- a/master.out
+++ b/v3-0001.out
@@ -17,8 +17,8 @@ SET
-> Hash Full Join (actual rows=10090.00 loops=1)
Hash Cond: (ea_probe.id = ea_hash.id)
-> Seq Scan on ea_probe (actual rows=10000.00 loops=1)
- -> Hash (actual rows=100.00 loops=1)
- Buckets: 1024 Batches: 1 Memory Usage: 12kB
+ -> Hash (actual rows=10.00 loops=1)
+ Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on ea_hash (actual rows=100.00 loops=1)
(7 rows)
@@ -29,8 +29,8 @@ SET
-> Hash Right Join (actual rows=10090.00 loops=1)
Hash Cond: (ea_probe.id = ea_hash.id)
-> Seq Scan on ea_probe (actual rows=10000.00 loops=1)
- -> Hash (actual rows=100.00 loops=1)
- Buckets: 1024 Batches: 1 Memory Usage: 12kB
+ -> Hash (actual rows=10.00 loops=1)
+ Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on ea_hash (actual rows=100.00 loops=1)
(7 rows)
% git diff --no-index v3-0001.out v3-0002.out
diff --git a/v3-0001.out b/v3-0002.out
index 54210c49757..17dfe335b0b 100644
--- a/v3-0001.out
+++ b/v3-0002.out
@@ -17,7 +17,7 @@ SET
-> Hash Full Join (actual rows=10090.00 loops=1)
Hash Cond: (ea_probe.id = ea_hash.id)
-> Seq Scan on ea_probe (actual rows=10000.00 loops=1)
- -> Hash (actual rows=10.00 loops=1)
+ -> Hash (actual rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on ea_hash (actual rows=100.00 loops=1)
(7 rows)
@@ -29,7 +29,7 @@ SET
-> Hash Right Join (actual rows=10090.00 loops=1)
Hash Cond: (ea_probe.id = ea_hash.id)
-> Seq Scan on ea_probe (actual rows=10000.00 loops=1)
- -> Hash (actual rows=10.00 loops=1)
+ -> Hash (actual rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on ea_hash (actual rows=100.00 loops=1)
(7 rows)
% git diff --no-index master.out v3-0002.out
diff --git a/master.out b/v3-0002.out
index 1e05e7e39a6..17dfe335b0b 100644
--- a/master.out
+++ b/v3-0002.out
@@ -18,7 +18,7 @@ SET
Hash Cond: (ea_probe.id = ea_hash.id)
-> Seq Scan on ea_probe (actual rows=10000.00 loops=1)
-> Hash (actual rows=100.00 loops=1)
- Buckets: 1024 Batches: 1 Memory Usage: 12kB
+ Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on ea_hash (actual rows=100.00 loops=1)
(7 rows)
@@ -30,7 +30,7 @@ SET
Hash Cond: (ea_probe.id = ea_hash.id)
-> Seq Scan on ea_probe (actual rows=10000.00 loops=1)
-> Hash (actual rows=100.00 loops=1)
- Buckets: 1024 Batches: 1 Memory Usage: 12kB
+ Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on ea_hash (actual rows=100.00 loops=1)
(7 rows)
/Joel
pgsql-hackers by date: