Re: Non-trivial condition is only propagated to one side of JOIN - Mailing list pgsql-hackers

From Wetmore, Matthew (CTR)
Subject Re: Non-trivial condition is only propagated to one side of JOIN
Date
Msg-id 8A2B6D94-75EA-4CC6-AFF5-C8527C16B716@glbcore.com
Whole thread Raw
In response to Non-trivial condition is only propagated to one side of JOIN  (Tobias Hoffmann <ldev-list@thax.hardliners.org>)
List pgsql-hackers

You must use a where clause on the FDW table or you get a full load/SEQ scan of that table, per documentation.

Select * is not recommended for FDW tables.

 

From: Tobias Hoffmann <ldev-list@thax.hardliners.org>
Date: Sunday, August 25, 2024 at 8:10 AM
To: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
Subject: Non-trivial condition is only propagated to one side of JOIN

 

Hi, using `PostgreSQL 16.2 (Debian 16.2-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit`, I've observed the following behavior: – keep in mind that this example is as simplified as possible, the original

Hi,
 
using `PostgreSQL 16.2 (Debian 16.2-1.pgdg120+2) on x86_64-pc-linux-gnu, 
compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit`, I've observed the 
following behavior:
 
– keep in mind that this example is as simplified as possible, the 
original query involves foreign tables, and the failure to propagate / 
push down the condition results in a query plan that basically tries to 
download the complete foreign table, which is not a feasible execution 
strategy:
 
Setup:
 
CREATE TABLE tbl1 (id INTEGER GENERATED ALWAYS AS IDENTITY, site_id 
INTEGER NOT NULL, data TEXT);
CREATE TABLE tbl2 (id INTEGER GENERATED ALWAYS AS IDENTITY, site_id 
INTEGER NOT NULL, data TEXT);
CREATE INDEX ON tbl1 (site_id);
CREATE INDEX ON tbl2 (site_id);
 
Working queries:
 
SELECT * FROM tbl1 WHERE tbl1.site_id = 1;  -- "trivial condition"
SELECT * FROM tbl2 WHERE tbl2.site_id = 1;
SELECT * FROM tbl1 WHERE tbl1.site_id = 1 OR tbl1.site_id IS NULL;  -- 
"non-trivial condition"
SELECT * FROM tbl2 WHERE tbl2.site_id = 1 OR tbl2.site_id IS NULL;
 
1) Exemplary Query Plan:
 
# EXPLAIN SELECT * FROM tbl2 WHERE tbl2.site_id = 1 OR tbl2.site_id IS NULL;
 
                                      QUERY PLAN
-------------------------------------------------------------------------------------
  Bitmap Heap Scan on tbl2  (cost=8.40..19.08 rows=12 width=40)
    Recheck Cond: ((site_id = 1) OR (site_id IS NULL))
    ->  BitmapOr  (cost=8.40..8.40 rows=12 width=0)
          ->  Bitmap Index Scan on tbl2_site_id_idx (cost=0.00..4.20 
rows=6 width=0)
                Index Cond: (site_id = 1)
          ->  Bitmap Index Scan on tbl2_site_id_idx (cost=0.00..4.20 
rows=6 width=0)
                Index Cond: (site_id IS NULL)
(7 rows)
 
The key takeaway is, that the index can be used, because the condition 
is propagated deep enough.
 
2) Still working example:
 
# EXPLAIN SELECT * FROM tbl1 LEFT JOIN tbl2 ON tbl2.site_id = 
tbl1.site_id WHERE tbl1.site_id = 1;
 
                                         QUERY PLAN
-------------------------------------------------------------------------------------------
  Nested Loop Left Join  (cost=8.40..27.80 rows=36 width=80)
    ->  Bitmap Heap Scan on tbl1  (cost=4.20..13.67 rows=6 width=40)
          Recheck Cond: (site_id = 1)
          ->  Bitmap Index Scan on tbl1_site_id_idx (cost=0.00..4.20 
rows=6 width=0)
                Index Cond: (site_id = 1)
    ->  Materialize  (cost=4.20..13.70 rows=6 width=40)
          ->  Bitmap Heap Scan on tbl2  (cost=4.20..13.67 rows=6 width=40)
                Recheck Cond: (site_id = 1)
                ->  Bitmap Index Scan on tbl2_site_id_idx 
(cost=0.00..4.20 rows=6 width=0)
                      Index Cond: (site_id = 1)
(10 rows)
 
The condition is propagated into BOTH branches of the join. The join 
could also be an INNER join and might also be realized as a Merge Join 
or Hash Join: they all behave the same.
 
3) Problematic example:
 
# EXPLAIN SELECT * FROM tbl1 JOIN tbl2 ON tbl2.site_id = tbl1.site_id 
WHERE tbl1.site_id = 1 OR tbl1.site_id IS NULL;
 
                                            QUERY PLAN
-------------------------------------------------------------------------------------------------
  Hash Join  (cost=19.23..46.45 rows=72 width=80)
    Hash Cond: (tbl2.site_id = tbl1.site_id)
    ->  Seq Scan on tbl2  (cost=0.00..22.00 rows=1200 width=40)
    ->  Hash  (cost=19.08..19.08 rows=12 width=40)
          ->  Bitmap Heap Scan on tbl1  (cost=8.40..19.08 rows=12 width=40)
                Recheck Cond: ((site_id = 1) OR (site_id IS NULL))
                ->  BitmapOr  (cost=8.40..8.40 rows=12 width=0)
                      ->  Bitmap Index Scan on tbl1_site_id_idx 
(cost=0.00..4.20 rows=6 width=0)
                            Index Cond: (site_id = 1)
                      ->  Bitmap Index Scan on tbl1_site_id_idx 
(cost=0.00..4.20 rows=6 width=0)
                            Index Cond: (site_id IS NULL)
(11 rows)
 
Now, a full seq scan used for tbl2, the condition is only pushed down on 
ONE side of the JOIN!
(with `WHERE tbl2.site_id = 1 OR tbl2.site_id IS NULL`, the Seq Scan 
would have been on tbl1... [not so easily demostrated w/ LEFT JOINs]).
Also, `ON tbl1.site_id IS NOT DISTINCT FROM tbl2.site_id` does not help,
 
The weird thing is: The subqueries on both sides of the join are 
perfectly capable of accepting/using the "non-trivial" condition, as 
demonstrated in 1), and JOINs are generally able to propagate conditions 
to both sides, as demonstrated in 2).
 
Is there a magic knob to force postgres to do the right thing, or is 
this basically a bug in the query planner?
 
   Tobias
 
 
 
 
 
 

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Redundant Result node
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Doc: fix the note related to the GUC "synchronized_standby_slots"