Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date
Msg-id 1642f540-badc-ef36-5d24-465f3ccb3343@enterprisedb.com
Whole thread Raw
In response to Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
List pgsql-hackers
Hi,

there's been an interesting case [1] of a slow query on pgsql-general, 
related to the topic discussed in this thread. It causes an order the 
query to run slower by multiple orders of magnitude, and I think it's 
interesting, so let me present a simple example demonstrating it.

------------------------------------------------------------------------
create table t1 (a int);
create table t2 (a int);

insert into t1
select i from generate_series(1,100000) s(i);

insert into t2
select mod(i,100000) from generate_series(1,10000000) s(i);

create index on t1(a);
create index on t2(a);

vacuum analyze t1, t2;

-- we need to force mergejoin
set enable_nestloop = off;
------------------------------------------------------------------------

Now, let's run a simple query:

SELECT t1.a, t2.a FROM t1 JOIN t2 USING (a)
  WHERE (t1.a > 99000) ORDER BY t1.a LIMIT 100;

                               QUERY PLAN
------------------------------------------------------------------------
  Limit  (cost=4.63..224.57 rows=100 width=8)
         (actual time=8999.487..8999.707 rows=100 loops=1)
    ->  Merge Join  (cost=4.63..209447.97 rows=95226 width=8)
                    (actual time=8999.485..8999.620 rows=100 loops=1)
          Merge Cond: (t1.a = t2.a)
          ->  Index Only Scan using t1_a_idx on t1
              (cost=0.29..29.25 rows=969 width=4)
              (actual time=0.010..0.011 rows=1 loops=1)
                Index Cond: (a > 99000)
                Heap Fetches: 0
          ->  Index Only Scan using t2_a_idx on t2
              (cost=0.43..183464.09 rows=9999977 width=4)
              (actual time=0.026..4594.757 rows=9900200 loops=1)
                Heap Fetches: 0
  Planning Time: 0.338 ms
  Execution Time: 8999.768 ms
(10 rows)


Now, let's do a simple trick and add condition on t2.a, "implied" by the 
join condition (t1.a = t2.a) and inequality (t1.a > 99000).


SELECT t1.a, t2.a FROM t1 JOIN t2 USING (a)
  WHERE (t1.a > 99000) AND (t2.a > 99000) ORDER BY t1.a LIMIT 100;

                               QUERY PLAN
------------------------------------------------------------------------
  Limit  (cost=0.77..250.39 rows=100 width=8)
         (actual time=0.040..0.294 rows=100 loops=1)
    ->  Merge Join  (cost=0.77..2297.23 rows=920 width=8)
                    (actual time=0.039..0.172 rows=100 loops=1)
          Merge Cond: (t1.a = t2.a)
          ->  Index Only Scan using t1_a_idx on t1
              (cost=0.29..29.25 rows=969 width=4)
              (actual time=0.031..0.031 rows=1 loops=1)
                Index Cond: (a > 99000)
                Heap Fetches: 0
          ->  Index Only Scan using t2_a_idx on t2
              (cost=0.43..2014.87 rows=96596 width=4)
              (actual time=0.005..0.052 rows=100 loops=1)
                Index Cond: (a > 99000)
                Heap Fetches: 0
  Planning Time: 0.222 ms
  Execution Time: 0.414 ms
(11 rows)

Well, that's quite a difference! From 9000ms to 1ms, pretty good.

What is happening in the first plan is the merge join needs t2 sorted by 
t2.a, and the index-only-scan looks like a great way to do that, as it 
has low startup cost (because LIMIT likes that). But this completely 
misses that (t1.a > 9900) implies (t2.a > 9900) through the equality in 
join condition. So we start scanning t2_a_idx, only to throw the first 
99% of tuples away.

In the original report this is particularly egregious, because the index 
only scan looks like this:

   -> Index Only Scan using data_class_pkey on data_class ta
      (cost=0.57..4935483.78 rows=216964862 width=8)
      (actual time=0.018..35022.908 rows=151321889 loops=1)
        Heap Fetches: 151321889

So yeah, 151M heap fetches, that's bound to be expensive :-/

Adding the condition on t2.a allows just skipping the first chunk of the 
index, eliminating the expensive part.

Of course, this breaks the estimates in the faster query, because we now 
apply the condition twice - once for the index scan, one as the join 
clause. So instead of ~100k rows the join is estimated as ~1000 rows.

I'm also not claiming this is 100% worth it - queries with a suitable 
combination of clauses (conditions on the join keys) seems rather 
uncommon. But it seems like an interesting example, because it may be 
seen either as missed execution optimization (failing to skip the 
initial chunk of rows), or an costing issue due to not accounting for 
having to process the rows (which would likely result in picking a 
different plan).



regards

[1] 
https://www.postgresql.org/message-id/CA%2B1Wm9U_sP9237f7OH7O%3D-UTab71DWOO4Qc-vnC78DfsJQBCwQ%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [BUG]Update Toast data failure in logical replication
Next
From: Bharath Rupireddy
Date:
Subject: Re: pg_walinspect - a new extension to get raw WAL data and WAL stats