Thread: Incorrect choice of Nested Loop for a skewed distribution

Incorrect choice of Nested Loop for a skewed distribution

From
Oleg Kharin
Date:
Hi All,

With PostgreSQL 10 and 11, the planner doesn't use the lists of most 
common values to determine the selectivity of "=" for Nested Loop as it 
does for a normal inner join in eqjoinsel_inner(). Incorrect choice of a 
nested loops join strategy causes poor query performance.
To demonstrate it one can make the following test case:

   create table t(f integer not null,g integer not null);
   create table u(f integer not null,g integer not null);
   create sequence s cache 1000;
   insert into t select 0,s from (select nextval('s') as s) as d;
   insert into t select 0,s from (select nextval('s') as s) as d;
   insert into t select 0,s from (select nextval('s') as s from t,t t1,t 
t2) as d;
   insert into t select 0,s from (select nextval('s') as s from t,t t1,t 
t2,t t3) as d;
   insert into t(f,g) select g,f from t;
   insert into u select * from t;
   create index t_f on t(f);
   vacuum analyze;

The columns f and g of both tables t and u have a skewed distribution: 
10010 values of 0 and 10010 unique values starting from 1.
Let's see query plan for the join of t and u:

   explain analyze select * from t,u where t.f=u.f and t.g=u.g;

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.29..7629.95 rows=25055030 width=16) (actual 
time=0.042..22540.629 rows=20020 loops=1)
    ->  Seq Scan on u  (cost=0.00..289.20 rows=20020 width=8) (actual 
time=0.011..3.025 rows=20020 loops=1)
    ->  Index Scan using t_f on t  (cost=0.29..0.36 rows=1 width=8) 
(actual time=0.565..1.125 rows=1 loops=20020)
          Index Cond: (f = u.f)
          Filter: (u.g = g)
          Rows Removed by Filter: 5004
  Planning Time: 0.394 ms
  Execution Time: 22542.639 ms

After dropping the index
   drop index t_f;
we obtain much better query plan (without Nested Loop):

                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
  Merge Join  (cost=3439.09..442052.26 rows=25055030 width=16) (actual 
time=15.708..32.735 rows=20020 loops=1)
    Merge Cond: ((t.f = u.f) AND (t.g = u.g))
    ->  Sort  (cost=1719.54..1769.59 rows=20020 width=8) (actual 
time=8.189..10.189 rows=20020 loops=1)
          Sort Key: t.f, t.g
          Sort Method: quicksort  Memory: 1707kB
          ->  Seq Scan on t  (cost=0.00..289.20 rows=20020 width=8) 
(actual time=0.012..2.958 rows=20020 loops=1)
    ->  Sort  (cost=1719.54..1769.59 rows=20020 width=8) (actual 
time=7.510..9.459 rows=20020 loops=1)
          Sort Key: u.f, u.g
          Sort Method: quicksort  Memory: 1707kB
          ->  Seq Scan on u  (cost=0.00..289.20 rows=20020 width=8) 
(actual time=0.008..2.748 rows=20020 loops=1)
  Planning Time: 0.239 ms
  Execution Time: 34.585 ms

Using MCV lists in var_eq_non_const() would prevent from choosing Nested 
Loop in such cases.

Regards,
Oleg Kharin




Re: Incorrect choice of Nested Loop for a skewed distribution

From
Justin Pryzby
Date:
On Tue, Sep 03, 2019 at 09:47:42PM +0400, Oleg Kharin wrote:
> Hi All,
> 
> With PostgreSQL 10 and 11, the planner doesn't use the lists of most common
> values to determine the selectivity of "=" for Nested Loop as it does for a
> normal inner join in eqjoinsel_inner(). Incorrect choice of a nested loops
> join strategy causes poor query performance.
> To demonstrate it one can make the following test case:
...
> The columns f and g of both tables t and u have a skewed distribution: 10010
> values of 0 and 10010 unique values starting from 1.
> Let's see query plan for the join of t and u:
> 
>   explain analyze select * from t,u where t.f=u.f and t.g=u.g;
>  Nested Loop  (cost=0.29..7629.95 rows=25055030 width=16) (actual time=0.042..22540.629 rows=20020 loops=1)
>    ->  Seq Scan on u  (cost=0.00..289.20 rows=20020 width=8) (actual time=0.011..3.025 rows=20020 loops=1)
>    ->  Index Scan using t_f on t  (cost=0.29..0.36 rows=1 width=8) (actual time=0.565..1.125 rows=1 loops=20020)
>          Index Cond: (f = u.f)
>          Filter: (u.g = g)
>          Rows Removed by Filter: 5004
>  Planning Time: 0.394 ms
>  Execution Time: 22542.639 ms
> 
> After dropping the index
>   drop index t_f;
> we obtain much better query plan (without Nested Loop):
>  Merge Join  (cost=3439.09..442052.26 rows=25055030 width=16) (actual time=15.708..32.735 rows=20020 loops=1)
>    Merge Cond: ((t.f = u.f) AND (t.g = u.g))
>    ->  Sort  (cost=1719.54..1769.59 rows=20020 width=8) (actual time=8.189..10.189 rows=20020 loops=1)
>          Sort Key: t.f, t.g
>          Sort Method: quicksort  Memory: 1707kB
>          ->  Seq Scan on t  (cost=0.00..289.20 rows=20020 width=8) (actual time=0.012..2.958 rows=20020 loops=1)
>    ->  Sort  (cost=1719.54..1769.59 rows=20020 width=8) (actual time=7.510..9.459 rows=20020 loops=1)
>          Sort Key: u.f, u.g
>          Sort Method: quicksort  Memory: 1707kB
>          ->  Seq Scan on u  (cost=0.00..289.20 rows=20020 width=8) (actual time=0.008..2.748 rows=20020 loops=1)
>  Planning Time: 0.239 ms
>  Execution Time: 34.585 ms

>  Nested Loop  (cost=0.29..     7629.95 rows=25055030 width=16) (actual...
>  Merge Join   (cost=3439.09..442052.26 rows=25055030 width=16) (actual...

When you dropped the index, you necessarily refused the plan involving index scan.
So it did merge join instead (which it thinks of as expensive because it has to
sort both sides).

As you saw, the rowcount estimate of the join result is badly off.  But that's
true of both plans.

Choice of join type is affected by the size of its *inputs*, but doesn't and
shouldn't have any effect on the result rowcount (or its) estimate.  The
rowcount *output* of the join would only affect any *following* plan nodes (of
which there are none in this case).

You suggested using the MCV list, but I don't think that's possible, since the
nested loop is evaluating its "inner" table multiple times, in a "loop":

>    ->  Index Scan using t_f on t  (cost=0.29..0.36 rows=1 width=8) (actual time=0.565..1.125 rows=1 loops=20020)

Hypothetically, one could plan the query 20k times, for each value of u.f and
u.g, but that's tantamount to actually executing the query, which is why it
uses (I gather) var_eq_non_const.

If you enable BUFFERS, you can see:

postgres=# explain (analyze on,buffers) select * from t,u WHERE t.g=u.g AND t.f=u.f ;
 Nested Loop  (cost=0.29..7629.95 rows=25055030 width=16) (actual time=0.031..22634.482 rows=20020 loops=1)
   Buffers: shared hit=770913
   ->  Seq Scan on t  (cost=0.00..289.20 rows=20020 width=8) (actual time=0.011..22.883 rows=20020 loops=1)
         Buffers: shared hit=89
   ->  Index Scan using u_f_idx on u  (cost=0.29..0.36 rows=1 width=8) (actual time=0.596..1.125 rows=1 loops=20020)
         ...
         Buffers: shared hit=770824

vs.

postgres=# SET enable_nestloop=off;
postgres=# explain (analyze on,buffers) select * from t,u WHERE t.g=u.g AND t.f=u.f ;
 Merge Join  (cost=3439.09..442052.26 rows=25055030 width=16) (actual time=74.262..187.454 rows=20020 loops=1)
   Merge Cond: ((t.g = u.g) AND (t.f = u.f))
   Buffers: shared hit=178
   ...

So the nest loop plan is hitting 770k buffer pages (5GB) while looping 20k
times, rather than 178 buffers when each page is read once (to sort).

Perhaps you could get a good plan by playing with these, but it's unclear why
that's necessary.

 effective_cache_size         | 1280
 cpu_index_tuple_cost         | 0.005
 cpu_operator_cost            | 0.0025
 cpu_tuple_cost               | 0.01
 random_page_cost             | 4
 seq_page_cost                | 1

Also, since PG96, FKs can improve join estimates:

postgres=# CREATE UNIQUE INDEX ON u(f,g);
postgres=# CREATE UNIQUE INDEX ON t(f,g);
postgres=# ALTER TABLE t ADD CONSTRAINT fk_f FOREIGN KEY (f,g) REFERENCES u(f,g);
postgres=# ALTER TABLE u ADD CONSTRAINT fk_u FOREIGN KEY (f,g) REFERENCES t(f,g);
postgres=# explain analyze select * from t,u WHERE t.g=u.g AND t.f=u.f ;
 Hash Join  (cost=589.50..999.14 rows=20020 width=16) (actual time=29.054..69.296 rows=20020 loops=1)
   Hash Cond: ((t.g = u.g) AND (t.f = u.f))
   ->  Seq Scan on t  (cost=0.00..289.20 rows=20020 width=8) (actual time=0.016..11.331 rows=20020 loops=1)
   ->  Hash  (cost=289.20..289.20 rows=20020 width=8) (actual time=28.980..28.981 rows=20020 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 1039kB
         ->  Seq Scan on u  (cost=0.00..289.20 rows=20020 width=8) (actual time=0.010..12.730 rows=20020 loops=1)

Justin



Re: Incorrect choice of Nested Loop for a skewed distribution

From
Oleg Kharin
Date:
Thank you Justin!

On Wed, 04 Sep 2019 17:18:36 -0700 (PDT), Justin Pryzby wrote:

> When you dropped the index, you necessarily refused the plan involving 
> index scan.
> So it did merge join instead (which it thinks of as expensive because it has to
> sort both sides).
>
> As you saw, the rowcount estimate of the join result is badly off.  But that's
> true of both plans.
>
> Choice of join type is affected by the size of its *inputs*, but doesn't and
> shouldn't have any effect on the result rowcount (or its) estimate.  The
> rowcount *output* of the join would only affect any *following* plan nodes (of
> which there are none in this case).
>
> You suggested using the MCV list, but I don't think that's possible, since the
> nested loop is evaluating its "inner" table multiple times, in a "loop":
>
>>     ->  Index Scan using t_f on t  (cost=0.29..0.36 rows=1 width=8) (actual time=0.565..1.125 rows=1 loops=20020)
> Hypothetically, one could plan the query 20k times, for each value of u.f and
> u.g, but that's tantamount to actually executing the query, which is why it
> uses (I gather) var_eq_non_const.

In fact the planner has information about the outer loop relation when 
it is estimating the number of inner loop rows for a nested-loop join. 
It ignores this information and considers the outer/inner loop relations 
as independent. So it uses for the rowcount estimate the number of 
distinct values only, not MCV lists of both tables.

For the test case above, the planner estimates the selectivity of the 
clause "t.f=u.f" as 1/10011. It hopes to scan 1/10011*20020=2 rows in a 
inner loop for each row of the outer loop. In fact it scans 10010 rows 
10010 times and 1 row 10010 times. The average number of rows scanned in 
a inner loop is (10010*10010+10010)/20020=5005. That is badly off from 2 
rows expected. The planner should use 5005/20020 as a more accurate 
value for the effective selectivity of "t.f=u.f".

I tried to make improvements to the function var_eq_non_const() so that 
it calculates the effective selectivity using MCV lists if possible. The 
patch for PostgreSQL 11.5 is attached to the letter.

The patched PostgreSQL chooses an optimal plan for the test case. As a 
result the query execution time  is reduced by more than 600 times.

Now if we force the planner to choose the nested-loop join

   set enable_mergejoin=off;

   set enable_hashjoin=off;

   explain analyze select * from t,u where t.f=u.f and t.g=u.g;

  Nested Loop  (cost=0.29..2261681.75 rows=25055030 width=16) (actual 
time=0.048..22519.232 rows=20020 loops=1)
    ->  Seq Scan on u  (cost=0.00..289.20 rows=20020 width=8) (actual 
time=0.012..2.970 rows=20020 loops=1)
    ->  Index Scan using t_f on t  (cost=0.29..100.44 rows=1252 width=8) 
(actual time=0.565..1.124 rows=1 loops=20020)
          Index Cond: (f = u.f)
          Filter: (u.g = g)
          Rows Removed by Filter: 5004
  Planning Time: 0.188 ms
  Execution Time: 22521.248 ms

we will see that the cost of the index scan is increased from 0.36 to 
100.44, which is much more realistic and reflects 2 to 5005 rows ratio.

Regards,

Oleg


Attachment