Thread: Incorrect choice of Nested Loop for a skewed distribution
Hi All, With PostgreSQL 10 and 11, the planner doesn't use the lists of most common values to determine the selectivity of "=" for Nested Loop as it does for a normal inner join in eqjoinsel_inner(). Incorrect choice of a nested loops join strategy causes poor query performance. To demonstrate it one can make the following test case: create table t(f integer not null,g integer not null); create table u(f integer not null,g integer not null); create sequence s cache 1000; insert into t select 0,s from (select nextval('s') as s) as d; insert into t select 0,s from (select nextval('s') as s) as d; insert into t select 0,s from (select nextval('s') as s from t,t t1,t t2) as d; insert into t select 0,s from (select nextval('s') as s from t,t t1,t t2,t t3) as d; insert into t(f,g) select g,f from t; insert into u select * from t; create index t_f on t(f); vacuum analyze; The columns f and g of both tables t and u have a skewed distribution: 10010 values of 0 and 10010 unique values starting from 1. Let's see query plan for the join of t and u: explain analyze select * from t,u where t.f=u.f and t.g=u.g; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.29..7629.95 rows=25055030 width=16) (actual time=0.042..22540.629 rows=20020 loops=1) -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8) (actual time=0.011..3.025 rows=20020 loops=1) -> Index Scan using t_f on t (cost=0.29..0.36 rows=1 width=8) (actual time=0.565..1.125 rows=1 loops=20020) Index Cond: (f = u.f) Filter: (u.g = g) Rows Removed by Filter: 5004 Planning Time: 0.394 ms Execution Time: 22542.639 ms After dropping the index drop index t_f; we obtain much better query plan (without Nested Loop): QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Merge Join (cost=3439.09..442052.26 rows=25055030 width=16) (actual time=15.708..32.735 rows=20020 loops=1) Merge Cond: ((t.f = u.f) AND (t.g = u.g)) -> Sort (cost=1719.54..1769.59 rows=20020 width=8) (actual time=8.189..10.189 rows=20020 loops=1) Sort Key: t.f, t.g Sort Method: quicksort Memory: 1707kB -> Seq Scan on t (cost=0.00..289.20 rows=20020 width=8) (actual time=0.012..2.958 rows=20020 loops=1) -> Sort (cost=1719.54..1769.59 rows=20020 width=8) (actual time=7.510..9.459 rows=20020 loops=1) Sort Key: u.f, u.g Sort Method: quicksort Memory: 1707kB -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8) (actual time=0.008..2.748 rows=20020 loops=1) Planning Time: 0.239 ms Execution Time: 34.585 ms Using MCV lists in var_eq_non_const() would prevent from choosing Nested Loop in such cases. Regards, Oleg Kharin
On Tue, Sep 03, 2019 at 09:47:42PM +0400, Oleg Kharin wrote: > Hi All, > > With PostgreSQL 10 and 11, the planner doesn't use the lists of most common > values to determine the selectivity of "=" for Nested Loop as it does for a > normal inner join in eqjoinsel_inner(). Incorrect choice of a nested loops > join strategy causes poor query performance. > To demonstrate it one can make the following test case: ... > The columns f and g of both tables t and u have a skewed distribution: 10010 > values of 0 and 10010 unique values starting from 1. > Let's see query plan for the join of t and u: > > explain analyze select * from t,u where t.f=u.f and t.g=u.g; > Nested Loop (cost=0.29..7629.95 rows=25055030 width=16) (actual time=0.042..22540.629 rows=20020 loops=1) > -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8) (actual time=0.011..3.025 rows=20020 loops=1) > -> Index Scan using t_f on t (cost=0.29..0.36 rows=1 width=8) (actual time=0.565..1.125 rows=1 loops=20020) > Index Cond: (f = u.f) > Filter: (u.g = g) > Rows Removed by Filter: 5004 > Planning Time: 0.394 ms > Execution Time: 22542.639 ms > > After dropping the index > drop index t_f; > we obtain much better query plan (without Nested Loop): > Merge Join (cost=3439.09..442052.26 rows=25055030 width=16) (actual time=15.708..32.735 rows=20020 loops=1) > Merge Cond: ((t.f = u.f) AND (t.g = u.g)) > -> Sort (cost=1719.54..1769.59 rows=20020 width=8) (actual time=8.189..10.189 rows=20020 loops=1) > Sort Key: t.f, t.g > Sort Method: quicksort Memory: 1707kB > -> Seq Scan on t (cost=0.00..289.20 rows=20020 width=8) (actual time=0.012..2.958 rows=20020 loops=1) > -> Sort (cost=1719.54..1769.59 rows=20020 width=8) (actual time=7.510..9.459 rows=20020 loops=1) > Sort Key: u.f, u.g > Sort Method: quicksort Memory: 1707kB > -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8) (actual time=0.008..2.748 rows=20020 loops=1) > Planning Time: 0.239 ms > Execution Time: 34.585 ms > Nested Loop (cost=0.29.. 7629.95 rows=25055030 width=16) (actual... > Merge Join (cost=3439.09..442052.26 rows=25055030 width=16) (actual... When you dropped the index, you necessarily refused the plan involving index scan. So it did merge join instead (which it thinks of as expensive because it has to sort both sides). As you saw, the rowcount estimate of the join result is badly off. But that's true of both plans. Choice of join type is affected by the size of its *inputs*, but doesn't and shouldn't have any effect on the result rowcount (or its) estimate. The rowcount *output* of the join would only affect any *following* plan nodes (of which there are none in this case). You suggested using the MCV list, but I don't think that's possible, since the nested loop is evaluating its "inner" table multiple times, in a "loop": > -> Index Scan using t_f on t (cost=0.29..0.36 rows=1 width=8) (actual time=0.565..1.125 rows=1 loops=20020) Hypothetically, one could plan the query 20k times, for each value of u.f and u.g, but that's tantamount to actually executing the query, which is why it uses (I gather) var_eq_non_const. If you enable BUFFERS, you can see: postgres=# explain (analyze on,buffers) select * from t,u WHERE t.g=u.g AND t.f=u.f ; Nested Loop (cost=0.29..7629.95 rows=25055030 width=16) (actual time=0.031..22634.482 rows=20020 loops=1) Buffers: shared hit=770913 -> Seq Scan on t (cost=0.00..289.20 rows=20020 width=8) (actual time=0.011..22.883 rows=20020 loops=1) Buffers: shared hit=89 -> Index Scan using u_f_idx on u (cost=0.29..0.36 rows=1 width=8) (actual time=0.596..1.125 rows=1 loops=20020) ... Buffers: shared hit=770824 vs. postgres=# SET enable_nestloop=off; postgres=# explain (analyze on,buffers) select * from t,u WHERE t.g=u.g AND t.f=u.f ; Merge Join (cost=3439.09..442052.26 rows=25055030 width=16) (actual time=74.262..187.454 rows=20020 loops=1) Merge Cond: ((t.g = u.g) AND (t.f = u.f)) Buffers: shared hit=178 ... So the nest loop plan is hitting 770k buffer pages (5GB) while looping 20k times, rather than 178 buffers when each page is read once (to sort). Perhaps you could get a good plan by playing with these, but it's unclear why that's necessary. effective_cache_size | 1280 cpu_index_tuple_cost | 0.005 cpu_operator_cost | 0.0025 cpu_tuple_cost | 0.01 random_page_cost | 4 seq_page_cost | 1 Also, since PG96, FKs can improve join estimates: postgres=# CREATE UNIQUE INDEX ON u(f,g); postgres=# CREATE UNIQUE INDEX ON t(f,g); postgres=# ALTER TABLE t ADD CONSTRAINT fk_f FOREIGN KEY (f,g) REFERENCES u(f,g); postgres=# ALTER TABLE u ADD CONSTRAINT fk_u FOREIGN KEY (f,g) REFERENCES t(f,g); postgres=# explain analyze select * from t,u WHERE t.g=u.g AND t.f=u.f ; Hash Join (cost=589.50..999.14 rows=20020 width=16) (actual time=29.054..69.296 rows=20020 loops=1) Hash Cond: ((t.g = u.g) AND (t.f = u.f)) -> Seq Scan on t (cost=0.00..289.20 rows=20020 width=8) (actual time=0.016..11.331 rows=20020 loops=1) -> Hash (cost=289.20..289.20 rows=20020 width=8) (actual time=28.980..28.981 rows=20020 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 1039kB -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8) (actual time=0.010..12.730 rows=20020 loops=1) Justin
Thank you Justin! On Wed, 04 Sep 2019 17:18:36 -0700 (PDT), Justin Pryzby wrote: > When you dropped the index, you necessarily refused the plan involving > index scan. > So it did merge join instead (which it thinks of as expensive because it has to > sort both sides). > > As you saw, the rowcount estimate of the join result is badly off. But that's > true of both plans. > > Choice of join type is affected by the size of its *inputs*, but doesn't and > shouldn't have any effect on the result rowcount (or its) estimate. The > rowcount *output* of the join would only affect any *following* plan nodes (of > which there are none in this case). > > You suggested using the MCV list, but I don't think that's possible, since the > nested loop is evaluating its "inner" table multiple times, in a "loop": > >> -> Index Scan using t_f on t (cost=0.29..0.36 rows=1 width=8) (actual time=0.565..1.125 rows=1 loops=20020) > Hypothetically, one could plan the query 20k times, for each value of u.f and > u.g, but that's tantamount to actually executing the query, which is why it > uses (I gather) var_eq_non_const. In fact the planner has information about the outer loop relation when it is estimating the number of inner loop rows for a nested-loop join. It ignores this information and considers the outer/inner loop relations as independent. So it uses for the rowcount estimate the number of distinct values only, not MCV lists of both tables. For the test case above, the planner estimates the selectivity of the clause "t.f=u.f" as 1/10011. It hopes to scan 1/10011*20020=2 rows in a inner loop for each row of the outer loop. In fact it scans 10010 rows 10010 times and 1 row 10010 times. The average number of rows scanned in a inner loop is (10010*10010+10010)/20020=5005. That is badly off from 2 rows expected. The planner should use 5005/20020 as a more accurate value for the effective selectivity of "t.f=u.f". I tried to make improvements to the function var_eq_non_const() so that it calculates the effective selectivity using MCV lists if possible. The patch for PostgreSQL 11.5 is attached to the letter. The patched PostgreSQL chooses an optimal plan for the test case. As a result the query execution time is reduced by more than 600 times. Now if we force the planner to choose the nested-loop join set enable_mergejoin=off; set enable_hashjoin=off; explain analyze select * from t,u where t.f=u.f and t.g=u.g; Nested Loop (cost=0.29..2261681.75 rows=25055030 width=16) (actual time=0.048..22519.232 rows=20020 loops=1) -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8) (actual time=0.012..2.970 rows=20020 loops=1) -> Index Scan using t_f on t (cost=0.29..100.44 rows=1252 width=8) (actual time=0.565..1.124 rows=1 loops=20020) Index Cond: (f = u.f) Filter: (u.g = g) Rows Removed by Filter: 5004 Planning Time: 0.188 ms Execution Time: 22521.248 ms we will see that the cost of the index scan is increased from 0.36 to 100.44, which is much more realistic and reflects 2 to 5005 rows ratio. Regards, Oleg