Nested loop vs merge join: inconsistencies between estimated and actual time - Mailing list pgsql-performance

From Vlad Arkhipov
Subject Nested loop vs merge join: inconsistencies between estimated and actual time
Date
Msg-id 47D0D7B0.7060109@dc.baikal.ru
Whole thread Raw
Responses Re: Nested loop vs merge join: inconsistencies between estimated and actual time  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
I've came across this issue while writing report-like query for 2 not
very large tables. I've tried several methods to resolve this one (see
below). But now I'm really stuck...
PostgreSQL 8.3, default configuration

There are 2 tables (structure was simplified to show only problematic
place):
create table c
(
  id bigint primary key
  cdate date
);

create index c_cdate_idx on c (cdate);

create table i
(
  id bigint primary key,
  id_c bigint references c(id)
);

select count(*) from c

count
--------
636 565

select count(*) from i

count
--------
4 646 145

analyze i;
analyze c;

explain analyze
select id
from c
  join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'

QUERY
PLAN




---------------------------------------------------------------------------------------------------------------------------------------------------------



Merge Join  (cost=738.95..57864.63 rows=14479 width=8) (actual
time=13954.681..14358.731 rows=14583
loops=1)
  Merge Cond: (i.idc =
c.id)


  ->  Index Scan using fki_i_c_fk on i  (cost=0.00..194324.34
rows=4646145 width=8) (actual time=17.254..12061.414 rows=1042599 loops=1)
  ->  Sort  (cost=738.94..756.88 rows=7178 width=8) (actual
time=53.942..75.013 rows=14583
loops=1)
        Sort Key:
c.id


        Sort Method:  quicksort  Memory:
404kB


        ->  Index Scan using c_cdate_idx on c  (cost=0.00..279.21
rows=7178 width=8) (actual time=23.595..41.470 rows=7064 loops=1)
              Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
Total runtime: 14379.461
ms



set enable_mergejoin to off;
set enable_hashjoin to off;

QUERY
PLAN



--------------------------------------------------------------------------------------------------------------------------------------------------



Nested Loop  (cost=0.00..59833.70 rows=14479 width=8) (actual
time=0.129..153.038 rows=14583
loops=1)
  ->  Index Scan using  c_cdate_idx on c  (cost=0.00..279.21 rows=7178
width=8) (actual time=0.091..14.468 rows=7064 loops=1)
        Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
  ->  Index Scan using fki_i_c_fk on i  (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)
        Index Cond: (i.idc =
c.id)


Total runtime: 172.599 ms

Ok, the first problem is here:
  ->  Index Scan using fki_i_c_fk on i  (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)

I collected statistics for these tables at level 1000 for all columns.

select attname, null_frac, avg_width, n_distinct, correlation
from pg_stats
where tablename = 'i'

attname     null_frac           avg_width     n_distinct
correlation
----------  ------------------  ------------  -------------
------------------
id     0                   8             -1             0,9999849796295166
idc    0,7236369848251343  8             95583          0,999763011932373

Nice stats except of n_distinct for idc column.

select count(distinct idc)
from i

count
--------
633 864

Of course it is not correct solution but...

update pg_statistic
set stadistinct = 633864
where starelid = ... and staattnum = ...

Reconnect and execute:

explain analyze
select id
from c
  join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'

QUERY
PLAN



--------------------------------------------------------------------------------------------------------------------------------------------------



Nested Loop  (cost=0.00..57342.39 rows=14479 width=8) (actual
time=0.133..151.426 rows=14583
loops=1)
  ->  Index Scan using c_cdate_idx on c  (cost=0.00..279.21 rows=7178
width=8) (actual time=0.094..14.242 rows=7064 loops=1)
        Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
  ->  Index Scan using fki_i_c_fk on i  (cost=0.00..7.92 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7064)
        Index Cond: (i.idc =
c.id)


Total runtime: 170.911
ms



But the reason of this issue is not the incorrect value of n_distinct.
Let's expand dates interval in WHERE clause.


explain analyze
select id
from c
  join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-19'

QUERY
PLAN



--------------------------------------------------------------------------------------------------------------------------------------------------------



Merge Join  (cost=831.16..57981.98 rows=16155 width=8) (actual
time=11691.156..12155.201 rows=16357
loops=1)
  Merge Cond: (i.idc =
c.id)


  ->  Index Scan using fki_i_c_fk on i  (cost=0.00..194324.34
rows=4646145 width=8) (actual time=22.236..9928.489 rows=1044373 loops=1)
  ->  Sort  (cost=831.15..851.17 rows=8009 width=8) (actual
time=31.660..55.277 rows=16357
loops=1)
        Sort Key:
c.id


        Sort Method:  quicksort  Memory:
438kB


        ->  Index Scan using c_cdate_idx on c  (cost=0.00..311.87
rows=8009 width=8) (actual time=0.116..17.050 rows=7918 loops=1)
              Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
Total runtime: 12178.678 ms

set enable_mergejoin to off;
set enable_hashjoin to off;

QUERY
PLAN



--------------------------------------------------------------------------------------------------------------------------------------------------



Nested Loop  (cost=0.00..63724.20 rows=16155 width=8) (actual
time=0.131..171.292 rows=16357
loops=1)
  ->  Index Scan using c_cdate_idx on c  (cost=0.00..311.87 rows=8009
width=8) (actual time=0.093..15.906 rows=7918 loops=1)
        Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
  ->  Index Scan using fki_i_c_fk on i  (cost=0.00..7.89 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7918)
        Index Cond: (i.idc =
c.id)


Total runtime: 193.221 ms

Why nested loop is overestimated here (63000 estimated = 171 actual,
58000 estimated = 12155 actual).


pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Improve Full text rank in a query
Next
From: Tom Lane
Date:
Subject: Re: Nested loop vs merge join: inconsistencies between estimated and actual time