Re: [PERFORM] Optimization inner join - Mailing list pgsql-performance

From Gustavo Rezende Montesino
Subject Re: [PERFORM] Optimization inner join
Date
Msg-id 5880D11A.7060900@trtsp.jus.br
Whole thread Raw
In response to Re: [PERFORM] Optimization inner join  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Em 19/01/2017 12:13, Tom Lane escreveu:
> Gustavo Rezende Montesino <gustavo.montesino@trtsp.jus.br> writes:
>> Being the client in question, I would like to make a little remark: What
>> we thought could be optimized here at first is on the row estimate of
>> the index scan; which could take null_frac into account. To put things
>> into perspective, our similar case in production has a table with 6
>> million lines where only 9.5k aren´t null for the join field, an the
>> over-estimation is throwing away good plans (like ~150ms execution time)
>> in favor of pretty bad ones (~80s execution time).
> Please provide a concrete test case for that.  AFAIK the null fraction
> should be accounted for in join size estimates.  Here's a little test
> case showing that it is:

Hello,

Expanding a little on you example:

postgres=# create table t1 as select generate_series(1,1000000) as f1;
SELECT 1000000
postgres=# create table t2 as select generate_series(1,1000000) as f1;
SELECT 1000000
postgres=# insert into t2 select null from generate_series(1,1000000);
INSERT 0 1000000
postgres=# create index on t1(f1);
CREATE INDEX
postgres=# create index on t2(f1);
CREATE INDEX
postgres=# analyze t1;
ANALYZE
postgres=# analyze t2;
ANALYZE
postgres=# explain select * from t1,t2 where t1.f1=t2.f1;
                                        QUERY PLAN
-----------------------------------------------------------------------------------------
  Merge Join  (cost=2.68..59298.81 rows=499433 width=8)
    Merge Cond: (t1.f1 = t2.f1)
    ->  Index Only Scan using t1_f1_idx on t1 (cost=0.42..24916.42
rows=1000000 width=4)
    ->  Index Only Scan using t2_f1_idx on t2 (cost=0.43..48837.43
rows=2000000 width=4)
(4 rows)
postgres=# explain select * from t1,t2 where t1.f1=t2.f1 and t2.f1 is
not null;
                                        QUERY PLAN
-----------------------------------------------------------------------------------------
  Merge Join  (cost=1.85..44588.02 rows=249434 width=8)
    Merge Cond: (t1.f1 = t2.f1)
    ->  Index Only Scan using t1_f1_idx on t1 (cost=0.42..24916.42
rows=1000000 width=4)
    ->  Index Only Scan using t2_f1_idx on t2 (cost=0.43..26890.60
rows=998867 width=4)
          Index Cond: (f1 IS NOT NULL)
(5 rows)


Notice the difference in the estimated costs. In our real case this
difference leads
to a (very) bad plan choice.

BTW, execution itself is indeed faster without the not null clause.

These tests where on 9.3, but our production with the "real" case is in
9.6. Behavior seems
to be the same on both.


Regards,

Gustavo R. Montesino



pgsql-performance by date:

Previous
From: Philippe Ivaldi
Date:
Subject: [PERFORM] Chaotic query planning ?
Next
From: Albe Laurenz
Date:
Subject: Re: [PERFORM] Chaotic query planning ?