Re: PATCH: use foreign keys to improve join estimates v1 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: use foreign keys to improve join estimates v1
Date
Msg-id 560939A4.4080305@2ndquadrant.com
Whole thread Raw
In response to Re: PATCH: use foreign keys to improve join estimates v1  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: PATCH: use foreign keys to improve join estimates v1  (David Rowley <david.rowley@2ndquadrant.com>)
Re: PATCH: use foreign keys to improve join estimates v1  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
Hi,

On 09/27/2015 02:00 PM, David Rowley wrote:
> I've been working on this again. I've put back the code that you wrote
> for the looping over each combination of relations from either side of
> the join.
>
> I've also added some code to get around the problem with eclass joins
> and the RestrictInfo having some alternative Vars that don't belong to
> the foreign key. Basically I'm just checking if the RestrictInfo has a
> parent_ec, and if it does just loop over the members to try and find the
> Vars that belong to the foreign key. I've tested it with the following,
> and it seems to work:

I didn't have time to look into the code yet, but this seems like an 
interesting idea.

>
> create table a as select i as a_id1, i as a_id2, i as dummy1 from
> generate_series(0,999) s(i);
> alter table a add unique (a_id1, a_id2);
> create table b as select i as b_id1, i as b_id2 from
> generate_series(0,332) s(i);
>
> analyze a;
> analyze b;
>
> alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);
>
> explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
> a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;
>
>                                                   QUERY PLAN
> -----------------------------------------------------------------------------------------------------------
>   Hash Join  (cost=18.57..26.41 rows=2 width=20) (actual
> time=0.775..1.046 rows=333 loops=1)
>     Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
>     ->  Seq Scan on b  (cost=0.00..5.33 rows=333 width=8) (actual
> time=0.013..0.046 rows=333 loops=1)
>     ->  Hash  (cost=18.50..18.50 rows=5 width=12) (actual
> time=0.737..0.737 rows=1000 loops=1)
>           Buckets: 1024  Batches: 1  Memory Usage: 51kB
>           ->  Seq Scan on a  (cost=0.00..18.50 rows=5 width=12) (actual
> time=0.014..0.389 rows=1000 loops=1)
>                 Filter: (dummy1 = a_id1)
>
> The non-patched version estimates 1 row. The patched estimates 2 rows,
> but that's due to the bad estimate on dummy1 = a_id1.
>
> The 2 comes from ceil(5 * 0.333).
>
> Perhaps you have a better test case to for this?

I think the additional WHERE clause is needlessly confusing. I've been 
able to come up with an example - pretty much a normalized with a "main" 
table and auxiliary tables (referencing the main one using FK) with 
additional info. So not unlikely to happen in practice (except maybe for 
the multi-column foreign key bit).


CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));

CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES 
f(id1, id2));
CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES 
f(id1, id2));

INSERT INTO f SELECT i, i FROM generate_series(1,1000000) s(i);

INSERT INTO d1 SELECT i, i FROM generate_series(1,100000) s(i);
INSERT INTO d2 SELECT i, i FROM generate_series(1,300000) s(i);

now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated 
perfectly accurately, but as soon as the query involves both of them, 
this happens:

SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)                JOIN d2 ON (f.id1 = d2.id1 AND f.id2 =
d2.id2);
                          QUERY PLAN
------------------------------------------------------------------------- Nested Loop  (cost=3334.43..12647.57
rows=30000width=24)              (actual time=221.086..1767.206 rows=100000 loops=1)   Join Filter: ((d1.id1 = f.id1)
AND(d1.id2 = f.id2))   ->  Hash Join  (cost=3334.00..12647.01 rows=1 width=16)                  (actual
time=221.058..939.482rows=100000 loops=1)         Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))         ->  Seq
Scanon d2  (cost=0.00..4328.00 rows=300000 width=8)                     (actual time=0.038..263.356 rows=300000
loops=1)        ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)                   (actual time=220.721..220.721
rows=100000loops=1)               Buckets: 131072  Batches: 2  Memory Usage: 2982kB               ->  Seq Scan on d1
(cost=0.00..1443.00rows=100000 ...)                       (actual time=0.033..101.547 rows=100000 loops=1)   ->  Index
OnlyScan using f_pkey on f  (cost=0.42..0.54 rows=1 ...)                        (actual time=0.004..0.004 rows=1
loops=100000)        Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))         Heap Fetches: 100000
 

Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. 100000). 
I assume that's only because we find FK only on the second join with f.

So it seems like s a clear improvement, both compared to master and the 
previous versions of the patch.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: YUriy Zhuravlev
Date:
Subject: Re: No Issue Tracker - Say it Ain't So!
Next
From: Paul Ramsey
Date:
Subject: Re: [PATCH] postgres_fdw extension support