Re: PATCH: use foreign keys to improve join estimates v1 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PATCH: use foreign keys to improve join estimates v1 |
Date | |
Msg-id | 56CDD883.7050209@2ndquadrant.com Whole thread Raw |
In response to | Re: PATCH: use foreign keys to improve join estimates v1 (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: PATCH: use foreign keys to improve join estimates v1
|
List | pgsql-hackers |
Hi, On 09/30/2015 03:12 AM, David Rowley wrote: ... > CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2)); > > CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES > f(id1, id2)); > CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES > f(id1, id2)); > > INSERT INTO f SELECT i, i FROM generate_series(1,1000000) s(i); > > INSERT INTO d1 SELECT i, i FROM generate_series(1,100000) s(i); > INSERT INTO d2 SELECT i, i FROM generate_series(1,300000) s(i); > > now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated > perfectly accurately, but as soon as the query involves both of > them, this happens: > > SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2) > JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2); > > QUERY PLAN > ------------------------------------------------------------------------- > Nested Loop (cost=3334.43..12647.57 rows=30000 width=24) > (actual time=221.086..1767.206 rows=100000 loops=1) > Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2)) > -> Hash Join (cost=3334.00..12647.01 rows=1 width=16) > (actual time=221.058..939.482 rows=100000 loops=1) > Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2)) > -> Seq Scan on d2 (cost=0.00..4328.00 rows=300000 width=8) > (actual time=0.038..263.356 rows=300000 loops=1) > -> Hash (cost=1443.00..1443.00 rows=100000 width=8) > (actual time=220.721..220.721 rows=100000 loops=1) > Buckets: 131072 Batches: 2 Memory Usage: 2982kB > -> Seq Scan on d1 (cost=0.00..1443.00 rows=100000 ...) > (actual time=0.033..101.547 rows=100000 loops=1) > -> Index Only Scan using f_pkey on f (cost=0.42..0.54 rows=1 ...) > (actual time=0.004..0.004 rows=1 loops=100000) > Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2)) > Heap Fetches: 100000 > > Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. > 100000). I assume that's only because we find FK only on the second > join with f. > > So it seems like s a clear improvement, both compared to master and > the previous versions of the patch. > > > I've been experimenting with this example. Of course, the reason why we > get the 1 row estimate on the join between d1 and d2 is that there's no > foreign key between those two relations. > > The attached patch changes things so that the foreign key matching code > is better able to see foreign keys "hidden" behind eclasses. So it does > now in fact detect a foreign key on d2 referencing d1, by looking for > Vars foreign keys which have Vars in the same eclasses as the joinquals > are built from. This has improved the result > > postgres=# EXPLAIN ANALYZE SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 > AND f.id2 = d1.id2) JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2); > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------- > Hash Join (cost=16655.94..26066.95 rows=30000 width=24) (actual > time=267.322..468.383 rows=100000 loops=1) > Hash Cond: ((d2.id1 = f.id1) AND (d2.id2 = f.id2)) > -> Seq Scan on d2 (cost=0.00..4328.00 rows=300000 width=8) (actual > time=0.019..31.396 rows=300000 loops=1) > -> Hash (cost=14666.94..14666.94 rows=100000 width=16) (actual > time=266.263..266.263 rows=100000 loops=1) > Buckets: 131072 Batches: 2 Memory Usage: 3373kB > -> Merge Join (cost=9748.32..14666.94 rows=100000 width=16) > (actual time=104.494..224.908 rows=100000 loops=1) > Merge Cond: ((f.id1 = d1.id1) AND (f.id2 = d1.id2)) > -> Index Only Scan using f_pkey on f > (cost=0.42..36214.93 rows=1000000 width=8) (actual time=0.045..35.758 > rows=100001 loops=1) > Heap Fetches: 100001 > -> Sort (cost=9747.82..9997.82 rows=100000 width=8) > (actual time=104.440..122.401 rows=100000 loops=1) > Sort Key: d1.id1, d1.id2 > Sort Method: external sort Disk: 2152kB > -> Seq Scan on d1 (cost=0.00..1443.00 > rows=100000 width=8) (actual time=0.019..9.443 rows=100000 loops=1) > > The problem is that the code I added is sometimes a bit too optimistic > at finding a suitable foreign key. When performing estimates for the > join between (f,d1) <-> (d2), since the code loops over each relation > making up the set of relations at either side of the join, we find a > foreign key on 'f' which references d2, this one actually exists. It > then goes on and also finds a foreign key for (d1) references (d2), of > course this one does not exists and it's only could due to the eclasses. > The problem here is, which one do we use? If we multiply the selectivity > for each of these foreign keys then we'd end up with a selectivty = (1.0 > / 1000000) * (1.0 / 300000), which is a massive underestimation. Perhaps > doing this would be perfectly valid if the actual foreign key being > around was not the same one as the last one, but this seems wrong when > we match to the same foreign key in both instances. > > I've gone though a few variations on ways to handle this and I'm a bit > stuck on what's the best way. > > In the attached I've coded it to take the Min() selectivity for when the > same quals are matched more than once. I know this is not correct, but > since it seems impossible to obtain an exact estimate in this case, we'd > need to decide on some logic which does well in the average case. I don't think we should derive foreign keys like this. The basis for using foreign keys to improve estimates is that the foreign keys are supposed to provide guarantees of existence, but that's clearly not the case here - there's no foreign key between the two tables that get joined first, and the FK you derive this way guarantees nothing. For example the tables might refer different subsets of the "f" table, e.g. d1 might reference odd rows while d2 even rows. That kinda breaks the assumption of containment, but well - that's exactly what FK constraints are supposed to guarantee (and what we use to improve the estimates). The problem with estimating cardinality of the d1:d2 join is purely in our inability to estimate the cardinality of the pair of columns used in the join condition. The planner sees the conditions independently, estimates the selectivity as 1/ndistinct for the column and then multiplies that together. Sadly, in this case ndistinct is equal to cardinality of each of the tables, thus we get extreme under-estimate. Consider a simplified example: CREATE TABLE d1 (id1 INT, id2 INT); CREATE TABLE d2 (id1 INT, id2 INT); INSERT INTO d1 SELECT i/100, i%100 FROM generate_series(0,9999) s(i); INSERT INTO d2 SELECT i/548, i%548 FROM generate_series(0,299999) s(i); In this case the data are constructed in a way that the product of column cardinalities is equal to table cardinality. d1: 100 x 100 = 10.000 d2: 548 x 548 = 300.000 (about) QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=10000.00..30046.69 rows=9969 width=8) (actual time=278.989..306.935 rows=10000 loops=1) Hash Cond: ((d1.id1 = d2.id1) AND (d1.id2 = d2.id2)) -> Seq Scan on d1 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.031..4.202 rows=10000loops=1) -> Hash (cost=4328.00..4328.00 rows=300000 width=8) (actual time=278.717..278.717 rows=300000loops=1) Buckets: 131072 Batches: 4 Memory Usage: 3947kB -> Seq Scan on d2 (cost=0.00..4328.00rows=300000 width=8) (actual time=0.020..129.025 rows=300000 loops=1) Planning time:0.556 ms Execution time: 311.037 ms (8 rows) So fixing the cardinality estimate for (id1,id2) actually fixes the estimate, but that's a task for the multivariate stats patch, not for this one. The FK improves the ndistinct estimate implicitly as it guarantees the cardinality of one side is exactly the cardinality of the table (thanks to referencing a primary key). Maybe we could use the existence of unique constraints in other cases. Overall, I still believe the FK patch is a clear improvement of the current status - while it's true it does not improve all possible cases and there's a room for additional improvements (like handling multiple candidate FK constraints), it should not make existing estimates worse. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: