Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Problems with estimating OR conditions, IS NULL on LEFT JOINs |
Date | |
Msg-id | 52b55b53-420a-722c-adbd-706922fc059b@enterprisedb.com Whole thread Raw |
Responses |
Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
|
List | pgsql-hackers |
Hi, I ran into a pretty terrible case of LEFT JOIN estimate, resulting in pretty arbitrary underestimate. The query is pretty trivial, nothing overly complex, and the more I think about it the more I think this is a fairly fundamental flaw in how we estimate this type of joins. Imagine you have two trivial tables: CREATE TABLE large (id INT, a INT); INSERT INTO large SELECT i, i FROM generate_series(1,1000000) s(i); CREATE TABLE small (id INT, b INT); INSERT INTO small SELECT i, i FROM generate_series(1,100) s(i); The business meaning may be that "large" stores orders and "small" is for events related to tiny fraction of the large table (e.g. returns). And let's do a couple simple LEFT JOIN queries, adding conditions to it. Let's start with no condition at all: EXPLAIN ANALYZE SELECT * FROM large LEFT JOIN small ON (large.id = small.id) QUERY PLAN ---------------------------------------------------------------------- Hash Left Join (cost=3.25..18179.25 rows=1000000 width=16) (actual time=0.069..550.290 rows=1000000 loops=1) Hash Cond: (large.id = small.id) -> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.010..174.056 rows=1000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.052... Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ... Planning Time: 0.291 ms Execution Time: 663.551 ms (8 rows) So great, this estimate is perfect. Now, let's add IS NULL condition on the small table, to find rows without a match (e.g. orders that were not returned): EXPLAIN ANALYZE SELECT * FROM large LEFT JOIN small ON (large.id = small.id) WHERE (small.id IS NULL); QUERY PLAN ---------------------------------------------------------------------- Hash Anti Join (cost=3.25..27052.36 rows=999900 width=16) (actual time=0.071..544.568 rows=999900 loops=1) Hash Cond: (large.id = small.id) -> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.015..166.019 rows=1000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.051... Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ... Planning Time: 0.260 ms Execution Time: 658.379 ms (8 rows) Also very accurate, great! Now let's do a condition on the large table instead, filtering some the rows: EXPLAIN ANALYZE SELECT * FROM large LEFT JOIN small ON (large.id = small.id) WHERE (large.a IN (1000, 2000, 3000, 4000, 5000)); QUERY PLAN ---------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..20684.75 rows=5 width=16) (actual time=0.957..127.376 rows=5 loops=1) Join Filter: (large.id = small.id) Rows Removed by Join Filter: 500 -> Seq Scan on large (cost=0.00..20675.00 rows=5 width=8) (actual time=0.878..127.171 rows=5 loops=1) Filter: (a = ANY ('{1000,2000,3000,4000,5000}'::integer[])) Rows Removed by Filter: 999995 -> Materialize (cost=0.00..2.50 rows=100 width=8) ... -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ... Planning Time: 0.223 ms Execution Time: 127.407 ms (10 rows) Also great estimate! Surely, if we do both conditions with OR, we'll get a decent estimate too? EXPLAIN ANALYZE SELECT * FROM large LEFT JOIN small ON (large.id = small.id) WHERE (small.id IS NULL) OR (large.a IN (1000, 2000, 3000, 4000, 5000)); QUERY PLAN ---------------------------------------------------------------------- Hash Left Join (cost=3.25..18179.88 rows=5 width=16) (actual time=0.073..580.827 rows=999900 loops=1) Hash Cond: (large.id = small.id) Filter: ((small.id IS NULL) OR (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[]))) Rows Removed by Filter: 100 -> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.015..166.809 rows=1000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.052... Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ... Planning Time: 0.309 ms Execution Time: 694.427 ms (10 rows) Well, bummer! This is pretty surprising, because if we know that clause A produces estimate 1M and clause B estimates as 5, then it's expected that (A OR B) should be estimated as something >= max(1M, 5). For users running this, this has to be really surprising. It's also quite serious, because with underestimates like this the planner is likely to pick nestloops for additional joins, and we all know how that performs for millions of rows ... So, how does this happen? Well, the simple reason is that joins are estimated by applying selectivities for all the clauses on a cartesian product. So we calculate product (100 * 1M), and then apply selectivity for the join condition, and the two single-table clauses. The problem is that the selectivity for "IS NULL" is estimated using the table-level statistics. But the LEFT JOIN entirely breaks the idea that the null_frac has anything to do with NULLs in the join result. Because the join result is not a subset of cartesian product - it's a superset. Even if the small table has no NULLs, the join can have them - that's the whole point of outer joins. When there's only the IS NULL condition, we actually recognize this as a special case and treat the join as antijoin (Hash Anti Join), and that also fixes the estimates - antijoins do handle this fine. But as soon as you change the condition a bit and do the IS NULL check on the other column of the table, it goes wrong too: EXPLAIN ANALYZE SELECT * FROM large LEFT JOIN small ON (large.id = small.id) WHERE (small.b IS NULL); QUERY PLAN ---------------------------------------------------------------------- Hash Left Join (cost=3.25..18179.25 rows=1 width=16) (actual time=0.311..3110.298 rows=999900 loops=1) Hash Cond: (large.id = small.id) Filter: (small.b IS NULL) Rows Removed by Filter: 100 -> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.014..1032.497 rows=1000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.287... Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ... Planning Time: 0.105 ms Execution Time: 4083.634 ms (10 rows) I'd bet most users would be rather surprised to learn this subtle change makes a difference. I wonder how to improve this, say by adjusting the IS NULL selectivity when we know to operate on the outer side of the join. We're able to do this for antijoins, so maybe we could do that here, somehow? Unfortunately the important things (join type detection, IS NULL clause estimation) happen pretty far away, but maybe we could pass that info about fraction of NULLs introduced by he join to the nulltestsel, and use it there (essentially by doing null_frac + join_null_frac). And maybe we could do something like that for NULL tests on other columns from the outer side ... The other thing that might be beneficial is calculating boundaries for the estimate. In this case we're capable of estimating the join size for individual conditions, and then we could make ensure the final estimate for the "a OR b" join is >= max of these cardinalities. Of course, that might be expensive if we have to redo some of the join planning/estimates for each individual condition (otherwise might not notice the antijoin transformation and stuff like that). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: