PoC: using sampling to estimate joins / complex conditions - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | PoC: using sampling to estimate joins / complex conditions |
Date | |
Msg-id | 580ba824-7630-e6e8-f80f-725dfa587da5@enterprisedb.com Whole thread Raw |
Responses |
Re: PoC: using sampling to estimate joins / complex conditions
Re: PoC: using sampling to estimate joins / complex conditions |
List | pgsql-hackers |
Hi, estimating joins is one of the significant gaps related to extended statistics, and I've been regularly asked about what we might do about that. This is an early experimental patch that I think might help us with improving this, possible even in PG15. Note: I do not claim this is exactly how it should be implemented, but it's probably sufficient to demonstrate the pros/cons of various alternative approaches, etc. In short, the patch samples the tables and uses those samples to estimate selectivity for scans and joins. The samples are collected during planning, which may be quite expensive - random I/O for each query, etc. It'd be possible to build them during analyze, but that'd require solving serialization, tweak CREATE STATISTICS to handle join queries, etc. I decided to keep the PoC simple. It still uses CREATE STATISTICS with a new "sample" kind, instructing the optimizer to use sampling when estimating clauses on the attributes. A little example demonstrating what the patch does: create table t (a int, b int, c int); insert into t select mod(i,10), mod(i,20), mod(i,40) from generate_series(1,10000000) s(i); analyze t; -- estimate without any statistics / sampling explain analyze select * from t where a = 0 and b = 0 and c = 0; QUERY PLAN ------------------------------------------------------------------- Seq Scan on t (cost=0.00..229055.00 rows=1361 width=12) (actual time=0.025..761.571 rows=250000 loops=1) Filter: ((a = 0) AND (b = 0) AND (c = 0)) Rows Removed by Filter: 9750000 Planning Time: 0.471 ms Execution Time: 901.182 ms (5 rows) -- enable sampling on those columns create statistics s (sample) on a, b, c from t; explain analyze select * from t where a = 0 and b = 0 and c = 0; QUERY PLAN ------------------------------------------------------------------- Seq Scan on t (cost=0.00..229055.00 rows=250390 width=12) (actual time=0.307..717.937 rows=250000 loops=1) Filter: ((a = 0) AND (b = 0) AND (c = 0)) Rows Removed by Filter: 9750000 Planning Time: 194.528 ms Execution Time: 851.832 ms (5 rows) Of course, in this case a MCV would work well too, because there are very few combinations in (a,b,c) - a sample would work even when that's not the case, and it has various other benefits (can estimate almost any expression while MCV supports only a subset, etc.) Now, let's look at a join between a fact and a dimension table: create table f (d1 int, d2 int, f1 int, f2 int, f3 int); create table d (d1 int, d2 int, d3 int, d4 int, d5 int, primary key (d1, d2)); insert into d select i, i, mod(i,100), mod(i,100), mod(i,100) from generate_series(0,999) s(i); insert into f select mod(i,1000), mod(i,1000), mod(i,100), mod(i,100), mod(i,100) from generate_series(1,1000000) s(i); analyze f, d; explain analyze select * from f join d using (d1,d2) where f1 < 50 and f2 < 50 and d3 < 50 and d4 < 50; QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=25.75..22717.01 rows=63 width=32) (actual time=3.197..861.899 rows=500000 loops=1) Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2)) -> Seq Scan on f (cost=0.00..21370.00 rows=251669 width=20) (actual time=0.033..315.401 rows=500000 loops=1) Filter: ((f1 < 50) AND (f2 < 50)) Rows Removed by Filter: 500000 -> Hash (cost=22.00..22.00 rows=250 width=20) (actual time=3.139..3.141 rows=500 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 34kB -> Seq Scan on d (cost=0.00..22.00 rows=250 width=20) (actual time=0.018..1.706 rows=500 loops=1) Filter: ((d3 < 50) AND (d4 < 50)) Rows Removed by Filter: 500 Planning Time: 0.806 ms Execution Time: 1099.229 ms (12 rows) So, not great - underestimated by 10000x is likely to lead to inefficient plans. And now with the samples enabled on both sides: create statistics s1 (sample) on d1, d2, f1, f2, f3 from f; create statistics s2 (sample) on d1, d2, d3, d4, d5 from d; QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=29.50..24057.25 rows=503170 width=32) (actual time=0.630..837.483 rows=500000 loops=1) Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2)) -> Seq Scan on f (cost=0.00..21370.00 rows=503879 width=20) (actual time=0.008..301.584 rows=500000 loops=1) Filter: ((f1 < 50) AND (f2 < 50)) Rows Removed by Filter: 500000 -> Hash (cost=22.00..22.00 rows=500 width=20) (actual time=0.616..0.618 rows=500 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 34kB -> Seq Scan on d (cost=0.00..22.00 rows=500 width=20) (actual time=0.004..0.321 rows=500 loops=1) Filter: ((d3 < 50) AND (d4 < 50)) Rows Removed by Filter: 500 Planning Time: 603.442 ms Execution Time: 1071.735 ms (12 rows) Yes, it takes 600ms to do the sampling, but I'm sure most of this can be eliminated by optimizing the code and/or storing the samples just like other types of stats. Note that most of the 1000x underestimate is not due to poor estimates at the scan level, but mostly due to the join condition having two correlated clauses. Yes, adding a proper foreign key would probably improve this (we already leverage this information in planning), but there can be cross-table correlations between the other conditions, and the FK can't help with that. Correlations between different dimension tables are quite common, and sampling can help with those. Note: There's another PoC patch using multi-column MCVs to improve join estimated - that has the same limitations as MCVs for scans. It works quite fine (only) when the MCV represents large part of the data, and it does not support evaluating arbitrary expressions. Now, a little bit about the implementation, sampling limitations etc. At the scan level, sampling is fairly straightforward - the patch simply runs a TABLESAMPLE query through SPI, with a sample fraction calculated from a GUC (estimate_sample_rate, 1% by default) and statistics target. The samples may be too large and the calculation may need some changes, but that's a minor detail I think. Not sure SPI is the right way to do this, but for PoC it's good enough. For joins, sampling is way more complicated - we can't sample both tables randomly, because that'd require huge samples on both sides - as shown in [3], sampling n rows from a join with table having N rows requires sqrt(n * N) from the table. Which is a lot. So what this patch attempts to do is "correlated sampling", described in [1] and [3]. Imagine a join on a foreign key, as in the example query. (The patch only looks for a PK, for simplicity.) This is a pretty common pattern, especially in star and snowflake queries, which join a "fact" table to one or more "dimension" tables. The "correlated" sampling means the "fact" table (side of the join without the PK) is sampled randomly, but the dimensions are simply scanned for matching rows. The PK means there can only be one matching row for each sample one, so we're "enriching" the random sample. This is what [1] describes as CS2, including how to extend the approach to joins without the PK/FK requirement and various corner cases, and [3] improves that to leverage indexes. [4] discussed various CS2 variations, addressing various problems - reducing space requirements, etc. The current PoC patch is however very simplistic and naive - for example it does not attempt to correlate joins with multiple dimensions, so for example when joining F with D1 and then D2, we sample (F,D1) and then (F,D2) independently. This means we sample F twice, which can be quite expensive, and it also fails to miss correlations between D1 and D2 (which is common in actual data sets). There are various other efficiency issues, because the joins go through calc_joinrel_size_estimate and compute_semi_anti_join_factors, and each place does the sampling again. The samples should be cached somewhere and reused, probably. I'm sure there's plenty open questions, some of which are mentioned in the many XXX comments added to the patch. FWIW The patch does have some issues with expressions, so joins on complex expressions (e.g. ON ((a+b) = (c+d)) do not work properly. That shouldn't be a big deal for PoC, I think. regards [1] CS2: A new database synopsis for query estimation https://www.researchgate.net/publication/262350868_CS2_A_new_database_synopsis_for_query_estimation [2] Join Size Estimation Subject to Filter Conditions https://www.semanticscholar.org/paper/Join-Size-Estimation-Subject-to-Filter-Conditions-Vengerov-Menck/c8bd4caf0fc9c8a4fbffc7e05416901d4fd7a41b [3] Cardinality Estimation Done Right: Index-Based Join Sampling https://www.semanticscholar.org/paper/Cardinality-Estimation-Done-Right%3A-Index-Based-Join-Leis-Radke/15f211eaafc6ce421a511a413613e1d2683879d2 [4] Improved Correlated Sampling for Join SizeEstimation https://www.comp.nus.edu.sg/~taining/estimation/report.pdf [5] A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration https://arxiv.org/abs/2101.01507 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: