using extended statistics to improve join estimates - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | using extended statistics to improve join estimates |
Date | |
Msg-id | c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c@enterprisedb.com Whole thread Raw |
Responses |
Re: using extended statistics to improve join estimates
Re: using extended statistics to improve join estimates |
List | pgsql-hackers |
Hi, So far the extended statistics are applied only at scan level, i.e. when estimating selectivity for individual tables. Which is great, but joins are a known challenge, so let's try doing something about it ... Konstantin Knizhnik posted a patch [1] using functional dependencies to improve join estimates in January. It's an interesting approach, but as I explained in that thread I think we should try a different approach, similar to how we use MCV lists without extended stats. We'll probably end up considering functional dependencies too, but probably only as a fallback (similar to what we do for single-relation estimates). This is a PoC demonstrating the approach I envisioned. It's incomplete and has various limitations: - no expression support, just plain attribute references - only equality conditions - requires MCV lists on both sides - inner joins only All of this can / should be relaxed later, of course. But for a PoC this seems sufficient. The basic principle is fairly simple, and mimics what eqjoinsel_inner does. Assume we have a query: SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the same logic as eqjoinsel_inner and "match" them together. If the MCV list is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but the general idea is the same. To demonstrate this, consider a very simple example with a table that has a lot of dependency between the columns: ================================================================== CREATE TABLE t (a INT, b INT, c INT, d INT); INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100) FROM generate_series(1,100000) s(i); ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t; ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); ALTER STATISTICS s SET STATISTICS 10000; ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); ================================================================== The results look like this: - actual rows: 100000000 - estimated (no stats): 1003638 - estimated (stats, 100): 100247844 - estimated (stats, 10k): 100000000 So, in this case the extended stats help quite a bit, even with the default statistics target. However, there are other things we can do. For example, we can use restrictions (at relation level) as "conditions" to filter the MCV lits, and calculate conditional probability. This is useful even if there's just a single join condition (on one column), but there are dependencies between that column and the other filters. Or maybe when there are filters between conditions on the two sides. Consider for example these two queries: SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.c < 25 AND t2.d < 25; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.c < 25 AND t2.d > 75; In this particular case we know that (a = b = c = d) so the two filters are somewhat redundant. The regular estimates will ignore that, but with MCV we can actually detect that - when we combine the two MCV lists, we essentially calculate MCV (a,b,t1.c,t2.d), and use that. Q1 Q2 - actual rows: 25000000 0 - estimated (no stats): 62158 60241 - estimated (stats, 100): 25047900 1 - estimated (stats, 10k): 25000000 1 Obviously, the accuracy depends on how representative the MCV list is (what fraction of the data it represents), and in this case it works fairly nicely. A lot of the future work will be about handling cases when it represents only part of the data. The attached PoC patch has a number of FIXME and XXX, describing stuff I ignored to keep it simple, possible future improvement. And so on. regards [1] https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: