Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PoC/WIP: Extended statistics on expressions |
Date | |
Msg-id | ce50ee26-7135-c435-cae8-073ad59eba60@enterprisedb.com Whole thread Raw |
In response to | Re: PoC/WIP: Extended statistics on expressions (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: PoC/WIP: Extended statistics on expressions
|
List | pgsql-hackers |
On 3/17/21 9:58 PM, Dean Rasheed wrote: > On Wed, 17 Mar 2021 at 20:48, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> >> For reference, here is the test case I was using (which isn't really very good for >> catching dependence between columns): >> > > And here's a test case with much more dependence between the columns: > > DROP TABLE IF EXISTS foo; > CREATE TABLE foo (a int, b int, c int, d int); > INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,100000) x; > SELECT COUNT(DISTINCT a) FROM foo; -- 2 > SELECT COUNT(DISTINCT b) FROM foo; -- 5 > SELECT COUNT(DISTINCT c) FROM foo; -- 10 > SELECT COUNT(DISTINCT d) FROM foo; -- 15 > SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6 > SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20 > SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10 > SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30 > > -- First case: stats on [(a+b),c] > CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; > ANALYSE foo; > EXPLAIN ANALYSE > SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); > -- Estimate = 150, Actual = 30 > -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15, > -- which is much better than ndistinct((a+b)) * ndistinct(c) * > ndistinct(d) = 6*10*15 = 900 > -- Estimate with no stats = 1500 > > -- Second case: stats on (c+d) as well > CREATE STATISTICS s2 ON (c+d) FROM foo; > ANALYSE foo; > EXPLAIN ANALYSE > SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); > -- Estimate = 120, Actual = 30 > -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20 > > Again, I'd say the current behaviour is pretty good. > Thanks! I agree applying at least the [(a+b),c] stats is probably the right approach, as it means we're considering at least the available information about dependence between the columns. I think to improve this, we'll need to teach the code to use overlapping statistics, a bit like conditional probability. In this case we might do something like this: ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c)) Which in this case would be either, for the "less correlated" case 228 * 24 / 12 = 446 (actual = 478, current estimate = 480) or, for the "more correlated" case 10 * 20 / 10 = 20 (actual = 30, current estimate = 120) But that's clearly a matter for a future patch, and I'm sure there are cases where this will produce worse estimates. Anyway, I plan to go over the patches one more time, and start pushing them sometime early next week. I don't want to leave it until the very last moment in the CF. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: