Thread: using extended statistics to improve join estimates
Hi, So far the extended statistics are applied only at scan level, i.e. when estimating selectivity for individual tables. Which is great, but joins are a known challenge, so let's try doing something about it ... Konstantin Knizhnik posted a patch [1] using functional dependencies to improve join estimates in January. It's an interesting approach, but as I explained in that thread I think we should try a different approach, similar to how we use MCV lists without extended stats. We'll probably end up considering functional dependencies too, but probably only as a fallback (similar to what we do for single-relation estimates). This is a PoC demonstrating the approach I envisioned. It's incomplete and has various limitations: - no expression support, just plain attribute references - only equality conditions - requires MCV lists on both sides - inner joins only All of this can / should be relaxed later, of course. But for a PoC this seems sufficient. The basic principle is fairly simple, and mimics what eqjoinsel_inner does. Assume we have a query: SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the same logic as eqjoinsel_inner and "match" them together. If the MCV list is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but the general idea is the same. To demonstrate this, consider a very simple example with a table that has a lot of dependency between the columns: ================================================================== CREATE TABLE t (a INT, b INT, c INT, d INT); INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100) FROM generate_series(1,100000) s(i); ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t; ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); ALTER STATISTICS s SET STATISTICS 10000; ANALYZE t; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); ================================================================== The results look like this: - actual rows: 100000000 - estimated (no stats): 1003638 - estimated (stats, 100): 100247844 - estimated (stats, 10k): 100000000 So, in this case the extended stats help quite a bit, even with the default statistics target. However, there are other things we can do. For example, we can use restrictions (at relation level) as "conditions" to filter the MCV lits, and calculate conditional probability. This is useful even if there's just a single join condition (on one column), but there are dependencies between that column and the other filters. Or maybe when there are filters between conditions on the two sides. Consider for example these two queries: SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.c < 25 AND t2.d < 25; SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.c < 25 AND t2.d > 75; In this particular case we know that (a = b = c = d) so the two filters are somewhat redundant. The regular estimates will ignore that, but with MCV we can actually detect that - when we combine the two MCV lists, we essentially calculate MCV (a,b,t1.c,t2.d), and use that. Q1 Q2 - actual rows: 25000000 0 - estimated (no stats): 62158 60241 - estimated (stats, 100): 25047900 1 - estimated (stats, 10k): 25000000 1 Obviously, the accuracy depends on how representative the MCV list is (what fraction of the data it represents), and in this case it works fairly nicely. A lot of the future work will be about handling cases when it represents only part of the data. The attached PoC patch has a number of FIXME and XXX, describing stuff I ignored to keep it simple, possible future improvement. And so on. regards [1] https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi,
+ * has_matching_mcv
+ * Check whether the list contains statistic of a given kind
+ * Check whether the list contains statistic of a given kind
The method name is find_matching_mcv(). It seems the method initially returned bool but later the return type was changed.
+ StatisticExtInfo *found = NULL;
found normally is associated with bool return value. Maybe call the variable matching_mcv or something similar.
+ /* skip items eliminated by restrictions on rel2 */
+ if (conditions2 && !conditions2[j])
+ continue;
+ if (conditions2 && !conditions2[j])
+ continue;
Maybe you can add a counter recording the number of non-skipped items for the inner loop. If this counter is 0 after the completion of one iteration, we come out of the outer loop directly.
Cheers
On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...
Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).
This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:
- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only
All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.
The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:
SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)
If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.
To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:
==================================================================
CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
FROM generate_series(1,100000) s(i);
ANALYZE t;
SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;
SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;
SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
==================================================================
The results look like this:
- actual rows: 100000000
- estimated (no stats): 1003638
- estimated (stats, 100): 100247844
- estimated (stats, 10k): 100000000
So, in this case the extended stats help quite a bit, even with the
default statistics target.
However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.
Consider for example these two queries:
SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d < 25;
SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d > 75;
In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.
Q1 Q2
- actual rows: 25000000 0
- estimated (no stats): 62158 60241
- estimated (stats, 100): 25047900 1
- estimated (stats, 10k): 25000000 1
Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.
The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.
regards
[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi, Here's a slightly improved / cleaned up version of the PoC patch, removing a bunch of XXX and FIXMEs, adding comments, etc. The approach is sound in principle, I think, although there's still a bunch of things to address: 1) statext_compare_mcvs only really deals with equijoins / inner joins at the moment, as it's based on eqjoinsel_inner. It's probably desirable to add support for additional join types (inequality and outer joins). 2) Some of the steps are performed multiple times - e.g. matching base restrictions to statistics, etc. Those probably can be cached somehow, to reduce the overhead. 3) The logic of picking the statistics to apply is somewhat simplistic, and maybe could be improved in some way. OTOH the number of candidate statistics is likely low, so this is not a big issue. 4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of assumptions similar to the original, but some of those assumptions may be wrong in multi-column case, particularly when working with a subset of columns. For example (ndistinct - size(MCV)) may not be the number of distinct combinations outside the MCV, when ignoring some columns. Same for nullfract, and so on. I'm not sure we can do much more than pick some reasonable approximation. 5) There are no regression tests at the moment. Clearly a gap. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, attached is an improved version of this patch, addressing some of the points mentioned in my last message: 1) Adds a couple regression tests, testing various join cases with expressions, additional conditions, etc. 2) Adds support for expressions, so the join clauses don't need to reference just simple columns. So e.g. this can benefit from extended statistics, when defined on the expressions: -- CREATE STATISTICS s1 ON (a+1), b FROM t1; -- CREATE STATISTICS s2 ON (a+1), b FROM t2; SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b); 3) Can combine extended statistics and regular (per-column) statistics. The previous version required extended statistics MCV on both sides of the join, but adding extended statistics on both sides may impractical (e.g. if one side does not have correlated columns it's silly to have to add it just to make this patch happy). For example you may have extended stats on the dimension table but not the fact table, and the patch still can combine those two. Of course, if there's no MCV on either side, we can't do much. So this patch works when both sides have extended statistics MCV, or when one side has extended MCV and the other side regular MCV. It might seem silly, but the extended MCV allows considering additional baserel conditions (if there are any). examples ======== The table / data is very simple, but hopefully good enough for some simple examples. create table t1 (a int, b int, c int); create table t2 (a int, b int, c int); insert into t1 select mod(i,50), mod(i,50), mod(i,50) from generate_series(1,1000) s(i); insert into t2 select mod(i,50), mod(i,50), mod(i,50) from generate_series(1,1000) s(i); analyze t1, t2; First, without extended stats (just the first line of explain analyze, to keep the message short): explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=31.00..106.00 rows=400 width=24) (actual time=5.426..22.678 rows=20000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=28.50..160.75 rows=10000 width=24) (actual time=5.325..19.760 rows=10000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25 and t2.c > 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=24.50..104.75 rows=4800 width=24) (actual time=5.618..5.639 rows=0 loops=1) Now, let's create statistics: create statistics s1 on a, b, c from t1 ; create statistics s2 on a, b, c from t2 ; analyze t1, t2; and now the same queries again: explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=31.00..106.00 rows=20000 width=24) (actual time=5.448..22.713 rows=20000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=28.50..160.75 rows=10000 width=24) (actual time=5.317..16.680 rows=10000 loops=1) explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25 and t2.c > 25; QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=24.50..104.75 rows=1 width=24) (actual time=5.647..5.667 rows=0 loops=1) Those examples are a bit simplistic, but the improvements are fairly clear I think. limitations & open issues ========================= Let's talk about the main general restrictions and open issues in the current patch that I can think of at the moment. 1) statistics covering all join clauses The patch requires the statistics to cover all the join clauses, mostly because it simplifies the implementation. This means that to use the per-column statistics, there has to be just a single join clause. AFAICS this could be relaxed and we could use multiple statistics to estimate the clauses. But it'd make selection of statistics much more complicated, because we have to pick "matching" statistics on both sides of the join. So it seems like an overkill, and most joins have very few conditions anyway. 2) only equality join clauses The other restriction is that at the moment this only supports simple equality clauses, combined with AND. So for example this is supported t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1)) while these are not: t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1)) t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1)) t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c))) I'm not entirely sure these restrictions can be relaxed. It's not that difficult to evaluate these cases when matching items between the MCV lists, similarly to how we evaluate bitmaps for baserel estimates. But I'm not sure what to do about the part not covered by the MCV lists. The eqjoinsel() approach uses ndistinct estimates for that, but that only works for AND clauses, I think. How would that work for OR? Similarly, I'm not sure we can do much for non-equality conditions, but those are currently estimated as default selectivity in selfuncs.c. 3) estimation by join pairs At the moment, the estimates are calculated for pairs of relations, so for example given a query explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t1.b = t3.b and t2.c = t3.c); we'll estimate the first join (t1,t2) just fine, but then the second join actually combines (t1,t2,t3). What the patch currently does is it splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this should actually combine all three MCVs at once - we're pretty much combining the MCVs into one large MCV representing the join result. But I haven't done that yet, as it requires the MCVs to be combined using the join clauses (overlap in a way), but I'm not sure how likely that is in practice. In the example it could help, but that's a bit artificial example. 4) still just inner equi-joins I haven't done any work on extending this to outer joins etc. Adding outer and semi joins should not be complicated, mostly copying and tweaking what eqjoinsel() does. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
attached is an improved version of this patch, addressing some of the
points mentioned in my last message:
1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.
2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:
-- CREATE STATISTICS s1 ON (a+1), b FROM t1;
-- CREATE STATISTICS s2 ON (a+1), b FROM t2;
SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);
3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).
For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.
So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).
examples
========
The table / data is very simple, but hopefully good enough for some
simple examples.
create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);
insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);
insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);
analyze t1, t2;
First, without extended stats (just the first line of explain analyze,
to keep the message short):
explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=20000 loops=1)
explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.325..19.760 rows=10000 loops=1)
explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)
Now, let's create statistics:
create statistics s1 on a, b, c from t1 ;
create statistics s2 on a, b, c from t2 ;
analyze t1, t2;
and now the same queries again:
explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=20000 width=24)
(actual time=5.448..22.713 rows=20000 loops=1)
explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.317..16.680 rows=10000 loops=1)
explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)
Those examples are a bit simplistic, but the improvements are fairly
clear I think.
limitations & open issues
=========================
Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.
1) statistics covering all join clauses
The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.
AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.
2) only equality join clauses
The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported
t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))
while these are not:
t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))
t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))
t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))
I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.
But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?
Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.
3) estimation by join pairs
At the moment, the estimates are calculated for pairs of relations, so
for example given a query
explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);
we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.
But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.
4) still just inner equi-joins
I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
+
+ /* if the new statistics covers more conditions, use it */
+ if (list_length(conditions2) > list_length(conditions1))
+ {
+ mcv = stat;
It seems conditions2 is calculated using mcv, I wonder why mcv is replaced by stat (for conditions1 whose length is shorter) ?
Cheers
On 10/6/21 23:03, Zhihong Yu wrote: > Hi, > > + conditions2 = statext_determine_join_restrictions(root, rel, mcv); > + > + /* if the new statistics covers more conditions, use it */ > + if (list_length(conditions2) > list_length(conditions1)) > + { > + mcv = stat; > > It seems conditions2 is calculated using mcv, I wonder why mcv is > replaced by stat (for conditions1 whose length is shorter) ? > Yeah, that's wrong - it should be the other way around, i.e. if (list_length(conditions1) > list_length(conditions2)) There's no test with multiple candidate statistics yet, so this went unnoticed :-/ regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Tomas: This is the exact patch I want, thanks for the patch! On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > 3) estimation by join pairs > > At the moment, the estimates are calculated for pairs of relations, so > for example given a query > > explain analyze > select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) > join t3 on (t1.b = t3.b and t2.c = t3.c); > > we'll estimate the first join (t1,t2) just fine, but then the second > join actually combines (t1,t2,t3). What the patch currently does is it > splits it into (t1,t2) and (t2,t3) and estimates those. Actually I can't understand how this works even for a simpler example. let's say we query like this (ONLY use t2's column to join t3). select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it is hard to work. Here I see your solution is splitting it into (t1, t2) AND (t2, t3) and estimate those. But how does this help to estimate the size of JoinRel(t1, t2, t3)? > I wonder if this > should actually combine all three MCVs at once - we're pretty much > combining the MCVs into one large MCV representing the join result. > I guess we can keep the MCVs on joinrel for these matches. Take the above query I provided for example, and suppose the MCV data as below: t1(a, b) (1, 2) -> 0.1 (1, 3) -> 0.2 (2, 3) -> 0.5 (2, 8) -> 0.1 t2(a, b) (1, 2) -> 0.2 (1, 3) -> 0.1 (2, 4) -> 0.2 (2, 10) -> 0.1 After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below (1, 2, 1, 2) -> 0.1 * 0.2 (1, 3, 1, 3) -> 0.2 * 0.1 And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) * (0.2 + 0.1 + 0.2 + 0.1) With this design, the nitems of MCV on joinrel would be less than either of baserel. and since we handle the eqjoin as well, we even can record the items as (1, 2) -> 0.1 * 0.2 (1, 3) -> 0.2 * 0.1; About when we should maintain the JoinRel's MCV data, rather than maintain this just after the JoinRel size is estimated, we can only estimate it when it is needed. for example: select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); we don't need to maintain the MCV on (t1, t2, t3) since no others need it at all. However I don't check code too closely to see if it (Lazing computing MVC on joinrel) is convenient to do. > But I haven't done that yet, as it requires the MCVs to be combined > using the join clauses (overlap in a way), but I'm not sure how likely > that is in practice. In the example it could help, but that's a bit > artificial example. > > > 4) still just inner equi-joins > > I haven't done any work on extending this to outer joins etc. Adding > outer and semi joins should not be complicated, mostly copying and > tweaking what eqjoinsel() does. > Overall, thanks for the feature and I am expecting there are more cases to handle during discussion. To make the review process more efficient, I suggest that we split the patch into smaller ones and review/commit them separately if we have finalized the design roughly . For example: Patch 1 -- required both sides to have extended statistics. Patch 2 -- required one side to have extended statistics and the other side had per-column MCV. Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const; Patch 3 -- handle the case for 3+ table joins. Patch 4 -- supports the outer join. I think we can do this if we are sure that each individual patch would work in some cases and would not make any other case worse. If you agree with this, I can do that splitting work during my review process. -- Best Regards Andy Fan (https://www.aliyun.com/)
Your regression tests include two errors, which appear to be accidental, and fixing the error shows that this case is being estimated poorly. +-- try combining with single-column (and single-expression) statistics +DROP STATISTICS join_test_2; +ERROR: statistics object "join_test_2" does not exist ... +ERROR: statistics object "join_stats_2" already exists -- Justin
On 11/22/21 02:23, Justin Pryzby wrote: > Your regression tests include two errors, which appear to be accidental, and > fixing the error shows that this case is being estimated poorly. > > +-- try combining with single-column (and single-expression) statistics > +DROP STATISTICS join_test_2; > +ERROR: statistics object "join_test_2" does not exist > ... > +ERROR: statistics object "join_stats_2" already exists > D'oh, what a silly mistake ... You're right fixing the DROP STATISTICS results in worse estimate, but that's actually expected for a fairly simple reason. The join condition has expressions on both sides, and dropping the statistics means we don't have any MCV for the join_test_2 side. So the optimizer ends up not using the regular estimates, as if there were no extended stats. A couple lines later the script creates an extended statistics on that expression alone, which fixes this. An expression index would do the trick too. Attached is a patch fixing the test and also the issue reported by Zhihong Yu some time ago. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 11/6/21 11:03, Andy Fan wrote: > Hi Tomas: > > This is the exact patch I want, thanks for the patch! > Good to hear. > On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > >> 3) estimation by join pairs >> >> At the moment, the estimates are calculated for pairs of relations, so >> for example given a query >> >> explain analyze >> select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) >> join t3 on (t1.b = t3.b and t2.c = t3.c); >> >> we'll estimate the first join (t1,t2) just fine, but then the second >> join actually combines (t1,t2,t3). What the patch currently does is it >> splits it into (t1,t2) and (t2,t3) and estimates those. > > Actually I can't understand how this works even for a simpler example. > let's say we query like this (ONLY use t2's column to join t3). > > select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) > join t3 on (t2.c = t3.c and t2.d = t3.d); > > Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes > to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it > is hard to > work. Here I see your solution is splitting it into (t1, t2) AND (t2, > t3) and estimate > those. But how does this help to estimate the size of JoinRel(t1, t2, t3)? > Yeah, this is really confusing. The crucial thing to keep in mind is this works with clauses before running setrefs.c, so the clauses reference the original relations - *not* the join relation. Otherwise even the regular estimation would not work, because where would it get the per-column MCV lists etc. Let's use a simple case with join clauses referencing just a single attribute for each pair or relations. And let's talk about how many join pairs it'll extract: t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b) => first we join t1/t2, which is 1 join pair (t1,t2) => then we join t1/t2/t3, but the join clause references just 2 rels, so 1 join pair (t1,t3) Now a more complicated case, with more complex join clause t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c) => first we join t1/t2, which is 1 join pair (t1,t2) => then we join t1/t2/t3, but this time the join clause references all three rels, so we have two join pairs (t1,t3) and (t2,t3) and we can use all the statistics. >> I wonder if this >> should actually combine all three MCVs at once - we're pretty much >> combining the MCVs into one large MCV representing the join result. >> > > I guess we can keep the MCVs on joinrel for these matches. Take the above > query I provided for example, and suppose the MCV data as below: > > t1(a, b) > (1, 2) -> 0.1 > (1, 3) -> 0.2 > (2, 3) -> 0.5 > (2, 8) -> 0.1 > > t2(a, b) > (1, 2) -> 0.2 > (1, 3) -> 0.1 > (2, 4) -> 0.2 > (2, 10) -> 0.1 > > After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below > > (1, 2, 1, 2) -> 0.1 * 0.2 > (1, 3, 1, 3) -> 0.2 * 0.1 > > And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) * > (0.2 + 0.1 + 0.2 + 0.1) > Right, that's about the joint distribution I whole join. > With this design, the nitems of MCV on joinrel would be less than > either of baserel. > Actually, I think the number of items can grow, because the matches may duplicate some items. For example in your example with (t1.a = t2.a) the first first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for (1,3) in t1. So that's 4 combinations. Of course, we could aggregate the MCV by ignoring columns not used in the query. > and since we handle the eqjoin as well, we even can record the items as > > (1, 2) -> 0.1 * 0.2 > (1, 3) -> 0.2 * 0.1; > > About when we should maintain the JoinRel's MCV data, rather than > maintain this just > after the JoinRel size is estimated, we can only estimate it when it > is needed. for example: > > select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) > join t3 on (t2.c = t3.c and t2.d = t3.d); > > we don't need to maintain the MCV on (t1, t2, t3) since no others > need it at all. However > I don't check code too closely to see if it (Lazing computing MVC on > joinrel) is convenient > to do. > I'm not sure I understand what you're proposing here. However, I think that estimating it for pairs has two advantages: 1) Combining MCVs for k relations requires k for loops. Processing 2 relations at a time limits the amount of CPU we need. Of course, this assumes the joins are independent, which may or may not be true. 2) It seems fairly easy to combine different types of statistics (regular, extended, ...), and also consider the part not represented by MCV. It seems much harder when joining more than 2 relations. I'm also worried about amplification of errors - I suspect attempting to build the joint MCV for the whole join relation may produce significant estimation errors. Furthermore, I think joins with clauses referencing more than just two relations are fairly uncommon. And we can always improve the feature in this direction in the future. > >> But I haven't done that yet, as it requires the MCVs to be combined >> using the join clauses (overlap in a way), but I'm not sure how likely >> that is in practice. In the example it could help, but that's a bit >> artificial example. >> >> >> 4) still just inner equi-joins >> >> I haven't done any work on extending this to outer joins etc. Adding >> outer and semi joins should not be complicated, mostly copying and >> tweaking what eqjoinsel() does. >> > > Overall, thanks for the feature and I am expecting there are more cases > to handle during discussion. To make the review process more efficient, > I suggest that we split the patch into smaller ones and review/commit them > separately if we have finalized the design roughly . For example: > > Patch 1 -- required both sides to have extended statistics. > Patch 2 -- required one side to have extended statistics and the other side had > per-column MCV. > Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const; > Patch 3 -- handle the case for 3+ table joins. > Patch 4 -- supports the outer join. > > I think we can do this if we are sure that each individual patch would work in > some cases and would not make any other case worse. If you agree with this, > I can do that splitting work during my review process. > I'll consider splitting it like this, but I'm not sure it makes the main patch that much smaller. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Here's an updated patch, rebased and fixing a couple typos reported by Justin Pryzby directly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote: > Here's an updated patch, rebased and fixing a couple typos reported by > Justin Pryzby directly. FWIW, cfbot reports a few compiler warnings: https://cirrus-ci.com/task/6067262669979648?logs=gcc_warning#L505 [18:52:15.132] time make -s -j${BUILD_JOBS} world-bin [18:52:22.697] mcv.c: In function ‘mcv_combine_simple’: [18:52:22.697] mcv.c:2787:7: error: ‘reverse’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [18:52:22.697] 2787 | if (reverse) [18:52:22.697] | ^ [18:52:22.697] mcv.c:2766:27: error: ‘index’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [18:52:22.697] 2766 | if (mcv->items[i].isnull[index]) [18:52:22.697] | ^ Greetings, Andres Freund
Hi, On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote: > On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote: > > Here's an updated patch, rebased and fixing a couple typos reported by > > Justin Pryzby directly. > > FWIW, cfbot reports a few compiler warnings: Also the patch doesn't apply anymore: http://cfbot.cputube.org/patch_36_3055.log === Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 === === applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch patching file src/backend/optimizer/path/clausesel.c patching file src/backend/statistics/extended_stats.c Hunk #1 FAILED at 30. Hunk #2 succeeded at 102 (offset 1 line). Hunk #3 succeeded at 2619 (offset 9 lines). 1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej
On Wed, Jan 19, 2022 at 06:18:09PM +0800, Julien Rouhaud wrote: > On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote: > > On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote: > > > Here's an updated patch, rebased and fixing a couple typos reported by > > > Justin Pryzby directly. > > > > FWIW, cfbot reports a few compiler warnings: > > Also the patch doesn't apply anymore: > > http://cfbot.cputube.org/patch_36_3055.log > === Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 === > === applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch > patching file src/backend/optimizer/path/clausesel.c > patching file src/backend/statistics/extended_stats.c > Hunk #1 FAILED at 30. > Hunk #2 succeeded at 102 (offset 1 line). > Hunk #3 succeeded at 2619 (offset 9 lines). > 1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej Rebased over 269b532ae and muted compiler warnings. Tomas - is this patch viable for pg15 , or should move to the next CF ? In case it's useful, I ran this on cirrus with my branch for code coverage. https://cirrus-ci.com/task/5816731397521408 https://api.cirrus-ci.com/v1/artifact/task/5816731397521408/coverage/coverage/00-index.html statext_find_matching_mcv() has poor coverage. statext_clauselist_join_selectivity() has poor coverage for the "stats2" case. In mcv.c: mcv_combine_extended() and mcv_combine_simple() have poor coverage for the "else if" cases (does it matter?) Not related to this patch: build_attnums_array() isn't being hit. Same at statext_is_compatible_clause_internal() 1538 0 : *exprs = lappend(*exprs, clause); statext_mcv_[de]serialize() aren't being hit for cstrings. -- Justin
On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: > Rebased over 269b532ae and muted compiler warnings. And attached.
Attachment
> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: >> Rebased over 269b532ae and muted compiler warnings. Thank you Justin for the rebase! Hello Tomas, Thanks for the patch! Before I review the path at the code level, I want to explain my understanding about this patch first. Before this patch, we already use MCV information for the eqjoinsel, it works as combine the MCV on the both sides to figure out the mcv_freq and then treat the rest equally, but this doesn't work for MCV in extended statistics, this patch fill this gap. Besides that, since extended statistics means more than 1 columns are involved, if 1+ columns are Const based on RestrictInfo, we can use such information to filter the MCVs we are interesting, that's really cool. I did some more testing, all of them are inner join so far, all of them works amazing and I am suprised this patch didn't draw enough attention. I will test more after I go though the code. At for the code level, I reviewed them in the top-down manner and almost 40% completed. Here are some findings just FYI. For efficiency purpose, I provide each feedback with a individual commit, after all I want to make sure my comment is practical and coding and testing is a good way to archive that. I tried to make each of them as small as possible so that you can reject or accept them convinently. 0001 is your patch, I just rebase them against the current master. 0006 is not much relevant with current patch, and I think it can be committed individually if you are OK with that. Hope this kind of review is helpful. -- Best Regards Andy Fan
Attachment
- v1-0001-Estimate-joins-using-extended-statistics.patch
- v1-0002-Remove-estimiatedcluases-and-varRelid-arguments.patch
- v1-0003-Remove-SpecialJoinInfo-sjinfo-argument.patch
- v1-0004-Remove-joinType-argument.patch
- v1-0005-use-the-pre-calculated-RestrictInfo-left-right_re.patch
- v1-0006-Fast-path-for-general-clauselist_selectivity.patch
- v1-0007-bms_is_empty-is-more-effective-than-bms_num_membe.patch
- v1-0008-a-branch-of-updates-around-JoinPairInfo.patch
On 4/2/24 10:23, Andy Fan wrote: > >> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: >>> Rebased over 269b532ae and muted compiler warnings. > > Thank you Justin for the rebase! > > Hello Tomas, > > Thanks for the patch! Before I review the path at the code level, I want > to explain my understanding about this patch first. > If you want to work on this patch, that'd be cool. A review would be great, but if you want to maybe take over and try moving it forward, that'd be even better. I don't know when I'll have time to work on it again, but I'd promise to help you with working on it. > Before this patch, we already use MCV information for the eqjoinsel, it > works as combine the MCV on the both sides to figure out the mcv_freq > and then treat the rest equally, but this doesn't work for MCV in > extended statistics, this patch fill this gap. Besides that, since > extended statistics means more than 1 columns are involved, if 1+ > columns are Const based on RestrictInfo, we can use such information to > filter the MCVs we are interesting, that's really cool. > Yes, I think that's an accurate description of what the patch does. > I did some more testing, all of them are inner join so far, all of them > works amazing and I am suprised this patch didn't draw enough > attention. I will test more after I go though the code. > I think it didn't go forward for a bunch of reasons: 1) I got distracted by something else requiring immediate attention, and forgot about this patch. 2) I got stuck on some detail of the patch, unsure which of the possible solutions to try first. 3) Uncertainty about how applicable the patch is in practice. I suppose it was some combination of these reasons, not sure. As for the "practicality" mentioned in (3), it's been a while since I worked on the patch so I don't recall the details, but I think I've been thinking mostly about "start join" queries, where a big "fact" table joins to small dimensions. And in that case the fact table may have a MCV, but the dimensions certainly don't have any (because the join happens on a PK). But maybe that's a wrong way to think about it - it was clearly useful to consider the case with (per-attribute) MCVs on both sides as worth special handling. So why not to do that for multi-column MCVs, right? > At for the code level, I reviewed them in the top-down manner and almost > 40% completed. Here are some findings just FYI. For efficiency purpose, > I provide each feedback with a individual commit, after all I want to > make sure my comment is practical and coding and testing is a good way > to archive that. I tried to make each of them as small as possible so > that you can reject or accept them convinently. > > 0001 is your patch, I just rebase them against the current master. 0006 > is not much relevant with current patch, and I think it can be committed > individually if you are OK with that. > > Hope this kind of review is helpful. > Cool! There's obviously no chance to get this into v18, and I have stuff to do in this CF. But I'll take a look after that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 4/2/24 10:23, Andy Fan wrote: >> >>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote: >>>> Rebased over 269b532ae and muted compiler warnings. >> >> Thank you Justin for the rebase! >> >> Hello Tomas, >> >> Thanks for the patch! Before I review the path at the code level, I want >> to explain my understanding about this patch first. >> > > If you want to work on this patch, that'd be cool. A review would be > great, but if you want to maybe take over and try moving it forward, > that'd be even better. I don't know when I'll have time to work on it > again, but I'd promise to help you with working on it. OK, I'd try to moving it forward. > >> Before this patch, we already use MCV information for the eqjoinsel, it >> works as combine the MCV on the both sides to figure out the mcv_freq >> and then treat the rest equally, but this doesn't work for MCV in >> extended statistics, this patch fill this gap. Besides that, since >> extended statistics means more than 1 columns are involved, if 1+ >> columns are Const based on RestrictInfo, we can use such information to >> filter the MCVs we are interesting, that's really cool. >> > > Yes, I think that's an accurate description of what the patch does. Great to know that:) > >> I did some more testing, all of them are inner join so far, all of them >> works amazing and I am suprised this patch didn't draw enough >> attention. > I think it didn't go forward for a bunch of reasons: > .. > > 3) Uncertainty about how applicable the patch is in practice. > > I suppose it was some combination of these reasons, not sure. > > As for the "practicality" mentioned in (3), it's been a while since I > worked on the patch so I don't recall the details, but I think I've been > thinking mostly about "start join" queries, where a big "fact" table > joins to small dimensions. And in that case the fact table may have a > MCV, but the dimensions certainly don't have any (because the join > happens on a PK). > > But maybe that's a wrong way to think about it - it was clearly useful > to consider the case with (per-attribute) MCVs on both sides as worth > special handling. So why not to do that for multi-column MCVs, right? Yes, that's what my current understanding is. There are some cases where there are 2+ clauses between two tables AND the rows estimiation is bad AND the plan is not the best one. In such sisuations, I'd think this patch probably be helpful. The current case in hand is PG11, there is no MCV information for extended statistics, so I even can't verify the patch here is useful or not manually. When I see them next time in a newer version of PG, I can verity it manually to see if the rows estimation can be better. >> At for the code level, I reviewed them in the top-down manner and almost >> 40% completed. Here are some findings just FYI. For efficiency purpose, >> I provide each feedback with a individual commit, after all I want to >> make sure my comment is practical and coding and testing is a good way >> to archive that. I tried to make each of them as small as possible so >> that you can reject or accept them convinently. >> >> 0001 is your patch, I just rebase them against the current master. 0006 >> is not much relevant with current patch, and I think it can be committed >> individually if you are OK with that. >> >> Hope this kind of review is helpful. >> > > Cool! There's obviously no chance to get this into v18, and I have stuff > to do in this CF. But I'll take a look after that. Good to know that. I will continue my work before that. -- Best Regards Andy Fan
On Tue, Apr 02, 2024 at 04:23:45PM +0800, Andy Fan wrote: > > 0001 is your patch, I just rebase them against the current master. 0006 > is not much relevant with current patch, and I think it can be committed > individually if you are OK with that. Your 002 should also remove listidx to avoid warning ../src/backend/statistics/extended_stats.c:2879:8: error: variable 'listidx' set but not used [-Werror,-Wunused-but-set-variable] > Subject: [PATCH v1 2/8] Remove estimiatedcluases and varRelid arguments > @@ -2939,15 +2939,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid, > /* needs to happen before skipping any clauses */ > listidx++; > > - /* Skip clauses that were already estimated. */ > - if (bms_is_member(listidx, estimatedclauses)) > - continue; > - Your 007 could instead test if relids == NULL: > Subject: [PATCH v1 7/8] bms_is_empty is more effective than bms_num_members(b) >- if (bms_num_members(relids) == 0) >+ if (bms_is_empty(relids)) typos: 001: s/heuristict/heuristics/ 002: s/grantee/guarantee/ 002: s/estimiatedcluases/estimatedclauses/ It'd be nice to fix/silence these warnings from 001: |../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized [-Wmaybe-uninitialized] | 3151 | if (var->varno != relid) | | ^ |../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here | 3104 | int relid; | | ^~~~~ |[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o |../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’: |../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local [-Wshadow=compatible-local] FYI, I also ran the patch with a $large number of reports without observing any errors or crashes. I'll try to look harder at the next patch revision. -- Justin
Hello Tomas! >>> At for the code level, I reviewed them in the top-down manner and almost >>> 40% completed. Here are some findings just FYI. For efficiency purpose, >>> I provide each feedback with a individual commit, after all I want to >>> make sure my comment is practical and coding and testing is a good way >>> to archive that. I tried to make each of them as small as possible so >>> that you can reject or accept them convinently. >>> >>> 0001 is your patch, I just rebase them against the current master. 0006 >>> is not much relevant with current patch, and I think it can be committed >>> individually if you are OK with that. >>> >>> Hope this kind of review is helpful. >>> >> >> Cool! There's obviously no chance to get this into v18, and I have stuff >> to do in this CF. But I'll take a look after that. > > Good to know that. I will continue my work before that. I have completed my code level review and modification. These individual commits and message probably be helpful for discussion. -- Best Regards Andy Fan
Attachment
Hello Justin! Justin Pryzby <pryzby@telsasoft.com> writes: > |../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized [-Wmaybe-uninitialized] > | 3151 | if (var->varno != relid) > | | ^ > |../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here > | 3104 | int relid; > | | ^~~~~ > |[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o > |../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’: > |../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local [-Wshadow=compatible-local] Thanks for the feedback, the warnning should be fixed in the lastest revision and 's/estimiatedcluases/estimatedclauses/' typo error in the commit message is not fixed since I have to regenerate all the commits to fix that. We are still in dicussion stage and I think these impact is pretty limited on dicussion. > FYI, I also ran the patch with a $large number of reports without > observing any errors or crashes. Good to know that. > I'll try to look harder at the next patch revision. Thank you! -- Best Regards Andy Fan
On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote: > 's/estimiatedcluases/estimatedclauses/' typo error in the > commit message is not fixed since I have to regenerate all the commits Maybe you know this, but some of these patches need to be squashed. Regenerating the patches to address feedback is the usual process. When they're not squished, it makes it hard to review the content of the patches. For example: [PATCH v1 18/22] Fix error "unexpected system attribute" when join with system attr ..adds .sql regression tests, but the expected .out isn't updated until [PATCH v1 19/22] Fix the incorrect comment on extended stats. That fixes an elog() in Tomas' original commit, so it should probably be 002 or 003. It might make sense to keep the first commit separate for now, since it's nice to keep Tomas' original patch "pristine" to make more apparent the changes you're proposing. Another: [PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner. ..doesn't compile without [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same as Your 022 patch fixes a typo in your 002 patch, which means that first one reads a patch with a typo, and then later, a 10 line long patch reflowing the comment with a typo fixed. A good guideline is that each patch should be self-contained, compiling and passing tests. Which is more difficult with a long stack of patches. -- Justin
Hello Justin, Thanks for showing interest on this! > On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote: >> 's/estimiatedcluases/estimatedclauses/' typo error in the >> commit message is not fixed since I have to regenerate all the commits > > Maybe you know this, but some of these patches need to be squashed. > Regenerating the patches to address feedback is the usual process. > When they're not squished, it makes it hard to review the content of the > patches. You might overlooked the fact that the each individual commit is just to make the communication effectively (easy to review) and all of them will be merged into 1 commit at the last / during the process of review. Even though if something make it hard to review, I am pretty happy to regenerate the patches, but does 's/estimiatedcluases/estimatedclauses/' belongs to this category? I'm pretty sure that is not the only typo error or inapproprate word, if we need to regenerate the 22 patches because of that, we have to regenerate that pretty often. Do you mind to provide more feedback once and I can merge all of them in one modification or you think the typo error has blocked the review process? > > For example: > [PATCH v1 18/22] Fix error "unexpected system attribute" when join with system attr > ..adds .sql regression tests, but the expected .out isn't updated until > [PATCH v1 19/22] Fix the incorrect comment on extended stats. > > That fixes an elog() in Tomas' original commit, so it should probably be > 002 or 003. Which elog are you talking about? > It might make sense to keep the first commit separate for > now, since it's nice to keep Tomas' original patch "pristine" to make > more apparent the changes you're proposing. This is my goal as well, did you find anything I did which break this rule, that's absoluately not my intention. > Another: > [PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner. > ..doesn't compile without > [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same as > > Your 022 patch fixes a typo in your 002 patch, which means that first > one reads a patch with a typo, and then later, a 10 line long patch > reflowing the comment with a typo fixed. I would like to regenerate the 22 patches if you think the typo error make the reivew process hard. I can do such things but not willing to do that often. > > A good guideline is that each patch should be self-contained, compiling > and passing tests. Which is more difficult with a long stack of > patches. I agree. -- Best Regards Andy Fan
On 4/3/24 01:22, Tomas Vondra wrote: > Cool! There's obviously no chance to get this into v18, and I have stuff > to do in this CF. But I'll take a look after that. I'm looking at your patch now - an excellent start to an eagerly awaited feature! A couple of questions: 1. I didn't find the implementation of strategy 'c' - estimation by the number of distinct values. Do you forget it? 2. Can we add a clauselist selectivity hook into the core (something similar the code in attachment)? It can allow the development and testing of multicolumn join estimations without patching the core. -- regards, Andrei Lepikhov Postgres Professional
Attachment
Hi Andrei, > On 4/3/24 01:22, Tomas Vondra wrote: >> Cool! There's obviously no chance to get this into v18, and I have stuff >> to do in this CF. But I'll take a look after that. > I'm looking at your patch now - an excellent start to an eagerly awaited > feature! > A couple of questions: > 1. I didn't find the implementation of strategy 'c' - estimation by the > number of distinct values. Do you forget it? What do you mean the "strategy 'c'"? > 2. Can we add a clauselist selectivity hook into the core (something > similar the code in attachment)? It can allow the development and > testing of multicolumn join estimations without patching the core. The idea LGTM. But do you want + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + rather than + if (clauselist_selectivity_hook) + *return* clauselist_selectivity_hook(root, clauses, ..) ? -- Best Regards Andy Fan
On 20/5/2024 15:52, Andy Fan wrote: > > Hi Andrei, > >> On 4/3/24 01:22, Tomas Vondra wrote: >>> Cool! There's obviously no chance to get this into v18, and I have stuff >>> to do in this CF. But I'll take a look after that. >> I'm looking at your patch now - an excellent start to an eagerly awaited >> feature! >> A couple of questions: >> 1. I didn't find the implementation of strategy 'c' - estimation by the >> number of distinct values. Do you forget it? > > What do you mean the "strategy 'c'"? As described in 0001-* patch: * c) No extended stats with MCV. If there are multiple join clauses, * we can try using ndistinct coefficients and do what eqjoinsel does. > >> 2. Can we add a clauselist selectivity hook into the core (something >> similar the code in attachment)? It can allow the development and >> testing of multicolumn join estimations without patching the core. > > The idea LGTM. But do you want > > + if (clauselist_selectivity_hook) > + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, > + > > rather than > > + if (clauselist_selectivity_hook) > + *return* clauselist_selectivity_hook(root, clauses, ..) Of course - library may estimate not all the clauses - it is a reason, why I added input/output parameter 'estimatedclauses' by analogy with statext_clauselist_selectivity. -- regards, Andrei Lepikhov Postgres Professional
On 5/20/24 16:40, Andrei Lepikhov wrote: > On 20/5/2024 15:52, Andy Fan wrote: >> + if (clauselist_selectivity_hook) >> + *return* clauselist_selectivity_hook(root, clauses, ..) > Of course - library may estimate not all the clauses - it is a reason, > why I added input/output parameter 'estimatedclauses' by analogy with > statext_clauselist_selectivity. Here is a polished and a bit modified version of the hook proposed. Additionally, I propose exporting the statext_mcv_clauselist_selectivity routine, likewise dependencies_clauselist_selectivity. This could potentially enhance the functionality of the PostgreSQL estimation code. To clarify the purpose, I want an optional, loaded as a library, more conservative estimation based on distinct statistics. Let's provide (a bit degenerate) example: CREATE TABLE is_test(x1 integer, x2 integer, x3 integer, x4 integer); INSERT INTO is_test (x1,x2,x3,x4) SELECT x%5,x%7,x%11,x%13 FROM generate_series(1,1E3) AS x; INSERT INTO is_test (x1,x2,x3,x4) SELECT 14,14,14,14 FROM generate_series(1,100) AS x; CREATE STATISTICS ist_stat (dependencies,ndistinct) ON x1,x2,x3,x4 FROM is_test; ANALYZE is_test; EXPLAIN (ANALYZE, COSTS ON, SUMMARY OFF, TIMING OFF) SELECT * FROM is_test WHERE x1=14 AND x2=14 AND x3=14 AND x4=14; DROP TABLE is_test CASCADE; I see: (cost=0.00..15.17 rows=3 width=16) (actual rows=100 loops=1) Dependency works great if it is the same for all the data in the columns. But we get underestimations if we have different laws for subsets of rows. So, if we don't have MCV statistics, sometimes we need to pass over dependency statistics and use ndistinct instead. -- regards, Andrei Lepikhov Postgres Professional
Attachment
Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes: > On 20/5/2024 15:52, Andy Fan wrote: >> Hi Andrei, >> >>> On 4/3/24 01:22, Tomas Vondra wrote: >>>> Cool! There's obviously no chance to get this into v18, and I have stuff >>>> to do in this CF. But I'll take a look after that. >>> I'm looking at your patch now - an excellent start to an eagerly awaited >>> feature! >>> A couple of questions: >>> 1. I didn't find the implementation of strategy 'c' - estimation by the >>> number of distinct values. Do you forget it? >> What do you mean the "strategy 'c'"? > As described in 0001-* patch: > * c) No extended stats with MCV. If there are multiple join clauses, > * we can try using ndistinct coefficients and do what eqjoinsel does. OK, I didn't pay enough attention to this comment before. and yes, I get the same conclusion as you - there is no implementation of this. and if so, I think we should remove the comments and do the implementation in the next patch. >>> 2. Can we add a clauselist selectivity hook into the core (something >>> similar the code in attachment)? It can allow the development and >>> testing of multicolumn join estimations without patching the core. >> The idea LGTM. But do you want >> + if (clauselist_selectivity_hook) >> + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, >> + >> rather than >> + if (clauselist_selectivity_hook) >> + *return* clauselist_selectivity_hook(root, clauses, ..) > Of course - library may estimate not all the clauses - it is a reason, > why I added input/output parameter 'estimatedclauses' by analogy with > statext_clauselist_selectivity. OK. Do you think the hook proposal is closely connected with the current topic? IIUC it's seems not. So a dedicated thread to explain the problem to slove and the proposal and the follwing discussion should be helpful for both topics. I'm just worried about mixing the two in one thread would make things complexer unnecessarily. -- Best Regards Andy Fan
On 5/23/24 09:04, Andy Fan wrote: > Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes: >> * c) No extended stats with MCV. If there are multiple join clauses, >> * we can try using ndistinct coefficients and do what eqjoinsel does. > > OK, I didn't pay enough attention to this comment before. and yes, I get > the same conclusion as you - there is no implementation of this. > > and if so, I think we should remove the comments and do the > implementation in the next patch. I have an opposite opinion about it: 1. distinct estimation is more universal thing - you can use it precisely on any subset of columns. 2. distinct estimation is faster - it just a number, you don't need to detoast huge array of values and compare them one-by-one. So, IMO, it is essential part of join estimation and it should be implemented like in eqjoinsel. > Do you think the hook proposal is closely connected with the current > topic? IIUC it's seems not. So a dedicated thread to explain the problem > to slove and the proposal and the follwing discussion should be helpful > for both topics. I'm just worried about mixing the two in one thread > would make things complexer unnecessarily. Sure. -- regards, Andrei Lepikhov Postgres Professional
Hi, I finally got to do a review of the reworked patch series. For the most part I do like the changes, although I'm not 100% sure about some of them. I do like that the changes have been kept in separate patches, which makes it much easier to understand what the goal is etc. But it's probably time to start merging some of the patches back into the main patch - it's a bit tedious work with 22 patches. Note: This needs to be added to the next CF, so that we get cfbot results and can focus on it in 2024-07. Also, I'd attach the patches directly, not as .tar. I did go though the patches one by one, and did a review for each of them separately. I only had a couple hours for this today, so it's not super-deep review, more a start for a discussion / asking questions. For each patch I added a "review" and "pgindent" where review is my comments, pgindent is the changes pgindent would do (which we now expect to happen before commit). In hindsight I should have skipped the pgindent, it made it a more tedious with little benefits. But I realized that half-way through the series, so it was easier to just continue. Let me quickly go through the original parts - most of this is already in the "review" patches, but it's better to quote the main points here to start a discussion. I'll omit some of the smaller suggestions, so please look at the 'review' patches. v20240617-0001-Estimate-joins-using-extended-statistics.patch - rewords a couple comments, particularly for statext_find_matching_mcv - a couple XXX comments about possibly stale/inaccurate comments - suggestion to improve statext_determine_join_restrictions, but we one of the later patches already does the caching v20240617-0004-Remove-estimiatedcluases-and-varRelid-argu.patch - I'm not sure we actually should do this (esp. the removal of estimatedclauses bitmap). It breaks if we add the new hook. v20240617-0007-Remove-SpecialJoinInfo-sjinfo-argument.patch v20240617-0009-Remove-joinType-argument.patch - I'm skeptical about removing these two. Yes, the current code does not actually use those fields, but selfuncs.c always passes both jointype and sjinfo, so maybe we should do that too for consistency. What happens if we end up wanting to call an existing selfuncs function that needs these parameters in the future? Say because we want to call the regular join estimator, and then apply some "correction" to the result? v20240617-0011-use-the-pre-calculated-RestrictInfo-left-r.patch - why not to keep the BMS_MULTIPLE check on clause_relids, seems cheap so maybe we could do it before the more expensive stuff? v20240617-0014-Fast-path-for-general-clauselist_selectivi.patch - Does this actually make meaningful difference? v20240617-0017-a-branch-of-updates-around-JoinPairInfo.patch - Can we actually assume the clause has a RestrictInfo on top? IIRC there are cases where we can get here without it (e.g. AND clause?). v20240617-0020-Cache-the-result-of-statext_determine_join.patch - This addresses some of my suggestions in 0001, but I think we don't actually need to recalculate both lists in each loop. v20240617-0030-optimize-the-order-of-mcv-equal-function-e.patch - There's no explanation to support this optimization. I guess I know what it tries to do, but doesn't it have the same issues withu npredictable behavior like the GROUP BY patch, which ended up reverting and reworking? - modifies .sql test but not the expected output - The McvProc name seems a bit misleading. I think it's really "procs", for example. v20240617-0033-Merge-3-palloc-into-1-palloc.patch - Not sure. It's presented as an optimization to save on palloc calls, but I doubt that's measurable. Maybe it makes it a little bit more readable, but now I'm not convinced it's worth it. v20240617-0036-Remove-2-pull_varnos-calls-with-rinfo-left.patch - Again, can we rely on this always getting a RestrictInfo? Maybe we do, but it's not obvious to me, so a comment explaining that would be nice. And maybe an assert to check this. v20240617-0040-some-code-refactor-as-before.patch - Essentially applies earlier refactorings/tweaks to another place. - Seems OK (depending on whether we agree on those changes), but it seems mostly independent of this patch series. So I'd at least keep it in a separate patch. v20240617-0043-Fix-error-unexpected-system-attribute-when.patch - seems to only tweak the .sql, not expected output - One of the comments refers to "above limitation" but I'm unsure what that's about. v20240617-0048-Add-fastpath-when-combine-the-2-MCV-like-e.patch v20240617-0050-When-mcv-ndimensions-list_length-clauses-h.patch - I'm not sure about one of the opimizations, relying on having a clause for each dimensions of the MCV. v20240617-0054-clauselist_selectivity_hook.patch - I believe this does not work with the earlier patch that removed estimatedclaused bitmap from the "try" function. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v20240617-0001-Estimate-joins-using-extended-statistics.patch
- v20240617-0002-review.patch
- v20240617-0003-pgindent.patch
- v20240617-0004-Remove-estimiatedcluases-and-varRelid-argu.patch
- v20240617-0005-review.patch
- v20240617-0006-pgindent.patch
- v20240617-0007-Remove-SpecialJoinInfo-sjinfo-argument.patch
- v20240617-0008-review.patch
- v20240617-0009-Remove-joinType-argument.patch
- v20240617-0010-review.patch
- v20240617-0011-use-the-pre-calculated-RestrictInfo-left-r.patch
- v20240617-0012-review.patch
- v20240617-0013-pgindent.patch
- v20240617-0014-Fast-path-for-general-clauselist_selectivi.patch
- v20240617-0015-review.patch
- v20240617-0016-bms_is_empty-is-more-effective-than-bms_nu.patch
- v20240617-0017-a-branch-of-updates-around-JoinPairInfo.patch
- v20240617-0018-review.patch
- v20240617-0019-pgindent.patch
- v20240617-0020-Cache-the-result-of-statext_determine_join.patch
- v20240617-0021-review.patch
- v20240617-0022-pgindent.patch
- v20240617-0023-Simplify-code-by-using-list_cell_number.patch
- v20240617-0024-review.patch
- v20240617-0025-pgindent.patch
- v20240617-0026-Handle-the-RelableType.patch
- v20240617-0027-Use-FunctionCallInvoke-instead-of-Function.patch
- v20240617-0028-review.patch
- v20240617-0029-pgindent.patch
- v20240617-0030-optimize-the-order-of-mcv-equal-function-e.patch
- v20240617-0031-review.patch
- v20240617-0032-pgindent.patch
- v20240617-0033-Merge-3-palloc-into-1-palloc.patch
- v20240617-0034-review.patch
- v20240617-0035-pgindent.patch
- v20240617-0036-Remove-2-pull_varnos-calls-with-rinfo-left.patch
- v20240617-0037-review.patch
- v20240617-0038-pgindent.patch
- v20240617-0039-add-the-statistic_proc_security_check-chec.patch
- v20240617-0040-some-code-refactor-as-before.patch
- v20240617-0041-review.patch
- v20240617-0042-pgindent.patch
- v20240617-0043-Fix-error-unexpected-system-attribute-when.patch
- v20240617-0044-review.patch
- v20240617-0045-pgindent.patch
- v20240617-0046-Fix-the-incorrect-comment-on-extended-stat.patch
- v20240617-0047-review.patch
- v20240617-0048-Add-fastpath-when-combine-the-2-MCV-like-e.patch
- v20240617-0049-review.patch
- v20240617-0050-When-mcv-ndimensions-list_length-clauses-h.patch
- v20240617-0051-review.patch
- v20240617-0052-pgindent.patch
- v20240617-0053-Fix-typo-error-s-grantee-guarantee.patch
- v20240617-0054-clauselist_selectivity_hook.patch
- v20240617-0055-review.patch
- v20240617-0056-pgindent.patch
On 17/6/2024 18:10, Tomas Vondra wrote: > Let me quickly go through the original parts - most of this is already > in the "review" patches, but it's better to quote the main points here > to start a discussion. I'll omit some of the smaller suggestions, so > please look at the 'review' patches. > > > v20240617-0001-Estimate-joins-using-extended-statistics.patch > > - rewords a couple comments, particularly for statext_find_matching_mcv > > - a couple XXX comments about possibly stale/inaccurate comments > v20240617-0054-clauselist_selectivity_hook.patch > > - I believe this does not work with the earlier patch that removed > estimatedclaused bitmap from the "try" function. This patch set is too big to eat at once - it's just challenging to invent examples and counterexamples. Can we see these two patches in the master and analyse further improvements based on that? Some thoughts: You remove verRelid. I have thought about replacing this value with RelOptInfo, which would allow extensions (remember selectivity hook) to know about the underlying path tree. The first patch is generally ok, and I vote for having it in the master. However, the most harmful case I see most reports about is parameterised JOIN on multiple anded clauses. In that case, we have a scan filter on something like the below: x = $1 AND y = $2 AND ... As I see, current patch doesn't resolve this issue currently. -- regards, Andrei Lepikhov