Thread: using extended statistics to improve join estimates

using extended statistics to improve join estimates

From
Tomas Vondra
Date:
Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

  SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
  FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows:             100000000
- estimated (no stats):      1003638
- estimated (stats, 100):  100247844
- estimated (stats, 10k):  100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

                                  Q1          Q2
- actual rows:              25000000           0
- estimated (no stats):        62158       60241
- estimated (stats, 100):   25047900           1
- estimated (stats, 10k):   25000000           1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.


regards


[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: using extended statistics to improve join estimates

From
Zhihong Yu
Date:
Hi,

+ * has_matching_mcv
+ *     Check whether the list contains statistic of a given kind

The method name is find_matching_mcv(). It seems the method initially returned bool but later the return type was changed.

+   StatisticExtInfo *found = NULL;

found normally is associated with bool return value. Maybe call the variable matching_mcv or something similar.

+           /* skip items eliminated by restrictions on rel2 */
+           if (conditions2 && !conditions2[j])
+               continue;

Maybe you can add a counter recording the number of non-skipped items for the inner loop. If this counter is 0 after the completion of one iteration, we come out of the outer loop directly.

Cheers

On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

  SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
  FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows:             100000000
- estimated (no stats):      1003638
- estimated (stats, 100):  100247844
- estimated (stats, 10k):  100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

                                  Q1          Q2
- actual rows:              25000000           0
- estimated (no stats):        62158       60241
- estimated (stats, 100):   25047900           1
- estimated (stats, 10k):   25000000           1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.


regards


[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
Hi,

Here's a slightly improved / cleaned up version of the PoC patch, 
removing a bunch of XXX and FIXMEs, adding comments, etc.

The approach is sound in principle, I think, although there's still a 
bunch of things to address:

1) statext_compare_mcvs only really deals with equijoins / inner joins 
at the moment, as it's based on eqjoinsel_inner. It's probably desirable 
to add support for additional join types (inequality and outer joins).

2) Some of the steps are performed multiple times - e.g. matching base 
restrictions to statistics, etc. Those probably can be cached somehow, 
to reduce the overhead.

3) The logic of picking the statistics to apply is somewhat simplistic, 
and maybe could be improved in some way. OTOH the number of candidate 
statistics is likely low, so this is not a big issue.

4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of 
assumptions similar to the original, but some of those assumptions may 
be wrong in multi-column case, particularly when working with a subset 
of columns. For example (ndistinct - size(MCV)) may not be the number of 
distinct combinations outside the MCV, when ignoring some columns. Same 
for nullfract, and so on. I'm not sure we can do much more than pick 
some reasonable approximation.

5) There are no regression tests at the moment. Clearly a gap.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

 -- CREATE STATISTICS s1 ON (a+1), b FROM t1;
 -- CREATE STATISTICS s2 ON (a+1), b FROM t2;

 SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).


examples
========

The table / data is very simple, but hopefully good enough for some
simple examples.

  create table t1 (a int, b int, c int);
  create table t2 (a int, b int, c int);

  insert into t1 select mod(i,50), mod(i,50), mod(i,50)
    from generate_series(1,1000) s(i);

  insert into t2 select mod(i,50), mod(i,50), mod(i,50)
    from generate_series(1,1000) s(i);

  analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=31.00..106.00 rows=400 width=24)
            (actual time=5.426..22.678 rows=20000 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=28.50..160.75 rows=10000 width=24)
            (actual time=5.325..19.760 rows=10000 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=24.50..104.75 rows=4800 width=24)
            (actual time=5.618..5.639 rows=0 loops=1)


Now, let's create statistics:

  create statistics s1 on a, b, c from t1 ;
  create statistics s2 on a, b, c from t2 ;
  analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=31.00..106.00 rows=20000 width=24)
            (actual time=5.448..22.713 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=28.50..160.75 rows=10000 width=24)
            (actual time=5.317..16.680 rows=10000 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=24.50..104.75 rows=1 width=24)
            (actual time=5.647..5.667 rows=0 loops=1)


Those examples are a bit simplistic, but the improvements are fairly
clear I think.


limitations & open issues
=========================

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.


2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

   t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

   t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

   t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))

   t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))

I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.

But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?

Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.


3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

  explain analyze
  select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                   join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.


4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: using extended statistics to improve join estimates

From
Zhihong Yu
Date:


On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

 -- CREATE STATISTICS s1 ON (a+1), b FROM t1;
 -- CREATE STATISTICS s2 ON (a+1), b FROM t2;

 SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).


examples
========

The table / data is very simple, but hopefully good enough for some
simple examples.

  create table t1 (a int, b int, c int);
  create table t2 (a int, b int, c int);

  insert into t1 select mod(i,50), mod(i,50), mod(i,50)
    from generate_series(1,1000) s(i);

  insert into t2 select mod(i,50), mod(i,50), mod(i,50)
    from generate_series(1,1000) s(i);

  analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=31.00..106.00 rows=400 width=24)
            (actual time=5.426..22.678 rows=20000 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=28.50..160.75 rows=10000 width=24)
            (actual time=5.325..19.760 rows=10000 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=24.50..104.75 rows=4800 width=24)
            (actual time=5.618..5.639 rows=0 loops=1)


Now, let's create statistics:

  create statistics s1 on a, b, c from t1 ;
  create statistics s2 on a, b, c from t2 ;
  analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=31.00..106.00 rows=20000 width=24)
            (actual time=5.448..22.713 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=28.50..160.75 rows=10000 width=24)
            (actual time=5.317..16.680 rows=10000 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

                            QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=24.50..104.75 rows=1 width=24)
            (actual time=5.647..5.667 rows=0 loops=1)


Those examples are a bit simplistic, but the improvements are fairly
clear I think.


limitations & open issues
=========================

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.


2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

   t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

   t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

   t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))

   t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))

I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.

But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?

Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.


3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

  explain analyze
  select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                   join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.


4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,

+       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+       /* if the new statistics covers more conditions, use it */
+       if (list_length(conditions2) > list_length(conditions1))
+       {
+           mcv = stat;

It seems conditions2 is calculated using mcv, I wonder why mcv is replaced by stat (for conditions1 whose length is shorter) ?

Cheers

Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
On 10/6/21 23:03, Zhihong Yu wrote:
> Hi,
> 
> +       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
> +
> +       /* if the new statistics covers more conditions, use it */
> +       if (list_length(conditions2) > list_length(conditions1))
> +       {
> +           mcv = stat;
> 
> It seems conditions2 is calculated using mcv, I wonder why mcv is 
> replaced by stat (for conditions1 whose length is shorter) ?
> 

Yeah, that's wrong - it should be the other way around, i.e.

     if (list_length(conditions1) > list_length(conditions2))

There's no test with multiple candidate statistics yet, so this went 
unnoticed :-/


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
Hi Tomas:

This is the exact patch I want, thanks for the patch!

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:


> 3) estimation by join pairs
>
> At the moment, the estimates are calculated for pairs of relations, so
> for example given a query
>
>   explain analyze
>   select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
>                    join t3 on (t1.b = t3.b and t2.c = t3.c);
>
> we'll estimate the first join (t1,t2) just fine, but then the second
> join actually combines (t1,t2,t3). What the patch currently does is it
> splits it into (t1,t2) and (t2,t3) and estimates those.

Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                         join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2)  AND JoinRel(t2, t3).  But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel,  so it
is hard to
work.  Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those.  But how does this help to estimate the size of JoinRel(t1, t2, t3)?

> I wonder if this
> should actually combine all three MCVs at once - we're pretty much
> combining the MCVs into one large MCV representing the join result.
>

I guess we can keep the MCVs on joinrel for these matches.  Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b,  we can build the MCV as below

(1, 2, 1, 2)  -> 0.1 * 0.2
(1, 3, 1, 3)  -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed.  for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                         join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3)  since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.


> But I haven't done that yet, as it requires the MCVs to be combined
> using the join clauses (overlap in a way), but I'm not sure how likely
> that is in practice. In the example it could help, but that's a bit
> artificial example.
>
>
> 4) still just inner equi-joins
>
> I haven't done any work on extending this to outer joins etc. Adding
> outer and semi joins should not be complicated, mostly copying and
> tweaking what eqjoinsel() does.
>

Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion.  To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like  WHERE t1.a = t2.a  and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse.  If you agree with this,
I can do that splitting work during my review process.


-- 
Best Regards
Andy Fan (https://www.aliyun.com/)



Re: using extended statistics to improve join estimates

From
Justin Pryzby
Date:
Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.

+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists

-- 
Justin



Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
On 11/22/21 02:23, Justin Pryzby wrote:
> Your regression tests include two errors, which appear to be accidental, and
> fixing the error shows that this case is being estimated poorly.
> 
> +-- try combining with single-column (and single-expression) statistics
> +DROP STATISTICS join_test_2;
> +ERROR:  statistics object "join_test_2" does not exist
> ...
> +ERROR:  statistics object "join_stats_2" already exists
> 

D'oh, what a silly mistake ...

You're right fixing the DROP STATISTICS results in worse estimate, but 
that's actually expected for a fairly simple reason. The join condition 
has expressions on both sides, and dropping the statistics means we 
don't have any MCV for the join_test_2 side. So the optimizer ends up 
not using the regular estimates, as if there were no extended stats.

A couple lines later the script creates an extended statistics on that 
expression alone, which fixes this. An expression index would do the 
trick too.

Attached is a patch fixing the test and also the issue reported by 
Zhihong Yu some time ago.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
On 11/6/21 11:03, Andy Fan wrote:
> Hi Tomas:
> 
> This is the exact patch I want, thanks for the patch!
> 

Good to hear.

> On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
> 
>> 3) estimation by join pairs
>>
>> At the moment, the estimates are calculated for pairs of relations, so
>> for example given a query
>>
>>    explain analyze
>>    select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
>>                     join t3 on (t1.b = t3.b and t2.c = t3.c);
>>
>> we'll estimate the first join (t1,t2) just fine, but then the second
>> join actually combines (t1,t2,t3). What the patch currently does is it
>> splits it into (t1,t2) and (t2,t3) and estimates those.
> 
> Actually I can't understand how this works even for a simpler example.
> let's say we query like this (ONLY use t2's column to join t3).
> 
> select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
>                           join t3 on (t2.c = t3.c and t2.d = t3.d);
> 
> Then it works well on JoinRel(t1, t2)  AND JoinRel(t2, t3).  But when comes
> to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel,  so it
> is hard to
> work.  Here I see your solution is splitting it into (t1, t2) AND (t2,
> t3) and estimate
> those.  But how does this help to estimate the size of JoinRel(t1, t2, t3)?
> 

Yeah, this is really confusing. The crucial thing to keep in mind is 
this works with clauses before running setrefs.c, so the clauses 
reference the original relations - *not* the join relation. Otherwise 
even the regular estimation would not work, because where would it get 
the per-column MCV lists etc.

Let's use a simple case with join clauses referencing just a single 
attribute for each pair or relations. And let's talk about how many join 
pairs it'll extract:

   t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but the join clause references just 2 rels, so 
1 join pair (t1,t3)

Now a more complicated case, with more complex join clause

   t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but this time the join clause references all 
three rels, so we have two join pairs (t1,t3) and (t2,t3) and we can use 
all the statistics.



>> I wonder if this
>> should actually combine all three MCVs at once - we're pretty much
>> combining the MCVs into one large MCV representing the join result.
>>
> 
> I guess we can keep the MCVs on joinrel for these matches.  Take the above
> query I provided for example, and suppose the MCV data as below:
> 
> t1(a, b)
> (1, 2) -> 0.1
> (1, 3) -> 0.2
> (2, 3) -> 0.5
> (2, 8) -> 0.1
> 
> t2(a, b)
> (1, 2) -> 0.2
> (1, 3) -> 0.1
> (2, 4) -> 0.2
> (2, 10) -> 0.1
> 
> After t1.a = t2.a AND t1.b = t2.b,  we can build the MCV as below
> 
> (1, 2, 1, 2)  -> 0.1 * 0.2
> (1, 3, 1, 3)  -> 0.2 * 0.1
> 
> And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
> (0.2 + 0.1 + 0.2 + 0.1)
> 

Right, that's about the joint distribution I whole join.

> With this design, the nitems of MCV on joinrel would be less than
> either of baserel.
> 

Actually, I think the number of items can grow, because the matches may 
duplicate some items. For example in your example with (t1.a = t2.a) the 
first first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for 
(1,3) in t1. So that's 4 combinations. Of course, we could aggregate the 
MCV by ignoring columns not used in the query.

> and since we handle the eqjoin as well, we even can record the items as
> 
> (1, 2) -> 0.1 * 0.2
> (1, 3) -> 0.2 * 0.1;
> 
> About when we should maintain the JoinRel's MCV data, rather than
> maintain this just
> after the JoinRel size is estimated, we can only estimate it when it
> is needed.  for example:
> 
> select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
>                           join t3 on (t2.c = t3.c and t2.d = t3.d);
> 
> we don't need to maintain the MCV on (t1, t2, t3)  since no others
> need it at all. However
> I don't check code too closely to see if it (Lazing computing MVC on
> joinrel) is convenient
> to do.
> 

I'm not sure I understand what you're proposing here.

However, I think that estimating it for pairs has two advantages:

1) Combining MCVs for k relations requires k for loops. Processing 2 
relations at a time limits the amount of CPU we need. Of course, this 
assumes the joins are independent, which may or may not be true.

2) It seems fairly easy to combine different types of statistics 
(regular, extended, ...), and also consider the part not represented by 
MCV. It seems much harder when joining more than 2 relations.

I'm also worried about amplification of errors - I suspect attempting to 
build the joint MCV for the whole join relation may produce significant 
estimation errors.

Furthermore, I think joins with clauses referencing more than just two 
relations are fairly uncommon. And we can always improve the feature in 
this direction in the future.

> 
>> But I haven't done that yet, as it requires the MCVs to be combined
>> using the join clauses (overlap in a way), but I'm not sure how likely
>> that is in practice. In the example it could help, but that's a bit
>> artificial example.
>>
>>
>> 4) still just inner equi-joins
>>
>> I haven't done any work on extending this to outer joins etc. Adding
>> outer and semi joins should not be complicated, mostly copying and
>> tweaking what eqjoinsel() does.
>>
> 
> Overall, thanks for the feature and I am expecting there are more cases
> to handle during discussion.  To make the review process more efficient,
> I suggest that we split the patch into smaller ones and review/commit them
> separately if we have finalized the design roughly . For example:
> 
> Patch 1 -- required both sides to have extended statistics.
> Patch 2 -- required one side to have extended statistics and the other side had
> per-column MCV.
> Patch 3 -- handle the case like  WHERE t1.a = t2.a  and t1.b = Const;
> Patch 3 -- handle the case for 3+ table joins.
> Patch 4 -- supports the outer join.
> 
> I think we can do this if we are sure that each individual patch would work in
> some cases and would not make any other case worse.  If you agree with this,
> I can do that splitting work during my review process.
> 

I'll consider splitting it like this, but I'm not sure it makes the main 
patch that much smaller.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
Hi,

Here's an updated patch, rebased and fixing a couple typos reported by 
Justin Pryzby directly.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: using extended statistics to improve join estimates

From
Andres Freund
Date:
On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:
> Here's an updated patch, rebased and fixing a couple typos reported by
> Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

https://cirrus-ci.com/task/6067262669979648?logs=gcc_warning#L505
[18:52:15.132] time make -s -j${BUILD_JOBS} world-bin
[18:52:22.697] mcv.c: In function ‘mcv_combine_simple’:
[18:52:22.697] mcv.c:2787:7: error: ‘reverse’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[18:52:22.697]  2787 |    if (reverse)
[18:52:22.697]       |       ^
[18:52:22.697] mcv.c:2766:27: error: ‘index’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[18:52:22.697]  2766 |   if (mcv->items[i].isnull[index])
[18:52:22.697]       |                           ^

Greetings,

Andres Freund



Re: using extended statistics to improve join estimates

From
Julien Rouhaud
Date:
Hi,

On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:
> On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:
> > Here's an updated patch, rebased and fixing a couple typos reported by
> > Justin Pryzby directly.
> 
> FWIW, cfbot reports a few compiler warnings:

Also the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3055.log
=== Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 ===
=== applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch
patching file src/backend/optimizer/path/clausesel.c
patching file src/backend/statistics/extended_stats.c
Hunk #1 FAILED at 30.
Hunk #2 succeeded at 102 (offset 1 line).
Hunk #3 succeeded at 2619 (offset 9 lines).
1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej



Re: using extended statistics to improve join estimates

From
Justin Pryzby
Date:
On Wed, Jan 19, 2022 at 06:18:09PM +0800, Julien Rouhaud wrote:
> On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:
> > On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:
> > > Here's an updated patch, rebased and fixing a couple typos reported by
> > > Justin Pryzby directly.
> > 
> > FWIW, cfbot reports a few compiler warnings:
> 
> Also the patch doesn't apply anymore:
> 
> http://cfbot.cputube.org/patch_36_3055.log
> === Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 ===
> === applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch
> patching file src/backend/optimizer/path/clausesel.c
> patching file src/backend/statistics/extended_stats.c
> Hunk #1 FAILED at 30.
> Hunk #2 succeeded at 102 (offset 1 line).
> Hunk #3 succeeded at 2619 (offset 9 lines).
> 1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej

Rebased over 269b532ae and muted compiler warnings.

Tomas - is this patch viable for pg15 , or should move to the next CF ?

In case it's useful, I ran this on cirrus with my branch for code coverage.
https://cirrus-ci.com/task/5816731397521408
https://api.cirrus-ci.com/v1/artifact/task/5816731397521408/coverage/coverage/00-index.html

statext_find_matching_mcv() has poor coverage.
statext_clauselist_join_selectivity() has poor coverage for the "stats2" case.

In mcv.c: mcv_combine_extended() and mcv_combine_simple() have poor coverage
for the "else if" cases (does it matter?)

Not related to this patch:
build_attnums_array() isn't being hit.

Same at statext_is_compatible_clause_internal()
    1538           0 :         *exprs = lappend(*exprs, clause);

statext_mcv_[de]serialize() aren't being hit for cstrings.

-- 
Justin



Re: using extended statistics to improve join estimates

From
Justin Pryzby
Date:
On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
> Rebased over 269b532ae and muted compiler warnings.

And attached.

Attachment

Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>> Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool. 

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

-- 
Best Regards
Andy Fan


Attachment

Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
On 4/2/24 10:23, Andy Fan wrote:
> 
>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>>> Rebased over 269b532ae and muted compiler warnings.
> 
> Thank you Justin for the rebase!
> 
> Hello Tomas,
> 
> Thanks for the patch! Before I review the path at the code level, I want
> to explain my understanding about this patch first.
> 

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

> Before this patch, we already use MCV information for the eqjoinsel, it
> works as combine the MCV on the both sides to figure out the mcv_freq
> and then treat the rest equally, but this doesn't work for MCV in
> extended statistics, this patch fill this gap. Besides that, since
> extended statistics means more than 1 columns are involved, if 1+
> columns are Const based on RestrictInfo, we can use such information to
> filter the MCVs we are interesting, that's really cool. 
> 

Yes, I think that's an accurate description of what the patch does.

> I did some more testing, all of them are inner join so far, all of them
> works amazing and I am suprised this patch didn't draw enough
> attention. I will test more after I go though the code.
> 

I think it didn't go forward for a bunch of reasons:

1) I got distracted by something else requiring immediate attention, and
forgot about this patch.

2) I got stuck on some detail of the patch, unsure which of the possible
solutions to try first.

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.


As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

> At for the code level, I reviewed them in the top-down manner and almost
> 40% completed. Here are some findings just FYI. For efficiency purpose,
> I provide each feedback with a individual commit, after all I want to
> make sure my comment is practical and coding and testing is a good way
> to archive that. I tried to make each of them as small as possible so
> that you can reject or accept them convinently.
> 
> 0001 is your patch, I just rebase them against the current master. 0006
> is not much relevant with current patch, and I think it can be committed
> individually if you are OK with that.
> 
> Hope this kind of review is helpful.
> 

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> On 4/2/24 10:23, Andy Fan wrote:
>> 
>>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>>>> Rebased over 269b532ae and muted compiler warnings.
>> 
>> Thank you Justin for the rebase!
>> 
>> Hello Tomas,
>> 
>> Thanks for the patch! Before I review the path at the code level, I want
>> to explain my understanding about this patch first.
>> 
>
> If you want to work on this patch, that'd be cool. A review would be
> great, but if you want to maybe take over and try moving it forward,
> that'd be even better. I don't know when I'll have time to work on it
> again, but I'd promise to help you with working on it.

OK, I'd try to moving it forward.

>
>> Before this patch, we already use MCV information for the eqjoinsel, it
>> works as combine the MCV on the both sides to figure out the mcv_freq
>> and then treat the rest equally, but this doesn't work for MCV in
>> extended statistics, this patch fill this gap. Besides that, since
>> extended statistics means more than 1 columns are involved, if 1+
>> columns are Const based on RestrictInfo, we can use such information to
>> filter the MCVs we are interesting, that's really cool. 
>> 
>
> Yes, I think that's an accurate description of what the patch does.

Great to know that:)

>
>> I did some more testing, all of them are inner join so far, all of them
>> works amazing and I am suprised this patch didn't draw enough
>> attention.

> I think it didn't go forward for a bunch of reasons:
>
..
>
> 3) Uncertainty about how applicable the patch is in practice.
>
> I suppose it was some combination of these reasons, not sure.
>
> As for the "practicality" mentioned in (3), it's been a while since I
> worked on the patch so I don't recall the details, but I think I've been
> thinking mostly about "start join" queries, where a big "fact" table
> joins to small dimensions. And in that case the fact table may have a
> MCV, but the dimensions certainly don't have any (because the join
> happens on a PK).
>
> But maybe that's a wrong way to think about it - it was clearly useful
> to consider the case with (per-attribute) MCVs on both sides as worth
> special handling. So why not to do that for multi-column MCVs, right?

Yes, that's what my current understanding is.

There are some cases where there are 2+ clauses between two tables AND
the rows estimiation is bad AND the plan is not the best one. In such
sisuations, I'd think this patch probably be helpful. The current case
in hand is PG11, there is no MCV information for extended statistics, so
I even can't verify the patch here is useful or not manually. When I see
them next time in a newer version of PG, I can verity it manually to see
if the rows estimation can be better.  

>> At for the code level, I reviewed them in the top-down manner and almost
>> 40% completed. Here are some findings just FYI. For efficiency purpose,
>> I provide each feedback with a individual commit, after all I want to
>> make sure my comment is practical and coding and testing is a good way
>> to archive that. I tried to make each of them as small as possible so
>> that you can reject or accept them convinently.
>> 
>> 0001 is your patch, I just rebase them against the current master. 0006
>> is not much relevant with current patch, and I think it can be committed
>> individually if you are OK with that.
>> 
>> Hope this kind of review is helpful.
>> 
>
> Cool! There's obviously no chance to get this into v18, and I have stuff
> to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that. 

-- 
Best Regards
Andy Fan




Re: using extended statistics to improve join estimates

From
Justin Pryzby
Date:
On Tue, Apr 02, 2024 at 04:23:45PM +0800, Andy Fan wrote:
> 
> 0001 is your patch, I just rebase them against the current master. 0006
> is not much relevant with current patch, and I think it can be committed
> individually if you are OK with that.

Your 002 should also remove listidx to avoid warning
../src/backend/statistics/extended_stats.c:2879:8: error: variable 'listidx' set but not used
[-Werror,-Wunused-but-set-variable]

> Subject: [PATCH v1 2/8] Remove estimiatedcluases and varRelid arguments

> @@ -2939,15 +2939,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
>          /* needs to happen before skipping any clauses */
>          listidx++;
>  
> -        /* Skip clauses that were already estimated. */
> -        if (bms_is_member(listidx, estimatedclauses))
> -            continue;
> -

Your 007 could instead test if relids == NULL:

> Subject: [PATCH v1 7/8] bms_is_empty is more effective than bms_num_members(b)
>-       if (bms_num_members(relids) == 0)
>+       if (bms_is_empty(relids))

typos:
001: s/heuristict/heuristics/
002: s/grantee/guarantee/
002: s/estimiatedcluases/estimatedclauses/

It'd be nice to fix/silence these warnings from 001:

|../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized
[-Wmaybe-uninitialized]
| 3151 |                                 if (var->varno != relid)
|      |                                    ^
|../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here
| 3104 |         int                     relid;
|      |                                 ^~~~~
|[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o
|../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
|../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local
[-Wshadow=compatible-local]

FYI, I also ran the patch with a $large number of reports without
observing any errors or crashes.

I'll try to look harder at the next patch revision.

-- 
Justin



Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
Hello Tomas!

>>> At for the code level, I reviewed them in the top-down manner and almost
>>> 40% completed. Here are some findings just FYI. For efficiency purpose,
>>> I provide each feedback with a individual commit, after all I want to
>>> make sure my comment is practical and coding and testing is a good way
>>> to archive that. I tried to make each of them as small as possible so
>>> that you can reject or accept them convinently.
>>> 
>>> 0001 is your patch, I just rebase them against the current master. 0006
>>> is not much relevant with current patch, and I think it can be committed
>>> individually if you are OK with that.
>>> 
>>> Hope this kind of review is helpful.
>>> 
>>
>> Cool! There's obviously no chance to get this into v18, and I have stuff
>> to do in this CF. But I'll take a look after that.
>
> Good to know that. I will continue my work before that. 

I have completed my code level review and modification. These individual
commits and message probably be helpful for discussion. 

-- 
Best Regards
Andy Fan


Attachment

Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
Hello Justin!

Justin Pryzby <pryzby@telsasoft.com> writes:


> |../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized
[-Wmaybe-uninitialized]
> | 3151 |                                 if (var->varno != relid)
> |      |                                    ^
> |../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here
> | 3104 |         int                     relid;
> |      |                                 ^~~~~
> |[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o
> |../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
> |../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local
[-Wshadow=compatible-local]

Thanks for the feedback, the warnning should be fixed in the lastest
revision and 's/estimiatedcluases/estimatedclauses/' typo error in the
commit message is not fixed since I have to regenerate all the commits
to fix that. We are still in dicussion stage and I think these impact is
pretty limited on dicussion.

> FYI, I also ran the patch with a $large number of reports without
> observing any errors or crashes.

Good to know that.

> I'll try to look harder at the next patch revision.

Thank you!

--
Best Regards
Andy Fan




Re: using extended statistics to improve join estimates

From
Justin Pryzby
Date:
On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote:
> 's/estimiatedcluases/estimatedclauses/' typo error in the
> commit message is not fixed since I have to regenerate all the commits

Maybe you know this, but some of these patches need to be squashed.
Regenerating the patches to address feedback is the usual process.
When they're not squished, it makes it hard to review the content of the
patches.

For example:
[PATCH v1 18/22] Fix error "unexpected system attribute" when join with system attr
..adds .sql regression tests, but the expected .out isn't updated until
[PATCH v1 19/22] Fix the incorrect comment on extended stats.

That fixes an elog() in Tomas' original commit, so it should probably be
002 or 003.  It might make sense to keep the first commit separate for
now, since it's nice to keep Tomas' original patch "pristine" to make
more apparent the changes you're proposing.

Another:
[PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner.
..doesn't compile without
[PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same as

Your 022 patch fixes a typo in your 002 patch, which means that first
one reads a patch with a typo, and then later, a 10 line long patch
reflowing the comment with a typo fixed.

A good guideline is that each patch should be self-contained, compiling
and passing tests.  Which is more difficult with a long stack of
patches.

-- 
Justin



Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
Hello Justin,

Thanks for showing interest on this!

> On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote:
>> 's/estimiatedcluases/estimatedclauses/' typo error in the
>> commit message is not fixed since I have to regenerate all the commits
>
> Maybe you know this, but some of these patches need to be squashed.
> Regenerating the patches to address feedback is the usual process.
> When they're not squished, it makes it hard to review the content of the
> patches.

You might overlooked the fact that the each individual commit is just to
make the communication effectively (easy to review) and all of them
will be merged into 1 commit at the last / during the process of review. 

Even though if something make it hard to review, I am pretty happy to
regenerate the patches, but does 's/estimiatedcluases/estimatedclauses/'
belongs to this category? I'm pretty sure that is not the only typo
error or inapproprate word, if we need to regenerate the 22 patches
because of that, we have to regenerate that pretty often.

Do you mind to provide more feedback once and I can merge all of them in
one modification or you think the typo error has blocked the review
process?

>
> For example:
> [PATCH v1 18/22] Fix error "unexpected system attribute" when join with system attr
> ..adds .sql regression tests, but the expected .out isn't updated until
> [PATCH v1 19/22] Fix the incorrect comment on extended stats.
>
> That fixes an elog() in Tomas' original commit, so it should probably be
> 002 or 003.

Which elog are you talking about?

> It might make sense to keep the first commit separate for
> now, since it's nice to keep Tomas' original patch "pristine" to make
> more apparent the changes you're proposing.

This is my goal as well, did you find anything I did which break this
rule, that's absoluately not my intention.

> Another:
> [PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner.
> ..doesn't compile without
> [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same as
>
> Your 022 patch fixes a typo in your 002 patch, which means that first
> one reads a patch with a typo, and then later, a 10 line long patch
> reflowing the comment with a typo fixed.

I would like to regenerate the 22 patches if you think the typo error
make the reivew process hard. I can do such things but not willing to
do that often.

>
> A good guideline is that each patch should be self-contained, compiling
> and passing tests.  Which is more difficult with a long stack of
> patches.

I agree.

-- 
Best Regards
Andy Fan




Re: using extended statistics to improve join estimates

From
Andrei Lepikhov
Date:
On 4/3/24 01:22, Tomas Vondra wrote:
> Cool! There's obviously no chance to get this into v18, and I have stuff
> to do in this CF. But I'll take a look after that.
I'm looking at your patch now - an excellent start to an eagerly awaited 
feature!
A couple of questions:
1. I didn't find the implementation of strategy 'c' - estimation by the 
number of distinct values. Do you forget it?
2. Can we add a clauselist selectivity hook into the core (something 
similar the code in attachment)? It can allow the development and 
testing of multicolumn join estimations without patching the core.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Attachment

Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
Hi Andrei,

> On 4/3/24 01:22, Tomas Vondra wrote:
>> Cool! There's obviously no chance to get this into v18, and I have stuff
>> to do in this CF. But I'll take a look after that.
> I'm looking at your patch now - an excellent start to an eagerly awaited
> feature!
> A couple of questions:
> 1. I didn't find the implementation of strategy 'c' - estimation by the
> number of distinct values. Do you forget it?

What do you mean the "strategy 'c'"?  

> 2. Can we add a clauselist selectivity hook into the core (something
> similar the code in attachment)? It can allow the development and
> testing of multicolumn join estimations without patching the core.

The idea LGTM. But do you want 

+    if (clauselist_selectivity_hook)
+        s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+

rather than

+    if (clauselist_selectivity_hook)
+        *return* clauselist_selectivity_hook(root, clauses, ..)


?

-- 
Best Regards
Andy Fan




Re: using extended statistics to improve join estimates

From
Andrei Lepikhov
Date:
On 20/5/2024 15:52, Andy Fan wrote:
> 
> Hi Andrei,
> 
>> On 4/3/24 01:22, Tomas Vondra wrote:
>>> Cool! There's obviously no chance to get this into v18, and I have stuff
>>> to do in this CF. But I'll take a look after that.
>> I'm looking at your patch now - an excellent start to an eagerly awaited
>> feature!
>> A couple of questions:
>> 1. I didn't find the implementation of strategy 'c' - estimation by the
>> number of distinct values. Do you forget it?
> 
> What do you mean the "strategy 'c'"?
As described in 0001-* patch:
* c) No extended stats with MCV. If there are multiple join clauses,
* we can try using ndistinct coefficients and do what eqjoinsel does.

> 
>> 2. Can we add a clauselist selectivity hook into the core (something
>> similar the code in attachment)? It can allow the development and
>> testing of multicolumn join estimations without patching the core.
> 
> The idea LGTM. But do you want
> 
> +    if (clauselist_selectivity_hook)
> +        s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
> +
> 
> rather than
> 
> +    if (clauselist_selectivity_hook)
> +        *return* clauselist_selectivity_hook(root, clauses, ..)
Of course - library may estimate not all the clauses - it is a reason, 
why I added input/output parameter 'estimatedclauses' by analogy with 
statext_clauselist_selectivity.

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: using extended statistics to improve join estimates

From
Andrei Lepikhov
Date:
On 5/20/24 16:40, Andrei Lepikhov wrote:
> On 20/5/2024 15:52, Andy Fan wrote:
>> +    if (clauselist_selectivity_hook)
>> +        *return* clauselist_selectivity_hook(root, clauses, ..)
> Of course - library may estimate not all the clauses - it is a reason, 
> why I added input/output parameter 'estimatedclauses' by analogy with 
> statext_clauselist_selectivity.
Here is a polished and a bit modified version of the hook proposed.
Additionally, I propose exporting the statext_mcv_clauselist_selectivity 
routine, likewise dependencies_clauselist_selectivity. This could 
potentially enhance the functionality of the PostgreSQL estimation code.

To clarify the purpose, I want an optional, loaded as a library, more 
conservative estimation based on distinct statistics. Let's provide (a 
bit degenerate) example:

CREATE TABLE is_test(x1 integer, x2 integer, x3 integer, x4 integer);
INSERT INTO is_test (x1,x2,x3,x4)
    SELECT x%5,x%7,x%11,x%13 FROM generate_series(1,1E3) AS x;
INSERT INTO is_test (x1,x2,x3,x4)
    SELECT 14,14,14,14 FROM generate_series(1,100) AS x;
CREATE STATISTICS ist_stat (dependencies,ndistinct)
   ON x1,x2,x3,x4 FROM is_test;
ANALYZE is_test;
EXPLAIN (ANALYZE, COSTS ON, SUMMARY OFF, TIMING OFF)
SELECT * FROM is_test WHERE x1=14 AND x2=14 AND x3=14 AND x4=14;
DROP TABLE is_test CASCADE;

I see:
(cost=0.00..15.17 rows=3 width=16) (actual rows=100 loops=1)

Dependency works great if it is the same for all the data in the 
columns. But we get underestimations if we have different laws for 
subsets of rows. So, if we don't have MCV statistics, sometimes we need 
to pass over dependency statistics and use ndistinct instead.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Attachment

Re: using extended statistics to improve join estimates

From
Andy Fan
Date:
Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes:

> On 20/5/2024 15:52, Andy Fan wrote:
>> Hi Andrei,
>> 
>>> On 4/3/24 01:22, Tomas Vondra wrote:
>>>> Cool! There's obviously no chance to get this into v18, and I have stuff
>>>> to do in this CF. But I'll take a look after that.
>>> I'm looking at your patch now - an excellent start to an eagerly awaited
>>> feature!
>>> A couple of questions:
>>> 1. I didn't find the implementation of strategy 'c' - estimation by the
>>> number of distinct values. Do you forget it?
>> What do you mean the "strategy 'c'"?
> As described in 0001-* patch:
> * c) No extended stats with MCV. If there are multiple join clauses,
> * we can try using ndistinct coefficients and do what eqjoinsel does.

OK, I didn't pay enough attention to this comment before. and yes, I get
the same conclusion as you -  there is no implementation of this.

and if so, I think we should remove the comments and do the
implementation in the next patch. 

>>> 2. Can we add a clauselist selectivity hook into the core (something
>>> similar the code in attachment)? It can allow the development and
>>> testing of multicolumn join estimations without patching the core.
>> The idea LGTM. But do you want
>> +    if (clauselist_selectivity_hook)
>> +        s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
>> +
>> rather than
>> +    if (clauselist_selectivity_hook)
>> +        *return* clauselist_selectivity_hook(root, clauses, ..)
> Of course - library may estimate not all the clauses - it is a reason,
> why I added input/output parameter 'estimatedclauses' by analogy with
> statext_clauselist_selectivity.

OK.

Do you think the hook proposal is closely connected with the current
topic? IIUC it's seems not. So a dedicated thread to explain the problem
to slove and the proposal and the follwing discussion should be helpful
for both topics. I'm just worried about mixing the two in one thread
would make things complexer unnecessarily.

-- 
Best Regards
Andy Fan




Re: using extended statistics to improve join estimates

From
Andrei Lepikhov
Date:
On 5/23/24 09:04, Andy Fan wrote:
> Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes:
>> * c) No extended stats with MCV. If there are multiple join clauses,
>> * we can try using ndistinct coefficients and do what eqjoinsel does.
> 
> OK, I didn't pay enough attention to this comment before. and yes, I get
> the same conclusion as you -  there is no implementation of this.
> 
> and if so, I think we should remove the comments and do the
> implementation in the next patch.
I have an opposite opinion about it:
1. distinct estimation is more universal thing - you can use it 
precisely on any subset of columns.
2. distinct estimation is faster - it just a number, you don't need to 
detoast huge array of values and compare them one-by-one.

So, IMO, it is essential part of join estimation and it should be 
implemented like in eqjoinsel.
> Do you think the hook proposal is closely connected with the current
> topic? IIUC it's seems not. So a dedicated thread to explain the problem
> to slove and the proposal and the follwing discussion should be helpful
> for both topics. I'm just worried about mixing the two in one thread
> would make things complexer unnecessarily.
Sure.

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: using extended statistics to improve join estimates

From
Tomas Vondra
Date:
Hi,

I finally got to do a review of the reworked patch series. For the most
part I do like the changes, although I'm not 100% sure about some of
them. I do like that the changes have been kept in separate patches,
which makes it much easier to understand what the goal is etc. But it's
probably time to start merging some of the patches back into the main
patch - it's a bit tedious work with 22 patches.

Note: This needs to be added to the next CF, so that we get cfbot
results and can focus on it in 2024-07. Also, I'd attach the patches
directly, not as .tar.

I did go though the patches one by one, and did a review for each of
them separately. I only had a couple hours for this today, so it's not
super-deep review, more a start for a discussion / asking questions.

For each patch I added a "review" and "pgindent" where review is my
comments, pgindent is the changes pgindent would do (which we now expect
to happen before commit). In hindsight I should have skipped the
pgindent, it made it a more tedious with little benefits. But I realized
that half-way through the series, so it was easier to just continue.

Let me quickly go through the original parts - most of this is already
in the "review" patches, but it's better to quote the main points here
to start a discussion. I'll omit some of the smaller suggestions, so
please look at the 'review' patches.


v20240617-0001-Estimate-joins-using-extended-statistics.patch

- rewords a couple comments, particularly for statext_find_matching_mcv

- a couple XXX comments about possibly stale/inaccurate comments

- suggestion to improve statext_determine_join_restrictions, but we one
of the later patches already does the caching


v20240617-0004-Remove-estimiatedcluases-and-varRelid-argu.patch

- I'm not sure we actually should do this (esp. the removal of
estimatedclauses bitmap). It breaks if we add the new hook.


v20240617-0007-Remove-SpecialJoinInfo-sjinfo-argument.patch
v20240617-0009-Remove-joinType-argument.patch

- I'm skeptical about removing these two. Yes, the current code does not
actually use those fields, but selfuncs.c always passes both  jointype
and sjinfo, so maybe we should do that too for consistency. What happens
if we end up wanting to call an existing selfuncs function that needs
these parameters in the future? Say because we want to call the regular
join estimator, and then apply some "correction" to the result?


v20240617-0011-use-the-pre-calculated-RestrictInfo-left-r.patch

- why not to keep the BMS_MULTIPLE check on clause_relids, seems cheap
so maybe we could do it before the more expensive stuff?


v20240617-0014-Fast-path-for-general-clauselist_selectivi.patch

- Does this actually make meaningful difference?


v20240617-0017-a-branch-of-updates-around-JoinPairInfo.patch

- Can we actually assume the clause has a RestrictInfo on top? IIRC
there are cases where we can get here without it (e.g. AND clause?).


v20240617-0020-Cache-the-result-of-statext_determine_join.patch

- This addresses some of my suggestions in 0001, but I think we don't
actually need to recalculate both lists in each loop.


v20240617-0030-optimize-the-order-of-mcv-equal-function-e.patch

-  There's no explanation to support this optimization. I guess I know
what it tries to do, but doesn't it have the same issues withu
npredictable behavior like the GROUP BY patch, which ended up reverting
and reworking?

- modifies .sql test but not the expected output

- The McvProc name seems a bit misleading. I think it's really "procs",
for example.


v20240617-0033-Merge-3-palloc-into-1-palloc.patch

- Not sure. It's presented as an optimization to save on palloc calls,
but I doubt that's measurable. Maybe it makes it a little bit more
readable, but now I'm not convinced it's worth it.


v20240617-0036-Remove-2-pull_varnos-calls-with-rinfo-left.patch

- Again, can we rely on this always getting a RestrictInfo? Maybe we do,
but it's not obvious to me, so a comment explaining that would be nice.
And maybe an assert to check this.


v20240617-0040-some-code-refactor-as-before.patch

- Essentially applies earlier refactorings/tweaks to another place.

- Seems OK (depending on whether we agree on those changes), but it
seems mostly independent of this patch series. So I'd at least keep it
in a separate patch.


v20240617-0043-Fix-error-unexpected-system-attribute-when.patch

- seems to only tweak the .sql, not expected output

- One of the comments refers to "above limitation" but I'm unsure what
that's about.


v20240617-0048-Add-fastpath-when-combine-the-2-MCV-like-e.patch
v20240617-0050-When-mcv-ndimensions-list_length-clauses-h.patch

- I'm not sure about one of the opimizations, relying on having a clause
for each dimensions of the MCV.


v20240617-0054-clauselist_selectivity_hook.patch

- I believe this does not work with the earlier patch that removed
estimatedclaused bitmap from the "try" function.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: using extended statistics to improve join estimates

From
Andrei Lepikhov
Date:
On 17/6/2024 18:10, Tomas Vondra wrote:
> Let me quickly go through the original parts - most of this is already
> in the "review" patches, but it's better to quote the main points here
> to start a discussion. I'll omit some of the smaller suggestions, so
> please look at the 'review' patches.
> 
> 
> v20240617-0001-Estimate-joins-using-extended-statistics.patch
> 
> - rewords a couple comments, particularly for statext_find_matching_mcv
> 
> - a couple XXX comments about possibly stale/inaccurate comments
> v20240617-0054-clauselist_selectivity_hook.patch
> 
> - I believe this does not work with the earlier patch that removed
> estimatedclaused bitmap from the "try" function.
This patch set is too big to eat at once - it's just challenging to 
invent examples and counterexamples. Can we see these two patches in the 
master and analyse further improvements based on that?

Some thoughts:
You remove verRelid. I have thought about replacing this value with 
RelOptInfo, which would allow extensions (remember selectivity hook) to 
know about the underlying path tree.

The first patch is generally ok, and I vote for having it in the master. 
However, the most harmful case I see most reports about is parameterised 
JOIN on multiple anded clauses. In that case, we have a scan filter on 
something like the below:
x = $1 AND y = $2 AND ...
As I see, current patch doesn't resolve this issue currently.

-- 
regards, Andrei Lepikhov