Re: using extended statistics to improve join estimates - Mailing list pgsql-hackers

From Zhihong Yu
Subject Re: using extended statistics to improve join estimates
Date
Msg-id CALNJ-vSOptX3bF_gDE+fgb0=3FFDRAnP5QpCq7cW-cw5ua0J4w@mail.gmail.com
Whole thread Raw
In response to using extended statistics to improve join estimates  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
Hi,

+ * has_matching_mcv
+ *     Check whether the list contains statistic of a given kind

The method name is find_matching_mcv(). It seems the method initially returned bool but later the return type was changed.

+   StatisticExtInfo *found = NULL;

found normally is associated with bool return value. Maybe call the variable matching_mcv or something similar.

+           /* skip items eliminated by restrictions on rel2 */
+           if (conditions2 && !conditions2[j])
+               continue;

Maybe you can add a counter recording the number of non-skipped items for the inner loop. If this counter is 0 after the completion of one iteration, we come out of the outer loop directly.

Cheers

On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

  SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
  FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows:             100000000
- estimated (no stats):      1003638
- estimated (stats, 100):  100247844
- estimated (stats, 10k):  100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

                                  Q1          Q2
- actual rows:              25000000           0
- estimated (no stats):        62158       60241
- estimated (stats, 100):   25047900           1
- estimated (stats, 10k):   25000000           1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.


regards


[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

Previous
From: "'alvherre@alvh.no-ip.org'"
Date:
Subject: Re: libpq debug log
Next
From: David Rowley
Date:
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans