using extended statistics to improve join estimates - Mailing list pgsql-hackers

From Tomas Vondra
Subject using extended statistics to improve join estimates
Date
Msg-id c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c@enterprisedb.com
Whole thread Raw
Responses Re: using extended statistics to improve join estimates
Re: using extended statistics to improve join estimates
List pgsql-hackers
Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

  SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
  FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows:             100000000
- estimated (no stats):      1003638
- estimated (stats, 100):  100247844
- estimated (stats, 10k):  100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

                                  Q1          Q2
- actual rows:              25000000           0
- estimated (no stats):        62158       60241
- estimated (stats, 100):   25047900           1
- estimated (stats, 10k):   25000000           1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.


regards


[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: pg_amcheck contrib application
Next
From: Robert Haas
Date:
Subject: Re: making update/delete of inheritance trees scale better