PoC: using sampling to estimate joins / complex conditions - Mailing list pgsql-hackers

From Tomas Vondra
Subject PoC: using sampling to estimate joins / complex conditions
Date
Msg-id 580ba824-7630-e6e8-f80f-725dfa587da5@enterprisedb.com
Whole thread Raw
Responses Re: PoC: using sampling to estimate joins / complex conditions
Re: PoC: using sampling to estimate joins / complex conditions
List pgsql-hackers
Hi,

estimating joins is one of the significant gaps related to extended
statistics, and I've been regularly asked about what we might do about
that. This is an early experimental patch that I think might help us
with improving this, possible even in PG15.

Note: I do not claim this is exactly how it should be implemented, but
it's probably sufficient to demonstrate the pros/cons of various
alternative approaches, etc.

In short, the patch samples the tables and uses those samples to
estimate selectivity for scans and joins. The samples are collected
during planning, which may be quite expensive - random I/O for each
query, etc. It'd be possible to build them during analyze, but that'd
require solving serialization, tweak CREATE STATISTICS to handle join
queries, etc. I decided to keep the PoC simple.

It still uses CREATE STATISTICS with a new "sample" kind, instructing
the optimizer to use sampling when estimating clauses on the attributes.

A little example demonstrating what the patch does:

  create table t (a int, b int, c int);

  insert into t select mod(i,10), mod(i,20), mod(i,40)
    from generate_series(1,10000000) s(i);

  analyze t;

  -- estimate without any statistics / sampling
  explain analyze select * from t where a = 0 and b = 0 and c = 0;

                             QUERY PLAN
  -------------------------------------------------------------------
   Seq Scan on t  (cost=0.00..229055.00 rows=1361 width=12)
                  (actual time=0.025..761.571 rows=250000 loops=1)
     Filter: ((a = 0) AND (b = 0) AND (c = 0))
     Rows Removed by Filter: 9750000
   Planning Time: 0.471 ms
   Execution Time: 901.182 ms
  (5 rows)

  -- enable sampling on those columns
  create statistics s (sample) on a, b, c from t;

  explain analyze select * from t where a = 0 and b = 0 and c = 0;

                             QUERY PLAN
  -------------------------------------------------------------------
   Seq Scan on t  (cost=0.00..229055.00 rows=250390 width=12)
                  (actual time=0.307..717.937 rows=250000 loops=1)
     Filter: ((a = 0) AND (b = 0) AND (c = 0))
     Rows Removed by Filter: 9750000
   Planning Time: 194.528 ms
   Execution Time: 851.832 ms
  (5 rows)

Of course, in this case a MCV would work well too, because there are
very few combinations in (a,b,c) - a sample would work even when that's
not the case, and it has various other benefits (can estimate almost any
expression while MCV supports only a subset, etc.)

Now, let's look at a join between a fact and a dimension table:

  create table f (d1 int, d2 int, f1 int, f2 int, f3 int);

  create table d (d1 int, d2 int, d3 int, d4 int, d5 int,
                  primary key (d1, d2));

  insert into d select i, i, mod(i,100), mod(i,100), mod(i,100)
    from generate_series(0,999) s(i);

  insert into f select mod(i,1000), mod(i,1000), mod(i,100), mod(i,100),
                       mod(i,100) from generate_series(1,1000000) s(i);

  analyze f, d;

  explain analyze select * from f join d using (d1,d2)
    where f1 < 50 and f2 < 50 and d3 < 50 and d4 < 50;

                                QUERY PLAN
  ----------------------------------------------------------------------
   Hash Join  (cost=25.75..22717.01 rows=63 width=32)
              (actual time=3.197..861.899 rows=500000 loops=1)
     Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2))
     ->  Seq Scan on f  (cost=0.00..21370.00 rows=251669 width=20)
                        (actual time=0.033..315.401 rows=500000 loops=1)
           Filter: ((f1 < 50) AND (f2 < 50))
           Rows Removed by Filter: 500000
     ->  Hash  (cost=22.00..22.00 rows=250 width=20)
               (actual time=3.139..3.141 rows=500 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 34kB
           ->  Seq Scan on d  (cost=0.00..22.00 rows=250 width=20)
                        (actual time=0.018..1.706 rows=500 loops=1)
                 Filter: ((d3 < 50) AND (d4 < 50))
                 Rows Removed by Filter: 500
   Planning Time: 0.806 ms
   Execution Time: 1099.229 ms
  (12 rows)

So, not great - underestimated by 10000x is likely to lead to
inefficient plans. And now with the samples enabled on both sides:

  create statistics s1 (sample) on d1, d2, f1, f2, f3 from f;
  create statistics s2 (sample) on d1, d2, d3, d4, d5 from d;

                                QUERY PLAN
  ----------------------------------------------------------------------
   Hash Join  (cost=29.50..24057.25 rows=503170 width=32)
              (actual time=0.630..837.483 rows=500000 loops=1)
     Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2))
     ->  Seq Scan on f  (cost=0.00..21370.00 rows=503879 width=20)
                        (actual time=0.008..301.584 rows=500000 loops=1)
           Filter: ((f1 < 50) AND (f2 < 50))
           Rows Removed by Filter: 500000
     ->  Hash  (cost=22.00..22.00 rows=500 width=20)
               (actual time=0.616..0.618 rows=500 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 34kB
           ->  Seq Scan on d  (cost=0.00..22.00 rows=500 width=20)
                             (actual time=0.004..0.321 rows=500 loops=1)
                 Filter: ((d3 < 50) AND (d4 < 50))
                 Rows Removed by Filter: 500
   Planning Time: 603.442 ms
   Execution Time: 1071.735 ms
  (12 rows)

Yes, it takes 600ms to do the sampling, but I'm sure most of this can be
eliminated by optimizing the code and/or storing the samples just like
other types of stats.

Note that most of the 1000x underestimate is not due to poor estimates
at the scan level, but mostly due to the join condition having two
correlated clauses. Yes, adding a proper foreign key would probably
improve this (we already leverage this information in planning), but
there can be cross-table correlations between the other conditions, and
the FK can't help with that. Correlations between different dimension
tables are quite common, and sampling can help with those.

Note: There's another PoC patch using multi-column MCVs to improve join
estimated - that has the same limitations as MCVs for scans. It works
quite fine (only) when the MCV represents large part of the data, and it
does not support evaluating arbitrary expressions.


Now, a little bit about the implementation, sampling limitations etc.

At the scan level, sampling is fairly straightforward - the patch simply
runs a TABLESAMPLE query through SPI, with a sample fraction calculated
from a GUC (estimate_sample_rate, 1% by default) and statistics target.
The samples may be too large and the calculation may need some changes,
but that's a minor detail I think. Not sure SPI is the right way to do
this, but for PoC it's good enough.

For joins, sampling is way more complicated - we can't sample both
tables randomly, because that'd require huge samples on both sides - as
shown in [3], sampling n rows from a join with table having N rows
requires sqrt(n * N) from the table. Which is a lot.

So what this patch attempts to do is "correlated sampling", described in
[1] and [3]. Imagine a join on a foreign key, as in the example query.
(The patch only looks for a PK, for simplicity.)

This is a pretty common pattern, especially in star and snowflake
queries, which join a "fact" table to one or more "dimension" tables.

The "correlated" sampling means the "fact" table (side of the join
without the PK) is sampled randomly, but the dimensions are simply
scanned for matching rows. The PK means there can only be one matching
row for each sample one, so we're "enriching" the random sample.

This is what [1] describes as CS2, including how to extend the approach
to joins without the PK/FK requirement and various corner cases, and [3]
improves that to leverage indexes. [4] discussed various CS2 variations,
addressing various problems - reducing space requirements, etc.

The current PoC patch is however very simplistic and naive - for example
it does not attempt to correlate joins with multiple dimensions, so for
example when joining F with D1 and then D2, we sample (F,D1) and then
(F,D2) independently. This means we sample F twice, which can be quite
expensive, and it also fails to miss correlations between D1 and D2
(which is common in actual data sets).

There are various other efficiency issues, because the joins go through
calc_joinrel_size_estimate and compute_semi_anti_join_factors, and each
place does the sampling again. The samples should be cached somewhere
and reused, probably.

I'm sure there's plenty open questions, some of which are mentioned in
the many XXX comments added to the patch.

FWIW The patch does have some issues with expressions, so joins on
complex expressions (e.g. ON ((a+b) = (c+d)) do not work properly. That
shouldn't be a big deal for PoC, I think.


regards


[1] CS2: A new database synopsis for query estimation
https://www.researchgate.net/publication/262350868_CS2_A_new_database_synopsis_for_query_estimation

[2] Join Size Estimation Subject to Filter Conditions

https://www.semanticscholar.org/paper/Join-Size-Estimation-Subject-to-Filter-Conditions-Vengerov-Menck/c8bd4caf0fc9c8a4fbffc7e05416901d4fd7a41b

[3] Cardinality Estimation Done Right: Index-Based Join Sampling

https://www.semanticscholar.org/paper/Cardinality-Estimation-Done-Right%3A-Index-Based-Join-Leis-Radke/15f211eaafc6ce421a511a413613e1d2683879d2

[4] Improved Correlated Sampling for Join SizeEstimation
https://www.comp.nus.edu.sg/~taining/estimation/report.pdf

[5] A Survey on Advancing the DBMS Query Optimizer: Cardinality
Estimation, Cost Model, and Plan Enumeration
https://arxiv.org/abs/2101.01507

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Overflow hazard in pgbench
Next
From: Tom Lane
Date:
Subject: Re: PITR promote bug: Checkpointer writes to older timeline