Re: Is there value in having optimizer stats for joins/foreignkeys? - Mailing list pgsql-hackers

From Alexandra Wang
Subject Re: Is there value in having optimizer stats for joins/foreignkeys?
Date
Msg-id CAK98qZ2ySno00SApwj3X_MgN4iuBzKaXKXOY+U3jaFqTxPS8Tg@mail.gmail.com
Whole thread
In response to Re: Is there value in having optimizer stats for joins/foreignkeys?  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Is there value in having optimizer stats for joins/foreignkeys?
List pgsql-hackers
Hi hackers,

I posted a proof-of-concept for join statistics in the end of January [1].
Thank you to Tomas, Corey, Tom, and Andrei for the detailed feedback!
They were very helpful in shaping the v1 design.

Rather than reply to each point inline (I plan to follow up on
specific items separately), I'd like to post the v1 patch and
summarize what changed since the PoC, along with benchmark results
that include the cardinality accuracy measurements Tomas asked for.

I've attached the v1 patch and am looking for reviews.

## What changed since the PoC

The major changes, largely driven by Tomas's feedback:

1. Index-based join sampling (Algorithm 1 from Leis et al., CIDR 2017 [3]).
   The PoC relied on per-column MCV lists; v1 samples tuples from one
   side of the join during ANALYZE and probes the other table via index
   lookup to build a sample of the join result.

2. Catalog redesign. The PoC used a new stats kind 'c' and new catalog
   columns. v1 reuses the existing MCV kind ('m') and stxdmcv field,
   and stores join metadata in three new nullable columns on
   pg_statistic_ext: stxjoinrels, stxkeyrefs, and stxjoinconds. These
   are NULL for single-table statistics.

3. N-way catalog support. The catalog design now supports N-way joins
   (stxjoinrels stores the OIDs of participating relations beyond
   stxrelid, stxkeyrefs maps each stats key to its source relation,
   stxjoinconds stores the join conditions). v1
   collection is still limited to 2-way joins, but the schema is ready
   for future extension.

4. Full JOIN syntax.

     CREATE STATISTICS s (mcv) ON d.name
         FROM fact f JOIN dim d ON (f.dim_id = d.id);

5. Dropped the auto-creation-from-FK patch (0002 from the PoC). As
   Tom and Corey noted, this should remain explicit via CREATE
   STATISTICS.

6. A lot more test coverage and code for edge cases.

Scope for v1 is deliberately narrow: equality joins, equality/IN
filters, simple Var references, inner joins, 2-way only, MCV only.


## Benchmark results

I measured using the Join Order Benchmark (JOB) [2] with a single
join statistics object on the keyword/movie_keyword join:

  CREATE STATISTICS movie_keyword_keyword_stats (mcv)
      ON k.keyword
      FROM movie_keyword mk
      JOIN keyword k ON (mk.keyword_id = k.id);

Benchmark methodology:

- 4 database instances, 3 cold + 3 warm runs each configuration.
- To eliminate ANALYZE sampling noise from the main comparison, I used
  a single database for the "with stats" vs "without stats" runs:
  CREATE STATISTICS + ANALYZE once, benchmark with stats, then DROP
  STATISTICS and benchmark without. DROP STATISTICS only removes
  pg_statistic_ext/pg_statistic_ext_data, leaving pg_statistic
  untouched, so single-table estimates are identical in both runs.
- Separate noise calibration (master-vs-master on different DBs) and
  regression check (master vs patched-no-stats) confirmed that (a)
  cross-DB ANALYZE noise is ~30% of queries, up to 19x, and (b) the
  patch causes zero estimate changes when no join stats are created.

Cardinality accuracy at the keyword/movie_keyword join node (32
queries where this join appears in both plans with the same actual
row count, allowing direct comparison of the estimated rows at
that join),
measured using q-error = max(estimated/actual, actual/estimated).
A q-error of 1.0 means the estimate exactly matches reality; higher
is worse.

                   Without stats    With stats
  geometric mean      103.4x           4.2x
  median              131.7x           2.4x
  90th percentile   1,230.6x          29.6x

  16 improved, 0 regressed, 16 unchanged.

  The 16 unchanged queries have filters that the MCV join stat doesn't
  cover: LIKE patterns (e.g., 3a: k.keyword LIKE '%sequel%'), no
  filter on keyword at all (e.g., 15c), or equality values that
  aren't frequent enough to appear in the MCV list (e.g., 32a:
  k.keyword = '10,000-mile-club').

Execution time (all 113 JOB queries):

                   Without stats    With stats    Speedup
  warm (median)       114.1s          74.0s        1.54x
  cold (median)       141.7s         103.7s        1.37x

All 38 unrelated queries (not involving keyword/movie_keyword) show
identical estimates, confirming no side effects.


## On using single-table extended stats for join estimation

Tomas suggested that using per-table extended statistics (functional
dependencies, MCV) in join estimation could be an orthogonal
improvement. I experimented with two approaches: (1) using functional
dependency stats to adjust ndistinct in eqjoinsel when a filter
column determines the join key, and (2) using single-table extended
MCV stats covering both the join key and filter columns to compute a
correlated post-filter MCV for join estimation. Both are sound in
theory, but neither showed measurable improvement on JOB. I'll
follow up with more detail on those experiments in a separate reply.


## Feedback welcome

I'd appreciate feedback on:
- The catalog design (stxjoinrels, stxkeyrefs, stxjoinconds)
- The index-based sampling implementation
- The planner integration (clausesel.c, costsize.c)
- What should be in scope for v1 vs deferred

Best,
Alex

[1] https://www.postgresql.org/message-id/CAK98qZ0LwJbUoiZjjFXitojHy4UskkjYDiSd_JZfGE9LbfZm9w%40mail.gmail.com
[2] https://github.com/l-wang/join-order-benchmark
[3] https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf


Attachment

pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Parallel Apply
Next
From: John Naylor
Date:
Subject: Re: [PATCH] Fix duplicate errmsg in ALTER TABLE SPLIT PARTITION