Home > mailing lists

Re: Is there value in having optimizer stats for joins/foreignkeys? - Mailing list pgsql-hackers

From	Alexandra Wang
Subject	Re: Is there value in having optimizer stats for joins/foreignkeys?
Date	April 29 08:59:28
Msg-id	CAK98qZ2ySno00SApwj3X_MgN4iuBzKaXKXOY+U3jaFqTxPS8Tg@mail.gmail.com Whole thread
In response to	Re: Is there value in having optimizer stats for joins/foreignkeys? (Tomas Vondra <tomas@vondra.me>)
Responses	Re: Is there value in having optimizer stats for joins/foreignkeys?
List	pgsql-hackers

Tree view

Hi hackers,

I posted a proof-of-concept for join statistics in the end of January [1].

Thank you to Tomas, Corey, Tom, and Andrei for the detailed feedback!

They were very helpful in shaping the v1 design.

Rather than reply to each point inline (I plan to follow up on
specific items separately), I'd like to post the v1 patch and
summarize what changed since the PoC, along with benchmark results
that include the cardinality accuracy measurements Tomas asked for.

I've attached the v1 patch and am looking for reviews.

## What changed since the PoC

The major changes, largely driven by Tomas's feedback:

1. Index-based join sampling (Algorithm 1 from Leis et al., CIDR 2017 [3]).
The PoC relied on per-column MCV lists; v1 samples tuples from one
side of the join during ANALYZE and probes the other table via index
lookup to build a sample of the join result.

2. Catalog redesign. The PoC used a new stats kind 'c' and new catalog
columns. v1 reuses the existing MCV kind ('m') and stxdmcv field,
and stores join metadata in three new nullable columns on
pg_statistic_ext: stxjoinrels, stxkeyrefs, and stxjoinconds. These
are NULL for single-table statistics.

3. N-way catalog support. The catalog design now supports N-way joins
(stxjoinrels stores the OIDs of participating relations beyond
stxrelid, stxkeyrefs maps each stats key to its source relation,
stxjoinconds stores the join conditions). v1
collection is still limited to 2-way joins, but the schema is ready
for future extension.

4. Full JOIN syntax.

CREATE STATISTICS s (mcv) ON d.name
FROM fact f JOIN dim d ON (f.dim_id = d.id);

5. Dropped the auto-creation-from-FK patch (0002 from the PoC). As
Tom and Corey noted, this should remain explicit via CREATE
STATISTICS.

6. A lot more test coverage and code for edge cases.

Scope for v1 is deliberately narrow: equality joins, equality/IN
filters, simple Var references, inner joins, 2-way only, MCV only.

## Benchmark results

I measured using the Join Order Benchmark (JOB) [2] with a single
join statistics object on the keyword/movie_keyword join:

CREATE STATISTICS movie_keyword_keyword_stats (mcv)
ON k.keyword
FROM movie_keyword mk
JOIN keyword k ON (mk.keyword_id = k.id);

Benchmark methodology:

- 4 database instances, 3 cold + 3 warm runs each configuration.
- To eliminate ANALYZE sampling noise from the main comparison, I used
a single database for the "with stats" vs "without stats" runs:
CREATE STATISTICS + ANALYZE once, benchmark with stats, then DROP
STATISTICS and benchmark without. DROP STATISTICS only removes
pg_statistic_ext/pg_statistic_ext_data, leaving pg_statistic
untouched, so single-table estimates are identical in both runs.
- Separate noise calibration (master-vs-master on different DBs) and
regression check (master vs patched-no-stats) confirmed that (a)
cross-DB ANALYZE noise is ~30% of queries, up to 19x, and (b) the
patch causes zero estimate changes when no join stats are created.

Cardinality accuracy at the keyword/movie_keyword join node (32
queries where this join appears in both plans with the same actual
row count, allowing direct comparison of the estimated rows at
that join),
measured using q-error = max(estimated/actual, actual/estimated).
A q-error of 1.0 means the estimate exactly matches reality; higher
is worse.

Without stats With stats
geometric mean 103.4x 4.2x
median 131.7x 2.4x
90th percentile 1,230.6x 29.6x

16 improved, 0 regressed, 16 unchanged.

The 16 unchanged queries have filters that the MCV join stat doesn't
cover: LIKE patterns (e.g., 3a: k.keyword LIKE '%sequel%'), no
filter on keyword at all (e.g., 15c), or equality values that
aren't frequent enough to appear in the MCV list (e.g., 32a:
k.keyword = '10,000-mile-club').

Execution time (all 113 JOB queries):

Without stats With stats Speedup
warm (median) 114.1s 74.0s 1.54x
cold (median) 141.7s 103.7s 1.37x

All 38 unrelated queries (not involving keyword/movie_keyword) show
identical estimates, confirming no side effects.

## On using single-table extended stats for join estimation

Tomas suggested that using per-table extended statistics (functional
dependencies, MCV) in join estimation could be an orthogonal
improvement. I experimented with two approaches: (1) using functional
dependency stats to adjust ndistinct in eqjoinsel when a filter
column determines the join key, and (2) using single-table extended
MCV stats covering both the join key and filter columns to compute a
correlated post-filter MCV for join estimation. Both are sound in
theory, but neither showed measurable improvement on JOB. I'll
follow up with more detail on those experiments in a separate reply.

## Feedback welcome

I'd appreciate feedback on:
- The catalog design (stxjoinrels, stxkeyrefs, stxjoinconds)
- The index-based sampling implementation
- The planner integration (clausesel.c, costsize.c)
- What should be in scope for v1 vs deferred

Best,
Alex

[1] https://www.postgresql.org/message-id/CAK98qZ0LwJbUoiZjjFXitojHy4UskkjYDiSd_JZfGE9LbfZm9w%40mail.gmail.com
[2] https://github.com/l-wang/join-order-benchmark
[3] https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf

Attachment

v1-0001-Add-join-MCV-statistics-for-selectivity-estimatio.patch

pgsql-hackers by date:

From: shveta malik
Date: 29 April, 08:50:48
Subject: Re: Parallel Apply

From: John Naylor
Date: 29 April, 08:59:45
Subject: Re: [PATCH] Fix duplicate errmsg in ALTER TABLE SPLIT PARTITION

Re: Is there value in having optimizer stats for joins/foreignkeys? - Mailing list pgsql-hackers

Attachment

Previous

Next