pgsql: Fix estimate_hash_bucket_stats's correction for skewed data. - Mailing list pgsql-committers

From Tom Lane
Subject pgsql: Fix estimate_hash_bucket_stats's correction for skewed data.
Date
Msg-id E1vxsuJ-002PNY-1E@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
Fix estimate_hash_bucket_stats's correction for skewed data.

The previous idea was "scale up the bucketsize estimate by the ratio
of the MCV's frequency to the average value's frequency".  But we
should have been suspicious of that plan, since it frequently led to
impossible (> 1) values which we had to apply an ad-hoc clamp to.
Joel Jacobson demonstrated that it sometimes leads to making the
wrong choice about which side of the hash join should be inner.

Instead, drop the whole business of estimating average frequency, and
just clamp the bucketsize estimate to be at least the MCV's frequency.
This corresponds to the bucket size we'd get if only the MCV appears
in a bucket, and the MCV's frequency is not affected by the
WHERE-clause filters.  (We were already making the latter assumption.)
This also matches the coding used since 4867d7f62 in the case where
only a default ndistinct estimate is available.

Interestingly, this change affects no existing regression test cases.
Add one to demonstrate that it helps pick the smaller table to be
hashed when the MCV is common enough to affect the results.

This leaves estimate_hash_bucket_stats not considering the effects of
null join keys at all, which we should probably improve.  However,
I have a different patch in the queue that will change the executor's
handling of null join keys, so it seems appropriate to wait till
that's in before doing anything more here.

Reported-by: Joel Jacobson <joel@compiler.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Joel Jacobson <joel@compiler.org>
Discussion: https://postgr.es/m/341b723c-da45-4058-9446-1514dedb17c1@app.fastmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/e6a1d8f5acbc661d3165d92d0cc1ff6b8f997f16

Modified Files
--------------
src/backend/utils/adt/selfuncs.c   | 39 +++++++-------------------------------
src/test/regress/expected/join.out | 25 ++++++++++++++++++++++++
src/test/regress/sql/join.sql      | 19 +++++++++++++++++++
3 files changed, 51 insertions(+), 32 deletions(-)


pgsql-committers by date:

Previous
From: Tom Lane
Date:
Subject: pgsql: Fix yet another bug in archive streamer with LZ4 decompression.
Next
From: Andrew Dunstan
Date:
Subject: pgsql: Fix casting away const-ness in pg_restore.c