Home > mailing lists

pgsql: Omit null rows when applying the Haas-Stokes estimator for ndist - Mailing list pgsql-committers

From	Tom Lane
Subject	pgsql: Omit null rows when applying the Haas-Stokes estimator for ndist
Date	April 1, 2016 22:48:42
Msg-id	E1am53b-0000Dw-EU@gemulon.postgresql.org Whole thread Raw
List	pgsql-committers

Tree view

Omit null rows when applying the Haas-Stokes estimator for ndistinct.

Previously, we included null rows in the values of n and N that went
into the formula, which amounts to considering null as a value in its
own right; but the d and f1 values do not include nulls.  This is
inconsistent, and it contributes to significant underestimation of
ndistinct when the column is mostly nulls.  In any case stadistinct
is defined as the number of distinct non-null values, so we should
exclude nulls when doing this computation.

This is an aboriginal bug in our application of the Haas-Stokes formula,
but we'll refrain from back-patching for fear of destabilizing plan
choices in released branches.

While at it, make the code a bit more readable by omitting unnecessary
casts and intermediate variables.

Observation and original patch by Tomas Vondra, adjusted to fix both
uses of the formula by Alex Shulgin, cosmetic improvements by me

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/be4b4dc75955318e763f5b2e3a990e35366ac797

Modified Files
--------------
src/backend/commands/analyze.c | 62 ++++++++++++++++++++++++++----------------
1 file changed, 38 insertions(+), 24 deletions(-)

pgsql-committers by date:

From: Alvaro Herrera
Date: 01 April 2016, 22:48:20
Subject: pgsql: Fix logical_decoding_timelines test crashes

From: Alvaro Herrera
Date: 01 April 2016, 23:12:37
Subject: pgsql: pgbench: Remove unused parameter

pgsql: Omit null rows when applying the Haas-Stokes estimator for ndist - Mailing list pgsql-committers

Previous

Next