Re: [BUGS] BUG #14664: Nonsensical join selectivity estimationdespite n_distinct - Mailing list pgsql-bugs

From Jeff Janes
Subject Re: [BUGS] BUG #14664: Nonsensical join selectivity estimationdespite n_distinct
Date
Msg-id CAMkU=1y6tz1oygO0Z-caDUy1P1msT8ZM4RY9rujOF7wwCvkemg@mail.gmail.com
Whole thread Raw
In response to Re: [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Sat, Jun 3, 2017 at 1:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If I increase the stats target to 1000, I get results like

regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and attname = 'other') ss group by 1 order by 1;
      f      | count
-------------+-------
 6.66667e-06 |   935
       1e-05 |    65
(2 rows)

regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and attname = 'other') ss group by 1 order by 1;
      f      | count
-------------+-------
 6.66667e-06 |   936
       1e-05 |    63
 1.33333e-05 |     1
(3 rows)

(the counts move around a bit, but not very much).  That corresponds to a
minimum of 2 and a maximum of either 3 or 4 duplicates in a 300000-row
sample.

Given that we know that the duplicate values of "other" are adjacent,
what this is looking like to me is that there's something fishy about
ANALYZE's choice of sample rows.  It seems like it should not be picking 3
or 4 rows from the same block very often, and for all of those to come
from the same 10-row span (out of 225 rows per page, more or less, in
this table) seems like it ought to be a really low-probability event.
But maybe my math is off.

Yeah, first it samples 30,000 blocks, then it samples 30,000 rows from those blocks.  This leads to too many blocks getting zero rows sampled, too many getting more than one sampled, and too few getting exactly one sampled.

Basically, the variance for the poisson distribution for selected rows per block for sampling 30,000 row from 30,000 blocks is much broader than it is for sampling 30,000 rows from >> 30,000 blocks, so the 2 stage sampling method doesn't work very well.

Tomas was working on a patch a while back for fixing this by resampling from blocks, but I think he dropped it to work on multivariate stuff instead.

 Cheers,

Jeff

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: [BUGS] BUG #14701: pg_dump fails to dump pg_catalog schema
Next
From: Neil Anderson
Date:
Subject: Re: [BUGS] BUG #14701: pg_dump fails to dump pg_catalog schema