Home > mailing lists

Re: [BUGS] BUG #14664: Nonsensical join selectivity estimationdespite n_distinct - Mailing list pgsql-bugs

From	Jeff Janes
Subject	Re: [BUGS] BUG #14664: Nonsensical join selectivity estimationdespite n_distinct
Date	June 12, 2017 22:20:41
Msg-id	CAMkU=1y6tz1oygO0Z-caDUy1P1msT8ZM4RY9rujOF7wwCvkemg@mail.gmail.com Whole thread Raw
In response to	Re: [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-bugs

Tree view

On Sat, Jun 3, 2017 at 1:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If I increase the stats target to 1000, I get results like

regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and attname = 'other') ss group by 1 order by 1;
f | count
-------------+-------
6.66667e-06 | 935
1e-05 | 65
(2 rows)

regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and attname = 'other') ss group by 1 order by 1;
f | count
-------------+-------
6.66667e-06 | 936
1e-05 | 63
1.33333e-05 | 1
(3 rows)

(the counts move around a bit, but not very much). That corresponds to a
minimum of 2 and a maximum of either 3 or 4 duplicates in a 300000-row
sample.

Given that we know that the duplicate values of "other" are adjacent,
what this is looking like to me is that there's something fishy about
ANALYZE's choice of sample rows. It seems like it should not be picking 3

or 4 rows from the same block very often, and for all of those to come
from the same 10-row span (out of 225 rows per page, more or less, in
this table) seems like it ought to be a really low-probability event.
But maybe my math is off.

Yeah, first it samples 30,000 blocks, then it samples 30,000 rows from those blocks. This leads to too many blocks getting zero rows sampled, too many getting more than one sampled, and too few getting exactly one sampled.

Basically, the variance for the poisson distribution for selected rows per block for sampling 30,000 row from 30,000 blocks is much broader than it is for sampling 30,000 rows from >> 30,000 blocks, so the 2 stage sampling method doesn't work very well.

Tomas was working on a patch a while back for fixing this by resampling from blocks, but I think he dropped it to work on multivariate stuff instead.

Cheers,

Jeff

pgsql-bugs by date:

From: Tom Lane
Date: 12 June 2017, 22:18:06
Subject: Re: [BUGS] BUG #14701: pg_dump fails to dump pg_catalog schema

From: Neil Anderson
Date: 12 June 2017, 22:24:07
Subject: Re: [BUGS] BUG #14701: pg_dump fails to dump pg_catalog schema

Re: [BUGS] BUG #14664: Nonsensical join selectivity estimationdespite n_distinct - Mailing list pgsql-bugs

Previous

Next