Home > mailing lists

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date	June 19, 2015 23:39:17
Msg-id	55847DEA.3060002@2ndquadrant.com Whole thread Raw
In response to	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Jeff Janes <jeff.janes@gmail.com>)
Responses	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Jeff Janes <jeff.janes@gmail.com>)
List	pgsql-hackers

Tree view

On 06/19/2015 09:48 PM, Jeff Janes wrote:
> On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     But I think you might be on to something, because I manually
>     collected a random sample with 30k rows (by explicitly generating
>     30k random TIDs), and I get this:
>
>     tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
>     from lineitem_sample group by 1) foo group by 1;
>
>       cnt | count
>     -----+-------
>         1 | 29998
>         2 |     1
>     (2 rows)
>
>
>     That's quite different compared to what analyze gets, which
>     effectively looks something like this (this is derived from the
>     logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):
>
>       cnt | count
>     -----+-------
>         1 | 27976
>         2 |   976
>         3 |    24
>
>     Am I wrong or is the sample not that random?
>
>
> The sample is not truly random.  The two-stage sampling method causes
> too few blocks to have exactly one row chosen from them, and too many to
> have either 0 or 2+ rows chosen from them.
>
> When values in the same block are likely to be equal, then it finds too
> many duplicates because it too often picks two rows from a single block.

Yeah, I came to the same conclusion after a bit of experimenting. I've 
logged the block numbers for all the 30k sampled tuples (target=100) and 
I get this statistics for number of repetitions:
 cnt | count
-----+-------   1 | 11020   2 |  5637   3 |  1800   4 |   450   5 |    94   6 |     6

so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have 
2 tuples sampled etc.

With truly random sampling (just generating 30k random numbers between 0 
and 328509442 (number of pages of this particular table), I get this:

test=# select cnt, count(*) from (select (328509442 * random())::int AS 
blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1;
 cnt | count
-----+-------   1 | 29994   2 |     3

So yeah, not really random.

> See analysis here:
>
> http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

Thanks.

> If we assume all the blocks have the same tuple density, then it is
> easy to correct this. But without that assumption of constant tuple
> density, I don't know how to get a truly random sample while still
> skipping most of the table.

Hmmm, that's probably true. OTOH correlated columns are not all that 
uncommon (e.g. table storing time-series data etc.), and this blowup is 
quite bad ...

I don't think we need to really assume the density to be constant, maybe 
we can verify that while collecting the sample? I mean, we're already 
reading the pages, so we can track the density, and either do the 
correction or not.

Also, doesn't Vitter do pretty much the same assumption implicitly, 
otherwise it couldn't skipping some of the blocks?

regards
Tomas

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Brendan Jurd
Date: 19 June 2015, 23:02:05
Subject: Re: Tab completion for TABLESAMPLE

From: Michael Paquier
Date: 20 June 2015, 00:52:48
Subject: Re: The real reason why TAP testing isn't ready for prime time

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

Previous

Next