Home > mailing lists

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date	June 19, 2015 22:48:46
Msg-id	CAMkU=1zEV+Pwx=QBKTPSGU=qjvqVkpAsHi0JbEKWjQifmV4TYQ@mail.gmail.com Whole thread Raw
In response to	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
List	pgsql-hackers

Tree view

On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from lineitem_sample group by 1) foo group by 1;

cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)

That's quite different compared to what analyze gets, which effectively looks something like this (this is derived from the logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):

cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24

Am I wrong or is the sample not that random?

The sample is not truly random. The two-stage sampling method causes too few blocks to have exactly one row chosen from them, and too many to have either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too many duplicates because it too often picks two rows from a single block.

See analysis here:

http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

If we assume all the blocks have the same tuple density, then it is easy to correct this. But without that assumption of constant tuple density, I don't know how to get a truly random sample while still skipping most of the table.

Cheers,

Jeff

pgsql-hackers by date:

From: Andres Freund
Date: 19 June 2015, 22:44:12
Subject: Re: 9.5 release notes

From: Brendan Jurd
Date: 19 June 2015, 23:02:05
Subject: Re: Tab completion for TABLESAMPLE

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

Previous

Next