Home > mailing lists

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date	June 20, 2015 04:01:24
Msg-id	CAMkU=1yRtAjujpvz1agrQuq_2w6sEaKUM0canGgSXdiCiCZfTg@mail.gmail.com Whole thread Raw
In response to	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List	pgsql-hackers

Tree view

On Fri, Jun 19, 2015 at 1:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 06/19/2015 09:48 PM, Jeff Janes wrote:
On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

But I think you might be on to something, because I manually
collected a random sample with 30k rows (by explicitly generating
30k random TIDs), and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
from lineitem_sample group by 1) foo group by 1;

cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)

That's quite different compared to what analyze gets, which
effectively looks something like this (this is derived from the
logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):

cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24

Am I wrong or is the sample not that random?

The sample is not truly random. The two-stage sampling method causes
too few blocks to have exactly one row chosen from them, and too many to
have either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.

Yeah, I came to the same conclusion after a bit of experimenting. I've logged the block numbers for all the 30k sampled tuples (target=100) and I get this statistics for number of repetitions:

cnt | count
-----+-------
1 | 11020
2 | 5637
3 | 1800
4 | 450
5 | 94
6 | 6

so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have 2 tuples sampled etc.

With truly random sampling (just generating 30k random numbers between 0 and 328509442 (number of pages of this particular table), I get this:

test=# select cnt, count(*) from (select (328509442 * random())::int AS blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1;

cnt | count
-----+-------
1 | 29994
2 | 3

So yeah, not really random.

See analysis here:

http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

Thanks.

If we assume all the blocks have the same tuple density, then it is
easy to correct this. But without that assumption of constant tuple
density, I don't know how to get a truly random sample while still
skipping most of the table.

Hmmm, that's probably true. OTOH correlated columns are not all that uncommon (e.g. table storing time-series data etc.), and this blowup is quite bad ...

True, but we don't know how big of a problem the density-skew problem might be (since the current algorithm isn't sensitive to it). It might be just as big of a problem. Tom mentioned some ancient history in the above mentioned thread that made me think the density skew was enough of a problem to motivate the current system.

I don't think we need to really assume the density to be constant, maybe we can verify that while collecting the sample? I mean, we're already reading the pages, so we can track the density, and either do the correction or not.

Maybe. I don't know how that would work. We would have to keep two samples, and dynamically decide which to use. And what if the decision is that both density skew is a problem and that value clustering is also a problem?

I wonder if the n_distinct could be tweaked so that it counted any given value only once for each block it finds it in? So instead of asking "how many times was this value sampled", ask "in how many blocks was this value sampled at least once"?

Also, doesn't Vitter do pretty much the same assumption implicitly, otherwise it couldn't skipping some of the blocks?

Vitter samples an unstructured stream in a single pass, and is unbiased. The explicit block sampling is not part of Vitter, it is something we bolted on top of it.

My solution was to just unbolt the block sampling from the top, and let it sample the rows (still 300 * stats_target of them) from the whole table rather than a random 300 * 100 blocks of the table. On tables of the size I was worried about, block sampling was not very helpful anyway. Reading 30,000 blocks out of 250,000 blocks lead to no meaningful IO advantage on my hardware. Any advantage from skipped blocks was made up for (and sometimes exceeded) by fouling up the read-ahead mechanisms.

With 1000 times more blocks, that probably won't work for you.

But I do wonder, how much time does it take to read a random 1/10, 1/100, 1/1000, 1/10000 of a table of your size, just from an IO perspective? How much are we gaining by doing the block sample?

Cheers,

Jeff

pgsql-hackers by date:

From: Vik Fearing
Date: 20 June 2015, 02:09:04
Subject: Insufficient locking for ALTER DEFAULT PRIVILEGES

From: Alvaro Herrera
Date: 20 June 2015, 07:24:58
Subject: castoroides spinlock failure on test_shm_mq

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

Previous

Next