Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date
Msg-id 55882892.9040100@2ndquadrant.com
Whole thread Raw
In response to Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers

On 06/22/2015 07:21 AM, Jeff Janes wrote:
> On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     Hi,
>
>     On 06/20/2015 03:01 AM, Jeff Janes wrote:
>
>
>
>              I don't think we need to really assume the density to be
>         constant,
>              maybe we can verify that while collecting the sample? I
>         mean, we're
>              already reading the pages, so we can track the density, and
>         either
>              do the correction or not.
>
>
>         Maybe.  I don't know how that would work.
>
>
>     I was thinking about something like
>
>        (1) compute average tuples per page
>
>        (2) generate random sample of 'samplesize' blocks
>
>        (3) for each block, see what is the actual number of tuples on the
>            page and perform correction (e.g. if there's only 1/2 the average
>            tuples, toss a coin and decide whether to really sample the
>     block)
>
>        (4) repeat until you have sample of the requested size
>
>
> Are you producing a block sample or a row sample?  If a block is
> sampled, then what do you feed it to?

I'm not sure what's the question. The ultimate goal is to get a random 
row sample, but we need to choose which blocks to sample first. So you 
can just generate K random numbers (where K is the requested sample 
size). If you get the block number once, sample one row, if you get the 
block twice, sample two rows, etc.

Of course, this is based on the assumption of uniform tuple density, 
which allows you to choose block first and then independently a row from 
that block. Hence the correction idea I outlined in the previous message.

> Are you just picking one row out of each block that survives step 3?
> If so, that would be similar to my idea of counting a given value
> only once per block it was in, except I was thinking of applying that
> only to n_distinct, not to the entire stats collection process.

No, I'm not doing that, and I think it's a bad idea in general. The 
problem with the current sampling is that it does not produce a random 
sample of rows, but something skewed, which causes serious trouble in 
the estimators processing the sample. I think this 'count only once' 
would result in similar issues (e.g. for low-cardinality columns).

>
> My answer was to take out the block sampling entirely and read the
> whole table. That is what probably won't work for you. (Come to think
> of it, I was hesitant to deploy custom code to production, so I never
> actually deployed that. Instead I cranked up the stats target and let
> the chips fall where they may)

Yeah, reading the whole table is not going to fly I guess. But the 
estimates would be very accurate! ;-)

>
> I think you want to go back to the table to find new blocks to
> replace ones which were included in the original block sample but
> ended up having no tuples in the end tuple sample, either because
> they were low density blocks, or just through the luck of the draw.
> But I don't see how fixes the problem, unless you also prevent a
> block from contributing more than 1 tuple. And at that point, I worry
> about the uneven density again. If a block has 2 rows selected by
> pure random chance, I think it would be fine to keep only one (or, go
> find another random block to pull one row from as a replacement). But
> if it has 2 rows selected because it has more rows than average, then
> we would have to pull the replacement from a random block *of similar
> density* to avoid swapping one bias for another one.

Hmmm, maybe limiting the sample to just 1 tuple is a good idea, but I 
have to think about that a bit more.

The basic idea behind density correction is that when you get a block 
with density significantly different from the average (let's say more 
than 10%?), you can either treat it as a partial block (density below 
average), or a collection of multiple blocks (density above average), 
and see if it really sampled.

Handling the 'partial block' seems rather simple - if the block has half 
the average density, treat do a random coin toss and either keep or 
discard the tuple.

With high density blocks, it's probably a complicated, and I think it 
really means treating it as multiple 'virtual' blocks, and only keep 
samples from the first one.

And of course, this needs to be repeated if some of the rows get evicted 
because of the correction mechanism. Which will lead to slightly larger 
samples, but I don't see that as a major problem (and the difference is 
not that big, at least not in the case discussed in this thread previously).

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Sandro Santilli
Date:
Subject: PGXS "check" target forcing an install ?
Next
From: Prakash Itnal
Date:
Subject: Re: Auto-vacuum is not running in 9.1.12