Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date
Msg-id 55859B10.2040200@2ndquadrant.com
Whole thread Raw
In response to Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:
>
>     Hmmm, that's probably true. OTOH correlated columns are not all that
>     uncommon (e.g. table storing time-series data etc.), and this blowup
>     is quite bad ...
>
>
> True, but we don't know how big of a problem the density-skew
> problem might be (since the current algorithm isn't sensitive to it).
> It might be just as big of a problem. Tom mentioned some ancient
> history in the above mentioned thread that made me think the density
> skew was enough of a problem to motivate the current system.
>
>
>     I don't think we need to really assume the density to be constant,
>     maybe we can verify that while collecting the sample? I mean, we're
>     already reading the pages, so we can track the density, and either
>     do the correction or not.
>
>
> Maybe.  I don't know how that would work.

I was thinking about something like
  (1) compute average tuples per page
  (2) generate random sample of 'samplesize' blocks
  (3) for each block, see what is the actual number of tuples on the      page and perform correction (e.g. if there's
only1/2 the average      tuples, toss a coin and decide whether to really sample the block)
 
  (4) repeat until you have sample of the requested size

The first weak spot is that this requires a knowledge of reltuples, 
which is something we currently estimate based on the sample, so it 
would have to use the old values I guess.

Also, I'm not sure this works that great with blocks that have higher 
density - you'd like to choose them with a higher probability, but if 
you've already selected them there's no point in repeating that.

> We would have to keep two samples, and dynamically decide which to
> use. And what if the decision is that both density skew is a problem
> and that value clustering is also a problem?

I don't see why we would have to keep two samples? The idea is that 
sampling the blocks truly randomly solves the clustering issue, and the 
correction I mentioned above solves the density skew problem.

> I wonder if the n_distinct could be tweaked so that it counted any
> given value only once for each block it finds it in? So instead of
> asking "how many times was this value sampled", ask "in how many
> blocks was this value sampled at least once"?

I don't think that's a really good idea. Imagine a column with a very 
low cardinality, so that every block you sample has just a very few 
distinct values. That's going to get very nasty very soon, especially 
when combined with estimators assuming a truly random sample (which is 
exactly the issue with the current ndistinct estimator).

>
>     Also, doesn't Vitter do pretty much the same assumption implicitly,
>     otherwise it couldn't skipping some of the blocks?
>
>
> Vitter samples an unstructured stream in a single pass, and is
> unbiased.  The explicit block sampling is not part of Vitter, it is
> something we bolted on top of it.

Yeah, you're right of course. Sorry for not being quite precise. We're 
using a variant of the S algorithm, based on the idea that we know the 
number of blocks at the beginning - but that's pretty much the place 
where we assume uniform density of rows per block, no? Because S is 
skipping blocks using that assumption implicitly.

>
> My solution was to just unbolt the block sampling from the top, and let
> it sample the rows (still 300 * stats_target of them) from the whole
> table rather than a random 300 * 100 blocks of the table.  On tables of
> the size I was worried about, block sampling was not very helpful
> anyway.  Reading 30,000 blocks out of 250,000 blocks lead to no
> meaningful IO advantage on my hardware. Any advantage from skipped
> blocks was made up for (and sometimes exceeded) by fouling up the
> read-ahead mechanisms.>
> With 1000 times more blocks, that probably won't work for you.
> But I do wonder, how much time does it take to read a random 1/10,
> 1/100, 1/1000, 1/10000 of a table of your size, just from an IO
> perspective?  How much are we gaining by doing the block sample?

Well, actually I think it would be even more appropriate for very large 
tables. With a 2.5TB table, you don't really care whether analyze 
collects 5GB or 8GB sample, the difference is rather minor compared to 
I/O generated by the other queries etc. The current sample is already 
random enough not to work well with read-ahead, and it scans only a 
slightly lower number of blocks.

And if the larger "random" sample results in better plans (especially 
plans without OOM) ...

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Extension support for postgres_fdw
Next
From: Alvaro Herrera
Date:
Subject: Re: Insufficient locking for ALTER DEFAULT PRIVILEGES