Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date
Msg-id 558586B8.8090803@2ndquadrant.com
Whole thread Raw
In response to Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers

On 06/20/2015 04:17 PM, Robert Haas wrote:
> On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I'm currently running some tests on a 3TB TPC-H data set, and I
>> tripped over a pretty bad n_distinct underestimate, causing OOM in>> HashAgg (which somehow illustrates the
importanceof the>> memory-bounded hashagg patch Jeff Davis is working on).
 
>
> Stupid question, but why not just override it using ALTER TABLE ...
> ALTER COLUMN ... SET (n_distinct = ...)?

Sure, I'll do that, and it's probably the only way to do that at the 
moment. But I wasn't really sure why we're producing so poor estimate 
initially, even considering how small the sample is, it seemed a bit too 
bad, and the proportionality to statistics target seemed really strange.

Also, if we could produce better n_distinct estimate, wouldn't that be 
awesome? I'd much rather not force users to use the manual override if 
possible.

>
> I think it's been discussed quite often on previous threads that you
> need to sample an awful lot of the table to get a good estimate for
> n_distinct. We could support that, but it would be expensive, and it
> would have to be done again every time the table is auto-analyzed.
> The above syntax supports nailing the estimate to either an exact
> value or a percentage of the table, and I'm not sure why that isn't
> good enough.

Not really. Using a larger sample would certainly help in most cases, in 
this case the main problem is that the sample is not as random needed. 
It produces way more duplicate values than it should, which then utterly 
confuses the estimator which assumes random sample. This happens because 
the column is

I've lobotomized the sampling a bit to really produce a random set of 
blocks first, and that produces way better estimates:
   statistics target     estimate               random
-----------------------------------------------------------------  100                     429491 (10000x)   334430766
(14x)  1000                   4240418  (1000x)   439010499 (10x)
 

Also, the number of sampled blocks is not that different. With target 
100, the current sampler reads ~2900 blocks, while a completely random 
sampler uses 3000 blocks. So, where's the benefit?

I'm sure there are counter-examples, but that's true for all estimators. 
I do realize the random sampler assumes the tuple density is about the 
same on all blocks (so that it can sample the blocks directly), but ISTM 
the current block sampler has to do basically the same assumption to 
skip some of the blocks.

Have to read the previous threads on this topic I guess ...

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: pg_stat_*_columns?
Next
From: Feng Tian
Date:
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H