Home > mailing lists

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date	June 20, 2015 18:29:07
Msg-id	558586B8.8090803@2ndquadrant.com Whole thread Raw
In response to	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Jeff Janes <jeff.janes@gmail.com>)
List	pgsql-hackers

Tree view

On 06/20/2015 04:17 PM, Robert Haas wrote:
> On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I'm currently running some tests on a 3TB TPC-H data set, and I
>> tripped over a pretty bad n_distinct underestimate, causing OOM in>> HashAgg (which somehow illustrates the
importanceof the>> memory-bounded hashagg patch Jeff Davis is working on).

>
> Stupid question, but why not just override it using ALTER TABLE ...
> ALTER COLUMN ... SET (n_distinct = ...)?

Sure, I'll do that, and it's probably the only way to do that at the 
moment. But I wasn't really sure why we're producing so poor estimate 
initially, even considering how small the sample is, it seemed a bit too 
bad, and the proportionality to statistics target seemed really strange.

Also, if we could produce better n_distinct estimate, wouldn't that be 
awesome? I'd much rather not force users to use the manual override if 
possible.

>
> I think it's been discussed quite often on previous threads that you
> need to sample an awful lot of the table to get a good estimate for
> n_distinct. We could support that, but it would be expensive, and it
> would have to be done again every time the table is auto-analyzed.
> The above syntax supports nailing the estimate to either an exact
> value or a percentage of the table, and I'm not sure why that isn't
> good enough.

Not really. Using a larger sample would certainly help in most cases, in 
this case the main problem is that the sample is not as random needed. 
It produces way more duplicate values than it should, which then utterly 
confuses the estimator which assumes random sample. This happens because 
the column is

I've lobotomized the sampling a bit to really produce a random set of 
blocks first, and that produces way better estimates:
   statistics target     estimate               random
-----------------------------------------------------------------  100                     429491 (10000x)   334430766
(14x)  1000                   4240418  (1000x)   439010499 (10x)

Also, the number of sampled blocks is not that different. With target 
100, the current sampler reads ~2900 blocks, while a completely random 
sampler uses 3000 blocks. So, where's the benefit?

I'm sure there are counter-examples, but that's true for all estimators. 
I do realize the random sampler assumes the tuple density is about the 
same on all blocks (so that it can sample the blocks directly), but ISTM 
the current block sampler has to do basically the same assumption to 
skip some of the blocks.

Have to read the previous threads on this topic I guess ...

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Magnus Hagander
Date: 20 June 2015, 18:15:18
Subject: Re: pg_stat_*_columns?

From: Feng Tian
Date: 20 June 2015, 18:29:38
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

Previous

Next