Home > mailing lists

Re: Estimating number of distinct values. - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Estimating number of distinct values.
Date	October 24, 2018 14:27:19
Msg-id	19857.1540391239@sss.pgh.pa.us Whole thread Raw
In response to	Estimating number of distinct values. (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List	pgsql-hackers

Tree view

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
> I will be pleased if somebody (first of all Robert) can comment me 
> "strange" results of distinct values estimation.

Estimating the number of distinct values from a small sample is a hard
problem; every algorithm is going to blow it in some cases.

> In my case there are no null values:
>  totalrows = 48980014
>  samplerows = 30000
>  ndistinct = 29135
>  nmultiple = 800

You seem to be using the default statistics target.  Possibly raising that
would give better answers for this column.

> May be instead of sampling-based estimation use streaming 
> based estimation, for example HyperLogLog algorithm already present in 
> Postgres?

Maybe, but I'd be really surprised if HLL were any sort of magic bullet.
I think it's intended to make an ndistinct estimate with just a small
amount of state preserved as rows are scanned, which is an admirable
goal but not one that ANALYZE particularly needs.  Adopting such a
constraint when we don't need it does not sound like a recipe for
getting a better final answer.

            regards, tom lane

pgsql-hackers by date:

From: Sergei Kornilov
Date: 24 October 2018, 14:24:01
Subject: JSON validation behavior

From: "David G. Johnston"
Date: 24 October 2018, 14:40:00
Subject: Re: JSON validation behavior

Re: Estimating number of distinct values. - Mailing list pgsql-hackers

Previous

Next