Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Yet another abort-early plan disaster on 9.3
Date
Msg-id 542F319A.6030002@fuzzy.cz
Whole thread Raw
In response to Re: Yet another abort-early plan disaster on 9.3  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-performance
On 3.10.2014 21:58, Jeff Janes wrote:
> On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@agliodbs.com
> <mailto:josh@agliodbs.com>> wrote:
>
>     Yes, it's only intractable if you're wedded to the idea of a tiny,
>     fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>     can get a MUCH more accurate n_distinct estimate using multiple
>     algorithms, of which HLL is one.  While n_distinct will still have some
>     variance, it'll be over a much smaller range.
>
>
> In my hands, the problems with poor n_distinct were not due to the
> insufficient size of the sample, but the insufficient randomness of it.
> Increasing  default_statistics_target did help but not because it
> increases the number of rows sampled, but rather because it increases
> the number of blocks sampled.  Once substantially all of the blocks are
> part of the block sampling, the bias is eliminated even though it was
> still sampling a small fraction of the rows (roughly one per block).

I don't think that's entirely accurate. According to [1], there's a
lower boundary on ratio error, depending on the number of sampled rows.

Say there's a table with 10M rows, we sample 30k rows (which is the
default). Then with probability 5% we'll get ratio error over 20. That
is, we may either estimate <5% or >200% of the actual ndistinct value.
Combined with our arbitrary 10% limit that we use to decide whether
ndistinct scales with the number of rows, this sometimes explodes.

By increasing the statistics target, you get much larger sample and thus
lower probability of such error. But nevertheless, it breaks from time
to time, and the fact that statistics target is static (and not scaling
with the size of the table to get appropriate sample size) is not really
helping IMHO. Static sample size may work for histograms, for ndistinct
not so much.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf


> So one idea would be go get rid of the 2-stage sampling algorithm
> (sample blocks, sample rows from the chosen blocks) and just read
> the whole table and sample rows from it unbiased, at least under
> some conditions. Some low level benchmarking on my favorite server
> showed that reading 1% of a system's blocks (in block number order
> within each file) was no faster than reading all of them from an IO
> perspective. But that is a virtualized server that wasn't really
> speced out to be an IO intensive database server in the first place.
> It would be interesting to see what people get on real hardware that
> they actually designed for the task.

I think there was a discussion about the sampling on pgsql-hackers a
while ago ... and yes, here it is [2]. However it seems there was no
clear conclusion on how to change it at that time ...


[2]
http://www.postgresql.org/message-id/CA+TgmoZaqyGSuaL2v+YFVsX06DQDQh-pEV0nobGPws-dNwAwBw@mail.gmail.com

regards
Tomas


pgsql-performance by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3
Next
From: Tomas Vondra
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3