Home > mailing lists

Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From	Tomas Vondra
Subject	Re: Yet another abort-early plan disaster on 9.3
Date	October 10, 2014 12:10:58
Msg-id	f4c06746b31e468f9550fb9924798da3.squirrel@2.emaily.eu Whole thread Raw
In response to	Re: Yet another abort-early plan disaster on 9.3 (Greg Stark <stark@mit.edu>)
Responses	Re: Yet another abort-early plan disaster on 9.3 Re: Yet another abort-early plan disaster on 9.3
List	pgsql-performance

Tree view

Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one.  While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

Would be helpful to get a collection of test cases that currently perform
poorly. I have collected a few from the archives, but if those who follow
this thread can provide additional test cases / point to a thread
describing related etc. that'd be great.

It certainly won't be perfect, but if it considerably improves the
estimates then I believe it's step forward. Ultimately, it's impossible to
improve the estimates without increasing the sample size.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

regards
Tomas

pgsql-performance by date:

From: Greg Stark
Date: 10 October 2014, 11:16:57
Subject: Re: Yet another abort-early plan disaster on 9.3

From: Craig James
Date: 10 October 2014, 14:21:15
Subject: Re: Yet another abort-early plan disaster on 9.3

Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

Previous

Next