Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Yet another abort-early plan disaster on 9.3
Date
Msg-id f4c06746b31e468f9550fb9924798da3.squirrel@2.emaily.eu
Whole thread Raw
In response to Re: Yet another abort-early plan disaster on 9.3  (Greg Stark <stark@mit.edu>)
Responses Re: Yet another abort-early plan disaster on 9.3  (Craig James <cjames@emolecules.com>)
Re: Yet another abort-early plan disaster on 9.3  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-performance
Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one.  While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

Would be helpful to get a collection of test cases that currently perform
poorly. I have collected a few from the archives, but if those who follow
this thread can provide additional test cases / point to a thread
describing related etc. that'd be great.

It certainly won't be perfect, but if it considerably improves the
estimates then I believe it's step forward. Ultimately, it's impossible to
improve the estimates without increasing the sample size.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

regards
Tomas



pgsql-performance by date:

Previous
From: Greg Stark
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3
Next
From: Craig James
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3