Thread: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)

Hi!

This was initially posted to pgsql-performance in this thread:

  http://www.postgresql.org/message-id/5472416C.3080506@fuzzy.cz

but pgsql-hackers seems like a more appropriate place for further
discussion.

Anyways, attached is v3 of the patch implementing the adaptive ndistinct
estimator. Just like the previous version, the original estimate is the
one stored/used, and the alternative one is just printed, to make it
possible to compare the results.

Changes in this version:

1) implementing compute_minimal_stats

   - So far only the 'scalar' (more common) case was handled.

   - The algorithm requires more detailed input data, the MCV-based
     stats insufficient, so the code hashes the values and then
     determines the f1, f2, ..., fN coefficients by sorting and
     walking the array of hashes.

2) handling wide values properly (now are counted into f1)

3) compensating for NULL values when calling optimize_estimate

   - The estimator has no notion of NULL values, so it's necessary to
     remove them both from the total number of rows, and sampled rows.

4) some minor fixes and refactorings


I also repeated the tests comparing the results to the current estimator
- full results are at the end of the post.

The one interesting case is the 'step skew' with statistics_target=10,
i.e. estimates based on mere 3000 rows. In that case, the adaptive
estimator significantly overestimates:

    values   current    adaptive
    ------------------------------
    106           99         107
    106            8     6449190
    1006          38     6449190
    10006        327       42441

I don't know why I didn't get these errors in the previous runs, because
when I repeat the tests with the old patches I get similar results with
a 'good' result from time to time. Apparently I had a lucky day back
then :-/

I've been messing with the code for a few hours, and I haven't found any
significant error in the implementation, so it seems that the estimator
does not perform terribly well for very small samples (in this case it's
3000 rows out of 10.000.000 (i.e. ~0.03%).

However, I've been able to come up with a simple way to limit such
errors, because the number of distinct values is naturally bounded by

    (totalrows / samplerows) * ndistinct

where ndistinct is the number of distinct values in the sample. This
essentially means that if you slice the table into sets of samplerows
rows, you get different ndistinct values.

BTW, this also fixes the issue reported by Jeff Janes on 21/11.

With this additional sanity check, the results look like this:

    values   current    adaptive
    ------------------------------
    106           99         116
    106            8       23331
    1006          38       96657
    10006        327       12400

Which is much better, but clearly still a bit on the high side.

So either the estimator really is a bit unstable for such small samples
(it tends to overestimate a bit in all the tests), or there's a bug in
the implementation - I'd be grateful if someone could peek at the code
and maybe compare it to the paper describing the estimator. I've spent a
fair amount of time analyzing it, but found nothing.

But maybe the estimator really is unstable for such small samples - in
that case we could probably use the current estimator as a fallback.
After all, this only happens when someone explicitly decreases the
statistics target to 10 - with the default statistics target it's damn
accurate.

kind regards
Tomas


statistics_target = 10
======================

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

    k    current  adaptive
    -----------------------
    1      10231     11259
    2       6327      8543
    3       4364      7707
    4       3436      7052
    5       2725      5868
    6       2223      5071
    7       1979      5011
    8       1802      5017
    9       1581      4546

c) step skew (different numbers of values)

    values   current    adaptive
    ------------------------------
    106           99         107
    106            8     6449190
    1006          38     6449190
    10006        327       42441

   patched:

    values   current    adaptive
    ------------------------------
    106           99         116
    106            8       23331
    1006          38       96657
    10006        327       12400


statistics_target = 100
=======================

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

    k      current     adaptive
    -----------------------------
    1        10011        10655
    2         9641        10944
    3         8837        10846
    4         8315        10992
    5         7654        10760
    6         7162        10524
    7         6650        10375
    8         6268        10275
    9         5871         9783

c) step skew (different numbers of values)

    values   current    adaptive
    ------------------------------
    106           30          70
    1006         271        1181
    10006       2804       10312


statistics_target = 1000
========================

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

    k    current     adaptive
    ---------------------------
    3      10001        10002
    4      10000        10003
    5       9996        10008
    6       9985        10013
    7       9973        10047
    8       9954        10082
    9       9932        10100

c) step skew (different numbers of values)

    values   current    adaptive
    ------------------------------
    106          105         113
    1006         958        1077
    10006       9592       10840


Attachment
On 12/07/2014 03:54 AM, Tomas Vondra wrote:
> The one interesting case is the 'step skew' with statistics_target=10,
> i.e. estimates based on mere 3000 rows. In that case, the adaptive
> estimator significantly overestimates:
>
>      values   current    adaptive
>      ------------------------------
>      106           99         107
>      106            8     6449190
>      1006          38     6449190
>      10006        327       42441
>
> I don't know why I didn't get these errors in the previous runs, because
> when I repeat the tests with the old patches I get similar results with
> a 'good' result from time to time. Apparently I had a lucky day back
> then :-/
>
> I've been messing with the code for a few hours, and I haven't found any
> significant error in the implementation, so it seems that the estimator
> does not perform terribly well for very small samples (in this case it's
> 3000 rows out of 10.000.000 (i.e. ~0.03%).

The paper [1] gives an equation for an upper bound of the error of this 
GEE estimator. How do the above numbers compare with that bound?

[1] 
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

- Heikki




On 23.12.2014 11:28, Heikki Linnakangas wrote:
> On 12/07/2014 03:54 AM, Tomas Vondra wrote:
>> The one interesting case is the 'step skew' with statistics_target=10,
>> i.e. estimates based on mere 3000 rows. In that case, the adaptive
>> estimator significantly overestimates:
>>
>>      values   current    adaptive
>>      ------------------------------
>>      106           99         107
>>      106            8     6449190
>>      1006          38     6449190
>>      10006        327       42441
>>
>> I don't know why I didn't get these errors in the previous runs, because
>> when I repeat the tests with the old patches I get similar results with
>> a 'good' result from time to time. Apparently I had a lucky day back
>> then :-/
>>
>> I've been messing with the code for a few hours, and I haven't
>> found any significant error in the implementation, so it seems that
>> the estimator does not perform terribly well for very small samples
>> (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%).
> 
> The paper [1] gives an equation for an upper bound of the error of this
> GEE estimator. How do the above numbers compare with that bound?

Well, that's a bit more complicated because the "Theorem 1" you mention
does not directly specify upper boundary for a single estimate. It's
formulated like this:
 Assume table with "N" rows, D distinct values and sample of "r" rows (all those values are fixed). Then there exists a
datasetwith those features, so that "ratio error"
 
     error(D, D') = max(D'/D, D/D')
 is greater than f(N, r, P) with probability at least "P". I.e. if you randomly choose a sample of 'r' rows, you'll get
anerror exceeding the ratio with probability P.
 

So it's not not a hard limit, but speaks about probability of estimation
error with some (unknown) dataset dataset. So it describes what you can
achieve at best - if you think your estimator is better, there'll always
be a dataset hiding in the shadows hissing "Theorem 1".


Let's say we're looking for boundary that's crossed only in 1% (or 5%)
of measurements. Applying this to the sample data I posted before, i.e.
10M rows with three sample sizes 'r' (3000, 30.000 and 300.000 rows),
the ratio error boundary per the the paper is
           3.000     30.000     300.000 ----------------------------------------   1%         88         28           9
 5%         70         22           7
 


At least that's what I get if I compute it using this python function:
   def err(N, r, p):       return sqrt((N-r)/(2.0*r) * log(1.0/p))


So the estimates I posted before are not terribly good, I guess,
especially the ones returning 6449190. I wonder whether there really is
some stupid bug in the implementation.

regards

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: adaptive ndistinct estimator v4

From
Tomas Vondra
Date:
Hi all,

attached is v4 of the patch implementing adaptive ndistinct estimator.

I've been looking into the strange estimates, mentioned on 2014/12/07:

>      values   current    adaptive
>      ------------------------------
>      106           99         107
>      106            8     6449190
>      1006          38     6449190
>      10006        327       42441

I suspected this might be some sort of rounding error in the numerical
optimization (looking for 'm' solving the equation from paper), but
turns out that's not the case.

The adaptive estimator is a bit unstable for skewed distributions, that
are not sufficiently smooth. Whenever f[1] or f[2] was 0 (i.e. there
were no values occuring exactly once or twice in the sample), the result
was rather off.

The simple workaround for this was adding a fallback to GEE when f[1] or
f[2] is 0. GEE is another estimator described in the paper, behaving
much better in those cases.

With the current version, I do get this (with statistics_target=10):

      values   current    adaptive
      ------------------------------
      106           99         108
      106            8         178
      1006          38        2083
      10006        327       11120

The results do change a bit based on the sample, but these values are a
good example of the values I'm getting.

The other examples (with skewed but smooth distributions) work as good
as before.

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: PATCH: adaptive ndistinct estimator v4

From
Greg Stark
Date:
<p dir="ltr">> The simple workaround for this was adding a fallback to GEE when f[1] or f[2] is 0. GEE is another
estimatordescribed in the paper, behaving much better in those cases.<p dir="ltr">For completeness, what's the downside
injust always using GEE? 

Re: PATCH: adaptive ndistinct estimator v4

From
Tomas Vondra
Date:
Hi,

On 04/03/15 15:46, Greg Stark wrote:
>  > The simple workaround for this was adding a fallback to GEE when f[1]
> or f[2] is 0. GEE is another estimator described in the paper, behaving
> much better in those cases.
>
> For completeness, what's the downside in just always using GEE?

That's a good question.

GEE is the estimator with minimal average error, as defined in Theorem 1 
in that paper. The exact formulation of the theorem is a bit complex, 
but it essentially says that knowing just the sizes of the data set and 
sample, there's an accuracy limit.

Or put another way, it's possible to construct the data set so that the 
estimator gives you estimates with error exceeding some limit (with a 
certain probability).

Knowledge of how much the data set is skewed gives us opportunity to 
improve the estimates by choosing an estimator performing better with 
such data sets. The problem is we don't know the skew - we can only 
estimate it from the sample, which is what the hybrid estimators do.

The AE estimator (presented in the paper and implemented in the patch) 
is an example of such hybrid estimators, but based on my experiments it 
does not work terribly well with one particular type of skew that I'd 
expect to be relatively common (few very common values, many very rare 
values).

Luckily, GEE performs pretty well in this case, but we can use the AE 
otherwise (ISTM it gives really good estimates).

But of course - there'll always be data sets that are poorly estimated 
(pretty much as Theorem 1 in the paper says). I'd be nice to do more 
testing on real-world data sets, to see if this performs better or worse 
than our current estimator.

regards
Tomas



Re: PATCH: adaptive ndistinct estimator v4

From
Jeff Janes
Date:
On Tue, Mar 31, 2015 at 12:02 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi all,

attached is v4 of the patch implementing adaptive ndistinct estimator.

Hi Tomas,

I have a case here where the adaptive algorithm underestimates ndistinct by a factor of 7 while the default estimator is pretty close.

5MB file:


# create table foo2 (x text);
# \copy foo2 from program 'bzcat ~/temp/foo1.txt.bz2'
# analyze verbose foo2;
INFO:  analyzing "public.foo2"
INFO:  "foo2": scanned 6021 of 6021 pages, containing 1113772 live rows and 0 dead rows; 30000 rows in sample, 1113772 estimated total rows
WARNING: ndistinct estimate current=998951.78 adaptive=135819.00

Cheers,

Jeff

Re: PATCH: adaptive ndistinct estimator v4

From
Jeff Janes
Date:
On Tue, Apr 14, 2015 at 11:45 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Mar 31, 2015 at 12:02 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi all,

attached is v4 of the patch implementing adaptive ndistinct estimator.

Hi Tomas,

I have a case here where the adaptive algorithm underestimates ndistinct by a factor of 7 while the default estimator is pretty close.

5MB file:


# create table foo2 (x text);
# \copy foo2 from program 'bzcat ~/temp/foo1.txt.bz2'
# analyze verbose foo2;
INFO:  analyzing "public.foo2"
INFO:  "foo2": scanned 6021 of 6021 pages, containing 1113772 live rows and 0 dead rows; 30000 rows in sample, 1113772 estimated total rows
WARNING: ndistinct estimate current=998951.78 adaptive=135819.00

I've done a more complete analysis with a real world database I have access to.

I've analyzed patched and current with default_statistics_target of 100 and 10000. (I also have some of the same data under 9.2, but that is not meaningfully different than unpatched head).

For easier interpretation I hacked up the analyzer so that it just reports the estimated number, never converting to the negative fraction.

See the spreadsheet here:

https://docs.google.com/spreadsheets/d/1qUcBoQkRFFcSDq7GtkiQkHqlLTbxQYl5hh6S0byff2M/edit?usp=sharing

The 10000 target was initially collected in an attempt to discern the truth when the 100 target methods disagreed, but I decided to just collect the gold-standard truth.

The truth is given by:
select count(*) from (select distinct column from schema.table where column is not null) foo;

And number_not_null is given by:
select count(*) from schema.table where column is not null;

It looks like the proposed method sometimes overestimates, although never by a huge amount, while the old one never overestimated.  Overall it mostly seems to be more accurate, but occasionally it does substantially worse than the current method.  I suspect most of the problems are related to the same issue reported in the last email.  There are a lot of places where both underestimate, but where the new method does so by less than head.

If there are any columns anyone wants to examine further, give me the token for it and I'll try to create a generator that generates data with the same distribution (and clustering, if that seems relevant).

Cheers,

Jeff

Re: PATCH: adaptive ndistinct estimator v4

From
Robert Haas
Date:
On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> attached is v4 of the patch implementing adaptive ndistinct estimator.

So, I took a look at this today.  It's interesting work, but it looks
more like a research project than something we can commit to 9.5.  As
far as I can see, this still computes the estimate the way we do
today, but then also computes the estimate using this new method.  The
estimate computed the new way isn't stored anywhere, so this doesn't
really change behavior, except for printing out a WARNING comparing
the values produced by the two estimators.

IMHO, the comments in this patch are pretty inscrutable.  I believe
this is because they presume more knowledge of what the patch is doing
than I myself possess.  For example:

+ * The AEE estimator is based on solving this equality (for "m")
+ *
+ *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)

Perhaps I am just a dummy, but I have no idea what any of that means.
I think that needs to be fixed so that someone who knows what
n_distinct is but knows nothing about the details of these estimators
can get an idea of how they are doing their thing without too much
effort.  I think a lot of the comments share this problem.

Aside from the problems mentioned above, there's the broader question
of how to evaluate the quality of the estimates produced by this
estimator vs. what we're doing right now.  I see that Jeff Janes has
pointed out some cases that seem to regress with this patch; those
presumably need some investigation, or at least some comment.  And I
think some testing from other people would be good too, before we
commit to this.

Leaving that aside, at some point, you'll say, "OK, there may be some
regressions left but overall I believe this is going to be a win in
most cases".  It's going to be really hard for anyone, no matter how
smart, to figure out through code review whether that is true.  So
committing this figures to be extremely frightening.  It's just not
going to be reasonably possible to know what percentage of users are
going to be more happy after this change and what percentage are going
to be less happy.

Therefore, I think that:

1. This should be committed near the beginning of a release cycle, not
near the end.  That way, if there are problem cases, we'll have a year
or so of developer test to shake them out.

2. There should be a compatibility GUC to restore the old behavior.
The new behavior should be the default, because if we're not confident
that the new behavior will be better for most people, we have no
business installing it in the first place (plus few people will try
it).  But just in case it turns out to suck for some people, we should
provide an escape hatch, at least for a few releases.

3. There should be some clear documentation in the comments indicating
why we believe that this is a whole lot better than what we do today.
Maybe this has been discussed adequately on the thread and maybe it
hasn't, but whoever goes to look at the committed code should not have
to go root through hackers threads to understand why we replaced the
existing estimator.  It should be right there in the code.  If,
hypothetically speaking, I were to commit this, and if, again strictly
hypothetically, another distinguished committer were to write back a
year or two later, "clearly Robert was an idiot to commit this because
it's no better than what we had before" then I want to be able to say
"clearly, you have not what got committed very carefully, because the
comment for function <blat> clearly explains that this new technology
is teh awesome".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: adaptive ndistinct estimator v4

From
Heikki Linnakangas
Date:
On 04/30/2015 01:57 PM, Robert Haas wrote:
>
> 2. There should be a compatibility GUC to restore the old behavior.
> The new behavior should be the default, because if we're not confident
> that the new behavior will be better for most people, we have no
> business installing it in the first place (plus few people will try
> it).  But just in case it turns out to suck for some people, we should
> provide an escape hatch, at least for a few releases.

You can override the ndistinct estimate with ALTER TABLE. I think that's 
enough for an escape hatch.

- Heikki




Re: PATCH: adaptive ndistinct estimator v4

From
Robert Haas
Date:
On Thu, Apr 30, 2015 at 5:31 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 04/30/2015 01:57 PM, Robert Haas wrote:
>> 2. There should be a compatibility GUC to restore the old behavior.
>> The new behavior should be the default, because if we're not confident
>> that the new behavior will be better for most people, we have no
>> business installing it in the first place (plus few people will try
>> it).  But just in case it turns out to suck for some people, we should
>> provide an escape hatch, at least for a few releases.
>
> You can override the ndistinct estimate with ALTER TABLE. I think that's
> enough for an escape hatch.

I'm not saying that isn't nice to have, but I don't think it really
helps much here.  Setting the value manually requires that you know
what value to set, and you might not.  If, on some workloads, the old
algorithm beats the new one reliably, you want to be able to actually
go back to the old algorithm, not manually override every wrong
decision it makes.  A GUC for this is pretty cheap insurance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: adaptive ndistinct estimator v4

From
Tomas Vondra
Date:
Hi,

On 04/30/15 22:57, Robert Haas wrote:
> On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>
> So, I took a look at this today. It's interesting work, but it looks
> more like a research project than something we can commit to 9.5. As
> far as I can see, this still computes the estimate the way we do
> today, but then also computes the estimate using this new method.
> The estimate computed the new way isn't stored anywhere, so this
> doesn't really change behavior, except for printing out a WARNING
> comparing the values produced by the two estimators.

I agree that this is not ready for 9.5 - it was meant as an experiment
(hence printing the estimate in a WARNING, to make it easier to compare
the value to the current estimator). Without that it'd be much more
complicated to compare the old/new estimates, but you're right this is
not suitable for commit.

So far it received only reviews from Jeff Janes (thanks!), but I think a 
change like this really requires a more thorough review, including the 
math part. I don't expect that to happen at the very end of the last CF 
before the freeze.

> IMHO, the comments in this patch are pretty inscrutable.  I believe
> this is because they presume more knowledge of what the patch is doing
> than I myself possess.  For example:
>
> + * The AEE estimator is based on solving this equality (for "m")
> + *
> + *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
> + *
> + * where A, B are effectively constants (not depending on m), and A(m)
> + * and B(m) are functions. This is equal to solving
> + *
> + *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
>
> Perhaps I am just a dummy, but I have no idea what any of that means.
> I think that needs to be fixed so that someone who knows what
> n_distinct is but knows nothing about the details of these estimators
> can get an idea of how they are doing their thing without too much
> effort.  I think a lot of the comments share this problem.

Well, I don't think you're dummy, but this requires reading the paper 
describing the estimator. Explaining that fully would essentially mean 
copying a large portion of the paper in the comment, and I suppose 
that's not a good idea. The explanation might be perhaps a bit more 
detailed, though - not sure what's the right balance.

> Aside from the problems mentioned above, there's the broader question
> of how to evaluate the quality of the estimates produced by this
> estimator vs. what we're doing right now. I see that Jeff Janes has
> pointed out some cases that seem to regress with this patch; those
> presumably need some investigation, or at least some comment. And I
> think some testing from other people would be good too, before we
> commit to this.

Yeah, evaluating is difficult. I think that Jeff's approach - i.e. 
testing the estimator on real-world data sets - is the right approach 
here. Testing on synthetic data sets has it's value too (if only to 
better understand the failure cases).

> Leaving that aside, at some point, you'll say, "OK, there may be some
> regressions left but overall I believe this is going to be a win in
> most cases". It's going to be really hard for anyone, no matter how
> smart, to figure out through code review whether that is true. So
> committing this figures to be extremely frightening. It's just not
> going to be reasonably possible to know what percentage of users are
> going to be more happy after this change and what percentage are
> going to be less happy.

For every pair of estimators you can find cases where one of them is 
better than the other one. It's pretty much impossible to find an 
estimator that beats all other estimators on all possible inputs.

There's no way to make this an improvement for everyone - it will 
produce worse estimates in some cases, and we have to admit that. If we 
think this is unacceptable, we're effectively stuck with the current 
estimator forever.

> Therefore, I think that:
>
> 1. This should be committed near the beginning of a release cycle,
> not near the end. That way, if there are problem cases, we'll have a
> year or so of developer test to shake them out.

+1

> 2. There should be a compatibility GUC to restore the old behavior.
> The new behavior should be the default, because if we're not
> confident that the new behavior will be better for most people, we
> have no business installing it in the first place (plus few people
> will try it). But just in case it turns out to suck for some people,
> we should provide an escape hatch, at least for a few releases.

I think a "compatibility GUC" is a damn poor solution, IMNSHO.

For example, GUCs are database-wide, but I do expect the estimator to 
perform worse only on a few data sets / columns. So making this a 
column-level settings would be more appropriate, I think.

But it might work during the development cycle, as it would make 
comparing the estimators possible (just run the tests with the GUC set 
differently). Assuming we'll re-evaluate it at the end, and remove the 
GUC if possible.

>
> 3. There should be some clear documentation in the comments indicating
> why we believe that this is a whole lot better than what we do today.
> Maybe this has been discussed adequately on the thread and maybe it
> hasn't, but whoever goes to look at the committed code should not have
> to go root through hackers threads to understand why we replaced the
> existing estimator.  It should be right there in the code.  If,
> hypothetically speaking, I were to commit this, and if, again strictly
> hypothetically, another distinguished committer were to write back a
> year or two later, "clearly Robert was an idiot to commit this because
> it's no better than what we had before" then I want to be able to say
> "clearly, you have not what got committed very carefully, because the
> comment for function <blat> clearly explains that this new technology
> is teh awesome".

I certainly can add such comment to the patch ;-) Choose a function.


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: adaptive ndistinct estimator v4

From
Tomas Vondra
Date:

On 05/01/15 00:18, Robert Haas wrote:
> On Thu, Apr 30, 2015 at 5:31 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> You can override the ndistinct estimate with ALTER TABLE. I think
>> that's enough for an escape hatch.
>
> I'm not saying that isn't nice to have, but I don't think it really
> helps much here. Setting the value manually requires that you know
> what value to set, and you might not. If, on some workloads, the old
> algorithm beats the new one reliably, you want to be able to
> actually go back to the old algorithm, not manually override every
> wrong decision it makes. A GUC for this is pretty cheap insurance.

IMHO this is exactly the same situation as with the current ndistinct 
estimator. If we find out we'd have to use this workaround more 
frequently than before, then clearly the new estimator is rubbish and 
should not be committed.

In other words, I agree with Heikki.


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: adaptive ndistinct estimator v4

From
Robert Haas
Date:
On Thu, Apr 30, 2015 at 9:20 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I agree that this is not ready for 9.5 - it was meant as an experiment
> (hence printing the estimate in a WARNING, to make it easier to compare
> the value to the current estimator). Without that it'd be much more
> complicated to compare the old/new estimates, but you're right this is
> not suitable for commit.
>
> So far it received only reviews from Jeff Janes (thanks!), but I think a
> change like this really requires a more thorough review, including the math
> part. I don't expect that to happen at the very end of the last CF before
> the freeze.

OK.

>> IMHO, the comments in this patch are pretty inscrutable.  I believe
>> this is because they presume more knowledge of what the patch is doing
>> than I myself possess.  For example:
>>
>> + * The AEE estimator is based on solving this equality (for "m")
>> + *
>> + *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
>> + *
>> + * where A, B are effectively constants (not depending on m), and A(m)
>> + * and B(m) are functions. This is equal to solving
>> + *
>> + *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
>>
>> Perhaps I am just a dummy, but I have no idea what any of that means.
>> I think that needs to be fixed so that someone who knows what
>> n_distinct is but knows nothing about the details of these estimators
>> can get an idea of how they are doing their thing without too much
>> effort.  I think a lot of the comments share this problem.
>
> Well, I don't think you're dummy, but this requires reading the paper
> describing the estimator. Explaining that fully would essentially mean
> copying a large portion of the paper in the comment, and I suppose that's
> not a good idea. The explanation might be perhaps a bit more detailed,
> though - not sure what's the right balance.

Well, I think the problem in this case is that the comment describes
what the values are mathematically without explaining what they are
conceptually.  For example, in s=1/2at^2+v_0t+s_0, we could say that a
is meant to be the rate of change of an unseen variable v, while v_0
is the initial vale of v, and that s_0 is meant to be the starting
value of s, changing at a rate described by v.  That's basically the
kind of explanation you have right now.  It's all correct, but what
does it really mean?  It's more helpful to say that we're trying to
project the position of a body at a given time (t) given its initial
position (s_0), its initial velocity (v), and its rate of acceleration
(a).

>> 3. There should be some clear documentation in the comments indicating
>> why we believe that this is a whole lot better than what we do today.
>> Maybe this has been discussed adequately on the thread and maybe it
>> hasn't, but whoever goes to look at the committed code should not have
>> to go root through hackers threads to understand why we replaced the
>> existing estimator.  It should be right there in the code.  If,
>> hypothetically speaking, I were to commit this, and if, again strictly
>> hypothetically, another distinguished committer were to write back a
>> year or two later, "clearly Robert was an idiot to commit this because
>> it's no better than what we had before" then I want to be able to say
>> "clearly, you have not what got committed very carefully, because the
>> comment for function <blat> clearly explains that this new technology
>> is teh awesome".
>
> I certainly can add such comment to the patch ;-) Choose a function.

Well, at least the way things are organized right now,
adaptive_estimator seems like the place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: adaptive ndistinct estimator v4

From
Jeff Janes
Date:
On Thu, Apr 30, 2015 at 6:20 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 04/30/15 22:57, Robert Haas wrote:
On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
attached is v4 of the patch implementing adaptive ndistinct estimator.

So, I took a look at this today. It's interesting work, but it looks
more like a research project than something we can commit to 9.5. As
far as I can see, this still computes the estimate the way we do
today, but then also computes the estimate using this new method.
The estimate computed the new way isn't stored anywhere, so this
doesn't really change behavior, except for printing out a WARNING
comparing the values produced by the two estimators.

I agree that this is not ready for 9.5 - it was meant as an experiment
(hence printing the estimate in a WARNING, to make it easier to compare
the value to the current estimator). Without that it'd be much more
complicated to compare the old/new estimates, but you're right this is
not suitable for commit.

With the warning it is very hard to correlate the discrepancy you do see with which column is causing it, as the warnings don't include table or column names (Assuming of course that you run it on a substantial database--if you just run it on a few toy cases then the warning works well).  

If we want to have an explicitly experimental patch which we want people with interesting real-world databases to report back on, what kind of patch would it have to be to encourage that to happen?  Or are we never going to get such feedback no matter how friendly we make it?  Another problem is that you really need to have the gold standard to compare them to, and getting that is expensive (which is why we resort to sampling in the first place).  I don't think there is much to be done on that front other than bite the bullet and just do it--perhaps only for the tables which have discrepancies.

Some of the regressions I've seen are at least partly a bug:

+   /* find the 'm' value minimizing the difference */
+   for (m = 1; m <= total_rows; m += step)
+   {
+       double q = k / (sample_rows * m);

sample_rows and m are both integers, and their product overflows vigorously.  A simple cast to double before the multiplication fixes the first example I produced.  The estimate goes from 137,177 to 1,108,076.  The reality is 1,062,223.

Perhaps m should be just be declared a double, as it is frequently used in double arithmetic.

 
Leaving that aside, at some point, you'll say, "OK, there may be some
regressions left but overall I believe this is going to be a win in
most cases". It's going to be really hard for anyone, no matter how
smart, to figure out through code review whether that is true. So
committing this figures to be extremely frightening. It's just not
going to be reasonably possible to know what percentage of users are
going to be more happy after this change and what percentage are
going to be less happy.

For every pair of estimators you can find cases where one of them is better than the other one. It's pretty much impossible to find an estimator that beats all other estimators on all possible inputs.

There's no way to make this an improvement for everyone - it will produce worse estimates in some cases, and we have to admit that. If we think this is unacceptable, we're effectively stuck with the current estimator forever.

Therefore, I think that:

1. This should be committed near the beginning of a release cycle,
not near the end. That way, if there are problem cases, we'll have a
year or so of developer test to shake them out.

It can't hurt, but how effective will it be?  Will developers know or care whether ndistinct happened to get better or worse while they are working on other things?  I would think that problems will be found by focused testing, or during beta, and probably not by accidental discovery during the development cycle.  It can't hurt, but I don't know how much it will help.

 
2. There should be a compatibility GUC to restore the old behavior.
The new behavior should be the default, because if we're not
confident that the new behavior will be better for most people, we
have no business installing it in the first place (plus few people
will try it). But just in case it turns out to suck for some people,
we should provide an escape hatch, at least for a few releases.

I think a "compatibility GUC" is a damn poor solution, IMNSHO.

For example, GUCs are database-wide, but I do expect the estimator to perform worse only on a few data sets / columns. So making this a column-level settings would be more appropriate, I think.

But it might work during the development cycle, as it would make comparing the estimators possible (just run the tests with the GUC set differently). Assuming we'll re-evaluate it at the end, and remove the GUC if possible.

I agree with the "experimental GUC".  That way if hackers do happen to see something suspicious, they can just turn it off and see what difference it makes.  If they have to reverse out a patch from 6 months ago in an area of the code they aren't particularly interested in and then recompile their code and then juggle two different sets of binaries, they will likely just shrug it off without investigation.

Cheers,

Jeff

Re: PATCH: adaptive ndistinct estimator v4

From
Robert Haas
Date:
On Wed, May 13, 2015 at 5:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> With the warning it is very hard to correlate the discrepancy you do see
> with which column is causing it, as the warnings don't include table or
> column names (Assuming of course that you run it on a substantial
> database--if you just run it on a few toy cases then the warning works
> well).

Presumably the warning is going to go away before we actually commit this thing.

> If we want to have an explicitly experimental patch which we want people
> with interesting real-world databases to report back on, what kind of patch
> would it have to be to encourage that to happen?  Or are we never going to
> get such feedback no matter how friendly we make it?  Another problem is
> that you really need to have the gold standard to compare them to, and
> getting that is expensive (which is why we resort to sampling in the first
> place).  I don't think there is much to be done on that front other than
> bite the bullet and just do it--perhaps only for the tables which have
> discrepancies.

If we stick with the idea of a GUC to control the behavior, then
somebody can run ANALYZE, save the ndistinct estimates, run ANALYZE
again, and compare.  They can also run SQL queries against the tables
themselves to check the real value.  We could even provide a script
for all of that.  I think that would be quite handy.

> It can't hurt, but how effective will it be?  Will developers know or care
> whether ndistinct happened to get better or worse while they are working on
> other things?  I would think that problems will be found by focused testing,
> or during beta, and probably not by accidental discovery during the
> development cycle.  It can't hurt, but I don't know how much it will help.

Once we enter beta (or even feature freeze), it's too late to whack
around the algorithm heavily.  We're pretty much committed to
releasing and supporting whatever we have got at that point.  I guess
we could revert it if it doesn't work out, but that's about the only
option at that point.  We have more flexibility during the main part
of the development cycle.  But your point is certainly valid and I
don't mean to dispute it.

> I agree with the "experimental GUC".  That way if hackers do happen to see
> something suspicious, they can just turn it off and see what difference it
> makes.  If they have to reverse out a patch from 6 months ago in an area of
> the code they aren't particularly interested in and then recompile their
> code and then juggle two different sets of binaries, they will likely just
> shrug it off without investigation.

Yep.  Users, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: adaptive ndistinct estimator v4

From
Josh Berkus
Date:
On 05/15/2015 11:30 AM, Robert Haas wrote:
> Once we enter beta (or even feature freeze), it's too late to whack
> around the algorithm heavily.  We're pretty much committed to
> releasing and supporting whatever we have got at that point.  I guess
> we could revert it if it doesn't work out, but that's about the only
> option at that point.  We have more flexibility during the main part
> of the development cycle.  But your point is certainly valid and I
> don't mean to dispute it.

I will finally have a customer workload available to test this on this
weekend.  That's been rather delayed by the availability of customer
hardware,because I'm not allowed to copy out the database.  However,
this is a database which suffers from multiple ndistinct estimation
issues in production, so I should be able to get a set of stats back by
Monday which would show how much of a general improvement it is.

I realize that's after the deadline, but there wasn't much I could do
about it.  I've tried to simulate the kind of estimation issues I've
seen, but they don't simulate well.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: PATCH: adaptive ndistinct estimator v4

From
Robert Haas
Date:
On Fri, May 15, 2015 at 3:35 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 05/15/2015 11:30 AM, Robert Haas wrote:
>> Once we enter beta (or even feature freeze), it's too late to whack
>> around the algorithm heavily.  We're pretty much committed to
>> releasing and supporting whatever we have got at that point.  I guess
>> we could revert it if it doesn't work out, but that's about the only
>> option at that point.  We have more flexibility during the main part
>> of the development cycle.  But your point is certainly valid and I
>> don't mean to dispute it.
>
> I will finally have a customer workload available to test this on this
> weekend.  That's been rather delayed by the availability of customer
> hardware,because I'm not allowed to copy out the database.  However,
> this is a database which suffers from multiple ndistinct estimation
> issues in production, so I should be able to get a set of stats back by
> Monday which would show how much of a general improvement it is.
>
> I realize that's after the deadline, but there wasn't much I could do
> about it.  I've tried to simulate the kind of estimation issues I've
> seen, but they don't simulate well.

This is clearly 9.6 material at this point, and has been for a while.
The patch - at least the last version I looked at - didn't store
anything different in pg_statistic.  It just logged what it would have
stored.  So testing is good, but there's not a question of pushing
this into 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: adaptive ndistinct estimator v4

From
Josh Berkus
Date:
On 05/15/2015 12:58 PM, Robert Haas wrote:
> On Fri, May 15, 2015 at 3:35 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 05/15/2015 11:30 AM, Robert Haas wrote:
>>> Once we enter beta (or even feature freeze), it's too late to whack
>>> around the algorithm heavily.  We're pretty much committed to
>>> releasing and supporting whatever we have got at that point.  I guess
>>> we could revert it if it doesn't work out, but that's about the only
>>> option at that point.  We have more flexibility during the main part
>>> of the development cycle.  But your point is certainly valid and I
>>> don't mean to dispute it.
>>
>> I will finally have a customer workload available to test this on this
>> weekend.  That's been rather delayed by the availability of customer
>> hardware,because I'm not allowed to copy out the database.  However,
>> this is a database which suffers from multiple ndistinct estimation
>> issues in production, so I should be able to get a set of stats back by
>> Monday which would show how much of a general improvement it is.
>>
>> I realize that's after the deadline, but there wasn't much I could do
>> about it.  I've tried to simulate the kind of estimation issues I've
>> seen, but they don't simulate well.
> 
> This is clearly 9.6 material at this point, and has been for a while.
> The patch - at least the last version I looked at - didn't store
> anything different in pg_statistic.  It just logged what it would have
> stored.  So testing is good, but there's not a question of pushing
> this into 9.5.

I'm personally OK with that.  The last thing we want to do is make query
costing changes *in haste*.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: PATCH: adaptive ndistinct estimator v4

From
Tomas Vondra
Date:
Hi,

On 05/13/15 23:07, Jeff Janes wrote:
> With the warning it is very hard to correlate the discrepancy you do
> see with which column is causing it, as the warnings don't include
> table or column names (Assuming of course that you run it on a
> substantial database--if you just run it on a few toy cases then the
> warning works well).

That's true. I've added attnum/attname to the warning in the attached
version of the patch.

> If we want to have an explicitly experimental patch which we want
> people with interesting real-world databases to report back on, what
> kind of patch would it have to be to encourage that to happen? Or are
> we never going to get such feedback no matter how friendly we make
> it? Another problem is that you really need to have the gold standard
> to compare them to, and getting that is expensive (which is why we
> resort to sampling in the first place). I don't think there is much
> to be done on that front other than bite the bullet and just do
> it--perhaps only for the tables which have discrepancies.

Not sure. The "experimental" part of the patch was not really aimed at
the users outside the development community - it was meant to be used by
members of the community, possibly testing it on customer databases I
don't think adding the GUC into the final release is a good idea, it's
just a noise in the config no-one would actually use.

> Some of the regressions I've seen are at least partly a bug:
>
> +   /* find the 'm' value minimizing the difference */
> +   for (m = 1; m <= total_rows; m += step)
> +   {
> +       double q = k / (sample_rows * m);
>
> sample_rows and m are both integers, and their product overflows
> vigorously. A simple cast to double before the multiplication fixes
> the first example I produced. The estimate goes from 137,177 to
> 1,108,076. The reality is 1,062,223.
>
> Perhaps m should be just be declared a double, as it is frequently
> used in double arithmetic.

Yeah, I just discovered this bug independently. There's another bug that
the adaptive_estimator takes total_rows as integer, so it breaks for
tables with more than INT_MAX rows. Both are fixed in the v5.

>
>         Therefore, I think that:
>
>         1. This should be committed near the beginning of a release cycle,
>         not near the end. That way, if there are problem cases, we'll have a
>         year or so of developer test to shake them out.
>
>
> It can't hurt, but how effective will it be? Will developers know or
> care whether ndistinct happened to get better or worse while they
> are working on other things? I would think that problems will be
> found by focused testing, or during beta, and probably not by
> accidental discovery during the development cycle. It can't hurt, but
> I don't know how much it will help.

I agree with that - it's unlikely the regressions will get discovered
randomly. OTOH I'd expect non-trivial number of people on this list to
have a few examples of ndistinct failures, and testing those would be
more useful I guess. But that's unlikely to find the cases that worked
OK before and got broken by the new estimator :-(

> I agree with the "experimental GUC".  That way if hackers do happen to
> see something suspicious, they can just turn it off and see what
> difference it makes.  If they have to reverse out a patch from 6 months
> ago in an area of the code they aren't particularly interested in and
> then recompile their code and then juggle two different sets of
> binaries, they will likely just shrug it off without investigation.

+1


--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment