Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Yet another abort-early plan disaster on 9.3
Date
Msg-id 54381D20.9010501@fuzzy.cz
Whole thread Raw
In response to Re: Yet another abort-early plan disaster on 9.3  ("Tomas Vondra" <tv@fuzzy.cz>)
Responses Re: Yet another abort-early plan disaster on 9.3  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-performance
On 10.10.2014 14:10, Tomas Vondra wrote:
> Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one.  While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the
estimates for the  "scalar" case (i.e. data types that we can sort).
Implementing this for the "minimal" case is possible, but requires a bit
more work.

It only computes the estimate and prints a WARNING with both the current
and new estimate, but the old estimate is stored.

I also attach a few synthetic examples of synthetic datasets with
distributions stored in various ways, that I used for testing. In all
cases there's a single table with 10M rows and a single INT column.
There are three kinds of skew:

1) smooth skew

   - N distinct values (100, 10.000 and 100.000 values)
   - average moves to 0 as 'k' increases ('k' between 1 and 9)
   - smooth distribution of frequencies

   INSERT INTO test
        SELECT pow(random(),k) * 10000 FROM generate_series(1,10000000);

2) step skew

   - a few very frequent values, many rare values
   - for example this generates 5 very frequent and ~10k rare values

   INSERT INTO test
        SELECT (CASE WHEN (v < 90000) THEN MOD(v,5) ELSE v END)
          FROM (
             SELECT (random()*100000)::int AS v
               FROM generate_series(1,10000000)
          ) foo;


Results
=======

I tested this with various statistics target settings (10, 100, 1000),
which translates to different sample sizes.

statistics target 100 (default, 30k rows, 0.3% sample)
======================================================

a) smooth skew, 101 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   9986       10542
   3   8902       10883
   5   7579       10824
   7   6639       10188
   9   5947       10013

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      106        107
   106      35         104
   1006     259        1262
   10006    2823       11047


statistics target 10 (3k rows, 0.03% sample)
============================================

a) smooth skew, 101 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   9846       10014
   3   4399       7190
   5   2532       5477
   7   1938       4932
   9   1623       1623

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      100        114
   106      5          5
   1006     37         532
   10006    323        20970

statistics target 1000 (300k rows, 3% sample)
=============================================

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   10001      10002
   3   10000      10000
   5   9998       10011
   7   9973       10045
   9   9939       10114

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      106        107
   106      101        107
   1006     957        1096
   10006    9551       10550


Summary
=======

I'm yet to see an example where the adaptive estimator produces worse
results than the current estimator, irrespectedly of the distribution
and sample size / statistics target.

What really matters is the sample size, with respect to the table size,
so I'll use the 0.03%, 0.3%, and 3% instead of the statistics target.

For the large sample (3%) both estimators produce reasonably accurate
results. This however won't work as the tables grow, because the number
of rows we sample is static (does not grow with the table).

As the sample decreases, the adaptive estimator starts winning. For the
0.3% sample the difference is easily 3x for the high-skew cases. E.g.
for one of the "step skew" distributions the actual ndistinct value is
10006, current estimator gives 2823 while adaptive gives 11047. That's
ratio error ~3.5 vs. 1.1x.

For the tiny 0.03% sample the difference gets even more siginficant,
especially for the step-skew cases, where the improvement is often an
order of magnitude.


Proposal
========

I think the adaptive estimator works very well, and I plan to submit it
to the next commitfest after a bit of polishing. Examples of
distributions that break it are welcome of course.

Also, I think it's clear it'd be useful to be able to scale the sample
proportionally to the table (instead of only allowing the current
statistics target approach). I understand it may result in scanning
large part of the table, but I don't see a problem in that's not a
default behavior (and clearly documented). I don't see a way around that
- small samples simply result in poor estimates.

I was thinking that this could be done using statistics target, but it's
already used for other things (e.g. size of MCV list, histogram) and we
don't want to break that by adding yet another function.

Ideas?

regards
Tomas



Attachment

pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3
Next
From: Craig James
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3