Thread: ANALYZE sampling is too good

ANALYZE sampling is too good

From
Greg Stark
Date:
At multiple conferences I've heard about people trying all sorts of
gymnastics to avoid ANALYZE which they expect to take too long and
consume too much I/O. This is especially a big complain after upgrades
when their new database performs poorly until the new statistics are
in and they did pg_upgrade to avoid an extended downtime and complain
about ANALYZE taking hours.

I always gave the party line that ANALYZE only takes a small
constant-sized sample so even very large tables should be very quick.
But after hearing the same story again in Heroku I looked into it a
bit further. I was kind of shocked but the numbers.

ANALYZE takes a sample of 300 * statistics_target rows. That sounds
pretty reasonable but with default_statistics_target set to 100 that's
30,000 rows. If I'm reading the code right It takes this sample by
sampling 30,000 blocks and then (if the table is large enough) taking
an average of one row per block. Each block is 8192 bytes so that
means it's reading 240MB of each table.That's a lot more than I
realized.

It means if your table is anywhere up to 240MB you're effectively
doing a full table scan and then throwing out nearly all the data
read.

Worse, my experience with the posix_fadvise benchmarking is that on
spinning media reading one out of every 16 blocks takes about the same
time as reading them all. Presumably this is because the seek time
between tracks dominates and reading one out of every 16 blocks is
still reading every track. So in fact if your table is up to about
3-4G ANALYZE is still effectively going to do a full table scan, at
least as far as I/O time goes.

The current algorithm seems like it was designed with a 100G+ table in
mind but the consequences on the more common 100M-100G tables weren't
really considered. Consider what this means for partitioned tables. If
they partition their terabyte table into 10 partitions ANALYZE will
suddenly want to use 10x as much I/O which seems like a perverse
consequence.

I'm not sure I have a prescription but my general feeling is that
we're spending an awful lot of resources going after a statistically
valid sample when we can spend a lot less resources and get something
90% as good. Or if we're really going to read that much data that we
might as well use more of the rows we find.

-- 
greg



Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/03/2013 03:30 PM, Greg Stark wrote:
> It means if your table is anywhere up to 240MB you're effectively
> doing a full table scan and then throwing out nearly all the data
> read.

There are lots of issues with our random sampling approach for ANALYZE.This is why, back in our Greenplum days, Simon
proposedchanging to a
 
block-based sampling approach, where we would sample random *pages*
instead of random *rows*.  That would allow us to do things like sample
5% of the table, but only read 5% of the table, although we might have
to play some with OS-FS operations to make sure of that.  In addition to
solving the issue you cite above, it would let us get MUCH more accurate
estimates for very large tables, where currently we sample only about
0.1% of the table.

There are fairly well researched algorithms for block-based sampling
which estimate for the skew introduced by looking at consecutive rows in
a block.  In general, a minimum sample size of 5% is required, and the
error is no worse than our current system.  However, the idea was shot
down at the time, partly because I think other hackers didn't get the math.

I believe that both Oracle and MSSQL use block-based sampling, but of
course, I don't know which specific algo they use.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Tue, Dec 3, 2013 at 8:30 PM, Greg Stark <stark@mit.edu> wrote:
> Worse, my experience with the posix_fadvise benchmarking is that on
> spinning media reading one out of every 16 blocks takes about the same
> time as reading them all. Presumably this is because the seek time
> between tracks dominates and reading one out of every 16 blocks is
> still reading every track. So in fact if your table is up to about
> 3-4G ANALYZE is still effectively going to do a full table scan, at
> least as far as I/O time goes.


Actually, it's rotational latency the dominant cost there.



Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
> There are fairly well researched algorithms for block-based sampling
> which estimate for the skew introduced by looking at consecutive rows in
> a block.  In general, a minimum sample size of 5% is required, and the
> error is no worse than our current system.  However, the idea was shot
> down at the time, partly because I think other hackers didn't get the math.

I think that this certainly warrants revisiting. The benefits would be
considerable.

Has anyone ever thought about opportunistic ANALYZE piggy-backing on
other full-table scans? That doesn't really help Greg, because his
complaint is mostly that a fresh ANALYZE is too expensive, but it
could be an interesting, albeit risky approach.
Opportunistically/unpredictably acquiring a ShareUpdateExclusiveLock
would be kind of weird, for one thing, but if a full table scan really
is very expensive, would it be so unreasonable to attempt to amortize
that cost?

-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Amit Kapila
Date:
On Fri, Dec 6, 2013 at 7:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> There are fairly well researched algorithms for block-based sampling
>> which estimate for the skew introduced by looking at consecutive rows in
>> a block.  In general, a minimum sample size of 5% is required, and the
>> error is no worse than our current system.  However, the idea was shot
>> down at the time, partly because I think other hackers didn't get the math.
>
> I think that this certainly warrants revisiting. The benefits would be
> considerable.
>
> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> other full-table scans? That doesn't really help Greg, because his
> complaint is mostly that a fresh ANALYZE is too expensive, but it
> could be an interesting, albeit risky approach.

Is only fresh ANALYZE costly or consecutive one's are also equally costly?

Doing it in some background operation might not be a bad idea, but doing it
in backend query execution (seq scan) might add overhead for query response time
especially if part or most of data for table is in RAM, so here
overhead due to actual read
might not be very high but the calculation for analyse (like sort)
will make it costly.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ANALYZE sampling is too good

From
Andres Freund
Date:
On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> other full-table scans? That doesn't really help Greg, because his
> complaint is mostly that a fresh ANALYZE is too expensive, but it
> could be an interesting, albeit risky approach.

What I've been thinking of is

a) making it piggy back on scans vacuum is doing instead of doing
separate ones all the time (if possible, analyze needs to be more
frequent). Currently with quite some likelihood the cache will be gone
again when revisiting.

b) make analyze incremental. In lots of bigger tables most of the table
is static - and we actually *do* know that, thanks to the vm. So keep a
rawer form of what ends in the catalogs around somewhere, chunked by the
region of the table the statistic is from. Everytime a part of the table
changes, re-sample only that part. Then recompute the aggregate.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
It looks like this is a fairly well understood problem because in the
real world it's also often cheaper to speak to people in a small
geographic area or time interval too. These wikipedia pages sound
interesting and have some external references:

http://en.wikipedia.org/wiki/Cluster_sampling
http://en.wikipedia.org/wiki/Multistage_sampling

I suspect the hard part will be characterising the nature of the
non-uniformity in the sample generated by taking a whole block. Some
of it may come from how the rows were loaded (e.g. older rows were
loaded by pg_restore but newer rows were inserted retail) or from the
way Postgres works (e.g. hotter rows are on blocks with fewer rows in
them and colder rows are more densely packed).

I've felt for a long time that Postgres would make an excellent test
bed for some aspiring statistics research group.


-- 
greg



Re: ANALYZE sampling is too good

From
Fabien COELHO
Date:
> http://en.wikipedia.org/wiki/Cluster_sampling
> http://en.wikipedia.org/wiki/Multistage_sampling
>
> I suspect the hard part will be characterising the nature of the
> non-uniformity in the sample generated by taking a whole block. Some
> of it may come from how the rows were loaded (e.g. older rows were
> loaded by pg_restore but newer rows were inserted retail) or from the
> way Postgres works (e.g. hotter rows are on blocks with fewer rows in
> them and colder rows are more densely packed).

I would have thought that as VACUUM reclaims space it levels that issue in 
the long run and on average, so that it could be simply ignored?

> I've felt for a long time that Postgres would make an excellent test
> bed for some aspiring statistics research group.

I would say "applied statistics" rather than "research". Nevertheless I 
can ask my research statistician colleagues next door about their opinion 
on this sampling question.

-- 
Fabien.



Re: ANALYZE sampling is too good

From
Robert Haas
Date:
On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark@mit.edu> wrote:
> I always gave the party line that ANALYZE only takes a small
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
>
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.

That is a lot.  On the other hand, I question the subject line:
sometimes, our ANALYZE sampling is not good enough.  Before we raised
the default statistics target from 10 to 100, complaints about bad
plan choices due to insufficiently-precise statistics were legion --
and we still have people periodically proposing to sample a fixed
percentage of the table instead of a fixed amount of the table, even
on large tables, which is going the opposite direction.  I think this
is because they're getting really bad n_distinct estimates, and no
fixed-size sample can reliably give a good one.

More generally, I think the basic problem that people are trying to
solve by raising the statistics target is avoid index scans on
gigantic tables.  Obviously, there are a lot of other situations where
inadequate statistics can cause problems, but that's a pretty
easy-to-understand one that we do not always get right.  We know that
an equality search looking for some_column = 'some_constant', where
some_constant is an MCV, must be more selective than a search for the
least-frequent MCV.  If you store more and more MCVs for a table,
eventually you'll have enough that the least-frequent one is pretty
infrequent, and then things work a lot better.

This is more of a problem for big tables than for small tables.  MCV
#100 can't have a frequency of greater than 1/100 = 0.01, but that's a
lot more rows on a big table than small one.  On a table with 10
million rows we might estimate something close to 100,000 rows when
the real number is just a handful; when the table has only 10,000
rows, we just can't be off by as many orders of magnitude.  Things
don't always work out that badly, but in the worst case they do.

Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not.  So what I
think we should do is auto-tune the statistics target based on the
table size.  If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/07/2013 11:46 AM, Robert Haas wrote:
> Maybe there's some highly-principled statistical approach which could
> be taken here, and if so that's fine, but I suspect not.  So what I
> think we should do is auto-tune the statistics target based on the
> table size.  If, say, we think that the generally useful range for the
> statistics target is something like 10 to 400, then let's come up with
> a formula based on table size that outputs 10 for small tables, 400
> for really big tables, and intermediate values for tables in the
> middle.

The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.

This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables.  We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.

There's other qualitative improvements we could make, which Nathan Boley
has spoken on.   For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed.  If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
> The only approach which makes sense is to base it on a % of the table.
> In fact, pretty much every paper which has examined statistics
> estimation for database tables has determined that any estimate based on
> a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
> samples are 100% accurate, but at least they fit the 80/20 rule.

This is nonsense. The math behind the deductions the planner makes is
well understood and we know how to estimate the precision based on the
sample size. There are some questions that really need a proportional
sample such as n_distinct (and as Robert points out the MCV list) but
the main motivation of the sample size is the histograms and those are
used to answer questions which very clearly do not need a proportional
sample. The statistics is very clear there. Using a proportional
sample for the histograms would just be silly. It would be
substituting a gut feeling for real statistics.

The problems with using a proportional sample for things like
n_distinct or the MCV list is that you're talking about sorting or
hashing an unboundedly large set of values and storing an unboundedly
large array in the statistics table. I don't think that would be
tenable without dramatically changing the way process and store that
data to be more scalable. Using a lossy counting algorithm and
something more scalable than toasted arrays would be prerequisites I
think.

And as Robert mentions even if we solved those problems it wouldn't
help n_distinct. You really need to read all the values to deal with
n_distinct.

> This is the reason why implementing block-based sampling is critical;
> using our current "take one row out of every page" method, sampling 5%
> of the table means scanning the whole thing in most tables.  We also
> need to decouple the number of MCVs we keep from the sample size.
> Certainly our existing sampling algo seems designed to maximize IO for
> the sample size.

This would be helpful if you could specify what you mean by
"black-based sampling". The reason these previous pleas didn't go
anywhere is not because we can't do math. It's because of the lack of
specifics here. What we do *today* is called "block-based sampling" by
the literature.

What I'm saying is we need to change the "block-based sampling" that
we do to extract more rows per block. We currently extract the
"correct" number of rows to get a strong guarantee of uniformity. If
we could get a weaker guarantee of being "close enough" to uniform
samples for the deductions we want to make and get many more rows per
block then we could read a lot fewer blocks.

Or to put it another way people could increase the statistics target
dramatically and still be reading the same number of blocks as today.
In an ideal world perhaps we could have people reading 1% of the
blocks they read today and get statistics targets 10x better than
today.

-- 
greg



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 08/12/13 09:25, Josh Berkus wrote:
> On 12/07/2013 11:46 AM, Robert Haas wrote:
>> Maybe there's some highly-principled statistical approach which could
>> be taken here, and if so that's fine, but I suspect not.  So what I
>> think we should do is auto-tune the statistics target based on the
>> table size.  If, say, we think that the generally useful range for the
>> statistics target is something like 10 to 400, then let's come up with
>> a formula based on table size that outputs 10 for small tables, 400
>> for really big tables, and intermediate values for tables in the
>> middle.
>
> The only approach which makes sense is to base it on a % of the table.
> In fact, pretty much every paper which has examined statistics
> estimation for database tables has determined that any estimate based on
> a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
> samples are 100% accurate, but at least they fit the 80/20 rule.
>
> This is the reason why implementing block-based sampling is critical;
> using our current "take one row out of every page" method, sampling 5%
> of the table means scanning the whole thing in most tables.  We also
> need to decouple the number of MCVs we keep from the sample size.
> Certainly our existing sampling algo seems designed to maximize IO for
> the sample size.
>
> There's other qualitative improvements we could make, which Nathan Boley
> has spoken on.   For example, our stats code has no way to recognize a
> normal or exponential distrbution -- it assumes that all columns are
> randomly distributed.  If we could recoginze common distribution
> patterns, then not only could we have better query estimates, those
> would require keeping *fewer* stats, since all you need for a normal
> distribution are the end points and the variance.
>
From src/backend/commands/analyze.c

* As of May 2004 we use a new two-stage method:  Stage one selects up * to targrows random blocks (or all blocks, if
therearen't so many). * Stage two scans these blocks and uses the Vitter algorithm to create * a random sample of
targrowsrows (or less, if there are less in the * sample of blocks).  The two stages are executed simultaneously: each
*block is processed as soon as stage one returns its number and while * the rows are read stage two controls which ones
areto be inserted * into the sample.
 

I don't think we always read every block (been a while since I looked at 
this code, so I'll recheck). From what I understand this stuff is based 
on reasonable research (Vitter algorithm). Not to say it 
couldn't/shouldn't be looked at again to improve it - but it is not just 
dreamed up with no valid research!

Cheers

Mark



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 08/12/13 12:27, Mark Kirkwood wrote:
> On 08/12/13 09:25, Josh Berkus wrote:
>> On 12/07/2013 11:46 AM, Robert Haas wrote:
>>> Maybe there's some highly-principled statistical approach which could
>>> be taken here, and if so that's fine, but I suspect not.  So what I
>>> think we should do is auto-tune the statistics target based on the
>>> table size.  If, say, we think that the generally useful range for the
>>> statistics target is something like 10 to 400, then let's come up with
>>> a formula based on table size that outputs 10 for small tables, 400
>>> for really big tables, and intermediate values for tables in the
>>> middle.
>>
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
>>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables.  We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
>>
>> There's other qualitative improvements we could make, which Nathan Boley
>> has spoken on.   For example, our stats code has no way to recognize a
>> normal or exponential distrbution -- it assumes that all columns are
>> randomly distributed.  If we could recoginze common distribution
>> patterns, then not only could we have better query estimates, those
>> would require keeping *fewer* stats, since all you need for a normal
>> distribution are the end points and the variance.
>>
>
>  From src/backend/commands/analyze.c
>
> * As of May 2004 we use a new two-stage method:  Stage one selects up
>   * to targrows random blocks (or all blocks, if there aren't so many).
>   * Stage two scans these blocks and uses the Vitter algorithm to create
>   * a random sample of targrows rows (or less, if there are less in the
>   * sample of blocks).  The two stages are executed simultaneously: each
>   * block is processed as soon as stage one returns its number and while
>   * the rows are read stage two controls which ones are to be inserted
>   * into the sample.
>
> I don't think we always read every block (been a while since I looked at
> this code, so I'll recheck). From what I understand this stuff is based
> on reasonable research (Vitter algorithm). Not to say it
> couldn't/shouldn't be looked at again to improve it - but it is not just
> dreamed up with no valid research!
>

Since I volunteered to recheck :-)

from analyze.c again:


/*-------------------- * The following choice of minrows is based on the paper * "Random sampling for histogram
construction:how much is enough?" * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in * Proceedings of ACM
SIGMODInternational Conference on Management * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5 * says
thatfor table size n, histogram size k, maximum relative * error in bin size f, and error probability gamma, the
minimum* random sample size is *      r = 4 * k * ln(2*n/gamma) / f^2 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we
obtain*      r = 305.82 * k * Note that because of the log function, the dependence on n is * quite weak; even at n =
10^12,a 300*k sample gives <= 0.66 * bin size error with probability 0.99.  So there's no real need to * scale for n,
whichis a good thing because we don't necessarily * know it at this point. *-------------------- */
 
stats->minrows = 300 * attr->attstattarget;


So for typical settings (default_statictics_target = 100), we try to get 
30000 rows. This means we will sample about 30000 blocks.

Indeed quickly checking with a scale 100 pgbench db and a simple patch 
to make the block sampler say how many blocks it reads (note 
pgbench_accounts has 163935 blocks):

bench=# ANALYZE pgbench_branches;
NOTICE:  acquire sample will need 30000 blocks
NOTICE:  sampled 1 blocks
ANALYZE
Time: 16.935 ms

bench=# ANALYZE pgbench_accounts;
NOTICE:  acquire sample will need 30000 blocks
NOTICE:  sampled 30000 blocks
ANALYZE
Time: 10059.446 ms
bench=# \q


Regards

Mark



Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
On 08/12/13 10:27, Greg Stark wrote:
> On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
> This is nonsense. The math behind the deductions the planner makes is
> well understood and we know how to estimate the precision based on the
> sample size. There are some questions that really need a proportional
> sample such as n_distinct (and as Robert points out the MCV list) but
> the main motivation of the sample size is the histograms and those are
> used to answer questions which very clearly do not need a proportional
> sample. The statistics is very clear there. Using a proportional
> sample for the histograms would just be silly. It would be
> substituting a gut feeling for real statistics.
>
> The problems with using a proportional sample for things like
> n_distinct or the MCV list is that you're talking about sorting or
> hashing an unboundedly large set of values and storing an unboundedly
> large array in the statistics table. I don't think that would be
> tenable without dramatically changing the way process and store that
> data to be more scalable. Using a lossy counting algorithm and
> something more scalable than toasted arrays would be prerequisites I
> think.
>
> And as Robert mentions even if we solved those problems it wouldn't
> help n_distinct. You really need to read all the values to deal with
> n_distinct.
>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables.  We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
> This would be helpful if you could specify what you mean by
> "black-based sampling". The reason these previous pleas didn't go
> anywhere is not because we can't do math. It's because of the lack of
> specifics here. What we do *today* is called "block-based sampling" by
> the literature.
>
> What I'm saying is we need to change the "block-based sampling" that
> we do to extract more rows per block. We currently extract the
> "correct" number of rows to get a strong guarantee of uniformity. If
> we could get a weaker guarantee of being "close enough" to uniform
> samples for the deductions we want to make and get many more rows per
> block then we could read a lot fewer blocks.
>
> Or to put it another way people could increase the statistics target
> dramatically and still be reading the same number of blocks as today.
> In an ideal world perhaps we could have people reading 1% of the
> blocks they read today and get statistics targets 10x better than
> today.
>
I suspect that the number of rows to sample should be something of the form:

      { N                      where N <= a * 100
n =   { (a * N) / 100          where a * 100 < N <= 10000      { (a * SQRT(N)) / 100 where N > 10000

n: the number of rows to sample
N: total number of rows
a: configurable coefficient  (1 <= a <= 100)


For very large numbers of rows, like 10^8, taking a constant fraction 
will over sample for most purposes, hence the square root.
  a      N        n   sampled%  1  10000      100         1%
100  10000    10000       100%  1   10^8    10000      0.01%
100   10^8     10^6         1%  1  10^10     10^5     0.001%
100  10^10     10^7      0.01%


For very large values of N, you can get can still get a representative 
sample with a very small fraction of rows sampled.

Hmm... Yes, I can see some issues, once I tried out with test values!  
However. it might inspire a more useful and practical approach!


Cheers,
Gavin



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Sun, Dec 8, 2013 at 12:06 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
>
> bench=# ANALYZE pgbench_accounts;
> NOTICE:  acquire sample will need 30000 blocks
> NOTICE:  sampled 30000 blocks
> ANALYZE
> Time: 10059.446 ms
> bench=# \q


I did some experimenting here as well.

I hacked up a version of analyze.c that has a guc for rows_per_block
to sample. Setting it to 1 doesn't modify the behaviour at all whereas
setting it to 4 divides the number of blocks to sample by 4 which
causes it to do less I/O and use more rows from each block.

I then initialized pgbench with scale factor 100 but modified the code
to run the actual pgbench with scale factor 1. In other words I ran a
lot of updates on 1% of the database but left the other 99% untouched
from the initial load.

Then I ran "ANALYZE VERBOSE accounts" with rows_per_block set to 1, 4,
16, and 64. The latter is slightly larger than the average number of
tuples per block so the resulting sample is actually slightly short.

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


stark=# analyze verbose pgbench_accounts;
INFO:  analyzing "public.pgbench_accounts"
INFO:  "pgbench_accounts": scanned 30000 of 158756 pages, containing
1889701 live rows and 0 dead rows; 30000 rows in sample, 10000036
estimated total rows
ANALYZE
Time: 19468.987 ms


With rows_per_block=4 it reads only a quarter as many blocks but it's
not much faster:

stark=# analyze verbose pgbench_accounts;
INFO:  analyzing "public.pgbench_accounts"
INFO:  "pgbench_accounts": scanned 7501 of 158756 pages, containing
472489 live rows and 0 dead rows; 30000 rows in sample, 10000037
estimated total rows
ANALYZE
Time: 17062.331 ms


But with rows_per_block=16 it's much faster, 6.7s

stark=# set statistics_rows_per_block = 16;
SET
Time: 1.583 ms
stark=# analyze verbose pgbench_accounts;
INFO:  analyzing "public.pgbench_accounts"
INFO:  "pgbench_accounts": scanned 1876 of 158756 pages, containing
118163 live rows and 0 dead rows; 30000 rows in sample, 10000031
estimated total rows
ANALYZE
Time: 6694.331 ms


And with rows_per_block=64 it's under 2s:

stark=# set statistics_rows_per_block = 64;
SET
Time: 0.693 ms
stark=# analyze verbose pgbench_accounts;
INFO:  analyzing "public.pgbench_accounts"
INFO:  "pgbench_accounts": scanned 469 of 158756 pages, containing
29544 live rows and 0 dead rows; 29544 rows in sample, 10000033
estimated total rows
ANALYZE
Time: 1937.055 ms


The estimates for the total rows is just as accurate in every case.
(It seems to be consistently sightly high though which is a bit
disconcerting)

However looking at the actual pg_stats entries the stats are
noticeably less accurate for the "blockier" samples. The "bid" column
actually has 100 distinct values and so with a statistics_target of
100 each value should appear in the MCV list with a frequency of about
.01.

With rows_per_block=1   the MCV frequency list ranges from .0082 to .0123
With rows_per_block=4   the MCV frequency list ranges from .0063 to .0125
With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213

I'm not really sure if this is due to the blocky sample combined with
the skewed pgbench run or not. It doesn't seem to be consistently
biasing towards or against bid 1 which I believe are the only rows
that would have been touched by pgbench. Still it's suspicious that
they seem to be consistently getting less accurate as the blockiness
increases.

I've attached the results of pg_stats following the analyze with the
various levels of "blockiness".

--
greg

Attachment

Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/08/2013 10:14 AM, Greg Stark wrote:
> With rows_per_block=1   the MCV frequency list ranges from .0082 to .0123
> With rows_per_block=4   the MCV frequency list ranges from .0063 to .0125
> With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
> With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213
> 
> I'm not really sure if this is due to the blocky sample combined with
> the skewed pgbench run or not. It doesn't seem to be consistently
> biasing towards or against bid 1 which I believe are the only rows
> that would have been touched by pgbench. Still it's suspicious that
> they seem to be consistently getting less accurate as the blockiness
> increases.

They will certainly do so if you don't apply any statistical adjustments
for selecting more rows from the same pages.

So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block.  That's what I meant by
"block-based sampling"; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are.  And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.

Obviously, you need to look at more rows than you would with a
pure-random sample.  Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample.  However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.

My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Heikki Linnakangas
Date:
On 12/08/2013 08:14 PM, Greg Stark wrote:
> The whole accounts table is 1.2GB and contains 10 million rows. As
> expected with rows_per_block set to 1 it reads 240MB of that
> containing nearly 2 million rows (and takes nearly 20s -- doing a full
> table scan for select count(*) only takes about 5s):

One simple thing we could do, without or in addition to changing the 
algorithm, is to issue posix_fadvise() calls for the blocks we're going 
to read. It should at least be possible to match the speed of a plain 
sequential scan that way.

- Heikki



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
> They will certainly do so if you don't apply any statistical adjustments
> for selecting more rows from the same pages.
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block.

I just think this is an oversimplification. There's no skew introduced
just by reading all the rows in each block unless there's some kind of
dependency between the block a row is placed on and the data in it. So
I don't believe there can be some single set of math that
automatically removes any skew automatically. The math will depend on
what the dependency is.

Just to be clear, you have to think pretty hard about the way Postgres
internals work to see what kinds of skew might be appearing here. Due
to the way Postgres updates work and HOT cleanups work "hot" tuples
will be weighted less than "cold" tuples. That's not going to be
something someone in ACM knew to design into their maths.

I do have access to ACM or other academic articles if you remember any
author names or any keywords but if it's a database journal I would
worry about patent issues. Do you remember if it was over 17 years
old?


> Obviously, you need to look at more rows than you would with a
> pure-random sample.  Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample.

I really don't believe the 5% thing. It's not enough for n_distinct
and it's *far* too high a value for linear properties like histograms
or nullfrac etc. From a computer point of view it's too high to be
worth bothering. If we have to read 5% of the table we might as well
do a full scan anyways, it'll be marginally slower but much better
quality results.



-- 
greg



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 09/12/13 08:03, Josh Berkus wrote:
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block.  That's what I meant by
> "block-based sampling"; you read, say, 400 pages, you compile statistics
> on *all* of the rows on those pages, you apply some algorithms to adjust
> for groupings of rows based on how grouped they are.  And you have a
> pretty good estimate of how grouped they are, because you just looked a
> complete sets of rows on a bunch of nonadjacent pages.
>
> Obviously, you need to look at more rows than you would with a
> pure-random sample.  Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample.  However, since those rows come from the
> same pages, the cost of looking at more rows is quite small, compared to
> the cost of looking at 64 times as many disk pages.
>
> My ACM subscription has lapsed, though; someone with a current ACM
> subscription could search for this; there are several published papers,
> with math and pseudocode.
>

This makes more sense! Unfortunately you had confused everyone (well me 
at least) by decreeing that we needed block based sampling - when we 
started using that in 2004 - there is more than one way to sample blocks 
it seems :-)

What you are actually suggesting is a way to do block based sampling 
that requires reading fewer blocks, and that is a great idea! Of course 
as Greg has just suggested - the details are gonna be the clincher. Can 
you supply authors or titles of papers?

Also with reference to Greenplum - their use of postgres is fairly 
special case, and an approach that works well for them is not always 
suitable for more general purpose use.

Regards

Mark



Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
Greg,

> I really don't believe the 5% thing. It's not enough for n_distinct
> and it's *far* too high a value for linear properties like histograms
> or nullfrac etc. 

Actually, it is enough for n_distinct, or more properly, 5% is as good
as you can get for n_distinct unless you're going to jump to scanning
50% or more.

It's also applicable for the other stats; histogram buckets constructed
from a 5% sample are more likely to be accurate than those constructed
from a 0.1% sample.   Same with nullfrac.  The degree of improved
accuracy, would, of course, require some math to determine.

> From a computer point of view it's too high to be
> worth bothering. If we have to read 5% of the table we might as well
> do a full scan anyways, it'll be marginally slower but much better
> quality results.

Reading 5% of a 200GB table is going to be considerably faster than
reading the whole thing, if that 5% is being scanned in a way that the
FS understands.

Also, we can optimize this significantly by using the VM, as Robert (I
think) suggested.

In the advanced approaches section, there's also the idea of collecting
analyze data from table pages while they're in memory anyway for other
reasons.

You do seem kind of hostile to the idea of full-page-sampling, going
pretty far beyond the "I'd need to see the math".  Why?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Robert Haas
Date:
On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> I really don't believe the 5% thing. It's not enough for n_distinct
>> and it's *far* too high a value for linear properties like histograms
>> or nullfrac etc.
>
> Actually, it is enough for n_distinct, or more properly, 5% is as good
> as you can get for n_distinct unless you're going to jump to scanning
> 50% or more.

I'd like to see a proof of that result.

Not because I'm hostile to changing the algorithm, but because you've
made numerous mathematical claims on this thread that fly in the face
of what Greg, myself, and others understand to be mathematically true
- including this one.  If our understanding is wrong, then by all
means let's get that fixed.  But you're not going to convince anyone
here that we should rip out the existing algorithm and its
peer-reviewed journal citations by making categorical assertions about
the right way to do things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ANALYZE sampling is too good

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Reading 5% of a 200GB table is going to be considerably faster than
> reading the whole thing, if that 5% is being scanned in a way that the
> FS understands.

Really?  See the upthread point that reading one sector from each track
has just as much seek overhead as reading the whole thing.  I will grant
that if you think that reading a *contiguous* 5% of the table is good
enough, you can make it faster --- but I don't believe offhand that
you can make this better without seriously compromising the randomness
of your sample.  Too many tables are loaded roughly in time order, or
in other ways that make contiguous subsets nonrandom.

> You do seem kind of hostile to the idea of full-page-sampling, going
> pretty far beyond the "I'd need to see the math".  Why?

I'm detecting a lot of hostility to assertions unsupported by any math.
For good reason.
        regards, tom lane



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
> It's also applicable for the other stats; histogram buckets constructed
> from a 5% sample are more likely to be accurate than those constructed
> from a 0.1% sample.   Same with nullfrac.  The degree of improved
> accuracy, would, of course, require some math to determine.

This "some math" is straightforward basic statistics.  The 95th
percentile confidence interval for a sample consisting of 300 samples
from a population of a 1 million would be 5.66%. A sample consisting
of 1000 samples would have a 95th percentile confidence interval of
+/- 3.1%.

The histogram and nullfact answers the same kind of question as a
political poll, "what fraction of the population falls within this
subset". This is why pollsters don't need to sample 15 million
Americans to have a decent poll result. That's just not how the math
works for these kinds of questions.

n_distinct is an entirely different kettle of fish. It's a different
kind of problem and the error rate there *is* going to be dependent on
the percentage of the total population that you sampled. Moreover from
the papers I read I'm convinced any sample less than 50-80% is nearly
useless so I'm convinced you can't get good results without reading
the whole table.

-- 
greg



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark <stark@mit.edu> wrote:
>
> This "some math" is straightforward basic statistics.  The 95th
> percentile confidence interval for a sample consisting of 300 samples
> from a population of a 1 million would be 5.66%. A sample consisting
> of 1000 samples would have a 95th percentile confidence interval of
> +/- 3.1%.

Incidentally I got this using an online sample size calculator. Google
turns up several but this one seems the easiest to use:
http://www.raosoft.com/samplesize.html


-- 
greg



Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark@mit.edu> wrote:
> I always gave the party line that ANALYZE only takes a small
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
>
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.

That is a lot.  On the other hand, I question the subject line:
sometimes, our ANALYZE sampling is not good enough.  Before we raised
the default statistics target from 10 to 100, complaints about bad
plan choices due to insufficiently-precise statistics were legion --
and we still have people periodically proposing to sample a fixed
percentage of the table instead of a fixed amount of the table, even
on large tables, which is going the opposite direction.  I think this
is because they're getting really bad n_distinct estimates, and no
fixed-size sample can reliably give a good one.


I don't recall ever tracing a bad plan down to a bad n_distinct.  I have seen several that were due to bad frequency estimates in MCV list, because hash join planning is extremely sensitive to that.  Do we have some kind of catalog of generators of problematic data, so that changes can be tested on known problem sets?  Perhaps a wiki page to accumulate them would be useful.  For automated testing I guess the generator and query is the easy part, the hard part is the cost settings/caching/RAM needed to trigger the problem, and parsing and interpreting the results.
 

More generally, I think the basic problem that people are trying to
solve by raising the statistics target is avoid index scans on
gigantic tables.  Obviously, there are a lot of other situations where
inadequate statistics can cause problems, but that's a pretty
easy-to-understand one that we do not always get right.  We know that
an equality search looking for some_column = 'some_constant', where
some_constant is an MCV, must be more selective than a search for the
least-frequent MCV.  If you store more and more MCVs for a table,
eventually you'll have enough that the least-frequent one is pretty
infrequent, and then things work a lot better.

My reading of the code is that if it is not in the MCV, then it is assumed to have the average selectivity (about 1/n_distinct, but deflating top and bottom for the MCV list).  There is also a check that it is less than the least common of the MCV, but I don't know why that situation would ever prevail--that should always be higher or equal to the average selectivity.

 

This is more of a problem for big tables than for small tables.  MCV
#100 can't have a frequency of greater than 1/100 = 0.01, but that's a
lot more rows on a big table than small one.  On a table with 10
million rows we might estimate something close to 100,000 rows when
the real number is just a handful; when the table has only 10,000
rows, we just can't be off by as many orders of magnitude.  Things
don't always work out that badly, but in the worst case they do.

Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not.  So what I
think we should do is auto-tune the statistics target based on the
table size.  If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.

I think that parts of the planner are N^2 in the size of histogram (or was that the size of the MCV list?).  So we would probably need a way to use a larger sample size to get more accurate n_distinct and MCV frequencies, but not save the entire histogram that goes with that sample size.

 
Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Jim Nasby
Date:
On 12/6/13 3:21 AM, Andres Freund wrote:
> On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
>> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
>> other full-table scans? That doesn't really help Greg, because his
>> complaint is mostly that a fresh ANALYZE is too expensive, but it
>> could be an interesting, albeit risky approach.
>
> What I've been thinking of is
>
> a) making it piggy back on scans vacuum is doing instead of doing
> separate ones all the time (if possible, analyze needs to be more
> frequent). Currently with quite some likelihood the cache will be gone
> again when revisiting.

FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 2nd backend to do the ANALYZE portion
(orperhaps use Robert's fancy new shared memory stuff).
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: ANALYZE sampling is too good

From
Jim Nasby
Date:
On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
> On 12/08/2013 08:14 PM, Greg Stark wrote:
>> The whole accounts table is 1.2GB and contains 10 million rows. As
>> expected with rows_per_block set to 1 it reads 240MB of that
>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>> table scan for select count(*) only takes about 5s):
>
> One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for
theblocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way.
 

Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the
stage2 work in the callback routine...
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I don't recall ever tracing a bad plan down to a bad n_distinct.

It does happen. I've seen it several times.


-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Heikki Linnakangas
Date:
On 12/09/2013 11:35 PM, Jim Nasby wrote:
> On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
>> On 12/08/2013 08:14 PM, Greg Stark wrote:
>>> The whole accounts table is 1.2GB and contains 10 million rows. As
>>> expected with rows_per_block set to 1 it reads 240MB of that
>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>>> table scan for select count(*) only takes about 5s):
>>
>> One simple thing we could do, without or in addition to changing the
>> algorithm, is to issue posix_fadvise() calls for the blocks we're
>> going to read. It should at least be possible to match the speed of a
>> plain sequential scan that way.
>
> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
> thinking it wouldn't be very hard to do the stage 2 work in the callback
> routine...

Yeah, other than the fact we have no infrastructure to do asynchronous 
I/O anywhere in the backend. If we had that, then we could easily use it 
here. I doubt it would be much better than posix_fadvising the blocks, 
though.

- Heikki



Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 12/09/2013 11:35 PM, Jim Nasby wrote:
>>
>> On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
>>>
>>> On 12/08/2013 08:14 PM, Greg Stark wrote:
>>>>
>>>> The whole accounts table is 1.2GB and contains 10 million rows. As
>>>> expected with rows_per_block set to 1 it reads 240MB of that
>>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>>>> table scan for select count(*) only takes about 5s):
>>>
>>>
>>> One simple thing we could do, without or in addition to changing the
>>> algorithm, is to issue posix_fadvise() calls for the blocks we're
>>> going to read. It should at least be possible to match the speed of a
>>> plain sequential scan that way.
>>
>>
>> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
>> thinking it wouldn't be very hard to do the stage 2 work in the callback
>> routine...
>
>
> Yeah, other than the fact we have no infrastructure to do asynchronous I/O
> anywhere in the backend. If we had that, then we could easily use it here. I
> doubt it would be much better than posix_fadvising the blocks, though.


Without patches to the kernel, it is much better.

posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
bitmap heap scans (or similarly sorted analyze block samples) run at 1
IO / block, ie horrible, whereas aio can do read coalescence and
read-ahead when the kernel thinks it'll be profitable, significantly
increasing IOPS. I've seen everything from a 2x to 10x difference.



Re: ANALYZE sampling is too good

From
Robert Haas
Date:
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> My reading of the code is that if it is not in the MCV, then it is assumed
> to have the average selectivity (about 1/n_distinct, but deflating top and
> bottom for the MCV list).  There is also a check that it is less than the
> least common of the MCV, but I don't know why that situation would ever
> prevail--that should always be higher or equal to the average selectivity.

I've never seen an n_distinct value of more than 5 digits, regardless
of reality.  Typically I've seen 20-50k, even if the real number is
much higher.  But the n_distinct value is only for non-MCVs, so if we
estimate the selectivity of column = 'rarevalue' to be
(1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
the estimate, and making the MCV list longer naturally makes mcvfrac
bigger.  I'm not sure how important the
less-frequent-than-the-least-common-MCV part is, but I'm very sure
that raising the statistics target helps to solve the problem of
overestimating the prevalence of uncommon values in a very big table.

> I think that parts of the planner are N^2 in the size of histogram (or was
> that the size of the MCV list?).  So we would probably need a way to use a
> larger sample size to get more accurate n_distinct and MCV frequencies, but
> not save the entire histogram that goes with that sample size.

I think the saving the histogram part is important.  As you say, the
MCVs are important for a variety of planning purposes, such as hash
joins.  More than that, in my experience, people with large tables are
typically very willing to spend more planning time to get a better
plan, because mistakes are expensive and the queries are likely to run
for a while anyway.  People with small tables care about planning
time, because it makes no sense to spend an extra 1ms planning a query
unless you improve the plan by enough to save at least 1ms when
executing it, and when the tables are small and access is expected to
be fast anyway that's often not the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ANALYZE sampling is too good

From
Heikki Linnakangas
Date:
On 12/09/2013 11:56 PM, Claudio Freire wrote:
> On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 12/09/2013 11:35 PM, Jim Nasby wrote:
>>>
>>> On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
>>>>
>>>> On 12/08/2013 08:14 PM, Greg Stark wrote:
>>>>>
>>>>> The whole accounts table is 1.2GB and contains 10 million rows. As
>>>>> expected with rows_per_block set to 1 it reads 240MB of that
>>>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>>>>> table scan for select count(*) only takes about 5s):
>>>>
>>>> One simple thing we could do, without or in addition to changing the
>>>> algorithm, is to issue posix_fadvise() calls for the blocks we're
>>>> going to read. It should at least be possible to match the speed of a
>>>> plain sequential scan that way.
>>>
>>> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
>>> thinking it wouldn't be very hard to do the stage 2 work in the callback
>>> routine...
>>
>> Yeah, other than the fact we have no infrastructure to do asynchronous I/O
>> anywhere in the backend. If we had that, then we could easily use it here. I
>> doubt it would be much better than posix_fadvising the blocks, though.
>
> Without patches to the kernel, it is much better.
>
> posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
> bitmap heap scans (or similarly sorted analyze block samples) run at 1
> IO / block, ie horrible, whereas aio can do read coalescence and
> read-ahead when the kernel thinks it'll be profitable, significantly
> increasing IOPS. I've seen everything from a 2x to 10x difference.

How did you test that, given that we don't actually have an asynchronous
I/O implementation? I don't recall any recent patches floating around
either to do that. When Greg Stark investigated this back in 2007-2008
and implemented posix_fadvise() for bitmap heap scans, posix_fadvise
certainly gave a significant speedup on the test data he used. What kind
of a data distribution gives a slowdown like that?

I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
very easy, patch attached. Your mileage may vary, but I'm seeing a nice
gain from this on my laptop. Taking a 30000 page sample of a table with
717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6
seconds without the patch, and less than a second with the patch, with
effective_io_concurrency=10. If anyone with a good test data set loaded
would like to test this and post some numbers, that would be great.

- Heikki

Attachment

Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 12/09/2013 11:56 PM, Claudio Freire wrote:
>> Without patches to the kernel, it is much better.
>>
>> posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
>> bitmap heap scans (or similarly sorted analyze block samples) run at 1
>> IO / block, ie horrible, whereas aio can do read coalescence and
>> read-ahead when the kernel thinks it'll be profitable, significantly
>> increasing IOPS. I've seen everything from a 2x to 10x difference.
>
>
> How did you test that, given that we don't actually have an asynchronous I/O
> implementation? I don't recall any recent patches floating around either to
> do that. When Greg Stark investigated this back in 2007-2008 and implemented
> posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a
> significant speedup on the test data he used. What kind of a data
> distribution gives a slowdown like that?

That's basically my summarized experience from working on this[0]
patch, and the feedback given there about competing AIO work.

Sequential I/O was the biggest issue. I had to actively avoid
fadvising on sequential I/O, or sequential-ish I/O, which was a huge
burden on fadvise logic.

>
> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very
> easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from
> this on my laptop. Taking a 30000 page sample of a table with 717717 pages
> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the
> patch, and less than a second with the patch, with
> effective_io_concurrency=10. If anyone with a good test data set loaded
> would like to test this and post some numbers, that would be great.

Kernel version?

I raised this issue on LKML, and, while I got no news on this front,
they might have silently fixed it. I'd have to check the sources
again.

[0] http://www.postgresql.org/message-id/COL116-W162AEAA64173E77D4597EEA3670@phx.gbl



Re: ANALYZE sampling is too good

From
Heikki Linnakangas
Date:
Claudio Freire <klaussfreire@gmail.com> wrote:
>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
><hlinnakangas@vmware.com> wrote:
>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>be very
>> easy, patch attached. Your mileage may vary, but I'm seeing a nice
>gain from
>> this on my laptop. Taking a 30000 page sample of a table with 717717
>pages
>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
>the
>> patch, and less than a second with the patch, with
>> effective_io_concurrency=10. If anyone with a good test data set
>loaded
>> would like to test this and post some numbers, that would be great.
>
>Kernel version?

3.12, from Debian experimental. With an ssd drive and btrfs filesystem.  Admittedly not your average database server
setup,so it would be nice to get more reports from others. 
 

>I raised this issue on LKML, and, while I got no news on this front,
>they might have silently fixed it. I'd have to check the sources
>again.

Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway.
I

- Heikki



Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Claudio Freire <klaussfreire@gmail.com> wrote:
>>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
>><hlinnakangas@vmware.com> wrote:
>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>>be very
>>> easy, patch attached. Your mileage may vary, but I'm seeing a nice
>>gain from
>>> this on my laptop. Taking a 30000 page sample of a table with 717717
>>pages
>>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
>>the
>>> patch, and less than a second with the patch, with
>>> effective_io_concurrency=10. If anyone with a good test data set
>>loaded
>>> would like to test this and post some numbers, that would be great.
>>
>>Kernel version?
>
> 3.12, from Debian experimental. With an ssd drive and btrfs filesystem.  Admittedly not your average database server
setup,so it would be nice to get more reports from others.
 


Yeah, read-ahead isn't relevant for SSD.



Re: ANALYZE sampling is too good

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise
anyway.

Yeah.  If we're not reading consecutive blocks, readahead is unlikely
to do anything anyhow.

Claudio's comments do suggest that it might be a bad idea to issue a
posix_fadvise when the next block to be examined *is* adjacent to the
current one, though.
        regards, tom lane



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 12:14, Heikki Linnakangas wrote:
>
>
> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be 
> very easy, patch attached. Your mileage may vary, but I'm seeing a 
> nice gain from this on my laptop. Taking a 30000 page sample of a 
> table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes 
> about 6 seconds without the patch, and less than a second with the 
> patch, with effective_io_concurrency=10. If anyone with a good test 
> data set loaded would like to test this and post some numbers, that 
> would be great.
>
>



I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts    90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(

Regards

Mark



Re: ANALYZE sampling is too good

From
Tom Lane
Date:
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> I did a test run:

> pgbench scale 2000 (pgbench_accounts approx 25GB).
> postgres 9.4

> i7 3.5Ghz Cpu
> 16GB Ram
> 500 GB Velociraptor 10K

> (cold os and pg cache both runs)
> Without patch:  ANALYZE pgbench_accounts    90s
> With patch: ANALYZE pgbench_accounts  91s

> So I'm essentially seeing no difference :-(

What OS and filesystem?
        regards, tom lane



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 13:14, Mark Kirkwood wrote:
> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>
>>
>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
>> be very easy, patch attached. Your mileage may vary, but I'm seeing a 
>> nice gain from this on my laptop. Taking a 30000 page sample of a 
>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes 
>> about 6 seconds without the patch, and less than a second with the 
>> patch, with effective_io_concurrency=10. If anyone with a good test 
>> data set loaded would like to test this and post some numbers, that 
>> would be great.
>>
>>
>
>
>
> I did a test run:
>
> pgbench scale 2000 (pgbench_accounts approx 25GB).
> postgres 9.4
>
> i7 3.5Ghz Cpu
> 16GB Ram
> 500 GB Velociraptor 10K
>
> (cold os and pg cache both runs)
> Without patch:  ANALYZE pgbench_accounts    90s
> With patch: ANALYZE pgbench_accounts  91s
>
> So I'm essentially seeing no difference :-(


Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4




Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 13:20, Mark Kirkwood wrote:
> On 10/12/13 13:14, Mark Kirkwood wrote:
>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>
>>>
>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
>>> be very easy, patch attached. Your mileage may vary, but I'm seeing 
>>> a nice gain from this on my laptop. Taking a 30000 page sample of a 
>>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE 
>>> takes about 6 seconds without the patch, and less than a second with 
>>> the patch, with effective_io_concurrency=10. If anyone with a good 
>>> test data set loaded would like to test this and post some numbers, 
>>> that would be great.
>>>
>>>
>>
>>
>>
>> I did a test run:
>>
>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>> postgres 9.4
>>
>> i7 3.5Ghz Cpu
>> 16GB Ram
>> 500 GB Velociraptor 10K
>>
>> (cold os and pg cache both runs)
>> Without patch:  ANALYZE pgbench_accounts    90s
>> With patch: ANALYZE pgbench_accounts  91s
>>
>> So I'm essentially seeing no difference :-(
>
>
> Arrg - sorry forgot the important bits:
>
> Ubuntu 13.10 (kernel 3.11.0-14)
> filesystem is ext4
>
>
>

Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem 
mounted with discard):

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s






Re: ANALYZE sampling is too good

From
Craig Ringer
Date:
On 12/06/2013 09:52 AM, Peter Geoghegan wrote:
> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> other full-table scans? That doesn't really help Greg, because his
> complaint is mostly that a fresh ANALYZE is too expensive, but it
> could be an interesting, albeit risky approach.

It'd be particularly interesting, IMO, if autovacuum was able to do it
using the synchronized scans machinery, so the analyze still ran in a
separate proc and didn't impact the user's query. Have an autovac worker
or two waiting for new scans to start on tables they need to analyze,
and if one doesn't start within a reasonable time frame they start their
own scan.

We've seen enough issues with hint-bit setting causing unpredictable
query times for user queries that I wouldn't want to add another "might
write during a read-only query" behaviour.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Craig Ringer
Date:
On 12/10/2013 05:20 AM, Jim Nasby wrote:
>>
> 
> FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to
> fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's
> fancy new shared memory stuff).

Apologies for posting the same as a new idea before I saw your post. I
need to finish the thread before posting.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 13:53, Mark Kirkwood wrote:
> On 10/12/13 13:20, Mark Kirkwood wrote:
>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>
>>>>
>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
>>>> be very easy, patch attached. Your mileage may vary, but I'm seeing 
>>>> a nice gain from this on my laptop. Taking a 30000 page sample of a 
>>>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE 
>>>> takes about 6 seconds without the patch, and less than a second 
>>>> with the patch, with effective_io_concurrency=10. If anyone with a 
>>>> good test data set loaded would like to test this and post some 
>>>> numbers, that would be great.
>>>>
>>>>
>>>
>>>
>>>
>>> I did a test run:
>>>
>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>> postgres 9.4
>>>
>>> i7 3.5Ghz Cpu
>>> 16GB Ram
>>> 500 GB Velociraptor 10K
>>>
>>> (cold os and pg cache both runs)
>>> Without patch:  ANALYZE pgbench_accounts    90s
>>> With patch: ANALYZE pgbench_accounts  91s
>>>
>>> So I'm essentially seeing no difference :-(
>>
>>
>> Arrg - sorry forgot the important bits:
>>
>> Ubuntu 13.10 (kernel 3.11.0-14)
>> filesystem is ext4
>>
>>
>>
>
> Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem 
> mounted with discard):
>
> (cold os and pg cache both runs)
> Without patch:  ANALYZE pgbench_accounts  5s
> With patch: ANALYZE pgbench_accounts  5s
>
>
>
>
>

Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 15:04, Mark Kirkwood wrote:
> On 10/12/13 13:53, Mark Kirkwood wrote:
>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>
>>>>>
>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out 
>>>>> to be very easy, patch attached. Your mileage may vary, but I'm 
>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page 
>>>>> sample of a table with 717717 pages (ie. slightly larger than 
>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less 
>>>>> than a second with the patch, with effective_io_concurrency=10. If 
>>>>> anyone with a good test data set loaded would like to test this 
>>>>> and post some numbers, that would be great.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> I did a test run:
>>>>
>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>> postgres 9.4
>>>>
>>>> i7 3.5Ghz Cpu
>>>> 16GB Ram
>>>> 500 GB Velociraptor 10K
>>>>
>>>> (cold os and pg cache both runs)
>>>> Without patch:  ANALYZE pgbench_accounts    90s
>>>> With patch: ANALYZE pgbench_accounts  91s
>>>>
>>>> So I'm essentially seeing no difference :-(
>>>
>>>
>>> Arrg - sorry forgot the important bits:
>>>
>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>> filesystem is ext4
>>>
>>>
>>>
>>
>> Doing the same test as above, but on a 80GB Intel 520 (ext4 
>> filesystem mounted with discard):
>>
>> (cold os and pg cache both runs)
>> Without patch:  ANALYZE pgbench_accounts  5s
>> With patch: ANALYZE pgbench_accounts  5s
>>
>>
>>
>>
>>
>
> Redoing the filesystem on the 520 as btrfs didn't seem to make any 
> difference either:
>
> (cold os and pg cache both runs)
> Without patch:  ANALYZE pgbench_accounts  6.4s
> With patch: ANALYZE pgbench_accounts  6.4s
>
>
>

Ah - I have just realized I was not setting effective_io_concurrency - 
so I'll redo the test. - Apologies.

Regards

Mark




Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 15:11, Mark Kirkwood wrote:
> On 10/12/13 15:04, Mark Kirkwood wrote:
>> On 10/12/13 13:53, Mark Kirkwood wrote:
>>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>>
>>>>>>
>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out 
>>>>>> to be very easy, patch attached. Your mileage may vary, but I'm 
>>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page 
>>>>>> sample of a table with 717717 pages (ie. slightly larger than 
>>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less 
>>>>>> than a second with the patch, with effective_io_concurrency=10. 
>>>>>> If anyone with a good test data set loaded would like to test 
>>>>>> this and post some numbers, that would be great.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I did a test run:
>>>>>
>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>>> postgres 9.4
>>>>>
>>>>> i7 3.5Ghz Cpu
>>>>> 16GB Ram
>>>>> 500 GB Velociraptor 10K
>>>>>
>>>>> (cold os and pg cache both runs)
>>>>> Without patch:  ANALYZE pgbench_accounts    90s
>>>>> With patch: ANALYZE pgbench_accounts  91s
>>>>>
>>>>> So I'm essentially seeing no difference :-(
>>>>
>>>>
>>>> Arrg - sorry forgot the important bits:
>>>>
>>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>>> filesystem is ext4
>>>>
>>>>
>>>>
>>>
>>> Doing the same test as above, but on a 80GB Intel 520 (ext4 
>>> filesystem mounted with discard):
>>>
>>> (cold os and pg cache both runs)
>>> Without patch:  ANALYZE pgbench_accounts  5s
>>> With patch: ANALYZE pgbench_accounts  5s
>>>
>>>
>>>
>>>
>>>
>>
>> Redoing the filesystem on the 520 as btrfs didn't seem to make any 
>> difference either:
>>
>> (cold os and pg cache both runs)
>> Without patch:  ANALYZE pgbench_accounts  6.4s
>> With patch: ANALYZE pgbench_accounts  6.4s
>>
>>
>>
>
> Ah - I have just realized I was not setting effective_io_concurrency - 
> so I'll redo the test. - Apologies.
>
>

Redoing the test on the velociraptor gives me exactly the same numbers 
as before (effective_io_concurrency = 10 instead of 1).

Cheers

Mark




Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/09/2013 02:37 PM, Robert Haas wrote:
> I've never seen an n_distinct value of more than 5 digits, regardless
> of reality.  Typically I've seen 20-50k, even if the real number is
> much higher.  But the n_distinct value is only for non-MCVs, so if we
> estimate the selectivity of column = 'rarevalue' to be
> (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
> the estimate, and making the MCV list longer naturally makes mcvfrac
> bigger.  I'm not sure how important the
> less-frequent-than-the-least-common-MCV part is, but I'm very sure
> that raising the statistics target helps to solve the problem of
> overestimating the prevalence of uncommon values in a very big table.

I did an analysis of our ndistinct algorithm several years ago ( ~~
8.1), and to sum up:

1. we take far too small of a sample to estimate ndistinct well for
tables larger than 100,000 rows.

2. the estimation algo we have chosen is one which tends to be wrong in
the downwards direction, rather strongly so.  That is, if we could
potentially have an ndistinct of 1000 to 100,000 based on the sample,
our algo estimates 1500 to 3000.

3. Other algos exist.  The tend to be wrong in other directions.

4. Nobody has done an analysis of whether it's worse, on average, to
estimate low vs. high for ndistinct.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 15:17, Mark Kirkwood wrote:
> On 10/12/13 15:11, Mark Kirkwood wrote:
>> On 10/12/13 15:04, Mark Kirkwood wrote:
>>> On 10/12/13 13:53, Mark Kirkwood wrote:
>>>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>>>
>>>>>>>
>>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out 
>>>>>>> to be very easy, patch attached. Your mileage may vary, but I'm 
>>>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page 
>>>>>>> sample of a table with 717717 pages (ie. slightly larger than 
>>>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less 
>>>>>>> than a second with the patch, with effective_io_concurrency=10. 
>>>>>>> If anyone with a good test data set loaded would like to test 
>>>>>>> this and post some numbers, that would be great.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I did a test run:
>>>>>>
>>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>>>> postgres 9.4
>>>>>>
>>>>>> i7 3.5Ghz Cpu
>>>>>> 16GB Ram
>>>>>> 500 GB Velociraptor 10K
>>>>>>
>>>>>> (cold os and pg cache both runs)
>>>>>> Without patch:  ANALYZE pgbench_accounts    90s
>>>>>> With patch: ANALYZE pgbench_accounts  91s
>>>>>>
>>>>>> So I'm essentially seeing no difference :-(
>>>>>
>>>>>
>>>>> Arrg - sorry forgot the important bits:
>>>>>
>>>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>>>> filesystem is ext4
>>>>>
>>>>>
>>>>>
>>>>
>>>> Doing the same test as above, but on a 80GB Intel 520 (ext4 
>>>> filesystem mounted with discard):
>>>>
>>>> (cold os and pg cache both runs)
>>>> Without patch:  ANALYZE pgbench_accounts  5s
>>>> With patch: ANALYZE pgbench_accounts  5s
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> Redoing the filesystem on the 520 as btrfs didn't seem to make any 
>>> difference either:
>>>
>>> (cold os and pg cache both runs)
>>> Without patch:  ANALYZE pgbench_accounts  6.4s
>>> With patch: ANALYZE pgbench_accounts  6.4s
>>>
>>>
>>>
>>
>> Ah - I have just realized I was not setting effective_io_concurrency 
>> - so I'll redo the test. - Apologies.
>>
>>
>
> Redoing the test on the velociraptor gives me exactly the same numbers 
> as before (effective_io_concurrency = 10 instead of 1).
>
>

Good grief - repeating the test gave:

Without patch:  ANALYZE pgbench_accounts 90s
With patch: ANALYZE pgbench_accounts  42s

pretty consistent *now*. No idea what was going on in the 1st run (maybe 
I happened to have it running at the same time as a checkpoint)? Anyway 
will stop now before creating more confusion.

Regards

Mark




Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 15:32, Mark Kirkwood wrote:
> On 10/12/13 15:17, Mark Kirkwood wrote:
>> On 10/12/13 15:11, Mark Kirkwood wrote:
>>> On 10/12/13 15:04, Mark Kirkwood wrote:
>>>> On 10/12/13 13:53, Mark Kirkwood wrote:
>>>>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>>>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned 
>>>>>>>> out to be very easy, patch attached. Your mileage may vary, but 
>>>>>>>> I'm seeing a nice gain from this on my laptop. Taking a 30000 
>>>>>>>> page sample of a table with 717717 pages (ie. slightly larger 
>>>>>>>> than RAM), ANALYZE takes about 6 seconds without the patch, and 
>>>>>>>> less than a second with the patch, with 
>>>>>>>> effective_io_concurrency=10. If anyone with a good test data 
>>>>>>>> set loaded would like to test this and post some numbers, that 
>>>>>>>> would be great.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I did a test run:
>>>>>>>
>>>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>>>>> postgres 9.4
>>>>>>>
>>>>>>> i7 3.5Ghz Cpu
>>>>>>> 16GB Ram
>>>>>>> 500 GB Velociraptor 10K
>>>>>>>
>>>>>>> (cold os and pg cache both runs)
>>>>>>> Without patch:  ANALYZE pgbench_accounts    90s
>>>>>>> With patch: ANALYZE pgbench_accounts  91s
>>>>>>>
>>>>>>> So I'm essentially seeing no difference :-(
>>>>>>
>>>>>>
>>>>>> Arrg - sorry forgot the important bits:
>>>>>>
>>>>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>>>>> filesystem is ext4
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Doing the same test as above, but on a 80GB Intel 520 (ext4 
>>>>> filesystem mounted with discard):
>>>>>
>>>>> (cold os and pg cache both runs)
>>>>> Without patch:  ANALYZE pgbench_accounts  5s
>>>>> With patch: ANALYZE pgbench_accounts  5s
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Redoing the filesystem on the 520 as btrfs didn't seem to make any 
>>>> difference either:
>>>>
>>>> (cold os and pg cache both runs)
>>>> Without patch:  ANALYZE pgbench_accounts  6.4s
>>>> With patch: ANALYZE pgbench_accounts  6.4s
>>>>
>>>>
>>>>
>>>
>>> Ah - I have just realized I was not setting effective_io_concurrency 
>>> - so I'll redo the test. - Apologies.
>>>
>>>
>>
>> Redoing the test on the velociraptor gives me exactly the same 
>> numbers as before (effective_io_concurrency = 10 instead of 1).
>>
>>
>
> Good grief - repeating the test gave:
>
> Without patch:  ANALYZE pgbench_accounts 90s
> With patch: ANALYZE pgbench_accounts  42s
>
> pretty consistent *now*. No idea what was going on in the 1st run 
> (maybe I happened to have it running at the same time as a 
> checkpoint)? Anyway will stop now before creating more confusion.
>
>

Just one more...

The Intel 520 with ext4:

Without patch:  ANALYZE pgbench_accounts 5s
With patch: ANALYZE pgbench_accounts  1s

And double checking -
With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s

These results look more like Heikki's. Which suggests more benefit on 
SSD than spinning disks. Some more data points (apart from mine) would 
be good to see tho.

Cheers

Mark







Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> Just one more...
>
> The Intel 520 with ext4:
>
>
> Without patch:  ANALYZE pgbench_accounts 5s
> With patch: ANALYZE pgbench_accounts  1s
>
> And double checking -
> With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s
>
> These results look more like Heikki's. Which suggests more benefit on SSD
> than spinning disks. Some more data points (apart from mine) would be good
> to see tho.


Assuming ANALYZE is sampling less than 5% of the table, I'd say
fadvising will always be a win.

I'd also suggest higher e_i_c for rotating media. Rotating media has
longer latencias, and e_i_c has to be computed in terms of latency,
rather than "spindles" when doing prefetch.

For backward index scans, I found the optimum number for a single
spindle to be about 20.



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 10/12/13 17:18, Claudio Freire wrote:
> On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz> wrote:
>> Just one more...
>>
>> The Intel 520 with ext4:
>>
>>
>> Without patch:  ANALYZE pgbench_accounts 5s
>> With patch: ANALYZE pgbench_accounts  1s
>>
>> And double checking -
>> With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s
>>
>> These results look more like Heikki's. Which suggests more benefit on SSD
>> than spinning disks. Some more data points (apart from mine) would be good
>> to see tho.
>
> Assuming ANALYZE is sampling less than 5% of the table, I'd say
> fadvising will always be a win.
>
> I'd also suggest higher e_i_c for rotating media. Rotating media has
> longer latencias, and e_i_c has to be computed in terms of latency,
> rather than "spindles" when doing prefetch.
>
> For backward index scans, I found the optimum number for a single
> spindle to be about 20.

Yeah - was just fooling about with this on the velociraptor, looks like 
somewhere in the 20's works well.
                   | pgbench_accounts
eff_io_concurrency | analyze time (s)
-------------------+-----------------
8                  |  43
16                 |  40
24                 |  36
32                 |  35
64                 |  35

Cheers

Mark



Re: ANALYZE sampling is too good

From
Albe Laurenz
Date:
Greg Stark wrote:
>> It's also applicable for the other stats; histogram buckets constructed
>> from a 5% sample are more likely to be accurate than those constructed
>> from a 0.1% sample.   Same with nullfrac.  The degree of improved
>> accuracy, would, of course, require some math to determine.
> 
> This "some math" is straightforward basic statistics.  The 95th
> percentile confidence interval for a sample consisting of 300 samples
> from a population of a 1 million would be 5.66%. A sample consisting
> of 1000 samples would have a 95th percentile confidence interval of
> +/- 3.1%.

Doesn't all that assume a normally distributed random variable?

I don't think it can be applied to database table contents
without further analysis.

Yours,
Laurenz Albe

Re: ANALYZE sampling is too good

From
Greg Stark
Date:
<p dir="ltr"><br /> On 10 Dec 2013 08:28, "Albe Laurenz" <<a
href="mailto:laurenz.albe@wien.gv.at">laurenz.albe@wien.gv.at</a>>wrote:<br /> ><br /> ><br /> > Doesn't
allthat assume a normally distributed random variable?<br /><p dir="ltr">I don't think so because of the law of large
numbers.If you have a large population and sample it the sample behaves like a normal distribution when if the
distributionof the population isn't. 

Re: ANALYZE sampling is too good

From
Albe Laurenz
Date:
Greg Stark wrote:
>> Doesn't all that assume a normally distributed random variable?

> I don't think so because of the law of large numbers. If you have a large population and sample it the
> sample behaves like a normal distribution when if the distribution of the population isn't.

Statistics is the part of mathematics I know least of, but aren't
you saying that in a large enough sample of people there will
always be some with age < 0 (which is what a normal distribution
would imply)?

Yours,
Laurenz Albe

Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark@mit.edu> wrote:
>
> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote:
>>
>>
>> Doesn't all that assume a normally distributed random variable?
>
> I don't think so because of the law of large numbers. If you have a large
> population and sample it the sample behaves like a normal distribution when
> if the distribution of the population isn't.


No, the large population says that if you have an AVERAGE of many
samples of a random variable, the random variable that is the AVERAGE
behaves like a normal.

The variable itself doesn't.

And for n_distinct, you need to know the variable itself.



Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Tue, Dec 10, 2013 at 11:32 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark@mit.edu> wrote:
>>
>> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote:
>>>
>>>
>>> Doesn't all that assume a normally distributed random variable?
>>
>> I don't think so because of the law of large numbers. If you have a large
>> population and sample it the sample behaves like a normal distribution when
>> if the distribution of the population isn't.
>
>
> No, the large population says that if you have an AVERAGE of many
> samples of a random variable, the random variable that is the AVERAGE
> behaves like a normal.


Sorry, that should be "No, the *law of large numbers* says..."



Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 6 December 2013 09:21, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
>> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
>> other full-table scans? That doesn't really help Greg, because his
>> complaint is mostly that a fresh ANALYZE is too expensive, but it
>> could be an interesting, albeit risky approach.
>
> What I've been thinking of is
>
> a) making it piggy back on scans vacuum is doing instead of doing
> separate ones all the time (if possible, analyze needs to be more
> frequent). Currently with quite some likelihood the cache will be gone
> again when revisiting.

> b) make analyze incremental. In lots of bigger tables most of the table
> is static - and we actually *do* know that, thanks to the vm. So keep a
> rawer form of what ends in the catalogs around somewhere, chunked by the
> region of the table the statistic is from. Everytime a part of the table
> changes, re-sample only that part. Then recompute the aggregate.

Piggy-backing sounds like a bad thing. If I run a query, I don't want
to be given some extra task thanks! Especially if we might need to run
data type code I'm not authroised to run, or to sample data I may not
be authorised to see. The only way that could work is to kick off an
autovacuum worker to run the ANALYZE as a separate process and then
use synchronous scan behaviour to derive benefit indirectly.

However, these things presume that we need to continue scanning most
of the blocks of the table, which I don't think needs to be the case.
There is a better way.

Back in 2005/6, I advocated a block sampling method, as described by
Chaudri et al (ref?)
That has two advantages

* We don't need to visit all of the blocks, reducing I/O

* It would also give better analysis of clustered data (its all in
that paper...)

The recent patch on TABLESAMPLE contained a rewrite of the guts of
ANALYZE into a generic sample scan, which would then be used as the
basis for a TABLESAMPLE BERNOULLI. A TABLESAMPLE SYSTEM could use a
block sample, as it does in some other DBMS (DB2, SQLServer).

While I was unimpressed with the TABLESAMPLE patch, all it needs is
some committer-love, so Greg...

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Andres Freund
Date:
On 2013-12-10 19:23:37 +0000, Simon Riggs wrote:
> On 6 December 2013 09:21, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
> >> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> >> other full-table scans? That doesn't really help Greg, because his
> >> complaint is mostly that a fresh ANALYZE is too expensive, but it
> >> could be an interesting, albeit risky approach.
> >
> > What I've been thinking of is
> >
> > a) making it piggy back on scans vacuum is doing instead of doing
> > separate ones all the time (if possible, analyze needs to be more
> > frequent). Currently with quite some likelihood the cache will be gone
> > again when revisiting.
> 
> > b) make analyze incremental. In lots of bigger tables most of the table
> > is static - and we actually *do* know that, thanks to the vm. So keep a
> > rawer form of what ends in the catalogs around somewhere, chunked by the
> > region of the table the statistic is from. Everytime a part of the table
> > changes, re-sample only that part. Then recompute the aggregate.
> 
> Piggy-backing sounds like a bad thing. If I run a query, I don't want
> to be given some extra task thanks! Especially if we might need to run
> data type code I'm not authroised to run, or to sample data I may not
> be authorised to see. The only way that could work is to kick off an
> autovacuum worker to run the ANALYZE as a separate process and then
> use synchronous scan behaviour to derive benefit indirectly.

I was suggesting to piggyback on VACUUM, not user queries. The latter
suggestion was somebody else.
In combination with incremental or chunk-wise building of statistics,
doing more frequent partial vacuums which re-compute the changing part
of the stats would be great. All those blocks have to be read anyway,
and we should be more agressive about vacuuming anyway.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> However, these things presume that we need to continue scanning most
> of the blocks of the table, which I don't think needs to be the case.
> There is a better way.

Do they? I think it's one opportunistic way of ameliorating the cost.

> Back in 2005/6, I advocated a block sampling method, as described by
> Chaudri et al (ref?)

I don't think that anyone believes that not doing block sampling is
tenable, fwiw. Clearly some type of block sampling would be preferable
for most or all purposes.


-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 10 December 2013 19:49, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> However, these things presume that we need to continue scanning most
>> of the blocks of the table, which I don't think needs to be the case.
>> There is a better way.
>
> Do they? I think it's one opportunistic way of ameliorating the cost.
>
>> Back in 2005/6, I advocated a block sampling method, as described by
>> Chaudri et al (ref?)
>
> I don't think that anyone believes that not doing block sampling is
> tenable, fwiw. Clearly some type of block sampling would be preferable
> for most or all purposes.

If we have one way of reducing cost of ANALYZE, I'd suggest we don't
need 2 ways - especially if the second way involves the interaction of
otherwise not fully related parts of the code.

Or to put it clearly, lets go with block sampling and then see if that
needs even more work.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: 
> I don't think that anyone believes that not doing block sampling is
> tenable, fwiw. Clearly some type of block sampling would be preferable
> for most or all purposes.

As discussed, we need math though.  Does anyone have an ACM subscription
and time to do a search?  Someone must.  We can buy one with community
funds, but no reason to do so if we don't have to.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Tue, Dec 10, 2013 at 7:54 PM, Josh Berkus <josh@agliodbs.com> wrote:
> As discussed, we need math though.  Does anyone have an ACM subscription
> and time to do a search?  Someone must.  We can buy one with community
> funds, but no reason to do so if we don't have to.

Anyone in a university likely has access through their library.

But I don't really think this is the right way to go about this.
Research papers are going to turn up pretty specialized solutions that
are probably patented. We don't even have the basic understanding we
need. I suspect a basic textbook chapter on multistage sampling will
discuss at least the standard techniques.

Once we have a handle on the standard multistage sampling techniques
that would be safe from patents then we might want to go look at
research papers to find how they've been applied to databases in the
past but we would have to do that fairly carefully.

-- 
greg



Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 10 December 2013 19:54, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I don't think that anyone believes that not doing block sampling is
>> tenable, fwiw. Clearly some type of block sampling would be preferable
>> for most or all purposes.
>
> As discussed, we need math though.  Does anyone have an ACM subscription
> and time to do a search?  Someone must.  We can buy one with community
> funds, but no reason to do so if we don't have to.

We already have that, just use Vitter's algorithm at the block level
rather than the row level.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark <stark@mit.edu> wrote:
> But I don't really think this is the right way to go about this.
> Research papers are going to turn up pretty specialized solutions that
> are probably patented. We don't even have the basic understanding we
> need. I suspect a basic textbook chapter on multistage sampling will
> discuss at least the standard techniques.

I agree that looking for information on block level sampling
specifically, and its impact on estimation quality is likely to not
turn up very much, and whatever it does turn up will have patent
issues.

-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Heikki Linnakangas
Date:
On 12/10/2013 10:00 PM, Simon Riggs wrote:
> On 10 December 2013 19:54, Josh Berkus <josh@agliodbs.com> wrote:
>> On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
>>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> I don't think that anyone believes that not doing block sampling is
>>> tenable, fwiw. Clearly some type of block sampling would be preferable
>>> for most or all purposes.
>>
>> As discussed, we need math though.  Does anyone have an ACM subscription
>> and time to do a search?  Someone must.  We can buy one with community
>> funds, but no reason to do so if we don't have to.
>
> We already have that, just use Vitter's algorithm at the block level
> rather than the row level.

And what do you do with the blocks? How many blocks do you choose? 
Details, please.

- Heikki



Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 11/12/13 09:19, Heikki Linnakangas wrote:
> On 12/10/2013 10:00 PM, Simon Riggs wrote:
>> On 10 December 2013 19:54, Josh Berkus <josh@agliodbs.com> wrote:
>>> On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
>>>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs 
>>>> <simon@2ndquadrant.com> wrote:
>>>> I don't think that anyone believes that not doing block sampling is
>>>> tenable, fwiw. Clearly some type of block sampling would be preferable
>>>> for most or all purposes.
>>>
>>> As discussed, we need math though.  Does anyone have an ACM 
>>> subscription
>>> and time to do a search?  Someone must.  We can buy one with community
>>> funds, but no reason to do so if we don't have to.
>>
>> We already have that, just use Vitter's algorithm at the block level
>> rather than the row level.
>
> And what do you do with the blocks? How many blocks do you choose? 
> Details, please.
>
>

Yeah - and we seem to be back to Josh's point about needing 'some math' 
to cope with the rows within a block not being a purely random selection.

Regards

Mark






Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/10/2013 01:33 PM, Mark Kirkwood wrote:
> Yeah - and we seem to be back to Josh's point about needing 'some math'
> to cope with the rows within a block not being a purely random selection.

Well, sometimes they are effectively random.  But sometimes they are
not.  The Chaudri et al paper had a formula for estimating randomness
based on the grouping of rows in each block, assuming that the sampled
blocks were widely spaced (if they aren't there's not much you can do).This is where you get up to needing a 5% sample;
youneed to take
 
enough blocks that you're confident that the blocks you sampled are
representative of the population.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Tue, Dec 10, 2013 at 7:49 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Back in 2005/6, I advocated a block sampling method, as described by
>> Chaudri et al (ref?)
>
> I don't think that anyone believes that not doing block sampling is
> tenable, fwiw. Clearly some type of block sampling would be preferable
> for most or all purposes.

We do block sampling now. But then we select rows from those blocks uniformly.

Incidentally, if you mean Surajit Chaudhuri, he's a Microsoft Research
lead so I would be nervous about anything he's published being
patented by Microsoft.


-- 
greg



Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Mon, Dec 9, 2013 at 2:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> My reading of the code is that if it is not in the MCV, then it is assumed
> to have the average selectivity (about 1/n_distinct, but deflating top and
> bottom for the MCV list).  There is also a check that it is less than the
> least common of the MCV, but I don't know why that situation would ever
> prevail--that should always be higher or equal to the average selectivity.

I've never seen an n_distinct value of more than 5 digits, regardless
of reality.  Typically I've seen 20-50k, even if the real number is
much higher.


I don't recall seeing an n_distinct that is literally above 100,000 in the wild, but I've seen negative ones that multiply with reltuples to give values more than that.  In test cases it is easy enough to get values in the millions by creating tables using floor(random()*$something).

create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
create table baz2 as select * from baz order by floor;
create table baz3 as select * from baz order by md5(floor::text);

baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.

After analyzing all of them:

select tablename, n_distinct,correlation  from pg_stats where tablename  like 'baz%' and attname='floor'  ;
 tablename | n_distinct  | correlation
-----------+-------------+-------------
 baz       | 8.56006e+06 |  0.00497713
 baz2      |      333774 |           1
 baz3      |      361048 |  -0.0118147

So baz is pretty close, while the other two are way off.  But the n_distinct computation doesn't depend on the order of the rows, as far as I can tell, but only the composition of the sample.  So this can only mean that our "random" sampling method is desperately non-random, right?

 
 But the n_distinct value is only for non-MCVs, so if we
estimate the selectivity of column = 'rarevalue' to be
(1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
the estimate, and making the MCV list longer naturally makes mcvfrac
bigger.  

Ah, I see.  By including more things into MCV, we crowd out the rest of the space implicitly.


 
I'm not sure how important the
less-frequent-than-the-least-common-MCV part is, but I'm very sure
that raising the statistics target helps to solve the problem of
overestimating the prevalence of uncommon values in a very big table.

> I think that parts of the planner are N^2 in the size of histogram (or was
> that the size of the MCV list?).  So we would probably need a way to use a
> larger sample size to get more accurate n_distinct and MCV frequencies, but
> not save the entire histogram that goes with that sample size.

I think the saving the histogram part is important.  As you say, the
MCVs are important for a variety of planning purposes, such as hash
joins.  More than that, in my experience, people with large tables are
typically very willing to spend more planning time to get a better
plan, because mistakes are expensive and the queries are likely to run
for a while anyway.  People with small tables care about planning
time, because it makes no sense to spend an extra 1ms planning a query
unless you improve the plan by enough to save at least 1ms when
executing it, and when the tables are small and access is expected to
be fast anyway that's often not the case.

I would think that the dichotomy is more about the size of the query-plan than of the tables.  I think a lot of people with huge tables end up doing mostly indexed lookups in unique or highly selective indexes, once all the planning is done.

Does anyone have generators for examples of cases where increasing the sample size to get better histograms (as opposed more accurate n_distinct or more accurate MCV) was the key to fixing bad plans?  I wonder if it is more bins, or more accurate boundaries, that makes the difference.
 
Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Jim Nasby
Date:
On 12/10/13 2:17 PM, Peter Geoghegan wrote:
> On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark <stark@mit.edu> wrote:
>> But I don't really think this is the right way to go about this.
>> Research papers are going to turn up pretty specialized solutions that
>> are probably patented. We don't even have the basic understanding we
>> need. I suspect a basic textbook chapter on multistage sampling will
>> discuss at least the standard techniques.
>
> I agree that looking for information on block level sampling
> specifically, and its impact on estimation quality is likely to not
> turn up very much, and whatever it does turn up will have patent
> issues.

We have an entire analytics dept. at work that specializes in finding patterns in our data. I might be able to get some
timefrom them to at least provide some guidance here, if the community is interested. They could really only serve in a
consultingrole though.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby <jim@nasby.net> wrote:
>> I agree that looking for information on block level sampling
>> specifically, and its impact on estimation quality is likely to not
>> turn up very much, and whatever it does turn up will have patent
>> issues.
>
>
> We have an entire analytics dept. at work that specializes in finding
> patterns in our data. I might be able to get some time from them to at least
> provide some guidance here, if the community is interested. They could
> really only serve in a consulting role though.

I think that Greg had this right several years ago: it would probably
be very useful to have the input of someone with a strong background
in statistics. It doesn't seem that important that they already know a
lot about databases, provided they can understand what our constraints
are, and what is important to us. It might just be a matter of having
them point us in the right direction.


-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 10 December 2013 23:43, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby <jim@nasby.net> wrote:
>>> I agree that looking for information on block level sampling
>>> specifically, and its impact on estimation quality is likely to not
>>> turn up very much, and whatever it does turn up will have patent
>>> issues.
>>
>>
>> We have an entire analytics dept. at work that specializes in finding
>> patterns in our data. I might be able to get some time from them to at least
>> provide some guidance here, if the community is interested. They could
>> really only serve in a consulting role though.
>
> I think that Greg had this right several years ago: it would probably
> be very useful to have the input of someone with a strong background
> in statistics. It doesn't seem that important that they already know a
> lot about databases, provided they can understand what our constraints
> are, and what is important to us. It might just be a matter of having
> them point us in the right direction.

err, so what does stats target mean exactly in statistical theory?
Waiting for a statistician, and confirming his credentials before you
believe him above others here, seems like wasted time.

What your statistician will tell you is it that YMMV, depending on the data.

So we'll still need a parameter to fine tune things when the default
is off. We can argue about the default later, in various level of
rigour.

Block sampling, with parameter to specify sample size. +1

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
   On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Block sampling, with parameter to specify sample size. +1

Simon this is very frustrating. Can you define "block sampling"?

-- 
greg



Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 11 December 2013 00:28, Greg Stark <stark@mit.edu> wrote:
>    On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Block sampling, with parameter to specify sample size. +1
>
> Simon this is very frustrating. Can you define "block sampling"?

Blocks selected using Vitter's algorithm, using a parameterised
fraction of the total.

When we select a block we should read all rows on that block, to help
identify the extent of clustering within the data.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> When we select a block we should read all rows on that block, to help
> identify the extent of clustering within the data.

So how do you interpret the results of the sample read that way that
doesn't introduce bias?

-- 
greg



Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 4:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> err, so what does stats target mean exactly in statistical theory?

Why would I even mention that to a statistician? We want guidance. But
yes, I bet I could give a statistician an explanation of statistics
target that they'd understand without too much trouble.

> Waiting for a statistician, and confirming his credentials before you
> believe him above others here, seems like wasted time.
>
> What your statistician will tell you is it that YMMV, depending on the data.

I'm reasonably confident that they'd give me more than that.

> So we'll still need a parameter to fine tune things when the default
> is off. We can argue about the default later, in various level of
> rigour.
>
> Block sampling, with parameter to specify sample size. +1

Again, it isn't as if the likely efficacy of *some* block sampling
approach is in question. I'm sure analyze.c is currently naive about
many things. Everyone knows that there are big gains to be had.
Balancing those gains against the possible downsides in terms of
impact on the quality of statistics generated is pretty nuanced. I do
know enough to know that a lot of thought goes into mitigating and/or
detecting the downsides of block-based sampling.

-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 11 December 2013 00:44, Greg Stark <stark@mit.edu> wrote:
> On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> When we select a block we should read all rows on that block, to help
>> identify the extent of clustering within the data.
>
> So how do you interpret the results of the sample read that way that
> doesn't introduce bias?

Yes, it is not a perfect statistical sample. All sampling is subject
to an error that is data dependent.

I'm happy that we have an option to select this/or not and a default
that maintains current behaviour, since otherwise we might expect some
plan instability.

I would like to be able to

* allow ANALYZE to run faster in some cases
* increase/decrease sample size when it matters
* have the default sample size vary according to the size of the
table, i.e. a proportional sample

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
"Sergey E. Koposov"
Date:
For what it's worth.

I'll quote Chaudhuri et al. first line from the abstract about the block 
sampling.
"Block-level sampling is far more efficient than true uniform-random 
sampling over a large database, but prone to  significant errors if used 
to create database statistics."

And after briefly glancing through the paper, my opinion is why it works 
is because after making one version of statistics they cross-validate, see 
how well it goes and then collect more if the cross-validation error is 
large (for example because the data is clustered). Without this bit, as 
far as I can a simply block based sampler will be bound to make 
catastrophic mistakes depending on the distribution

Also, just another point about targets (e.g X%) for estimating stuff from 
the samples (as it was discussed in the thread). Basically, the is a 
point talking about a sampling a fixed target (5%) of the data 
ONLY if you fix the actual  distribution of your data in the table, and 
decide what statistic you are trying to find, e.g. average, std. dev. a 
90% percentile, ndistinct or a histogram and so forth. There won't be a 
general answer as the percentages will be distribution dependend and 
statistic dependent.

Cheers,    Sergey

PS I'm not a statistician, but I use statistics a lot

*******************************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/



Re: ANALYZE sampling is too good

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> Again, it isn't as if the likely efficacy of *some* block sampling
> approach is in question. I'm sure analyze.c is currently naive about
> many things.

It's not *that* naive; this is already about a third-generation algorithm.
The last major revision (commit 9d6570b8a4) was to address problems with
misestimating the number of live tuples due to nonuniform tuple density
in a table.  IIRC, the previous code could be seriously misled if the
first few pages in the table were significantly non-representative of the
live-tuple density further on.  I'm not sure how we can significantly
reduce the number of blocks examined without re-introducing that hazard in
some form.  In particular, given that you want to see at least N tuples,
how many blocks will you read if you don't have an a-priori estimate of
tuple density?  You have to decide that before you start sampling blocks,
if you want all blocks to have the same probability of being selected
and you want to read them in sequence.
        regards, tom lane



Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Tuesday, December 10, 2013, Simon Riggs wrote:
On 11 December 2013 00:28, Greg Stark <stark@mit.edu> wrote:
>    On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Block sampling, with parameter to specify sample size. +1
>
> Simon this is very frustrating. Can you define "block sampling"?

Blocks selected using Vitter's algorithm, using a parameterised
fraction of the total.

OK, thanks for defining that.

We only need Vitter's algorithm when we don't know in advance how many items we are sampling from (such as for tuples--unless we want to rely on the previous estimate for the current round of analysis).  But for blocks, we do know how many there are, so there are simpler ways to pick them.
 

When we select a block we should read all rows on that block, to help
identify the extent of clustering within the data.

But we have no mechanism to store such information (or to use it if it were stored), nor even ways to prevent the resulting skew in the sample from seriously messing up the estimates which we do have ways of storing and using.

Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 11 December 2013 01:27, Sergey E. Koposov <math@sai.msu.ru> wrote:
> For what it's worth.
>
> I'll quote Chaudhuri et al. first line from the abstract about the block
> sampling.
> "Block-level sampling is far more efficient than true uniform-random
> sampling over a large database, but prone to  significant errors if used to
> create database statistics."

This glosses over the point that both SQLServer and Oracle use this technique.

> And after briefly glancing through the paper, my opinion is why it works is
> because after making one version of statistics they cross-validate, see how
> well it goes and then collect more if the cross-validation error is large
> (for example because the data is clustered). Without this bit, as far as I
> can a simply block based sampler will be bound to make catastrophic mistakes
> depending on the distribution

I don't think its true that a block based sampler will be *bound* to
make "catastrophic mistakes". They can clearly happen, just as they
can with random samples, hence the need for a parameter to control the
sample with a parameter.

Realistically, I never heard of an Oracle DBA doing advanced
statistical mathematics before setting the sample size on ANALYZE. You
use the default and bump it up if the sample is insufficient for the
data.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 10:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 11 December 2013 01:27, Sergey E. Koposov <math@sai.msu.ru> wrote:
>> For what it's worth.
>>
>> I'll quote Chaudhuri et al. first line from the abstract about the block
>> sampling.
>> "Block-level sampling is far more efficient than true uniform-random
>> sampling over a large database, but prone to  significant errors if used to
>> create database statistics."
>
> This glosses over the point that both SQLServer and Oracle use this technique.

That seems like an unusual omission for Microsoft Research to have made.

I didn't read that paper, because undoubtedly it's all patented. But
before I figured that out, after finding it on Google randomly, I did
read the first couple of paragraphs, which more or less said "what
follows - the entire paper - is an explanation as to why it's okay
that we do block sampling".


-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Hannu Krosing
Date:
On 12/11/2013 01:44 AM, Greg Stark wrote:
> On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> When we select a block we should read all rows on that block, to help
>> identify the extent of clustering within the data.
> So how do you interpret the results of the sample read that way that
> doesn't introduce bias?
>
Initially/experimentally we could just compare it to our current approach :)

That is, implement *some* block sampling and then check it against what
we currently have. Then figure out the bad differences. Rinse. Repeat.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: ANALYZE sampling is too good

From
Mark Kirkwood
Date:
On 11/12/13 19:34, Simon Riggs wrote:
>
> Realistically, I never heard of an Oracle DBA doing advanced
> statistical mathematics before setting the sample size on ANALYZE. You
> use the default and bump it up if the sample is insufficient for the
> data.
>

I'm not sure that Oracle's stats and optimizer design is an example to 
be envied - pretty much all Oracle DBA's I've encountered will apply 
hints all queries to get the plan they want...

Regards

Mark



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

> Yes, it is not a perfect statistical sample. All sampling is subject
> to an error that is data dependent.

Well there's random variation due to the limitations of dealing with a
sample. And then there's systemic biases due to incorrect algorithms.
You wouldn't be happy if the samples discarded every row with NULLs or
every row older than some date etc. These things would not be
corrected by larger samples. That's the kind of "error" we're talking
about here.

But the more I think about things the less convinced I am that there
is a systemic bias introduced by reading the entire block. I had
assumed larger rows would be selected against but that's not really
true, they're just selected against relative to the number of bytes
they occupy which is the correct frequency to sample.

Even blocks that are mostly empty don't really bias things. Picture a
table that consists of 100 blocks with 100 rows each (value A) and
another 100 blocks with only 1 row each (value B). The rows with value
B have a 50% chance of being in any given block which is grossly
inflated however each block selected with value A will produce 100
rows. So if you sample 10 blocks you'll get 100x10xA and 1x10xB which
will be the correct proportion.

I'm not actually sure there is any systemic bias here. The larger
number of rows per block generate less precise results but from my
thought experiments they seem to still be accurate?



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote:
> I'm not actually sure there is any systemic bias here. The larger
> number of rows per block generate less precise results but from my
> thought experiments they seem to still be accurate?

So I've done some empirical tests for a table generated by:
create table sizeskew as (select i,j,repeat('i',i) from
generate_series(1,1000) as i, generate_series(1,1000) as j);

I find that using the whole block doesn't cause any problem with the
avg_width field for the "repeat" column.That does reinforce my belief
that we might not need any particularly black magic here.

It does however cause a systemic error in the histogram bounds. It
seems the median is systematically overestimated by more and more the
larger the number of rows per block are used:

1: 524
4: 549
8: 571
12: 596
16: 602
20: 618 (total sample slightly smaller than normal)
30: 703 (substantially smaller sample)

So there is something clearly wonky in the histogram stats that's
affected by the distribution of the sample. The only thing I can think
of is maybe the most common elements are being selected preferentially
from the early part of the sample which is removing a substantial part
of the lower end of the range. But even removing 100 from the
beginning shouldn't be enough to push the median above 550.

-- 
greg



Re: ANALYZE sampling is too good

From
Greg Stark
Date:
On Wed, Dec 11, 2013 at 12:08 PM, Greg Stark <stark@mit.edu> wrote:
> The only thing I can think
> of is maybe the most common elements are being selected preferentially
> from the early part of the sample which is removing a substantial part
> of the lower end of the range. But even removing 100 from the
> beginning shouldn't be enough to push the median above 550.

Just to follow up here. I think what's going is that not only are the
most_common_vals being preferentially taken from the beginning of the
sample but also their frequency is being massively overestimated. All
values have a frequency of about .001 but the head of the MCV has a
frequency as high as .10 in some of my tests.



-- 
greg



Re: ANALYZE sampling is too good

From
Heikki Linnakangas
Date:
On 12/11/2013 02:08 PM, Greg Stark wrote:
> On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote:
>> I'm not actually sure there is any systemic bias here. The larger
>> number of rows per block generate less precise results but from my
>> thought experiments they seem to still be accurate?
>
> So I've done some empirical tests for a table generated by:
> create table sizeskew as (select i,j,repeat('i',i) from
> generate_series(1,1000) as i, generate_series(1,1000) as j);
>
> I find that using the whole block doesn't cause any problem with the
> avg_width field for the "repeat" column.That does reinforce my belief
> that we might not need any particularly black magic here.

How large a sample did you use? Remember that the point of doing 
block-level sampling instead of the current approach would be to allow 
using a significantly smaller sample (in # of blocks), and still achieve 
the same sampling error. If the sample is "large enough", it will mask 
any systemic bias caused by block-sampling, but the point is to reduce 
the number of sampled blocks.

The practical question here is this: What happens to the quality of the 
statistics if you only read 1/2 the number of blocks than you normally 
would, but included all the rows in the blocks we read in the sample? 
How about 1/10 ?

Or to put it another way: could we achieve more accurate statistics by 
including all rows from the sampled rows, while reading the same number 
of blocks? In particular, I wonder if it would help with estimating 
ndistinct. It generally helps to have a larger sample for ndistinct 
estimation, so it might be beneficial.

- Heikki



Re: ANALYZE sampling is too good

From
Florian Pflug
Date:
On Dec10, 2013, at 15:32 , Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark@mit.edu> wrote:
>>
>> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote:
>>>
>>>
>>> Doesn't all that assume a normally distributed random variable?
>>
>> I don't think so because of the law of large numbers. If you have a large
>> population and sample it the sample behaves like a normal distribution when
>> if the distribution of the population isn't.
>
> No, the large population says that if you have an AVERAGE of many
> samples of a random variable, the random variable that is the AVERAGE
> behaves like a normal.

Actually, that's the central limit theorem, and it doesn't hold for all
random variables, only for those with finite expected value and variance.

The law of large numbers, in contrast, only tells you that the AVERAGE of
n samples of a random variable will converge to the random variables'
expected value as n goes to infinity (there are different versions of the
law which guarantee different kinds of convergence, weak or strong).

best regards,
Florian Pflug




Re: ANALYZE sampling is too good

From
Simon Riggs
Date:
On 11 December 2013 12:08, Greg Stark <stark@mit.edu> wrote:

> So there is something clearly wonky in the histogram stats that's
> affected by the distribution of the sample.

...in the case where the avg width changes in a consistent manner
across the table.

Well spotted.

ISTM we can have a specific cross check for bias in the sample of that
nature. We just calculate the avg width per block and then check for
correlation of the avg width against block number. If we find bias we
can calculate how many extra blocks to sample and from where.

There may be other biases also, so we can check for them and respond
accordingly.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: ANALYZE sampling is too good

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> So I've done some empirical tests for a table generated by:
> create table sizeskew as (select i,j,repeat('i',i) from
> generate_series(1,1000) as i, generate_series(1,1000) as j);

> I find that using the whole block doesn't cause any problem with the
> avg_width field for the "repeat" column.That does reinforce my belief
> that we might not need any particularly black magic here.

> It does however cause a systemic error in the histogram bounds. It
> seems the median is systematically overestimated by more and more the
> larger the number of rows per block are used:

Hm.  You can only take N rows from a block if there actually are at least
N rows in the block.  So the sampling rule I suppose you are using is
"select up to N rows from each sampled block" --- and that is going to
favor the contents of blocks containing narrower-than-average rows.

Now in this case, it looks like that ought to favor rows with *smaller* i
values, but you say the median goes up not down.  So I'm not sure what's
going on.  I thought at first that TOAST compression might be part of the
explanation, but TOAST shouldn't kick in on rows with raw representation
narrower than 2KB.

Did you do a run with no upper limit on the number of rows per block?
Because I'm not sure that tests with a limit in place are a good guide
to what happens without it.
        regards, tom lane



Re: ANALYZE sampling is too good

From
Tom Lane
Date:
I wrote:
> Hm.  You can only take N rows from a block if there actually are at least
> N rows in the block.  So the sampling rule I suppose you are using is
> "select up to N rows from each sampled block" --- and that is going to
> favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows.  You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages.  However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows.  This explains why you saw smaller
values of "i" being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again.  But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.
        regards, tom lane



Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
On 12/12/13 06:22, Tom Lane wrote:
> I wrote:
>> Hm.  You can only take N rows from a block if there actually are at least
>> N rows in the block.  So the sampling rule I suppose you are using is
>> "select up to N rows from each sampled block" --- and that is going to
>> favor the contents of blocks containing narrower-than-average rows.
> Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
> Actually, this sampling rule discriminates *against* blocks with
> narrower rows.  You previously argued, correctly I think, that
> sampling all rows on each page introduces no new bias because row
> width cancels out across all sampled pages.  However, if you just
> include up to N rows from each page, then rows on pages with more
> than N rows have a lower probability of being selected, but there's
> no such bias against wider rows.  This explains why you saw smaller
> values of "i" being undersampled.
>
> Had you run the test series all the way up to the max number of
> tuples per block, which is probably a couple hundred in this test,
> I think you'd have seen the bias go away again.  But the takeaway
> point is that we have to sample all tuples per page, not just a
> limited number of them, if we want to change it like this.
>
>             regards, tom lane
>
>
Surely we want to sample a 'constant fraction' (obviously, in practice 
you have to sample an integral number of rows in a page!) of rows per 
page? The simplest way, as Tom suggests, is to use all the rows in a page.

However, if you wanted the same number of rows from a greater number of 
pages, you could (for example) select a quarter of the rows from each 
page.  In which case, when this is a fractional number: take the 
integral number of rows, plus on extra row with a probability equal to 
the fraction (here 0.25).

Either way, if it is determined that you need N rows, then keep 
selecting pages at random (but never use the same page more than once) 
until you have at least N rows.


Cheers,
Gavin




Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
On 12/12/13 06:22, Tom Lane wrote:
> I wrote:
>> Hm.  You can only take N rows from a block if there actually are at least
>> N rows in the block.  So the sampling rule I suppose you are using is
>> "select up to N rows from each sampled block" --- and that is going to
>> favor the contents of blocks containing narrower-than-average rows.
> Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
> Actually, this sampling rule discriminates *against* blocks with
> narrower rows.  You previously argued, correctly I think, that
> sampling all rows on each page introduces no new bias because row
> width cancels out across all sampled pages.  However, if you just
> include up to N rows from each page, then rows on pages with more
> than N rows have a lower probability of being selected, but there's
> no such bias against wider rows.  This explains why you saw smaller
> values of "i" being undersampled.
>
> Had you run the test series all the way up to the max number of
> tuples per block, which is probably a couple hundred in this test,
> I think you'd have seen the bias go away again.  But the takeaway
> point is that we have to sample all tuples per page, not just a
> limited number of them, if we want to change it like this.
>
>             regards, tom lane
>
>
Hmm...

In my previous reply, which hasn't shown up yet!

I realized I made a mistake!

The fraction/probability could any of 0.25. 0.50, and 0.75.


Cheers,
Gavin



Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
On 12/12/13 07:22, Gavin Flower wrote:
> On 12/12/13 06:22, Tom Lane wrote:
>> I wrote:
>>> Hm.  You can only take N rows from a block if there actually are at 
>>> least
>>> N rows in the block.  So the sampling rule I suppose you are using is
>>> "select up to N rows from each sampled block" --- and that is going to
>>> favor the contents of blocks containing narrower-than-average rows.
>> Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
>> Actually, this sampling rule discriminates *against* blocks with
>> narrower rows.  You previously argued, correctly I think, that
>> sampling all rows on each page introduces no new bias because row
>> width cancels out across all sampled pages.  However, if you just
>> include up to N rows from each page, then rows on pages with more
>> than N rows have a lower probability of being selected, but there's
>> no such bias against wider rows.  This explains why you saw smaller
>> values of "i" being undersampled.
>>
>> Had you run the test series all the way up to the max number of
>> tuples per block, which is probably a couple hundred in this test,
>> I think you'd have seen the bias go away again.  But the takeaway
>> point is that we have to sample all tuples per page, not just a
>> limited number of them, if we want to change it like this.
>>
>>             regards, tom lane
>>
>>
> Surely we want to sample a 'constant fraction' (obviously, in practice 
> you have to sample an integral number of rows in a page!) of rows per 
> page? The simplest way, as Tom suggests, is to use all the rows in a 
> page.
>
> However, if you wanted the same number of rows from a greater number 
> of pages, you could (for example) select a quarter of the rows from 
> each page.  In which case, when this is a fractional number: take the 
> integral number of rows, plus on extra row with a probability equal to 
> the fraction (here 0.25).
>
> Either way, if it is determined that you need N rows, then keep 
> selecting pages at random (but never use the same page more than once) 
> until you have at least N rows.
>
>
> Cheers,
> Gavin
>
>
>
Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75.

But there is a bias introduced by the arithmetic average size of the 
rows in a page. This results in block sampling favouring large rows, as 
they are in a larger proportion of pages.

For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, 
using 400 byte pages.  In the pathologically worst case, assuming 
maximum packing density and no page has both types: the large rows would 
occupy  500 pages and the smaller rows 50 pages. So if one selected 11 
pages at random, you get about 10 pages of large rows and about one for 
small rows!  In practice, it would be much less extreme - for a start, 
not all blocks will be fully packed, most blocks would have both types 
of rows, and there is usually greater variation in row size - but still 
a bias towards sampling larger rows.  So somehow, this bias needs to be 
counteracted.


Cheers,
Gavin





Re: ANALYZE sampling is too good

From
Kevin Grittner
Date:
Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:

> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
> using 400 byte pages.  In the pathologically worst case, assuming
> maximum packing density and no page has both types: the large rows would
> occupy  500 pages and the smaller rows 50 pages. So if one selected 11
> pages at random, you get about 10 pages of large rows and about one for
> small rows!

With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
<div class="moz-cite-prefix">On 12/12/13 08:14, Gavin Flower wrote:<br /></div><blockquote
cite="mid:52A8B998.5040602@archidevsys.co.nz"type="cite">On 12/12/13 07:22, Gavin Flower wrote: <br /><blockquote
type="cite">On12/12/13 06:22, Tom Lane wrote: <br /><blockquote type="cite">I wrote: <br /><blockquote type="cite">Hm. 
Youcan only take N rows from a block if there actually are at least <br /> N rows in the block.  So the sampling rule I
supposeyou are using is <br /> "select up to N rows from each sampled block" --- and that is going to <br /> favor the
contentsof blocks containing narrower-than-average rows. <br /></blockquote> Oh, no, wait: that's backwards.  (I plead
insufficientcaffeine.) <br /> Actually, this sampling rule discriminates *against* blocks with <br /> narrower rows. 
Youpreviously argued, correctly I think, that <br /> sampling all rows on each page introduces no new bias because row
<br/> width cancels out across all sampled pages.  However, if you just <br /> include up to N rows from each page,
thenrows on pages with more <br /> than N rows have a lower probability of being selected, but there's <br /> no such
biasagainst wider rows.  This explains why you saw smaller <br /> values of "i" being undersampled. <br /><br /> Had
yourun the test series all the way up to the max number of <br /> tuples per block, which is probably a couple hundred
inthis test, <br /> I think you'd have seen the bias go away again.  But the takeaway <br /> point is that we have to
sampleall tuples per page, not just a <br /> limited number of them, if we want to change it like this. <br /><br />
           regards, tom lane <br /><br /><br /></blockquote> Surely we want to sample a 'constant fraction' (obviously,
inpractice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom
suggests,is to use all the rows in a page. <br /><br /> However, if you wanted the same number of rows from a greater
numberof pages, you could (for example) select a quarter of the rows from each page.  In which case, when this is a
fractionalnumber: take the integral number of rows, plus on extra row with a probability equal to the fraction (here
0.25).<br /><br /> Either way, if it is determined that you need N rows, then keep selecting pages at random (but never
usethe same page more than once) until you have at least N rows. <br /><br /><br /> Cheers, <br /> Gavin <br /><br
/><br/><br /></blockquote> Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75. <br /><br /> But
thereis a bias introduced by the arithmetic average size of the rows in a page. This results in block sampling
favouringlarge rows, as they are in a larger proportion of pages. <br /><br /> For example, assume 1000 rows of 200
bytesand 1000 rows of 20 bytes, using 400 byte pages.  In the pathologically worst case, assuming maximum packing
densityand no page has both types: the large rows would occupy  500 pages and the smaller rows 50 pages. So if one
selected11 pages at random, you get about 10 pages of large rows and about one for small rows!  In practice, it would
bemuch less extreme - for a start, not all blocks will be fully packed, most blocks would have both types of rows, and
thereis usually greater variation in row size - but still a bias towards sampling larger rows.  So somehow, this bias
needsto be counteracted. <br /><br /><br /> Cheers, <br /> Gavin <br /><br /></blockquote> Actually, I just thought of
apossible way to overcome the bias towards large rows.<br /><br /><ol><li>Calculate (a rough estimate may be
sufficient,if not too 'rough') the size of the smallest row.<br /><br /><li>Select a page at random (never selecting
thesame page twice)<br /><br /><li>Then select rows at random within the page (never selecting the same row twice). 
Foreach row selected, accept it with the probability equal to (size of smallest row)/(size of selected row).  I think
youfind that will almost completely offset the bias towards larger rows!<br /><br /><li>If you do not have sufficient
rows,and you still have pages not yet selected, goto 2<br /></ol> Note that it will be normal for for some pages not to
haveany rows selected, especially for large tables!<br /><br /><br /> Cheers,<br /> Gavin<br /><br />  P.S.<br /> I
reallyneed to stop thinking about this problem, and get on with my assigned project!!!<br /><br /><br /> 

Re: ANALYZE sampling is too good

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 4:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Why would I even mention that to a statistician? We want guidance. But
> yes, I bet I could give a statistician an explanation of statistics
> target that they'd understand without too much trouble.

Actually, I think that if we told a statistician about the statistics
target, his or her response would be: why would you presume to know
ahead of time what statistics target is going to be effective? I
suspect that the basic problem is that it isn't adaptive. I think that
if we could somehow characterize the quality of our sample as we took
it, and then cease sampling when we reached a certain degree of
confidence in its quality, that would be helpful. It might not even
matter that the sample was clustered from various blocks.


-- 
Peter Geoghegan



Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
On 12/12/13 08:31, Kevin Grittner wrote:
> Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
>
>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
>> using 400 byte pages.  In the pathologically worst case, assuming
>> maximum packing density and no page has both types: the large rows would
>> occupy  500 pages and the smaller rows 50 pages. So if one selected 11
>> pages at random, you get about 10 pages of large rows and about one for
>> small rows!
> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Sorry, I've simply come up with well argued nonsense!

Kevin, you're dead right.


Cheers,
Gavin



Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
On 12/12/13 08:39, Gavin Flower wrote:
> On 12/12/13 08:31, Kevin Grittner wrote:
>> Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
>>
>>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
>>> using 400 byte pages.  In the pathologically worst case, assuming
>>> maximum packing density and no page has both types: the large rows 
>>> would
>>> occupy  500 pages and the smaller rows 50 pages. So if one selected 11
>>> pages at random, you get about 10 pages of large rows and about one for
>>> small rows!
>> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.
>>
>> -- 
>> Kevin Grittner
>> EDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
> Sorry, I've simply come up with well argued nonsense!
>
> Kevin, you're dead right.
>
>
> Cheers,
> Gavin
>
>
I looked at:
http://www.postgresql.org/docs/current/interactive/storage-page-layout.html
this says that each row has an overhead, which suggests there should be 
a bias towards small rows.

There must be a lot of things going on, that I'm simply not aware of, 
that affect sampling bias...


Cheers,
Gavin



Re: ANALYZE sampling is too good

From
Gavin Flower
Date:
On 12/12/13 09:12, Gavin Flower wrote:
> On 12/12/13 08:39, Gavin Flower wrote:
>> On 12/12/13 08:31, Kevin Grittner wrote:
>>> Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
>>>
>>>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
>>>> using 400 byte pages.  In the pathologically worst case, assuming
>>>> maximum packing density and no page has both types: the large rows 
>>>> would
>>>> occupy  500 pages and the smaller rows 50 pages. So if one selected 11
>>>> pages at random, you get about 10 pages of large rows and about one 
>>>> for
>>>> small rows!
>>> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.
>>>
>>> -- 
>>> Kevin Grittner
>>> EDB: http://www.enterprisedb.com
>>> The Enterprise PostgreSQL Company
>> Sorry, I've simply come up with well argued nonsense!
>>
>> Kevin, you're dead right.
>>
>>
>> Cheers,
>> Gavin
>>
>>
> I looked at:
> http://www.postgresql.org/docs/current/interactive/storage-page-layout.html 
>
> this says that each row has an overhead, which suggests there should 
> be a bias towards small rows.
>
> There must be a lot of things going on, that I'm simply not aware of, 
> that affect sampling bias...
>
>
> Cheers,
> Gavin
>
>
Ignoring overheads per row and other things... There will be a biasing 
affect when the distribution of sizes is not symmetric.  For example: 
when the majority of rows have sizes greater than the arithmetic mean, 
then most samples will be biased towards larger rows. Similarly there 
could be a bias towards smaller rows when most rows are smaller than the 
arithmetic mean.  Yes, I did think about this in depth - but it is way 
too complicated to attempt to quantify the bias, because it depends on 
too many factors (even just limiting it to the distribution of row sizes).

So apart from the nature of volatility of the table, and the pattern of 
insertions/updates/deletes - there there will be a bias depending on the 
distribution of values in the table.

So I despair, that a simple elegant & practical algorithm will ever be 
found.

Therefore, I expect the best answer is probably some kind of empirical 
adaptive approach - which I think has already been suggested.


Cheers,
Gavin





Re: ANALYZE sampling is too good

From
Greg Stark
Date:
<p dir="ltr"><br /> I think we're all wet here. I don't see any bias towards larger or smaller rows. Larger tied will
beon a larger number of pages but there will be fewer of them on any one page. The average effect should be the same.<p
dir="ltr">Smallervalues might have a higher variance with block based sampling than larger values. But that actually
*is*the kind of thing that Simon's approach of just compensating with later samples can deal with.<br /><br /><p
dir="ltr">--<br /> greg 

Re: ANALYZE sampling is too good

From
Martijn van Oosterhout
Date:
On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote:
> Surely we want to sample a 'constant fraction' (obviously, in
> practice you have to sample an integral number of rows in a page!)
> of rows per page? The simplest way, as Tom suggests, is to use all
> the rows in a page.
>
> However, if you wanted the same number of rows from a greater number
> of pages, you could (for example) select a quarter of the rows from
> each page.  In which case, when this is a fractional number: take
> the integral number of rows, plus on extra row with a probability
> equal to the fraction (here 0.25).

In this discussion we've mostly used block = 1 postgresql block of 8k.
But when reading from a disk once you've read one block you can
basically read the following ones practically for free.

So I wonder if you could make your sampling read always 16 consecutive
blocks, but then use 25-50% of the tuples.  That way you get many more
tuples for the same amount of disk I/O seeks..

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/11/2013 02:39 PM, Martijn van Oosterhout wrote:
> In this discussion we've mostly used block = 1 postgresql block of 8k. 
> But when reading from a disk once you've read one block you can
> basically read the following ones practically for free.
> 
> So I wonder if you could make your sampling read always 16 consecutive
> blocks, but then use 25-50% of the tuples.  That way you get many more
> tuples for the same amount of disk I/O seeks..

Yeah, that's what I meant by "tune this for the FS".   We'll probably
have to test a lot of different "block sizes" on different FSes before
we arrive at a reasonable size, and even then I'll bet we have to offer
a GUC.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Florian Pflug
Date:
Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes <jeff.janes@gmail.com> wrote:
> create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
> create table baz2 as select * from baz order by floor;
> create table baz3 as select * from baz order by md5(floor::text);
>
> baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.
>
> After analyzing all of them:
>
> select tablename, n_distinct,correlation  from pg_stats where tablename  like 'baz%' and attname='floor'  ;
>  tablename | n_distinct  | correlation
> -----------+-------------+-------------
>  baz       | 8.56006e+06 |  0.00497713
>  baz2      |      333774 |           1
>  baz3      |      361048 |  -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then
tablename | n_distinct (est) | n_distinct (act)
-----------+------------------+------------------baz       |           391685 |           500000 baz2      |
36001|           500000  

ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.

So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

best regards,
Florian Pflug




Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark <stark@mit.edu> wrote:


I think we're all wet here. I don't see any bias towards larger or smaller rows. Larger tied will be on a larger number of pages but there will be fewer of them on any one page. The average effect should be the same.

Smaller values might have a higher variance with block based sampling than larger values. But that actually *is* the kind of thing that Simon's approach of just compensating with later samples can deal with.


I think that looking at all rows in randomly-chosen blocks will not bias size, or histograms.  But it will bias n_distinct and MCV for some data distributions of data, unless we find some way to compensate for it.

But even for avg size and histograms, what does block sampling get us?  We get larger samples sizes for the same IO, but those samples are less independent (assuming data is no randomly scattered over the table), so the "effective sample size" is less than the true sample size.  So we can't just sample 100 time fewer blocks because there are about 100 rows per block--doing so would not bias our avg size or histogram boundaries, but it would certainly make them noisier.

Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug <fgp@phlo.org> wrote:
Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes <jeff.janes@gmail.com> wrote:
> create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
> create table baz2 as select * from baz order by floor;
> create table baz3 as select * from baz order by md5(floor::text);
>
> baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.
>
> After analyzing all of them:
>
> select tablename, n_distinct,correlation  from pg_stats where tablename  like 'baz%' and attname='floor'  ;
>  tablename | n_distinct  | correlation
> -----------+-------------+-------------
>  baz       | 8.56006e+06 |  0.00497713
>  baz2      |      333774 |           1
>  baz3      |      361048 |  -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then

 tablename | n_distinct (est) | n_distinct (act)
-----------+------------------+------------------
 baz       |           391685 |           500000
 baz2      |            36001 |           500000

ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.


I think the problem is more subtle than that.  It is easier to visualize it if you think of every block having the same number of rows, with that number being fairly large.  If you pick 30,000 rows at random from 1,000,000 blocks, the number of rows chosen from any given block should be close to following a poisson distribution with average of 0.03, which means about 29113 blocks should have exactly 1 row chosen from them while 441 would have two or more rows chosen from them.  

But if you instead select 30,000 row from 30,000 blocks, which is what we ask Vitter's algorithm to do, you get about a Poisson distribution with average of 1.0.  Then about 11036 blocks have exactly one row chosen from them, and 7927 blocks have two or more rows sampled from it.  Another 11,036 blocks get zero rows selected from them due to Vitter, in addition to the 970,000 that didn't even get submitted to Vitter in the first place.  That is why you see too many duplicates for clustered data, as too many blocks are sampled multiple times.

The Poisson argument doesn't apply cleanly when blocks have variable number of rows, but the general principle still applies.  Too-few blocks have exactly one row sampled from them, and too many blocks have either zero row, or more than one row.

It would be relatively easy to fix this if we trusted the number of visible rows in each block to be fairly constant.  But without that assumption, I don't see a way to fix the sample selection process without reading the entire table.


 
So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

Perhaps we should consider changing that even if we don't change to block based sampling--but what would the other candidates be?  I guess the same paper that presented Duj1 would be a good place to start, but are there others?  It sounds like this was discussed several years ago, but I didn't find it in the archives with a casual search.
 
Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> It would be relatively easy to fix this if we trusted the number of visible
> rows in each block to be fairly constant.  But without that assumption, I
> don't see a way to fix the sample selection process without reading the
> entire table.

Yeah, varying tuple density is the weak spot in every algorithm we've
looked at.  The current code is better than what was there before, but as
you say, not perfect.  You might be entertained to look at the threads
referenced by the patch that created the current sampling method:
http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at

particularly

http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at


However ... where this thread started was not about trying to reduce
the remaining statistical imperfections in our existing sampling method.
It was about whether we could reduce the number of pages read for an
acceptable cost in increased statistical imperfection.
        regards, tom lane



Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> It would be relatively easy to fix this if we trusted the number of visible
>> rows in each block to be fairly constant.  But without that assumption, I
>> don't see a way to fix the sample selection process without reading the
>> entire table.
>
> Yeah, varying tuple density is the weak spot in every algorithm we've
> looked at.  The current code is better than what was there before, but as
> you say, not perfect.  You might be entertained to look at the threads
> referenced by the patch that created the current sampling method:
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at
>
> particularly
>
http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at
>
>
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.


Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?



Re: ANALYZE sampling is too good

From
Josh Berkus
Date:
On 12/12/2013 10:33 AM, Claudio Freire wrote:
> Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?

Well, we still need to look at each individual block to determine
grouping correlation.  Let's take a worst case example: imagine a table
has *just* been created by:

CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item;

If "category" is fairly low cardinality, then grouping will be severe;
we can reasonably expect that if we sample 100 blocks, many of them will
have only one category value present.  The answer to this is to make our
block samples fairly widely spaced and compare them.

In this simplified example, if the table had 1000 blocks, we would take
blocks 1,101,201,301,401,etc.  Then we would compare the number and
content of values found on each block with the number and content found
on each other block.  For example, if we see that block 101 is entirely
the category "cats", and block 701 is entirely the category "shopping"
and block 901 is split 60/40 between the categories "transportation" and
"voting", then we can assume that the level of grouping is very high,
and the number of unknown values we haven't seen is also high.

Whereas if 101 is "cats" and 201 is "cats" and 301 through 501 are
"cats" with 2% other stuff, then we assume that the level of grouping is
moderate and it's just the case that most of the dataset is "cats".
Which means that the number of unknown values we haven't seen is low.

Whereas if 101, 201, 501, and 901 have near-identical distributions of
values, we assume that the level of grouping is very low, and that there
are very few values we haven't seen.

As someone else pointed out, full-block (the proposal) vs. random-row
(our current style) doesn't have a very significant effect on estimates
of Histograms and nullfrac, as long as the sampled blocks are widely
spaced.  Well, nullfrac is affected in the extreme example of a totally
ordered table where the nulls are all in one block, but I'll point out
that we can (and do) also miss that using our current algo.

Estimated grouping should, however, affect MCVs.  In cases where we
estimate that grouping levels are high, the expected % of observed
values should be "discounted" somehow.  That is, with total random
distribution you have a 1:1 ratio between observed frequency of a value
and assumed frequency.  However, with highly grouped values, you might
have a 2:1 ratio.

Again, more math (backed by statistical analysis) is needed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
> Estimated grouping should, however, affect MCVs.  In cases where we
> estimate that grouping levels are high, the expected % of observed
> values should be "discounted" somehow.  That is, with total random
> distribution you have a 1:1 ratio between observed frequency of a value
> and assumed frequency.  However, with highly grouped values, you might
> have a 2:1 ratio.

Cross validation can help there. But it's costly.



Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> It would be relatively easy to fix this if we trusted the number of visible
>> rows in each block to be fairly constant.  But without that assumption, I
>> don't see a way to fix the sample selection process without reading the
>> entire table.
>
> Yeah, varying tuple density is the weak spot in every algorithm we've
> looked at.  The current code is better than what was there before, but as
> you say, not perfect.  You might be entertained to look at the threads
> referenced by the patch that created the current sampling method:
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at
>
> particularly
> http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at


Thanks, I will read those. 


>
>
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.

I think it is pretty clear that n_distinct at least, and probably MCV, would be a catastrophe under some common data distribution patterns if we sample all rows in each block without changing our current computation method.  If we come up with a computation that works for that type of sampling, it would probably be an improvement under our current sampling as well.  If we find such a thing, I wouldn't want it to get rejected just because the larger block-sampling method change did not make it in.

Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?

Bootstrapping methods generally do not work well when ties are significant events, i.e. when two values being identical is meaningfully different from them being very close but not identical.

Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Claudio Freire
Date:
On Thu, Dec 12, 2013 at 4:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Well, why not take a supersample containing all visible tuples from N
>> selected blocks, and do bootstrapping over it, with subsamples of M
>> independent rows each?
>
>
> Bootstrapping methods generally do not work well when ties are significant
> events, i.e. when two values being identical is meaningfully different from
> them being very close but not identical.

Yes, that's why I meant to say (but I see now that I didn't) that it
wouldn't do much for n_distinct, just the histogram.



Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Tue, Dec 3, 2013 at 3:30 PM, Greg Stark <stark@mit.edu> wrote:
At multiple conferences I've heard about people trying all sorts of
gymnastics to avoid ANALYZE which they expect to take too long and
consume too much I/O. This is especially a big complain after upgrades
when their new database performs poorly until the new statistics are
in and they did pg_upgrade to avoid an extended downtime and complain
about ANALYZE taking hours.

Out of curiosity, are they using the 3 stage script "analyze_new_cluster.sh"?  If so, is the complaint that even the first rounds are too slow, or that the database remains unusable until the last round is complete?

Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Florian Pflug
Date:
On Dec12, 2013, at 19:29 , Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.

True, but Jeff's case shows that even the imperfections of the current
sampling method are larger than what the n_distinct estimator expects.

Making it even more biased will thus require us to rethink how we
obtain a n_distinct estimate or something equivalent. I don't mean that
as an argument against changing the sampling method, just as something
to watch out for.

best regards,
Florian Pflug




Re: ANALYZE sampling is too good

From
Jeff Janes
Date:
On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

 
I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 30000 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great.

Performance is often chaotic near transition points, so I try to avoid data sets that are slightly bigger or slightly smaller than RAM (or some other limit).

Do you know how many io channels your SSD has (or whatever the term of art is for SSD drives)?

On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB) with 4 GB of RAM goes from ~106 seconds to ~19 seconds.

However, I'm not sure what problem we want to solve here.  I certainly would not wish to give a background maintenance process permission to confiscate my entire RAID throughput for its own operation.  Perhaps this could only be active for explicit analyze, and only if vacuum_cost_delay=0?

Perhaps there should be something like "alter background role autovac set ...".  Otherwise we are going to end up with an "autovacuum_*" shadow parameter for many of our parameters, see "autovacuum_work_mem" discussions.

Cheers,

Jeff

Re: ANALYZE sampling is too good

From
Heikki Linnakangas
Date:
On 12/17/2013 12:06 AM, Jeff Janes wrote:
> On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>wrote:
>
>>   I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
>> very easy, patch attached. Your mileage may vary, but I'm seeing a nice
>> gain from this on my laptop. Taking a 30000 page sample of a table with
>> 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds
>> without the patch, and less than a second with the patch, with
>> effective_io_concurrency=10. If anyone with a good test data set loaded
>> would like to test this and post some numbers, that would be great.
>
> Performance is often chaotic near transition points, so I try to avoid data
> sets that are slightly bigger or slightly smaller than RAM (or some other
> limit).
>
> Do you know how many io channels your SSD has (or whatever the term of art
> is for SSD drives)?

No idea. It's an Intel 335.

> On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB)
> with 4 GB of RAM goes from ~106 seconds to ~19 seconds.
>
> However, I'm not sure what problem we want to solve here.

The case that Greg Stark mentioned in the email starting this thread is 
doing a database-wide ANALYZE after an upgrade. In that use case, you 
certainly want to get it done as quickly as possible, using all the 
available resources.

> I certainly would not wish to give a background maintenance process
> permission to confiscate my entire RAID throughput for its own
> operation.

Then don't set effective_io_concurrency. If you're worried about that, 
you probably wouldn't want any other process to monopolize the RAID 
array either.

> Perhaps this could only be active for explicit analyze, and only if
> vacuum_cost_delay=0?

That would be a bit weird, because ANALYZE in general doesn't obey 
vacuum_cost_delay. Maybe it should, though...

> Perhaps there should be something like "alter background role autovac set
> ...".  Otherwise we are going to end up with an "autovacuum_*" shadow
> parameter for many of our parameters, see "autovacuum_work_mem" discussions.

Yeah, so it seems.

- Heikki



Re: ANALYZE sampling is too good

From
Bruce Momjian
Date:
I assume we never came up with a TODO from this thread:

---------------------------------------------------------------------------

On Tue, Dec  3, 2013 at 11:30:44PM +0000, Greg Stark wrote:
> At multiple conferences I've heard about people trying all sorts of
> gymnastics to avoid ANALYZE which they expect to take too long and
> consume too much I/O. This is especially a big complain after upgrades
> when their new database performs poorly until the new statistics are
> in and they did pg_upgrade to avoid an extended downtime and complain
> about ANALYZE taking hours.
> 
> I always gave the party line that ANALYZE only takes a small
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
> 
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.
> 
> It means if your table is anywhere up to 240MB you're effectively
> doing a full table scan and then throwing out nearly all the data
> read.
> 
> Worse, my experience with the posix_fadvise benchmarking is that on
> spinning media reading one out of every 16 blocks takes about the same
> time as reading them all. Presumably this is because the seek time
> between tracks dominates and reading one out of every 16 blocks is
> still reading every track. So in fact if your table is up to about
> 3-4G ANALYZE is still effectively going to do a full table scan, at
> least as far as I/O time goes.
> 
> The current algorithm seems like it was designed with a 100G+ table in
> mind but the consequences on the more common 100M-100G tables weren't
> really considered. Consider what this means for partitioned tables. If
> they partition their terabyte table into 10 partitions ANALYZE will
> suddenly want to use 10x as much I/O which seems like a perverse
> consequence.
> 
> I'm not sure I have a prescription but my general feeling is that
> we're spending an awful lot of resources going after a statistically
> valid sample when we can spend a lot less resources and get something
> 90% as good. Or if we're really going to read that much data that we
> might as well use more of the rows we find.
> 
> -- 
> greg
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +