Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

From Greg Stark
Subject Re: ANALYZE sampling is too good
Date
Msg-id CAM-w4HNfAaWSZKLoSjnF4SMU4qUoumtMLcYwG=YQQEjdTwnHXA@mail.gmail.com
Whole thread Raw
In response to Re: ANALYZE sampling is too good  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
> They will certainly do so if you don't apply any statistical adjustments
> for selecting more rows from the same pages.
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block.

I just think this is an oversimplification. There's no skew introduced
just by reading all the rows in each block unless there's some kind of
dependency between the block a row is placed on and the data in it. So
I don't believe there can be some single set of math that
automatically removes any skew automatically. The math will depend on
what the dependency is.

Just to be clear, you have to think pretty hard about the way Postgres
internals work to see what kinds of skew might be appearing here. Due
to the way Postgres updates work and HOT cleanups work "hot" tuples
will be weighted less than "cold" tuples. That's not going to be
something someone in ACM knew to design into their maths.

I do have access to ACM or other academic articles if you remember any
author names or any keywords but if it's a database journal I would
worry about patent issues. Do you remember if it was over 17 years
old?


> Obviously, you need to look at more rows than you would with a
> pure-random sample.  Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample.

I really don't believe the 5% thing. It's not enough for n_distinct
and it's *far* too high a value for linear properties like histograms
or nullfrac etc. From a computer point of view it's too high to be
worth bothering. If we have to read 5% of the table we might as well
do a full scan anyways, it'll be marginally slower but much better
quality results.



-- 
greg



pgsql-hackers by date:

Previous
From: "MauMau"
Date:
Subject: Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
Next
From: Mark Kirkwood
Date:
Subject: Re: ANALYZE sampling is too good