Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id CAM-w4HObqcAgM2n=cqym5LtY9oXndOzkQtZJ+11NZbYxKSLEFw@mail.gmail.com
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Christopher Browne <cbbrowne@gmail.com>)
Responses Re: Gsoc2012 idea, tablesample  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> I get the feeling that this is a somewhat-magical feature (in that
> users haven't much hope of understanding in what ways the results are
> deterministic) that is sufficiently "magical" that anyone serious
> about their result sets is likely to be unhappy to use either SYSTEM
> or BERNOULLI.

These both sound pretty useful. "BERNOULLI" is fine for cases where
you aren't worried about time dependency on your data. If you're
looking for the average or total value of some column for example.

SYSTEM just means "I'm willing to trade some unspecified amount of
speed for some unspecified amount of accuracy" which presumably is
only good if you trust the database designers to make a reasonable
trade-off for cases where speed matters and the accuracy requirements
aren't very strict.

> Possibly the forms of sampling that people *actually* need, most of
> the time, are more like Dollar Unit Sampling, which are pretty
> deterministic, in ways that mandate that they be rather expensive
> (e.g. - guaranteeing Seq Scan).

I don't know about that but the cases I would expect to need other
distributions would be ones where you're looking at the tuples in a
non-linear way. Things like "what's the average gap between events" or
"what's the average number of instances per value".  These might
require a full table scan but might still be useful if the data is
going to be subsequently aggregated or joined in ways that would be
too expensive on the full data set.

But we shouldn't let best be the enemy of the good here. Having SYSTEM
and BERNOULLI would solve most use cases and having those would make
it easier to add more later.

-- 
greg


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap
Next
From: Greg Smith
Date:
Subject: Re: Bug tracker tool we need