Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers

From Christopher Browne
Subject Re: Gsoc2012 Idea --- Social Network database schema
Date
Msg-id CAFNqd5UaDOZFoa_bk6n0fo-LvVsWaQmWbzdYx+LaJikh6WykSw@mail.gmail.com
Whole thread Raw
In response to Re: Gsoc2012 Idea --- Social Network database schema  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: Gsoc2012 Idea --- Social Network database schema
List pgsql-hackers
On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not.  Maybe you could do it
>>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>>> like to sample the table without reading all of it first, so that
>>> seems to miss the point.
>>
>> I think actually the traditional locution is more like
>>       WHERE random() < constant
>> where the constant is the fraction of the table you want.  And
>> yeah, the presumption is that you'd like it to not actually read
>> every row.  (Though unless the sampling density is quite a bit
>> less than 1 row per page, it's not clear how much you're really
>> going to win.)
>
> It's all going to depend on the use cases, which I don't think I've
> heard described very well yet.
>
> I've had to pick random rows from, for example, a table of
> disbursements to support a financial audit.  In those cases it has
> been the sample size that mattered, and order didn't.  One
> interesting twist there is that for some of these financial audits
> they wanted the probability of a row being selected to be
> proportional to the dollar amount of the disbursement.  I don't
> think you can do this without a first pass across the whole data
> set.

This one was commonly called "Dollar Unit Sampling," though the
terminology has gradually gotten internationalized.
http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html

What the article doesn't mention is that some particularly large items
might wind up covering multiple samples.  In the example, they're
looking for a sample every $3125 down the list.  If there was a single
transaction valued at $30000, that (roughly) covers 10 of the desired
samples.

It isn't possible to do this without scanning across the entire table.

If you want repeatability, you probably want to instantiate a copy of
enough information to indicate the ordering chosen.  That's probably
something that needs to be captured as part of the work of the audit,
so not only does it need to involve a pass across the data, it
probably requires capturing a fair bit of data for posterity.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Gsoc2012 Idea --- Social Network database schema
Next
From: Tom Lane
Date:
Subject: Re: Re: pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation)