Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Gsoc2012 Idea --- Social Network database schema
Date
Msg-id 4F6B0F2902000025000465DA@gw.wicourts.gov
Whole thread Raw
In response to Re: Gsoc2012 Idea --- Social Network database schema  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Gsoc2012 Idea --- Social Network database schema  (Christopher Browne <cbbrowne@gmail.com>)
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Well, the standard syntax apparently aims to reduce the number of
>> returned rows, which ORDER BY does not.  Maybe you could do it
>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>> like to sample the table without reading all of it first, so that
>> seems to miss the point.
> 
> I think actually the traditional locution is more like
>     WHERE random() < constant
> where the constant is the fraction of the table you want.  And
> yeah, the presumption is that you'd like it to not actually read
> every row.  (Though unless the sampling density is quite a bit
> less than 1 row per page, it's not clear how much you're really
> going to win.)
It's all going to depend on the use cases, which I don't think I've
heard described very well yet.
I've had to pick random rows from, for example, a table of
disbursements to support a financial audit.  In those cases it has
been the sample size that mattered, and order didn't.  One
interesting twist there is that for some of these financial audits
they wanted the probability of a row being selected to be
proportional to the dollar amount of the disbursement.  I don't
think you can do this without a first pass across the whole data
set.
I've also been involved in developing software to pick random people
for jury selection processes.  In some of these cases, you don't
want a certain percentage, you want a particular number of people,
and you want the list to be ordered so that an initial set of
potential jurors can be seated from the top of the list and then as
the voir dire[1] process progresses you can replace excused jurors
from progressive positions on the randomized list.
In both cases you need to be able to explain the technique used in
lay terms and show why it is very hard for anyone to predict or
control which rows are chosen, but also use statistical analysis to
prove that there is no significant correlation between the rows
chosen and identifiable characteristics of the rows.  For example,
selecting all the rows from random blocks would be right out for
juror selection because a list from the state DOT of people with
driver's licenses and state ID cards would probably be in license
number order when loaded, and since the start of the driver's
license number is a soundex of the last name, there is a strong
correlation between blocks of the table and ethnicity.
One technique which might be suitably random without reading the
whole table would be to figure out a maximum block number and tuple
ID for the table, and generate a series of random ctid values to
read.  If the tuple doesn't exist or is not visible to the snapshot,
you ignore it and continue, until you have read the requisite number
of rows.  You could try to generate them in advance and sort them by
block number, but then you need to solve the problems of what to do
if that set of ctids yields too many rows or too few rows, both of
which have sticky issues.
-Kevin
[1] http://en.wikipedia.org/wiki/Voir_dire#Use_in_the_United_States



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Trivial libpq refactoring patch (was: Re: Another review of URI for libpq, v7 submission)
Next
From: Christopher Browne
Date:
Subject: Re: Gsoc2012 Idea --- Social Network database schema