Re: Selecting "sample" data from large tables. - Mailing list pgsql-sql

From Greg Stark
Subject Re: Selecting "sample" data from large tables.
Date
Msg-id 877jun3l9r.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Selecting "sample" data from large tables.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-sql
> Joseph Turner <joseph.turner@oakleynetworks.com> writes:
> > I have a table with a decent number of rows (let's say for example a
> > billion rows).  I am trying to construct a graph that displays the
> > distribution of that data.  However, I don't want to read in the
> > complete data set (as reading a billion rows would take a while).  Can
> > anyone thing of a way to do this is postgresql?

One way would be to have an indexed column with random values in it. Then you
could use an index scan to pull out samples.

However this has a few downsides. 

a) index scans are a lot less efficient than sequential scans. Effectively
reducing the sample size you can get for a given amount of time even further.
a 10% sample using this technique is probably almost as slow as reading the
entire table, for example. If you only need .1% though this might be a good
approach.

b) the data in the random column would have to be static meaning multiple
samples wouldn't be entirely independent. Depending on what you're doing with
the samples this might be a problem or not.

Oh, and if you're content with always using the same sample but want to
analyze it multiple different ways, you might want to use a partial index. You
could have partial indexes matching the order by and where clauses of your
analyses but with a where clause of its own selecting on the random data. Then
make sure that where clause is in every query.

But still, the sample has to be small enough that even using an index scan
you're winning over reading the entire data.

-- 
greg



pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: Selecting "sample" data from large tables.
Next
From: Khairul Azmi
Date:
Subject: ERROR: duplicate key violates unique constraint