Thread: Selecting "sample" data from large tables.

Selecting "sample" data from large tables.

From
Joseph Turner
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a table with a decent number of rows (let's say for example a
billion rows).  I am trying to construct a graph that displays the
distribution of that data.  However, I don't want to read in the
complete data set (as reading a billion rows would take a while).  Can
anyone thing of a way to do this is postgresql?  I've been looking
online and most of the stuff I've found has been for other databases.
As far as I can tell ANSI SQL doesn't provide for this scenario.

I could potentially write a function to do this, however I'd prefer
not to.  But if that's what I'm going to be stuck doing I'd like to
know earlier then later.  Here's the description of the table:

create table score
{ pageId Integer NOT NULL, ruleId, Integer NOT NULL score Double precision NULL, rowAddedDate BigInt NULL,primary key
(pageId,ruleId)
 
};

I also have an index on row added date, which is just the number of
millis since the epoc (Jan 1, 1970 or so [java style timestamps]).
I'd be willing to accept that the row added date values are random
enough to represent random.

Thanks in advance,
 -- Joe T.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAv2Bqs/P36Z9SDAARAkmLAJ9dDB0sqACgFrxH8NukFUsizXz5zgCgt9IT
/wh3ryz4WQzc5qQY2cAZtVE=
=5dg+
-----END PGP SIGNATURE-----


Re: Selecting "sample" data from large tables.

From
Tom Lane
Date:
Joseph Turner <joseph.turner@oakleynetworks.com> writes:
> I have a table with a decent number of rows (let's say for example a
> billion rows).  I am trying to construct a graph that displays the
> distribution of that data.  However, I don't want to read in the
> complete data set (as reading a billion rows would take a while).  Can
> anyone thing of a way to do this is postgresql?

There is a fairly decent random-sampling engine inside ANALYZE, but
no way for the user to get at it :-(.  Can you make any use of
ANALYZE's results, viz the pg_stats view?
        regards, tom lane


Re: Selecting "sample" data from large tables.

From
Greg Stark
Date:
> Joseph Turner <joseph.turner@oakleynetworks.com> writes:
> > I have a table with a decent number of rows (let's say for example a
> > billion rows).  I am trying to construct a graph that displays the
> > distribution of that data.  However, I don't want to read in the
> > complete data set (as reading a billion rows would take a while).  Can
> > anyone thing of a way to do this is postgresql?

One way would be to have an indexed column with random values in it. Then you
could use an index scan to pull out samples.

However this has a few downsides. 

a) index scans are a lot less efficient than sequential scans. Effectively
reducing the sample size you can get for a given amount of time even further.
a 10% sample using this technique is probably almost as slow as reading the
entire table, for example. If you only need .1% though this might be a good
approach.

b) the data in the random column would have to be static meaning multiple
samples wouldn't be entirely independent. Depending on what you're doing with
the samples this might be a problem or not.

Oh, and if you're content with always using the same sample but want to
analyze it multiple different ways, you might want to use a partial index. You
could have partial indexes matching the order by and where clauses of your
analyses but with a where clause of its own selecting on the random data. Then
make sure that where clause is in every query.

But still, the sample has to be small enough that even using an index scan
you're winning over reading the entire data.

-- 
greg