Thread: Selecting "sample" data from large tables.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I have a table with a decent number of rows (let's say for example a billion rows). I am trying to construct a graph that displays the distribution of that data. However, I don't want to read in the complete data set (as reading a billion rows would take a while). Can anyone thing of a way to do this is postgresql? I've been looking online and most of the stuff I've found has been for other databases. As far as I can tell ANSI SQL doesn't provide for this scenario. I could potentially write a function to do this, however I'd prefer not to. But if that's what I'm going to be stuck doing I'd like to know earlier then later. Here's the description of the table: create table score { pageId Integer NOT NULL, ruleId, Integer NOT NULL score Double precision NULL, rowAddedDate BigInt NULL,primary key (pageId,ruleId) }; I also have an index on row added date, which is just the number of millis since the epoc (Jan 1, 1970 or so [java style timestamps]). I'd be willing to accept that the row added date values are random enough to represent random. Thanks in advance, -- Joe T. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFAv2Bqs/P36Z9SDAARAkmLAJ9dDB0sqACgFrxH8NukFUsizXz5zgCgt9IT /wh3ryz4WQzc5qQY2cAZtVE= =5dg+ -----END PGP SIGNATURE-----
Joseph Turner <joseph.turner@oakleynetworks.com> writes: > I have a table with a decent number of rows (let's say for example a > billion rows). I am trying to construct a graph that displays the > distribution of that data. However, I don't want to read in the > complete data set (as reading a billion rows would take a while). Can > anyone thing of a way to do this is postgresql? There is a fairly decent random-sampling engine inside ANALYZE, but no way for the user to get at it :-(. Can you make any use of ANALYZE's results, viz the pg_stats view? regards, tom lane
> Joseph Turner <joseph.turner@oakleynetworks.com> writes: > > I have a table with a decent number of rows (let's say for example a > > billion rows). I am trying to construct a graph that displays the > > distribution of that data. However, I don't want to read in the > > complete data set (as reading a billion rows would take a while). Can > > anyone thing of a way to do this is postgresql? One way would be to have an indexed column with random values in it. Then you could use an index scan to pull out samples. However this has a few downsides. a) index scans are a lot less efficient than sequential scans. Effectively reducing the sample size you can get for a given amount of time even further. a 10% sample using this technique is probably almost as slow as reading the entire table, for example. If you only need .1% though this might be a good approach. b) the data in the random column would have to be static meaning multiple samples wouldn't be entirely independent. Depending on what you're doing with the samples this might be a problem or not. Oh, and if you're content with always using the same sample but want to analyze it multiple different ways, you might want to use a partial index. You could have partial indexes matching the order by and where clauses of your analyses but with a where clause of its own selecting on the random data. Then make sure that where clause is in every query. But still, the sample has to be small enough that even using an index scan you're winning over reading the entire data. -- greg