Re: SELECTing every Nth record for better performance - Mailing list pgsql-general

From Greg Smith
Subject Re: SELECTing every Nth record for better performance
Date
Msg-id 4B18E772.4050506@2ndquadrant.com
Whole thread Raw
In response to SELECTing every Nth record for better performance  (Tom <tom@cstcomposites.com>)
List pgsql-general
Tom wrote:
> 1. taking the primary key (which is an auto increasing integer) and
> running modulo on it. This will not work in my case as the database is
> being used for a number of different logging 'runs', and these runs
> are not necessarily equally distributed over the primary keys.
>
Problem #1 with what you're trying to do is that it's tricky to get SQL
to have a notion of "row number" in a result set, so that you can then
filter on that number.  The best approach to this is to use PostgreSQL
8.4 where the SQL Window functions can be used for this purpose.

> 2. Subqueries that do SELECT COUNT to generate a row number are too
> slow as im dealing with thousands to tens of thousands of records.
>
Problem #2 is that if you're looking at a only a mildly filtered version
of your data, you're going to pull the whole set in anyway.  Random data
point in this area:  if you do a scan on a table that needs to look at
20% of a table using an index, what will happen when you execute it?
The correct answer is likely "sequential scan of the entire table",
because that's actually more efficient than trying to grab only a few
records once the percentage gets large enough.  The way multiple records
get packed onto a single page, you're likely to actually read every page
of the data anyway even when trying to grab a subset of them, unless the
subset is very small relative to the data and you can traverse an index
usefully.

Once the "n" in your zoom gets large enough, it's possible to make this
worthwhile.  Unless your records are really wide, I would guess that it
would take a 1000:1 compression or more before you'd end up with a query
that's truly shorter than scanning the whole set.  And that wouldn't
work like what you're trying to do right now at all:  you'd instead have
to know the bounds of the data set, generate a sequence of points from
within that set, and then grab the records best matching those to get a
useful zoomed-out subset.  Basically, determine where the sequence of
records you need should be, then go into the data set to find just them
using something like "WHERE ts>x LIMIT 1"; that's the only way to not
scan the whole thing.

I think this whole approach isn't likely to ever converge on what you
want.  The direction I think you should be going is to consider whether
it's possible to create materialized views of your data that summarize
it at wider time scales.  You can't compute such a thing in real-time
usefully without reading the whole data set, and once you realize that
you might as well figure out how to only compute the summarized version
once.  The last comment in this thread as I write this, from Grzegorz,
suggests one approach for something like that.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


pgsql-general by date:

Previous
From: Grzegorz Jaśkiewicz
Date:
Subject: Re: SELECTing every Nth record for better performance
Next
From: Jasen Betts
Date:
Subject: Re: Installing PL/pgSQL by default