Thread: Table Clustering & Time Range Queries

Table Clustering & Time Range Queries

From

Kevin Buckham

Date:

22 October 2009, 16:16:29

I am running several servers with Postgres 8.3 that are used to house
location data from thousands of devices.  Location updates are quite
frequent, so our tables rapidly become fairly large (often about 2GB per
day of growth).  We've been using Postgres for close to 10 years now and
have been very happy until recent performance issues with larger data
sets.

Our primary location table is clustered by "reporttime" (bigint).  Many
of the queries we need to perform are of the nature : "get me all
positions from a given device for yesterday".  Similar queries are "get
me the most recent 10 positions from a given device".

These are pretty simple and straightforward queries that run
surprisingly quickly either on small tables or tables that have been
clustered by reporttime.  If the tables are large and haven't been
actively clustered for a while, the simplest looking queries will take
close to a minute to execute.

Unfortunately, the clustering operation now takes far too long to run in
any reasonable maintenance window.  On smaller datasets clustering will
take from 30 minutes to an hour and on our large datasets several hours
(gave up at our 4 hour maintenance window limit).  Obviously I would
love to have Postgres support "online" clustering, however I need to
figure a way around this problem now or start planning the port to
another database server.

We have tried the "SELECT INTO ... ORDER BY REPORTTIME" trick instead or
running cluster, but since the tables are quite large that is still
taking too long (although quicker than clustering).

I have spent more time than I would like looking into clustering options
and other load balancing techniques.  I was thinking that I'd like to
take one server set down for clustering, while failing over to the
secondary database set.  The problem is I don't see an efficient or easy
way to then synchronize the sizable amount of updates that will have
occurred since the clustering started back to the now clustered primary
database without significantly affecting write performance.  If it is
difficult to solve, porting to Oracle or *gasp* SQL Server will be an
easier solution.

I have spent a lot of time Googling, and so far no obvious solutions
jump to mind besides either hoping (or helping) online clustering become
a reality on Postgres, or to migrate to a different DB engine.  I'd
really appreciate any thoughts or suggestions.

-Kevin

Re: Table Clustering & Time Range Queries

From

"Kevin Grittner"

Date:

22 October 2009, 16:25:21

Kevin Buckham <kbuckham@applocation.net> wrote:

> Our primary location table is clustered by "reporttime" (bigint).
> Many of the queries we need to perform are of the nature : "get me
> all positions from a given device for yesterday".  Similar queries
> are "get me the most recent 10 positions from a given device".

Have you looked at table partitioning?  You would then only need to
cluster the most recent partition or two.  I *seems* like a good fit
for your application.

http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html

-Kevin

Re: Table Clustering & Time Range Queries

From

Scott Carey

Date:

22 October 2009, 18:34:14


On 10/22/09 12:25 PM, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> Kevin Buckham <kbuckham@applocation.net> wrote:
>
>> Our primary location table is clustered by "reporttime" (bigint).
>> Many of the queries we need to perform are of the nature : "get me
>> all positions from a given device for yesterday".  Similar queries
>> are "get me the most recent 10 positions from a given device".
>
> Have you looked at table partitioning?  You would then only need to
> cluster the most recent partition or two.  I *seems* like a good fit
> for your application.
>
> http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html
>
> -Kevin

Partitioning by time should help a lot here as Kevin says.

Also, you might want to experiment with things like pg_reorg:
http://reorg.projects.postgresql.org/
http://pgfoundry.org/projects/reorg/
http://reorg.projects.postgresql.org/pg_reorg.html

Which is basically an online, optimized cluster or vacuum full.  However it
has several caveats.  I have not used it in production myself, just
experiments with it.


>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Table Clustering & Time Range Queries

From

Scara Maccai

Date:

23 October 2009, 04:23:17

> Also, you might want to experiment with things like
> pg_reorg:


Do you happen to know if that works with 8.4?

Re: Table Clustering & Time Range Queries

From

Kevin Buckham

Date:

23 October 2009, 16:18:08

I came across links to pg_reorg previously but it seemed that the
project was a bit "dead".  There is active development but not much
information, and not much in the way of discussions. I will definitely
be testing both partitioning and pg_reorg.  I am curious to see if
pg_reorg will be stable enough for us to use or not.

Thanks to everyone who provided answers for great and quick responses!
Wow, it makes me really want to keep Postgres around. :)

-Kevin

On Thu, 2009-10-22 at 14:33 -0700, Scott Carey wrote:
>
> Partitioning by time should help a lot here as Kevin says.
>
> Also, you might want to experiment with things like pg_reorg:
> http://reorg.projects.postgresql.org/
> http://pgfoundry.org/projects/reorg/
> http://reorg.projects.postgresql.org/pg_reorg.html
>
> Which is basically an online, optimized cluster or vacuum full.  However it
> has several caveats.  I have not used it in production myself, just
> experiments with it.

Re: Table Clustering & Time Range Queries

From

Stephen Frost

Date:

23 October 2009, 19:55:53

* Kevin Buckham (kbuckham@applocation.net) wrote:
> I came across links to pg_reorg previously but it seemed that the
> project was a bit "dead".  There is active development but not much
> information, and not much in the way of discussions. I will definitely
> be testing both partitioning and pg_reorg.  I am curious to see if
> pg_reorg will be stable enough for us to use or not.
>
> Thanks to everyone who provided answers for great and quick responses!
> Wow, it makes me really want to keep Postgres around. :)

I've been following this but havn't commented since it seemed well in
hand.  A few specific things I would mention:

Be sure to read:
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

I'd recommend partitioning using inheiritance.
Make sure to set constraint_exclusion = on unless you're using 8.4
(in 8.4, constraint_exclusion is tri-state: 'partition', where it will
be used when UNION ALL or inheiritance is used in queries, 'on' where it
will try to be used for all queries, and 'off' where it won't be used at
all; 8.4's default is 'partition').
You may want to consider upgrading to 8.4 if you're not on it already.
You probably want to use triggers on your 'input' table to handle
incoming traffic.
Decide on a sensible partitioning scheme and then test, test, test.
Make sure it does what you want.  explain analyze and all that.

    Enjoy,

        Stephen

Attachment

signature.asc

Re: Table Clustering & Time Range Queries

From

Greg Smith

Date:

24 October 2009, 21:22:35

I'm surprised clustering as your main optimization has scaled up for you
as long as it has, I normally see that approach fall apart once you're
past a few hundred GB of data.  You're putting a lot of work into a
technique that only is useful for smaller data sets than you have now.
There are two basic approaches to optimizing queries against large
archives of time-series data that do scale up when you can use them:

1) Partition the tables downward until you reach a time scale where the
working set fits in RAM.

2) Create materialized views that roll up the data needed for the most
common reports people need run in real-time.  Optimize when those run to
keep overhead reasonable (which sounds possible given your comments about
regular maintenance windows).  Switch the app over to running against the
materialized versions of any data it's possible to do so on.  The two
standard intros to this topic are at
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views and
http://www.pgcon.org/2008/schedule/events/69.en.html

From what you've said about your app, I'd expect both of these would be
worth considering.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD