Thread: Table Clustering & Time Range Queries
I am running several servers with Postgres 8.3 that are used to house location data from thousands of devices. Location updates are quite frequent, so our tables rapidly become fairly large (often about 2GB per day of growth). We've been using Postgres for close to 10 years now and have been very happy until recent performance issues with larger data sets. Our primary location table is clustered by "reporttime" (bigint). Many of the queries we need to perform are of the nature : "get me all positions from a given device for yesterday". Similar queries are "get me the most recent 10 positions from a given device". These are pretty simple and straightforward queries that run surprisingly quickly either on small tables or tables that have been clustered by reporttime. If the tables are large and haven't been actively clustered for a while, the simplest looking queries will take close to a minute to execute. Unfortunately, the clustering operation now takes far too long to run in any reasonable maintenance window. On smaller datasets clustering will take from 30 minutes to an hour and on our large datasets several hours (gave up at our 4 hour maintenance window limit). Obviously I would love to have Postgres support "online" clustering, however I need to figure a way around this problem now or start planning the port to another database server. We have tried the "SELECT INTO ... ORDER BY REPORTTIME" trick instead or running cluster, but since the tables are quite large that is still taking too long (although quicker than clustering). I have spent more time than I would like looking into clustering options and other load balancing techniques. I was thinking that I'd like to take one server set down for clustering, while failing over to the secondary database set. The problem is I don't see an efficient or easy way to then synchronize the sizable amount of updates that will have occurred since the clustering started back to the now clustered primary database without significantly affecting write performance. If it is difficult to solve, porting to Oracle or *gasp* SQL Server will be an easier solution. I have spent a lot of time Googling, and so far no obvious solutions jump to mind besides either hoping (or helping) online clustering become a reality on Postgres, or to migrate to a different DB engine. I'd really appreciate any thoughts or suggestions. -Kevin
Kevin Buckham <kbuckham@applocation.net> wrote: > Our primary location table is clustered by "reporttime" (bigint). > Many of the queries we need to perform are of the nature : "get me > all positions from a given device for yesterday". Similar queries > are "get me the most recent 10 positions from a given device". Have you looked at table partitioning? You would then only need to cluster the most recent partition or two. I *seems* like a good fit for your application. http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html -Kevin
On 10/22/09 12:25 PM, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Kevin Buckham <kbuckham@applocation.net> wrote: > >> Our primary location table is clustered by "reporttime" (bigint). >> Many of the queries we need to perform are of the nature : "get me >> all positions from a given device for yesterday". Similar queries >> are "get me the most recent 10 positions from a given device". > > Have you looked at table partitioning? You would then only need to > cluster the most recent partition or two. I *seems* like a good fit > for your application. > > http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html > > -Kevin Partitioning by time should help a lot here as Kevin says. Also, you might want to experiment with things like pg_reorg: http://reorg.projects.postgresql.org/ http://pgfoundry.org/projects/reorg/ http://reorg.projects.postgresql.org/pg_reorg.html Which is basically an online, optimized cluster or vacuum full. However it has several caveats. I have not used it in production myself, just experiments with it. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
> Also, you might want to experiment with things like > pg_reorg: Do you happen to know if that works with 8.4?
I came across links to pg_reorg previously but it seemed that the project was a bit "dead". There is active development but not much information, and not much in the way of discussions. I will definitely be testing both partitioning and pg_reorg. I am curious to see if pg_reorg will be stable enough for us to use or not. Thanks to everyone who provided answers for great and quick responses! Wow, it makes me really want to keep Postgres around. :) -Kevin On Thu, 2009-10-22 at 14:33 -0700, Scott Carey wrote: > > Partitioning by time should help a lot here as Kevin says. > > Also, you might want to experiment with things like pg_reorg: > http://reorg.projects.postgresql.org/ > http://pgfoundry.org/projects/reorg/ > http://reorg.projects.postgresql.org/pg_reorg.html > > Which is basically an online, optimized cluster or vacuum full. However it > has several caveats. I have not used it in production myself, just > experiments with it.
* Kevin Buckham (kbuckham@applocation.net) wrote: > I came across links to pg_reorg previously but it seemed that the > project was a bit "dead". There is active development but not much > information, and not much in the way of discussions. I will definitely > be testing both partitioning and pg_reorg. I am curious to see if > pg_reorg will be stable enough for us to use or not. > > Thanks to everyone who provided answers for great and quick responses! > Wow, it makes me really want to keep Postgres around. :) I've been following this but havn't commented since it seemed well in hand. A few specific things I would mention: Be sure to read: http://www.postgresql.org/docs/current/static/ddl-partitioning.html I'd recommend partitioning using inheiritance. Make sure to set constraint_exclusion = on unless you're using 8.4 (in 8.4, constraint_exclusion is tri-state: 'partition', where it will be used when UNION ALL or inheiritance is used in queries, 'on' where it will try to be used for all queries, and 'off' where it won't be used at all; 8.4's default is 'partition'). You may want to consider upgrading to 8.4 if you're not on it already. You probably want to use triggers on your 'input' table to handle incoming traffic. Decide on a sensible partitioning scheme and then test, test, test. Make sure it does what you want. explain analyze and all that. Enjoy, Stephen
Attachment
I'm surprised clustering as your main optimization has scaled up for you as long as it has, I normally see that approach fall apart once you're past a few hundred GB of data. You're putting a lot of work into a technique that only is useful for smaller data sets than you have now. There are two basic approaches to optimizing queries against large archives of time-series data that do scale up when you can use them: 1) Partition the tables downward until you reach a time scale where the working set fits in RAM. 2) Create materialized views that roll up the data needed for the most common reports people need run in real-time. Optimize when those run to keep overhead reasonable (which sounds possible given your comments about regular maintenance windows). Switch the app over to running against the materialized versions of any data it's possible to do so on. The two standard intros to this topic are at http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views and http://www.pgcon.org/2008/schedule/events/69.en.html From what you've said about your app, I'd expect both of these would be worth considering. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD