Thread: Strategies/Best Practises Handling Large Tables
Hi,
I currently have a table that is growing very quickly - i.e 7 million records in 5 days. This table acts as a placeholder for statistics, and hence the records are merely inserted and never updated or deleted.
Many queries are run on this table to obtain trend analysis. However, these queries are now starting to take a very long time (hours) to execute due to the size of the table.
I have put indexes on this table, to no significant benefit. Some of the other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics required
Does anyone know what is the best practice to handle this situation?
I would appreciate knowledge sharing on the pros and cons of the above, or if there are any other strategies that I could put in place.
Thanking you in advance.
On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote: > Hi, > > I currently have a table that is growing very quickly - i.e 7 million > records in 5 days. This table acts as a placeholder for statistics, and > hence the records are merely inserted and never updated or deleted. > > Many queries are run on this table to obtain trend analysis. However, these > queries are now starting to take a very long time (hours) to execute due to > the size of the table. > > I have put indexes on this table, to no significant benefit. Some of the > other strategies I have thought of: > 1. Purge old data > 2. Reindex > 3. Partition > 4. Creation of daily, monthly, yearly summary tables that contains > aggregated data specific to the statistics required > > Does anyone know what is the best practice to handle this situation? > > I would appreciate knowledge sharing on the pros and cons of the above, or > if there are any other strategies that I could put in place. Partitioning is prolly your best solution. 3 & 4 sound like variations on the same thing. Before you go that route, you should make sure that your bottleneck is really a result of the massive amount of data, and not some other problem. Are you sure that the indices you created are being used, and that you have all the indices that you need for your queries? Look at the query plan output from EXPLAIN, and/or post here if you're unsure. Reindexing shouldn't make a difference unless something is wrong with the indices that you already have in place. Purging old data is only a good solution if you do not need the data, and never will need the data.
Hi, On 13 October 2012 01:44, Chitra Creta <chitracreta@gmail.com> wrote: > I currently have a table that is growing very quickly - i.e 7 million > records in 5 days. This table acts as a placeholder for statistics, and > hence the records are merely inserted and never updated or deleted. > > Many queries are run on this table to obtain trend analysis. However, these > queries are now starting to take a very long time (hours) to execute due to > the size of the table. Have you changed autovacuum settings to make it more agressive? Another options is to run analyse after loading. > > I have put indexes on this table, to no significant benefit. Some of the > other strategies I have thought of: > 1. Purge old data > 3. Partition Those two go together. Partitioning is useful if you can constrain queries to specific ranges ie. this query needs last two days. You shouldn't go over 200 - 300 partitions per table. Partition granularity should be the same as the amount of data in average query. if you run weekly queries then use weekly partitions (bi-weekly or daily partitions might work but I do not have good experience). It is easy to purge old data because you need to drop unwanted partitions (no table/index bloat). Loading is little bit tricky becuase you have load data into right partition. > 4. Creation of daily, monthly, yearly summary tables that contains > aggregated data specific to the statistics required I think this way to the hell. You start with few tables and then you add more tables until you realise that it takes longer to update them then run your queries :) You might benefit from query parallelisation, for example, pgpool-II, Stado, Postgres XC or do not use Postgres at all. For example, any column oriented NoSQL database might be good choice. -- Ondrej Ivanic (ondrej.ivanic@gmail.com) (http://www.linkedin.com/in/ondrejivanic)
On Sat, Oct 13, 2012 at 01:44:02AM +1100, Chitra Creta wrote: > Hi, > > I currently have a table that is growing very quickly - i.e 7 million > records in 5 days. This table acts as a placeholder for statistics, and > hence the records are merely inserted and never updated or deleted. > > Many queries are run on this table to obtain trend analysis. However, these > queries are now starting to take a very long time (hours) to execute due to > the size of the table. > > I have put indexes on this table, to no significant benefit. Some of the > other strategies I have thought of: > 1. Purge old data > 2. Reindex > 3. Partition > 4. Creation of daily, monthly, yearly summary tables that contains > aggregated data specific to the statistics required > > Does anyone know what is the best practice to handle this situation? Three and four will probably be your biggest wins. We do both. There are trade-offs for both. If you need the absolute best in response times (e.g., in a web application), summary tables are the way to go. If you're regularly querying data in a specific date range, but response times aren't as critical (e.g., daily report generated at the end of day), partitioning will also help there. Having the correct indexes is crucial, as always. Additionally, you will be able to purge old data relatively quickly and without bloat using the partitioning method. > I would appreciate knowledge sharing on the pros and cons of the above, or > if there are any other strategies that I could put in place. > > Thanking you in advance. -Ryan Kelly
On 10/12/12 7:44 AM, Chitra Creta wrote: > > 1. Purge old data > 2. Reindex > 3. Partition > 4. Creation of daily, monthly, yearly summary tables that contains > aggregated data specific to the statistics required > if most of your queries read the majority of the tables, indexing will be of little help parittioning will aid in purging old data, as you can partitions by date (for instance, by week) and drop whole partitions rather than deleting individual records. aggregate tables likely will be the biggest win for your statistics if they reduce the mount of data you need to query. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
Thank you all for your suggestions. Since all of you recommended the Partition option, I decided to try it out myself.
I found a function that created partition tables for every month in the last two years. A trigger was also added to the parent table to ensure that every insert into it from hence forth will be inserted into the approapriate child table.
However, there were a few observations that I made which I would appreciate your comments on:
1. Since existing data was in the parent table, I had to do a pg_dump on it, drop it, and then to a restore on it to force the trigger to work on existing data. Is this how partitioning existing data should be done?
2. I noticed that there are two copies of the same record - i.e the one that was inserted into the parent table and another that was inserted in the child table. If I delete the record in the parent table, the child record gets automatically deleted. I was under the impression that partitioning meant that my parent table will not be large anymore because the data will be moved to smaller child tables. Is this the case?
3. Is there a way for me to evaluate the effectiveness of the partitioned table? Would performing an Explain Analyse allow me to determine whether querying the parent table for statistics is quicker than querying against a massive non-partitioned table?
Thank you.
On 10/12/12 7:44 AM, Chitra Creta wrote:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics required
if most of your queries read the majority of the tables, indexing will be of little help
parittioning will aid in purging old data, as you can partitions by date (for instance, by week) and drop whole partitions rather than deleting individual records.
aggregate tables likely will be the biggest win for your statistics if they reduce the mount of data you need to query.
--
john r pierce N 37, W 122
santa cruz ca mid-left coast
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 16, 2012 at 09:26:09PM +1100, Chitra Creta wrote: > Thank you all for your suggestions. Since all of you recommended the > Partition option, I decided to try it out myself. > > I found a function that created partition tables for every month in the > last two years. A trigger was also added to the parent table to ensure that > every insert into it from hence forth will be inserted into the > approapriate child table. > > However, there were a few observations that I made which I would appreciate > your comments on: > > 1. Since existing data was in the parent table, I had to do a pg_dump on > it, drop it, and then to a restore on it to force the trigger to work on > existing data. Is this how partitioning existing data should be done? I just wrote a one-time function to move it. > 2. I noticed that there are two copies of the same record - i.e the one > that was inserted into the parent table and another that was inserted in > the child table. If I delete the record in the parent table, the child > record gets automatically deleted. I was under the impression that > partitioning meant that my parent table will not be large anymore because > the data will be moved to smaller child tables. Is this the case? The data *is* in the child tables. Queries on the parent tables, by default, affect data in the child tables. So, issuing a SELECT against your parent table will also query the child tables. DELETE will, similarly, delete data in the child tables. You may target just the parent table using ONLY, e.g. SELECT * FROM ONLY foo. This behavior is also controlled by the GUC sql_inheritance, though I encourage you not to change this value. To get a better idea of what it happening, look at the output from EXPLAIN to see all the tables that are being included in your plan. > 3. Is there a way for me to evaluate the effectiveness of the partitioned > table? Would performing an Explain Analyse allow me to determine whether > querying the parent table for statistics is quicker than querying against a > massive non-partitioned table? Well, you can do it with EXPLAIN ANALYZE, or you can do it by timing your query, so that the overhead of EXPLAIN ANALYZE does not come into play. Also, I assume you've read this: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html That link will be helpful in understanding how partitioning could benefit you. -Ryan Kelly
On Fri, Oct 12, 2012 at 10:00 AM, Lonni J Friedman <netllama@gmail.com> wrote: > On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote: >> Hi, >> >> I currently have a table that is growing very quickly - i.e 7 million >> records in 5 days. This table acts as a placeholder for statistics, and >> hence the records are merely inserted and never updated or deleted. >> >> Many queries are run on this table to obtain trend analysis. However, these >> queries are now starting to take a very long time (hours) to execute due to >> the size of the table. >> >> I have put indexes on this table, to no significant benefit. Some of the >> other strategies I have thought of: >> 1. Purge old data >> 2. Reindex >> 3. Partition >> 4. Creation of daily, monthly, yearly summary tables that contains >> aggregated data specific to the statistics required >> >> Does anyone know what is the best practice to handle this situation? >> >> I would appreciate knowledge sharing on the pros and cons of the above, or >> if there are any other strategies that I could put in place. > > Partitioning is prolly your best solution. 3 & 4 sound like > variations on the same thing. Before you go that route, you should > make sure that your bottleneck is really a result of the massive > amount of data, and not some other problem. Are you sure that the > indices you created are being used, and that you have all the indices > that you need for your queries? Look at the query plan output from > EXPLAIN, and/or post here if you're unsure. Partitioning is not a strategy to improve query performance unless you are exploiting the data structure in some way through the partition. merlin
On 10/12/2012 09:44 AM, Chitra Creta wrote: > 4. Creation of daily, monthly, yearly summary tables that contains > aggregated data specific to the statistics required If the partitioning doesn't help you, I strongly urge you to build one or more Fact tables with appropriate Diminsions. If you don't know what these terms mean, there are lots of books on the subject. They're very versatile for producing fast reports on varying inputs. You can also "layer" them by having cascading levels of aggregation from day -> month -> year, and so on. These kinds of reporting structures are perfect for huge data accumulation warehouses. The book most recommended to me back in the day was The Data Warehouse Toolkit, and I can also vouch that it covers these subjects pretty well. I still recommend partitioning simply due to maintenance overhead, but you'll want to look into this too. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Hi,
I currently have a table that is growing very quickly - i.e 7 million records in 5 days. This table acts as a placeholder for statistics, and hence the records are merely inserted and never updated or deleted.
Many queries are run on this table to obtain trend analysis. However, these queries are now starting to take a very long time (hours) to execute due to the size of the table.
I have put indexes on this table, to no significant benefit. Some of the other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics requiredDoes anyone know what is the best practice to handle this situation?
On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:Hi,
I currently have a table that is growing very quickly - i.e 7 million records in 5 days. This table acts as a placeholder for statistics, and hence the records are merely inserted and never updated or deleted.
Many queries are run on this table to obtain trend analysis. However, these queries are now starting to take a very long time (hours) to execute due to the size of the table.
I have put indexes on this table, to no significant benefit. Some of the other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics requiredDoes anyone know what is the best practice to handle this situation?
The answer is well, it depends. Possibly some combination.One approach I like that may be included in #4 but not necessarily is the idea of summary tables which contain snapshots of the data, allowing you to roll forward or backward from defined points. This is what I call the log, aggregate, and snapshot approach. But it really depends on what you are doing and there is no one size fits all approach at this volume.Instead of reindexing, I would suggest also looking into partial indexes.Best Wishes,
Thank you all.Ryan, would you mind sharing your one-time function to move it?Merlin, what are your suggestions to improve query performance?Shaun, thank you. I will look into facts and dimensions should all else fail.Chris, would you mind giving me an example of what you mean by your log, aggregate and snapshot approach. Also, with indexing, I believe composite and partial indexes are better than indexes, am I correct? Do you have any recommendations as to which type (e.g btree, hash) is better for which situations.
Best Wishes,
On Sun, Oct 21, 2012 at 5:46 AM, Chitra Creta <chitracreta@gmail.com> wrote:Thank you all.Ryan, would you mind sharing your one-time function to move it?Merlin, what are your suggestions to improve query performance?Shaun, thank you. I will look into facts and dimensions should all else fail.Chris, would you mind giving me an example of what you mean by your log, aggregate and snapshot approach. Also, with indexing, I believe composite and partial indexes are better than indexes, am I correct? Do you have any recommendations as to which type (e.g btree, hash) is better for which situations.Sure. Suppose I have an accounting system.I may record the amounts in the transactions in a journal_entry and journal_line table. These will be write once read many. However time you will end up having to digest millions of records (given sufficient volume) to find out the balance of a checking account, and this is not really ideal.So to deal with this, I might, for example, add a table called account_checkpoint which might have the following fields:account_idend_datedebitscreditsbalanceAnd then I can snapshot on closing of books the accumulated debits, credits, and balance to date. If I need any of these numbers I can just grab the appropriate number from account_checkpoint and roll forward from end_date. If I have too much volume I can have closings on a monthly level of whatever.The checkpoint table contains sufficient information for me to start a report at any point and end it at any other point without having to scan interceding checkpointed periods. Additionally if I want to purge old data, I can do so without losing current balance information.So what this approach does, in essence is it gives you a way to purge without losing some aggregated information, and a way to skip portions of the table for aggregation purposes you can't skip otherwise. The big thing is you cannot insert (and if this is in doubt, you need to enforce this with a trigger) any records effective before the most recent checkpoint.
Best Wishes,Chris Travers
Thanks for your example Chris. I will look into it as a long-term solution.Partitioning tables as a strategy worked very well indeed. This will be my short/medium term solution.Another strategy that I would like to evaluate as a short/medium term solution is archiving old records in a table before purging them.I am aware that Oracle has a tool that allows records to be exported into a file / archive table before purging them. They also provide a tool to import these records.Does PostgreSQL have similar tools to export to a file and re-import?If PostgreSQL does not have a tool to do this, does anyone have any ideas on what file format (e.g. text file containing a table of headers being column names and rows being records) would be ideal for easy re-importing into a PostgreSQL table?Thank you for your ideas.