Thread: Strategies/Best Practises Handling Large Tables

Strategies/Best Practises Handling Large Tables

From
Chitra Creta
Date:

Hi,

I currently have a table that is growing very quickly - i.e 7 million records in 5 days. This table acts as a placeholder for statistics, and hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these queries are now starting to take a very long time (hours) to execute due to the size of the table.

I have put indexes on this table, to no significant benefit.  Some of the other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

I would appreciate knowledge sharing on the pros and cons of the above, or if there are any other strategies that I could put in place.

Thanking you in advance.

Re: Strategies/Best Practises Handling Large Tables

From
Lonni J Friedman
Date:
On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:
> Hi,
>
> I currently have a table that is growing very quickly - i.e 7 million
> records in 5 days. This table acts as a placeholder for statistics, and
> hence the records are merely inserted and never updated or deleted.
>
> Many queries are run on this table to obtain trend analysis. However, these
> queries are now starting to take a very long time (hours) to execute due to
> the size of the table.
>
> I have put indexes on this table, to no significant benefit.  Some of the
> other strategies I have thought of:
> 1. Purge old data
> 2. Reindex
> 3. Partition
> 4. Creation of daily, monthly, yearly summary tables that contains
> aggregated data specific to the statistics required
>
> Does anyone know what is the best practice to handle this situation?
>
> I would appreciate knowledge sharing on the pros and cons of the above, or
> if there are any other strategies that I could put in place.

Partitioning is prolly your best solution.  3 & 4 sound like
variations on the same thing.  Before you go that route, you should
make sure that your bottleneck is really a result of the massive
amount of data, and not some other problem.  Are you sure that the
indices you created are being used, and that you have all the indices
that you need for your queries?  Look at the query plan output from
EXPLAIN, and/or post here if you're unsure.

Reindexing shouldn't make a difference unless something is wrong with
the indices that you already have in place.

Purging old data is only a good solution if you do not need the data,
and never will need the data.


Re: Strategies/Best Practises Handling Large Tables

From
Ondrej Ivanič
Date:
Hi,

On 13 October 2012 01:44, Chitra Creta <chitracreta@gmail.com> wrote:
> I currently have a table that is growing very quickly - i.e 7 million
> records in 5 days. This table acts as a placeholder for statistics, and
> hence the records are merely inserted and never updated or deleted.
>
> Many queries are run on this table to obtain trend analysis. However, these
> queries are now starting to take a very long time (hours) to execute due to
> the size of the table.

Have you changed autovacuum settings to make it more agressive?
Another options is to run analyse after loading.

>
> I have put indexes on this table, to no significant benefit.  Some of the
> other strategies I have thought of:
> 1. Purge old data
> 3. Partition

Those two go together. Partitioning is useful if you can constrain
queries to specific ranges ie. this query needs last two days. You
shouldn't go over 200 - 300 partitions per table. Partition
granularity should be the same as the amount of data in average query.
if you run weekly queries then use weekly partitions (bi-weekly or
daily partitions might work but I do not have good experience).

It is easy to purge old data because you need to drop unwanted
partitions (no table/index bloat). Loading is little bit tricky
becuase you have load data into right partition.

> 4. Creation of daily, monthly, yearly summary tables that contains
> aggregated data specific to the statistics required

I think this way to the hell. You start with few tables and then you
add more tables until you realise that it takes longer to update them
then run your queries :)

You might benefit from query parallelisation, for example, pgpool-II,
Stado, Postgres XC or do not use Postgres at all. For example, any
column oriented NoSQL database might be good choice.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)
(http://www.linkedin.com/in/ondrejivanic)


Re: Strategies/Best Practises Handling Large Tables

From
Ryan Kelly
Date:
On Sat, Oct 13, 2012 at 01:44:02AM +1100, Chitra Creta wrote:
> Hi,
>
> I currently have a table that is growing very quickly - i.e 7 million
> records in 5 days. This table acts as a placeholder for statistics, and
> hence the records are merely inserted and never updated or deleted.
>
> Many queries are run on this table to obtain trend analysis. However, these
> queries are now starting to take a very long time (hours) to execute due to
> the size of the table.
>
> I have put indexes on this table, to no significant benefit.  Some of the
> other strategies I have thought of:
> 1. Purge old data
> 2. Reindex
> 3. Partition
> 4. Creation of daily, monthly, yearly summary tables that contains
> aggregated data specific to the statistics required
>
> Does anyone know what is the best practice to handle this situation?
Three and four will probably be your biggest wins. We do both. There are
trade-offs for both. If you need the absolute best in response times
(e.g., in a web application), summary tables are the way to go. If
you're regularly querying data in a specific date range, but response
times aren't as critical (e.g., daily report generated at the end of
day), partitioning will also help there.

Having the correct indexes is crucial, as always.

Additionally, you will be able to purge old data relatively quickly and
without bloat using the partitioning method.

> I would appreciate knowledge sharing on the pros and cons of the above, or
> if there are any other strategies that I could put in place.
>
> Thanking you in advance.

-Ryan Kelly


Re: Strategies/Best Practises Handling Large Tables

From
John R Pierce
Date:
On 10/12/12 7:44 AM, Chitra Creta wrote:
>
> 1. Purge old data
> 2. Reindex
> 3. Partition
> 4. Creation of daily, monthly, yearly summary tables that contains
> aggregated data specific to the statistics required
>

if most of your queries read the majority of the tables,  indexing will
be of little help

parittioning will aid in purging old data, as you can partitions by date
(for instance, by week) and drop whole partitions rather than deleting
individual records.

aggregate tables likely will be the biggest win for your statistics if
they reduce the mount of data you need to query.


--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



Re: Strategies/Best Practises Handling Large Tables

From
Chitra Creta
Date:

Thank you all for your suggestions. Since all of you recommended the Partition option, I decided to try it out myself.

I found a function that created partition tables for every month in the last two years. A trigger was also added to the parent table to ensure that every insert into it from hence forth will be inserted into the approapriate child table.

However, there were a few observations that I made which I would appreciate your comments on:

1. Since existing data was in the parent table, I had to do a pg_dump on it, drop it, and then to a restore on it to force the trigger to work on existing data. Is this how partitioning existing data should be done?

2. I noticed that there are two copies of the same record - i.e the one that was inserted into the parent table and another that was inserted in the child table. If I delete the record in the parent table, the child record gets automatically deleted. I was under the impression that partitioning meant that my parent table will not be large anymore because the data will be moved to smaller child tables. Is this the case?

3. Is there a way for me to evaluate the effectiveness of the partitioned table? Would performing an Explain Analyse allow me to determine whether querying the parent table for statistics is quicker than querying against a massive non-partitioned table?

Thank you.

On Oct 13, 2012 3:49 AM, "John R Pierce" <pierce@hogranch.com> wrote:
On 10/12/12 7:44 AM, Chitra Creta wrote:

1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics required


if most of your queries read the majority of the tables,  indexing will be of little help

parittioning will aid in purging old data, as you can partitions by date (for instance, by week) and drop whole partitions rather than deleting individual records.

aggregate tables likely will be the biggest win for your statistics if they reduce the mount of data you need to query.


--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Strategies/Best Practises Handling Large Tables

From
Ryan Kelly
Date:
On Tue, Oct 16, 2012 at 09:26:09PM +1100, Chitra Creta wrote:
> Thank you all for your suggestions. Since all of you recommended the
> Partition option, I decided to try it out myself.
>
> I found a function that created partition tables for every month in the
> last two years. A trigger was also added to the parent table to ensure that
> every insert into it from hence forth will be inserted into the
> approapriate child table.
>
> However, there were a few observations that I made which I would appreciate
> your comments on:
>
> 1. Since existing data was in the parent table, I had to do a pg_dump on
> it, drop it, and then to a restore on it to force the trigger to work on
> existing data. Is this how partitioning existing data should be done?
I just wrote a one-time function to move it.

> 2. I noticed that there are two copies of the same record - i.e the one
> that was inserted into the parent table and another that was inserted in
> the child table. If I delete the record in the parent table, the child
> record gets automatically deleted. I was under the impression that
> partitioning meant that my parent table will not be large anymore because
> the data will be moved to smaller child tables. Is this the case?
The data *is* in the child tables. Queries on the parent tables, by
default, affect data in the child tables. So, issuing a SELECT against
your parent table will also query the child tables. DELETE will,
similarly, delete data in the child tables. You may target just the
parent table using ONLY, e.g. SELECT * FROM ONLY foo. This behavior is
also controlled by the GUC sql_inheritance, though I encourage you not
to change this value. To get a better idea of what it happening, look at
the output from EXPLAIN to see all the tables that are being included in
your plan.

> 3. Is there a way for me to evaluate the effectiveness of the partitioned
> table? Would performing an Explain Analyse allow me to determine whether
> querying the parent table for statistics is quicker than querying against a
> massive non-partitioned table?
Well, you can do it with EXPLAIN ANALYZE, or you can do it by timing
your query, so that the overhead of EXPLAIN ANALYZE does not come into
play.

Also, I assume you've read this:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

That link will be helpful in understanding how partitioning could
benefit you.

-Ryan Kelly


Re: Strategies/Best Practises Handling Large Tables

From
Merlin Moncure
Date:
On Fri, Oct 12, 2012 at 10:00 AM, Lonni J Friedman <netllama@gmail.com> wrote:
> On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:
>> Hi,
>>
>> I currently have a table that is growing very quickly - i.e 7 million
>> records in 5 days. This table acts as a placeholder for statistics, and
>> hence the records are merely inserted and never updated or deleted.
>>
>> Many queries are run on this table to obtain trend analysis. However, these
>> queries are now starting to take a very long time (hours) to execute due to
>> the size of the table.
>>
>> I have put indexes on this table, to no significant benefit.  Some of the
>> other strategies I have thought of:
>> 1. Purge old data
>> 2. Reindex
>> 3. Partition
>> 4. Creation of daily, monthly, yearly summary tables that contains
>> aggregated data specific to the statistics required
>>
>> Does anyone know what is the best practice to handle this situation?
>>
>> I would appreciate knowledge sharing on the pros and cons of the above, or
>> if there are any other strategies that I could put in place.
>
> Partitioning is prolly your best solution.  3 & 4 sound like
> variations on the same thing.  Before you go that route, you should
> make sure that your bottleneck is really a result of the massive
> amount of data, and not some other problem.  Are you sure that the
> indices you created are being used, and that you have all the indices
> that you need for your queries?  Look at the query plan output from
> EXPLAIN, and/or post here if you're unsure.

Partitioning is not a strategy to improve query performance unless you
are exploiting the data structure in some way through the partition.

merlin


Re: Strategies/Best Practises Handling Large Tables

From
Shaun Thomas
Date:
On 10/12/2012 09:44 AM, Chitra Creta wrote:

> 4. Creation of daily, monthly, yearly summary tables that contains
> aggregated data specific to the statistics required

If the partitioning doesn't help you, I strongly urge you to build one
or more Fact tables with appropriate Diminsions. If you don't know what
these terms mean, there are lots of books on the subject. They're very
versatile for producing fast reports on varying inputs. You can also
"layer" them by having cascading levels of aggregation from day -> month
-> year, and so on.

These kinds of reporting structures are perfect for huge data
accumulation warehouses. The book most recommended to me back in the day
was The Data Warehouse Toolkit, and I can also vouch that it covers
these subjects pretty well.

I still recommend partitioning simply due to maintenance overhead, but
you'll want to look into this too.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Strategies/Best Practises Handling Large Tables

From
Chris Travers
Date:


On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:

Hi,

I currently have a table that is growing very quickly - i.e 7 million records in 5 days. This table acts as a placeholder for statistics, and hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these queries are now starting to take a very long time (hours) to execute due to the size of the table.

I have put indexes on this table, to no significant benefit.  Some of the other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?


The answer is well, it depends.  Possibly some combination.

One approach I like that may be included in #4 but not necessarily is the idea of summary tables which contain snapshots of the data, allowing you to roll forward or backward from defined points.  This is what I call the log, aggregate, and snapshot approach.   But it really depends on what you are doing and there is no one size fits all approach at this volume.

Instead of reindexing, I would suggest also looking into partial indexes.

Best Wishes,

Re: Strategies/Best Practises Handling Large Tables

From
Chitra Creta
Date:
Thank you all.

Ryan, would you mind sharing your one-time function to move it?

Merlin, what are your suggestions to improve query performance?

Shaun, thank you. I will look into facts and dimensions should all else fail.

Chris, would you mind giving me an example of what you mean by your log, aggregate and snapshot approach. Also, with indexing, I believe composite and partial indexes are better than indexes, am I correct? Do you have any recommendations as to which type (e.g btree, hash) is better for which situations.

Thank you,
Chitra



On Thu, Oct 18, 2012 at 12:47 AM, Chris Travers <chris.travers@gmail.com> wrote:


On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:

Hi,

I currently have a table that is growing very quickly - i.e 7 million records in 5 days. This table acts as a placeholder for statistics, and hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these queries are now starting to take a very long time (hours) to execute due to the size of the table.

I have put indexes on this table, to no significant benefit.  Some of the other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?


The answer is well, it depends.  Possibly some combination.

One approach I like that may be included in #4 but not necessarily is the idea of summary tables which contain snapshots of the data, allowing you to roll forward or backward from defined points.  This is what I call the log, aggregate, and snapshot approach.   But it really depends on what you are doing and there is no one size fits all approach at this volume.

Instead of reindexing, I would suggest also looking into partial indexes.

Best Wishes,

Re: Strategies/Best Practises Handling Large Tables

From
Chris Travers
Date:


On Sun, Oct 21, 2012 at 5:46 AM, Chitra Creta <chitracreta@gmail.com> wrote:
Thank you all.

Ryan, would you mind sharing your one-time function to move it?

Merlin, what are your suggestions to improve query performance?

Shaun, thank you. I will look into facts and dimensions should all else fail.

Chris, would you mind giving me an example of what you mean by your log, aggregate and snapshot approach. Also, with indexing, I believe composite and partial indexes are better than indexes, am I correct? Do you have any recommendations as to which type (e.g btree, hash) is better for which situations.

Sure.  Suppose I have an accounting system.

I may record the amounts in the transactions in a journal_entry and journal_line table.   These will be write once read many.  However time you will end up having to digest millions of records (given sufficient volume) to find out the balance of a checking account, and this is not really ideal.

So to deal with this, I might, for example, add a table called account_checkpoint which might have the following fields:

account_id
end_date
debits
credits
balance

And then I can snapshot on closing of books the accumulated debits, credits, and balance to date.  If I need any of these numbers I can just grab the appropriate number from account_checkpoint and roll forward from end_date.  If I have too much volume I can have closings on a monthly level of whatever.

The checkpoint table contains sufficient information for me to start a report at any point and end it at any other point without having to scan interceding checkpointed periods.  Additionally if I want to purge old data, I can do so without losing current balance information.

So what this approach does, in essence is it gives you a way to purge without losing some aggregated information, and a way to skip portions of the table for aggregation purposes you can't skip otherwise.  The big thing is you cannot insert (and if this is in doubt, you need to enforce this with a trigger) any records effective before the most recent checkpoint.

Best Wishes,
Chris Travers

Re: Strategies/Best Practises Handling Large Tables

From
Chitra Creta
Date:
Thanks for your example Chris. I will look into it as a long-term solution.

Partitioning tables as a strategy worked very well indeed. This will be my short/medium term solution. 

Another strategy that I would like to evaluate as a short/medium term solution is archiving old records in a table before purging them.

I am aware that Oracle has a tool that allows records to be exported into a file / archive table before purging them. They also provide a tool to import these records.

Does PostgreSQL have similar tools to export to a file and re-import? 

If PostgreSQL does not have a tool to do this, does anyone have any ideas on what file format (e.g. text file containing a table of headers being column names and rows being records) would be ideal for easy re-importing into a PostgreSQL table?

Thank you for your ideas.


On Mon, Oct 22, 2012 at 12:14 AM, Chris Travers <chris.travers@gmail.com> wrote:


On Sun, Oct 21, 2012 at 5:46 AM, Chitra Creta <chitracreta@gmail.com> wrote:
Thank you all.

Ryan, would you mind sharing your one-time function to move it?

Merlin, what are your suggestions to improve query performance?

Shaun, thank you. I will look into facts and dimensions should all else fail.

Chris, would you mind giving me an example of what you mean by your log, aggregate and snapshot approach. Also, with indexing, I believe composite and partial indexes are better than indexes, am I correct? Do you have any recommendations as to which type (e.g btree, hash) is better for which situations.

Sure.  Suppose I have an accounting system.

I may record the amounts in the transactions in a journal_entry and journal_line table.   These will be write once read many.  However time you will end up having to digest millions of records (given sufficient volume) to find out the balance of a checking account, and this is not really ideal.

So to deal with this, I might, for example, add a table called account_checkpoint which might have the following fields:

account_id
end_date
debits
credits
balance

And then I can snapshot on closing of books the accumulated debits, credits, and balance to date.  If I need any of these numbers I can just grab the appropriate number from account_checkpoint and roll forward from end_date.  If I have too much volume I can have closings on a monthly level of whatever.

The checkpoint table contains sufficient information for me to start a report at any point and end it at any other point without having to scan interceding checkpointed periods.  Additionally if I want to purge old data, I can do so without losing current balance information.

So what this approach does, in essence is it gives you a way to purge without losing some aggregated information, and a way to skip portions of the table for aggregation purposes you can't skip otherwise.  The big thing is you cannot insert (and if this is in doubt, you need to enforce this with a trigger) any records effective before the most recent checkpoint.

Best Wishes,
Chris Travers

Re: Strategies/Best Practises Handling Large Tables

From
Igor Romanchenko
Date:


On Thu, Nov 15, 2012 at 1:34 PM, Chitra Creta <chitracreta@gmail.com> wrote:
Thanks for your example Chris. I will look into it as a long-term solution.

Partitioning tables as a strategy worked very well indeed. This will be my short/medium term solution. 

Another strategy that I would like to evaluate as a short/medium term solution is archiving old records in a table before purging them.

I am aware that Oracle has a tool that allows records to be exported into a file / archive table before purging them. They also provide a tool to import these records.

Does PostgreSQL have similar tools to export to a file and re-import? 

If PostgreSQL does not have a tool to do this, does anyone have any ideas on what file format (e.g. text file containing a table of headers being column names and rows being records) would be ideal for easy re-importing into a PostgreSQL table?

Thank you for your ideas.

PostgreSQL has COPY TO to export records to a file ( http://wiki.postgresql.org/wiki/COPY ).