Re: Many thousands of partitions - Mailing list pgsql-general

From Gabriel E. Sánchez Martínez
Subject Re: Many thousands of partitions
Date
Msg-id 52548B74.7030801@gmail.com
Whole thread Raw
In response to Many thousands of partitions  (Grzegorz Tańczyk <goliatus@polzone.pl>)
Responses Re: Many thousands of partitions
List pgsql-general

On 10/08/2013 11:23 AM, Grzegorz Tańczyk wrote:
Hello,

I have question regarding one of caveats from docs:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html

"Partitioning using these techniques will work well with up to perhaps a hundred partitions; don't try to use many thousands of partitions."

What's the alternative? Nested partitioning could do the trick? I have milions of rows(numbers, timestamps and text(<4kb), which are frequently updated and there are also frequent inserts. Partitioning was my first thought about solution of this problem. I want to avoid long lasting locks, index rebuild problems and neverending vacuum.

Write performance may be low if at the same time I will have no problem selecting single rows using primary key(bigint).Partitioning seems to be the solution, but I'm sure I will end up with several thousands of automatically generated partitions.
I had the same question a short while back and actually tested some of it.  My test data set was on an unpartitioned table having about 432 million rows and taking about 46 GB.  In my case the data partitions nicely by date.  The table had about a year of data, spread over two years, and I was curious to see how much partioning could improve query speeds vs the unpartitioned table with an index on a timestamp field.  I tested partitioning by month (24 partitions) and also by day (over 700 partitions), in both cases using non-overlapping CHECK constraints to enable constraint exclusion.  Both partitioning schemes worked considerably faster than the single table (on the order of 200 ms vs 2500 ms), and were not significantly far from each other in performance.  Since more partitions translates to more management hasle, I concluded that partitioning by month was the best option.

I also tested the partitioning idea with a script that did a partition per day for 20 years.  Most of the tables were empty because I used the same data for this test.  I found that CREATE and DROP took very long, while SELECT queries (on over 7000 thousand of partitions) were still faster than on the single table but a bit slower than the schemes with only a few partitions.  With more complex SELECT queries the performance might be much worse.

I also tested the nested partitioning idea on partitions by day over two years, the idea being that since CHECK constraints are inherited the query planner should be able to exclude sets of tables at once, making thousands of partitions feasible.  I was not able to detect a significant change in the time it took to SELECT on the tables for two years, so I tested it with tables for 20 and 30 years.  If nested partitioning worked, having thousands of partitions would not appreciably slow down the query, but it did a bit, so it is clear that the query planner still scans all partitions down the inheritance tree.  Perhaps this is because one can override the default behavior of inheriting CHECK constraints, so a child table does not necesarilly observe a parent's CHECK.  I suppose that with a bit of bookkeeping the query planner could exploit nested partitioning, but that is not how it currently behaves.  Maybe this could be considered for a future performance feature of PostgreSQL, but it would only be useful in big data applications with several TB of data archival. 

I was not using INSERT triggers to distribute data across partitions.  Instead I left that logic to a Java application that relied on the naming convention for tables.  I imagine that things can slow down considerably as the number of partitions increases if you are inserting data with triggers.

Partioning seems to be a good idea if a single table would be too big to fit in your server's file cache, and also for management, since you can drop partitions of old data and create new ones without having to reindex and lock.  Does your data partition nicely by date, for example?  If most of the inserts are new data and old data is read-mostly, then partitioning may make sense because you would not have to reindex old partitions.  In fact, you could very well not have an index on the hot, write-mostly partition of, say, the current month, until the write activity on that table diminishes, which would make inserts faster.  If, on the other hand, your writes are scattered across many partitions, a single large table with an index may be a better solution. 


Thanks

--
Regards,
  Grzegorz

pgsql-general by date:

Previous
From: Rowan Collins
Date:
Subject: Re: ERROR: invalid value "????" for "YYYY"
Next
From: Steve Crawford
Date:
Subject: Re: ERROR: invalid value "????" for "YYYY"