Re: Transparent table partitioning in future version of PG? - Mailing list pgsql-performance
From | david@lang.hm |
---|---|
Subject | Re: Transparent table partitioning in future version of PG? |
Date | |
Msg-id | alpine.DEB.1.10.0905071932050.15782@asgard Whole thread Raw |
In response to | Re: Transparent table partitioning in future version of PG? (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Transparent table partitioning in future version of PG?
Re: Transparent table partitioning in future version of PG? |
List | pgsql-performance |
On Thu, 7 May 2009, Robert Haas wrote: > On Wed, May 6, 2009 at 6:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Agreed. Perhaps I should say then that the syntax needs to express the >> requirements of the planner/executor behaviour, rather than being the >> main aspect of the feature, as some have suggested. > > Agreed. > >> Hopefully, notions of partitioning won't be directly tied to chunking of >> data for parallel query access. Most queries access recent data and >> hence only a single partition (or stripe), so partitioning and >> parallelism and frequently exactly orthogonal. > > Yes, I think those things are unrelated. I'm not so sure (warning, I am relativly inexperianced in this area) it sounds like you can take two basic approaches to partition a database 1. The Isolation Plan you want to have it so that your queries match your partitioning. this is with the goal of only having to query a small number of paritions, minimizing the total amount of data touched (including minimumizing the number of indexes searched) this matches the use case mentioned above, with the partition based on date and only looking at the most recent date range. 2. The Load Balancing Plan you want to have your partitioning and your queries _not_ match as much as possible this is with the goal of having the query hit as many partitions as possible, so that the different parts of the search can happen in parallel However, with either partitioning plan, you will have queries that degenerate to look like the other plan. In the case of the isolation plan, you may need to search for all instances of a rare thing over the entire history (after all, if you never need to access that history, why do you pay for disks to store it? ;-) and even when you are searching a narrow time window, it may still span multiple partitions. I have a log analysis setup using the Splunk prioriatary database, it paritions by time, creating a new parition as the current one hits a configurable size (by default 10G on 64 bit systems). for my volume of logs I end up with each parition only covering a few hours. it's very common to want to search over a few days, which can be a few dozen partitions (this is out of many hundreds of partitions, so it's still a _huge_ win to narrow the timeframe) In the case of the load balancing plan, you may run into a query that happens to only fall into one partition (the query matches your paritioning logic) I think the only real difference is how common it is to need to search multiple partitions. If the expectation is that you will frequently need to search most/all of the partitions (the load balancing plan), then it's a waste of time to analyse the query to try and figure out which paritions you need to look at. If the expectation is that you will frequently only need to search a small number of the partitions (the isolation plan), then it's extremely valuble to spend as much time as needed working to analyse the query to try and figure out which partitions you need to look at. I believe that the isolation plan is probably more common than the load balancing plan, but I don't see them as being that different for the database engine point of view. To tune a system that can handle the isolation plan for load balancing, the key thing to do would be to have a knob to disable the partition planning, and just blindly send the search out to every partition. David Lang
pgsql-performance by date: