Thread: Transparent table partitioning in future version of PG?
Hi,
What can you do with the new Windows Live? Find out
I was looking at the support that PostgreSQL offers for table partitioning at http://www.postgresql.org/docs/8.4/static/ddl-partitioning.html. The concept looks promising, but its maybe fair to say that PG itself doesn't really supports partitioning natively, but one can simulate it using some of the existing PG features (namely inheritance, triggers, rules and constraint exclusion). This simulating does seem to work, but there are some disadvantages and caveats.
A major disadvantage is obviously that you need to set up and maintain the whole structure yourself (which is somewhat dangerous, or at least involves a lot of maintenance overhead). Next to that, it seemingly becomes hard to do simple queries likes 'select * from foo where bar> 1000 and bar < 5000', in case the answer to this query spans multiple partitions. constraint exclusion works to some degree, but the document I referred to above tells me I can no longer use prepared statements then.
I wonder if there are any plans to incorporate 'native' or 'transparent' partitioning in some future version of PG? With this I mean that I would basically be able to say something like (pseudo): "alter table foo partition on bar range 100", and PG would then simply start doing internally what we now have to do manually.
Is something like this on the radar or is it just wishful thinking of me?
Kind regards
What can you do with the new Windows Live? Find out
On Fri, May 1, 2009 at 10:32 AM, henk de wit <henk53602@hotmail.com> wrote: > I was looking at the support that PostgreSQL offers for table partitioning > at http://www.postgresql.org/docs/8.4/static/ddl-partitioning.html. The > concept looks promising, but its maybe fair to say that PG itself doesn't > really supports partitioning natively, but one can simulate it using some of > the existing PG features (namely inheritance, triggers, rules and constraint > exclusion). This simulating does seem to work, but there are some > disadvantages and caveats. > A major disadvantage is obviously that you need to set up and maintain the > whole structure yourself (which is somewhat dangerous, or at least involves > a lot of maintenance overhead). Next to that, it seemingly becomes hard to > do simple queries likes 'select * from foo where bar> 1000 and bar < 5000', > in case the answer to this query spans multiple partitions. constraint > exclusion works to some degree, but the document I referred to above tells > me I can no longer use prepared statements then. > I wonder if there are any plans to incorporate 'native' or 'transparent' > partitioning in some future version of PG? With this I mean that I would > basically be able to say something like (pseudo): "alter table foo partition > on bar range 100", and PG would then simply start doing internally what we > now have to do manually. > Is something like this on the radar or is it just wishful thinking of me? > Kind regards This has been discussed on this list multiple times previously; search the archives. The problem has been finding someone who has both the time and the ability to do the work. ...Robert
On 5/1/09 7:32 AM, "henk de wit" <henk53602@hotmail.com> wrote: > Hi, > > I was looking at the support that PostgreSQL offers for table partitioning > at http://www.postgresql.org/docs/8.4/static/ddl-partitioning.html. The > concept looks promising, but its maybe fair to say that PG itself doesn't > really supports partitioning natively, but one can simulate it using some of > the existing PG features (namely inheritance, triggers, rules and constraint > exclusion). This simulating does seem to work, but there are some > disadvantages and caveats. > > A major disadvantage is obviously that you need to set up and maintain the > whole structure yourself (which is somewhat dangerous, or at least involves a > lot of maintenance overhead). Next to that, it seemingly becomes hard to do > simple queries likes 'select * from foo where bar> 1000 and bar < 5000', in > case the answer to this query spans multiple partitions. constraint exclusion > works to some degree, but the document I referred to above tells me I can no > longer use prepared statements then. More caveats: Query plans go bad pretty quickly because the planner doesn't aggregate statistics correctly when scanning more than one table. Constraint exclusion code is completely and utterly broken if the table count gets large on DELETE or UPDATE queries -- I can get the query planner / constraint exclusion stuff to eat up 7GB of RAM trying to figure out what table to access when the number of partitions ~=6000. The same thing in select form doesn't consume that memory but still takes over a second. This is "not a bug". http://www.nabble.com/8.3.5:-Query-Planner-takes-15%2B-seconds-to-plan-Updat e-or-Delete-queries-on-partitioned-tables.-td21992054.html Its pretty much faster to do merge joins or hash joins client side on multiple tables -- basically doing partitioning client side -- after a point and for any more complicated aggregation or join. There is a lot of talk about overly complicated partitioning or auto-partitioning, but two much more simple things would go a long way to making this fairly workable: Make stat aggregation across tables better -- use weighted average for estimating row width, aggregate distinct counts and correlations better. Right now it mostly assumes the worst possible case and can end up with very unoptimal plans. Make a special case for "unique" child inheritance constraints that can be checked much faster -- nobody wants to partition and have overlapping constraint regions. And whatever is going on for it on the update / delete side that causes it to take so much longer and use so much more memory for what should be the same constraint exclusion check as a select needs to be attended to. There would still be manual work for managing creating partitions, but at this point, that is the _least_ of the problems. > > I wonder if there are any plans to incorporate 'native' or 'transparent' > partitioning in some future version of PG? With this I mean that I would > basically be able to say something like (pseudo): "alter table foo partition > on bar range 100", and PG would then simply start doing internally what we now > have to do manually. > > Is something like this on the radar or is it just wishful thinking of me? > > Kind regards > > > > > What can you do with the new Windows Live? Find out > <http://www.microsoft.com/windows/windowslive/default.aspx>
On Fri, 2009-05-01 at 11:27 -0400, Robert Haas wrote: > The problem has been finding someone who has both the time and the > ability to do the work. Unfortunately there has been significant debate over which parts of partitioning need to be improved. My own view is that considerable attention needs to be applied to both the executor and planner to improve matters and that syntax improvements are largely irrelevant, though seductive. Deep improvements will require significant analysis, agreement, effort and skill. What we have now took approximately 20 days to implement, with later patches adding about another 10-20 days work. I'd estimate the required work as 60-100 days work from primary author, plus planning and discussion time. YMMV. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2009-05-01 at 11:27 -0400, Robert Haas wrote: >> The problem has been finding someone who has both the time and the >> ability to do the work. > Unfortunately there has been significant debate over which parts of > partitioning need to be improved. My own view is that considerable > attention needs to be applied to both the executor and planner to > improve matters and that syntax improvements are largely irrelevant, > though seductive. My thought about it is that what we really need is an explicit notion of partitioned tables built into the system, instead of trying to make the planner re-deduce the partitioning behavior from first principles every time it builds a plan for such a table. Such a notion would presumably involve some new syntax to allow the partitioning rule to be specified at table creation time. I agree that the syntax details are a minor issue, but the set of possible partitioning rules is certainly a topic of great interest. regards, tom lane
On Wed, 2009-05-06 at 17:55 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2009-05-01 at 11:27 -0400, Robert Haas wrote: > >> The problem has been finding someone who has both the time and the > >> ability to do the work. > > > Unfortunately there has been significant debate over which parts of > > partitioning need to be improved. My own view is that considerable > > attention needs to be applied to both the executor and planner to > > improve matters and that syntax improvements are largely irrelevant, > > though seductive. > > My thought about it is that what we really need is an explicit notion > of partitioned tables built into the system, instead of trying to make > the planner re-deduce the partitioning behavior from first principles > every time it builds a plan for such a table. Such a notion would > presumably involve some new syntax to allow the partitioning rule to be > specified at table creation time. I agree that the syntax details are a > minor issue, but the set of possible partitioning rules is certainly a > topic of great interest. Agreed. Perhaps I should say then that the syntax needs to express the requirements of the planner/executor behaviour, rather than being the main aspect of the feature, as some have suggested. Hopefully, notions of partitioning won't be directly tied to chunking of data for parallel query access. Most queries access recent data and hence only a single partition (or stripe), so partitioning and parallelism and frequently exactly orthogonal. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Alvaro Herrera <alvherre@commandprompt.com> writes: > I think there should be a way to refer to individual partitions as > objects. Yeah, the individual partitions should be nameable tables, otherwise we will be reinventing a *whole* lot of management stuff to little gain. I don't actually think there is anything wrong with using table inheritance as the basic infrastructure --- I just want more smarts about one particular use pattern of inheritance. regards, tom lane
Simon Riggs escribió: > Hopefully, notions of partitioning won't be directly tied to chunking of > data for parallel query access. Most queries access recent data and > hence only a single partition (or stripe), so partitioning and > parallelism and frequently exactly orthogonal. I think there should be a way to refer to individual partitions as objects. That way we could execute some commands to enable certain optimizations, for example "mark this partition read only" which would mean it could be marked as not needing vacuum. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> I think there should be a way to refer to individual partitions as >> objects. > > Yeah, the individual partitions should be nameable tables, otherwise we > will be reinventing a *whole* lot of management stuff to little gain. > I don't actually think there is anything wrong with using table > inheritance as the basic infrastructure --- I just want more smarts > about one particular use pattern of inheritance. Maybe it's worth examining and documenting existing partition setups, the reasoning behind them, and how they're implemented, in order to guide any future plans for native partitioning support? Maybe that's already been/being done. On the off chance that it's not: Ones I can think of: - Partitioning an equally active dataset by ranges over a key to improve scan performance, INSERT/UPDATE costs on indexes, locking issues, etc. - The "classic" active/archive partition scheme where there's only one partition growing at any one time, and the others are historical data that's nowhere near as "hot". - A variant on the basic active/archive structure, where query activity decreases slowly over time and there are many partitions of recent data. Partitions are merged into larger ones as they age, somewhat like a RRD database. I also expect that in the future there will be demand for striping data across multiple partitions in different tablespaces to exploit in-parallel scanning (when/if supported) for better I/O utilization in multiple-disk-array situations. For example, partitioning on "MOD(id,10)" across 10 separate volumes, and firing off 10 concurrent scans, one per partition, to satisfy a query. Those are some simpler schemes. Does anyone actively using partitioning have particular schemes/details that're worth going into? -- Craig Ringer
On Thu, 2009-05-07 at 10:56 +0800, Craig Ringer wrote: > Tom Lane wrote: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> I think there should be a way to refer to individual partitions as > >> objects. > > > > Yeah, the individual partitions should be nameable tables, otherwise we > > will be reinventing a *whole* lot of management stuff to little gain. > > I don't actually think there is anything wrong with using table > > inheritance as the basic infrastructure --- I just want more smarts > > about one particular use pattern of inheritance. > > Maybe it's worth examining and documenting existing partition setups, > the reasoning behind them, and how they're implemented, in order to > guide any future plans for native partitioning support? > > Maybe that's already been/being done. On the off chance that it's not: > > Ones I can think of: > > - Partitioning an equally active dataset by ranges over a key to improve > scan performance, INSERT/UPDATE costs on indexes, locking issues, etc. > > - The "classic" active/archive partition scheme where there's only one > partition growing at any one time, and the others are historical data > that's nowhere near as "hot". > > - A variant on the basic active/archive structure, where query activity > decreases slowly over time and there are many partitions of recent data. > Partitions are merged into larger ones as they age, somewhat like a RRD > database. > > I also expect that in the future there will be demand for striping data > across multiple partitions in different tablespaces to exploit > in-parallel scanning (when/if supported) for better I/O utilization in > multiple-disk-array situations. For example, partitioning on > "MOD(id,10)" across 10 separate volumes, and firing off 10 concurrent > scans, one per partition, to satisfy a query. That's a good summary. It has already been documented and discussed, but saying it again and again is the best way to get this across. You've highlighted that partitioning is a feature with many underlying requirements: infrequent access to data (frequently historical), striping for parallelism and getting around RDBMS flaws (if any). We must be careful to implement each requirement in full, yet separately, so we don't end up with 60% functionality in each case by delivering an average or least common denominator solution. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On 5/7/09 1:54 AM, "Simon Riggs" <simon@2ndQuadrant.com> wrote: > > > On Thu, 2009-05-07 at 10:56 +0800, Craig Ringer wrote: >> Tom Lane wrote: >> >> I also expect that in the future there will be demand for striping data >> across multiple partitions in different tablespaces to exploit >> in-parallel scanning (when/if supported) for better I/O utilization in >> multiple-disk-array situations. For example, partitioning on >> "MOD(id,10)" across 10 separate volumes, and firing off 10 concurrent >> scans, one per partition, to satisfy a query. > > That's a good summary. It has already been documented and discussed, but > saying it again and again is the best way to get this across. > > You've highlighted that partitioning is a feature with many underlying > requirements: infrequent access to data (frequently historical), Actually, infrequent access is not a requirement. It is a common requirement however. Take for instance, a very large set of data that contains an integer column 'type_id' that has about 200 distinct values. The data is accessed with a strict 'type_id = X' requirement 99.9% of the time. If this was one large table, then scans of all sorts become much more expensive than if it is partitioned on 'type_id'. Furthermore, partitioning on type_id removes the requirement to even index on this value. Statistics on each partition may vary significantly, and the plannner can thus adapt to changes in the data per value of type_id naturally. The raw need is not "infrequent access" but highly partitioned access. It doesn't matter if your date-partitioned data is accessed evenly across all dates or skewed to the most frequent -- it matters that you are almost always accessing by small date ranges. > striping for parallelism and getting around RDBMS flaws (if any). We > must be careful to implement each requirement in full, yet separately, > so we don't end up with 60% functionality in each case by delivering an > average or least common denominator solution. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Wed, May 6, 2009 at 6:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Agreed. Perhaps I should say then that the syntax needs to express the > requirements of the planner/executor behaviour, rather than being the > main aspect of the feature, as some have suggested. Agreed. > Hopefully, notions of partitioning won't be directly tied to chunking of > data for parallel query access. Most queries access recent data and > hence only a single partition (or stripe), so partitioning and > parallelism and frequently exactly orthogonal. Yes, I think those things are unrelated. ...Robert
On Thu, 7 May 2009, Robert Haas wrote: > On Wed, May 6, 2009 at 6:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Agreed. Perhaps I should say then that the syntax needs to express the >> requirements of the planner/executor behaviour, rather than being the >> main aspect of the feature, as some have suggested. > > Agreed. > >> Hopefully, notions of partitioning won't be directly tied to chunking of >> data for parallel query access. Most queries access recent data and >> hence only a single partition (or stripe), so partitioning and >> parallelism and frequently exactly orthogonal. > > Yes, I think those things are unrelated. I'm not so sure (warning, I am relativly inexperianced in this area) it sounds like you can take two basic approaches to partition a database 1. The Isolation Plan you want to have it so that your queries match your partitioning. this is with the goal of only having to query a small number of paritions, minimizing the total amount of data touched (including minimumizing the number of indexes searched) this matches the use case mentioned above, with the partition based on date and only looking at the most recent date range. 2. The Load Balancing Plan you want to have your partitioning and your queries _not_ match as much as possible this is with the goal of having the query hit as many partitions as possible, so that the different parts of the search can happen in parallel However, with either partitioning plan, you will have queries that degenerate to look like the other plan. In the case of the isolation plan, you may need to search for all instances of a rare thing over the entire history (after all, if you never need to access that history, why do you pay for disks to store it? ;-) and even when you are searching a narrow time window, it may still span multiple partitions. I have a log analysis setup using the Splunk prioriatary database, it paritions by time, creating a new parition as the current one hits a configurable size (by default 10G on 64 bit systems). for my volume of logs I end up with each parition only covering a few hours. it's very common to want to search over a few days, which can be a few dozen partitions (this is out of many hundreds of partitions, so it's still a _huge_ win to narrow the timeframe) In the case of the load balancing plan, you may run into a query that happens to only fall into one partition (the query matches your paritioning logic) I think the only real difference is how common it is to need to search multiple partitions. If the expectation is that you will frequently need to search most/all of the partitions (the load balancing plan), then it's a waste of time to analyse the query to try and figure out which paritions you need to look at. If the expectation is that you will frequently only need to search a small number of the partitions (the isolation plan), then it's extremely valuble to spend as much time as needed working to analyse the query to try and figure out which partitions you need to look at. I believe that the isolation plan is probably more common than the load balancing plan, but I don't see them as being that different for the database engine point of view. To tune a system that can handle the isolation plan for load balancing, the key thing to do would be to have a knob to disable the partition planning, and just blindly send the search out to every partition. David Lang
On Thu, May 7, 2009 at 10:52 PM, <david@lang.hm> wrote: >>> Hopefully, notions of partitioning won't be directly tied to chunking of >>> data for parallel query access. Most queries access recent data and >>> hence only a single partition (or stripe), so partitioning and >>> parallelism and frequently exactly orthogonal. >> >> Yes, I think those things are unrelated. > > I'm not so sure (warning, I am relativly inexperianced in this area) > > it sounds like you can take two basic approaches to partition a database > > 1. The Isolation Plan [...] > 2. The Load Balancing Plan Well, even if the table is not partitioned at all, I don't see that it should preclude parallel query access. If I've got a 1 GB table that needs to be sequentially scanned for rows meeting some restriction clause, and I have two CPUs and plenty of I/O bandwidth, ISTM it should be possible to have them each scan half of the table and combine the results. Now, this is not easy and there are probably substantial planner and executor changes required to make it work, but I don't know that it would be particularly easier if I had two 500 MB partitions instead of a single 1 GB table. IOW, I don't think you should need to partition if all you want is load balancing. Partitioning should be for isolation, and load balancing should happen when appropriate, whether there is partitioning involved or not. ...Robert
On 5/7/09 7:52 PM, "david@lang.hm" <david@lang.hm> wrote: > > > I believe that the isolation plan is probably more common than the load > balancing plan, but I don't see them as being that different for the > database engine point of view. To tune a system that can handle the > isolation plan for load balancing, the key thing to do would be to have a > knob to disable the partition planning, and just blindly send the search > out to every partition. Lots of good points. However, implicit in the above is that the process of identifying which partitions contain the data is expensive. Right now it is (1.5 sec if 6000 partitions with the most simple possible constraint (column = CONSTANT). But identifying which partitions can contain a value is really nothing more than an index. If you constrain the possible partitioning functions to those where a single partition key can only exist in one partition, then this index and its look up should be very fast even for large partition counts. From what I can tell empirically, the current system does this in more of a sequential scan, running the constraint checks for each possibility. Furthremore, the actual tables don't have to contain the data if the key is a column identity function (date = X ) rather than a range or hash. At the core, partitioning is really just a form of 'chunky' indexing that doesn't fragment, or need re-indexing, or have much MVCC complexity. > > David Lang > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Fri, 8 May 2009, Robert Haas wrote: > On Thu, May 7, 2009 at 10:52 PM, <david@lang.hm> wrote: >>>> Hopefully, notions of partitioning won't be directly tied to chunking of >>>> data for parallel query access. Most queries access recent data and >>>> hence only a single partition (or stripe), so partitioning and >>>> parallelism and frequently exactly orthogonal. >>> >>> Yes, I think those things are unrelated. >> >> I'm not so sure (warning, I am relativly inexperianced in this area) >> >> it sounds like you can take two basic approaches to partition a database >> >> 1. The Isolation Plan > [...] >> 2. The Load Balancing Plan > > Well, even if the table is not partitioned at all, I don't see that it > should preclude parallel query access. If I've got a 1 GB table that > needs to be sequentially scanned for rows meeting some restriction > clause, and I have two CPUs and plenty of I/O bandwidth, ISTM it > should be possible to have them each scan half of the table and > combine the results. Now, this is not easy and there are probably > substantial planner and executor changes required to make it work, but > I don't know that it would be particularly easier if I had two 500 MB > partitions instead of a single 1 GB table. > > IOW, I don't think you should need to partition if all you want is > load balancing. Partitioning should be for isolation, and load > balancing should happen when appropriate, whether there is > partitioning involved or not. actually, I will contridict myself slightly. with the Isolation Plan there is not nessasarily a need to run the query on each parition in parallel. if parallel queries are possible, it will benifit Isolation Plan paritioning, but the biggest win with this plan is just reducing the number of paritions that need to be queried. with the Load Balancing Plan there is no benifit in partitioning unless you have the ability to run queries on each parition in parallel using a seperate back-end process to do a query on a seperate partition is a fairly straightforward, but not trivial thing to do (there are complications in merging the result sets, including the need to be able to do part of a query, merge the results, then use those results for the next step in the query) I would also note that there does not seem to be a huge conceptual difference between doing these parallel queries on one computer and shipping the queries off to other computers. however, trying to split the work on a single table runs into all sorts of 'interesting' issues with things needing to be shared between the multiple processes (they both need to use the same indexes, for example) so I think that it is much easier for the database engine to efficiantly search two 500G tables instead of one 1T table. David Lang
On 5/8/09 11:20 AM, "david@lang.hm" <david@lang.hm> wrote: > > with the Load Balancing Plan there is no benifit in partitioning unless > you have the ability to run queries on each parition in parallel > I think there is a benefit to partitioning in this case. If the statistics on other columns are highly skewed WRT the column(s) partitioned, the planner statistics will be better. It may have to access every partition, but it doesn't have to access every partition in the same way. Perhaps something like: user_id = 'FOO' is one of the most common vals in date partition A, and one of the least common vals in B, so a where clause with user_id = 'FOO' will sequential scan one and index scan another. For really large tables with data correlation that varies significantly, this can be a huge performance gain even if all partitions are accessed.
>> IOW, I don't think you should need to partition if all you want is >> load balancing. Partitioning should be for isolation, and load >> balancing should happen when appropriate, whether there is >> partitioning involved or not. > > actually, I will contridict myself slightly. > [...] > however, trying to split the work on a single table runs into all sorts of > 'interesting' issues with things needing to be shared between the multiple > processes (they both need to use the same indexes, for example) I disagree with this part of your email. It is already the case that tables and indexes need to support concurrent access by multiple Postgres processes. I don't see why that part of the problem would be any more difficult for parallel query execution than it would be for executing two different and unrelated queries on the same table. > so I think that it is much easier for the database engine to efficiantly > search two 500G tables instead of one 1T table. And that leads me to the opposite conclusion on this point. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: >> so I think that it is much easier for the database engine to efficiantly >> search two 500G tables instead of one 1T table. > And that leads me to the opposite conclusion on this point. I don't think there would be any difference on that score, either. regards, tom lane
Robert Haas wrote: > Well, even if the table is not partitioned at all, I don't see that it > should preclude parallel query access. If I've got a 1 GB table that > needs to be sequentially scanned for rows meeting some restriction > clause, and I have two CPUs and plenty of I/O bandwidth, ISTM it > should be possible to have them each scan half of the table and > combine the results. Now, this is not easy and there are probably > substantial planner and executor changes required to make it work, but > I don't know that it would be particularly easier if I had two 500 MB > partitions instead of a single 1 GB table. The point of partitioning in this scenario is primarily that you can put the different partitions in different tablespaces, most likely on independent disk devices. You therefore get more I/O bandwidth. -- Craig Ringer