Thread: nested partitioning
Suppose I have a lot of data (several GB worth) to store each day, going back a few years. I am considering a nested partition schema, in which the data for each day is stored in a separate child table, which inherits from an empty table for each month, which inherits from an empty table for each year. For example, if I had data going back three years, I'd have one master table with the column definitions. Then I'd have three year tables inherit from that master table. Then I'd have twelve month tables inherit from each year table. Then I'd have the appropriate number of day tables inherit from each month table. MASTER 2011 01 01 ... 31 ... 12 ... 2012 ... 2013 01 ... 12 01 ... 31 Check constraints would be used on each day table to ensure that only data for that year-month-day is stored there (performing a range check on an integer or timestamp type, for example). Check constraints would also be used on each month table to ensure that only data for that year-month is stored in that month's child tables. And check constraints would also be used on each year table to ensure that only data for that year is stored in that year's child tables. Each day table would have its day check constraint, plus inherited year and month check constraints. Now to the questions. Would constraint exclusion work in a tree fashion to prune tables from a query plan? In other words, if I query data for a specific day, would constraint exclusion first exclude all the years but the relevant one, then exclude all the months of that year but the relevant one, then exclude all the days of that month but the relevant one, and finally perform a sequential or index scan of the right partition? The idea is that by nesting the partitions in this manner the number of exclusion checks is reduced. One can see that with a decade of data stored, there would be approximately 3652 partitions. The documentation discourages having this many partitions. Thanks in advance for advice. Regards, Gabriel Sánchez-Martínez PhD Candidate Massachusetts Institute of Technology
On 09/18/2013 07:45 AM, "Gabriel E. Sánchez Martínez" wrote: > Now to the questions. Would constraint exclusion work in a tree > fashion to prune tables from a query plan? Constraint exclusion works by checking the constraints themselves. So if your constraints are based on date, they will always work as expected. Assuming they're generated, you could easily check down to the day level by >= desired day, and < next day. With that in mind, if you queried a specific day, the query would remove every partition but the relevant day (or range of days). > One can see that with a decade of data stored, there would be > approximately 3652 partitions. The documentation discourages having > this many partitions. And you shouldn't. In our partitions, we try to keep it around one partition per month. I've recently written a library that eschews triggers in favor of a nightly movement job to redistribute data, and it can handle partition granularity on a sliding scale. The default is one month, but my tests show it works properly down to 1 day. Still, the lowest I'd use it is probably 1 week. That would still drastically reduce your amount of estimated partitions. However you decide to do it, don't optimize prematurely. Monthly partitions are enough for us and our 130GB table with 250M rows, and that's for a single year. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 09/18/2013 11:34 AM, Shaun Thomas wrote: > On 09/18/2013 07:45 AM, "Gabriel E. Sánchez Martínez" wrote: > >> Now to the questions. Would constraint exclusion work in a tree >> fashion to prune tables from a query plan? > > Constraint exclusion works by checking the constraints themselves. So > if your constraints are based on date, they will always work as > expected. Assuming they're generated, you could easily check down to > the day level by >= desired day, and < next day. > > With that in mind, if you queried a specific day, the query would > remove every partition but the relevant day (or range of days). I understand that, but I want to know whether all the child day tables would be checked for exclusion, or if instead entire groups of tables can be excluded by checking the year and month first, and then checking the days only for the queried year-month. For 10 years of data, that is a difference between checking 10 years + 12 months + 30 days = 52 partitions vs. checking all 3652 partitions. Example: Querying for data of 2012-01-01. Would a query plan prune tables in a tree fashion, as follows? # Check partitions 2000, 2001, 2002, ... , 2013. 14 checks. All years but 2012 are excluded. Approximately 4748 partitions excluded with only 14 constraint exclusion checks. Check partitions 2012-01, 2012-02, ... , 2012-12. 12 checks. All months but 2012-01 are excluded. 325 partitions excluded with only 12 constraint exclusion checks. Check partitions 2012-01-01, 2012-01-02, ... , 2012-01-31. 31 checks. All days but 2012-01-01 are excluded. 30 partitions excluded with 31 constraint exclusion checks. Sequential scan on partition 2012-01-01. # Or would it have to check every child partition, as follows? # Check partition 2000-01-01, 2000-01-02, ... , 2013-09-18. 5010 checks. All days but 2012-01-01 are excluded. 5009 partitions excluded with 5010 constraint exclusion checks. Sequential scan on partition 2012-01-01. # The former would be much more efficient. > >> One can see that with a decade of data stored, there would be >> approximately 3652 partitions. The documentation discourages having >> this many partitions. > > And you shouldn't. In our partitions, we try to keep it around one > partition per month. I've recently written a library that eschews > triggers in favor of a nightly movement job to redistribute data, and > it can handle partition granularity on a sliding scale. The default is > one month, but my tests show it works properly down to 1 day. Still, > the lowest I'd use it is probably 1 week. That would still drastically > reduce your amount of estimated partitions. > > However you decide to do it, don't optimize prematurely. Monthly > partitions are enough for us and our 130GB table with 250M rows, and > that's for a single year. > >
On 09/18/2013 11:05 AM, "Gabriel E. Sánchez Martínez" wrote: > Or would it have to check every child partition, as follows? > > # > Check partition 2000-01-01, 2000-01-02, ... , 2013-09-18. 5010 checks. > All days but 2012-01-01 are excluded. 5009 partitions excluded with > 5010 constraint exclusion checks. My guess based on the fact the planner has no concept of ranges aside from their width, is that this is the approach it'll likely take. You can't really teach it that your particular inheritance tree is range constrained per level, so it has to check every range. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 09/18/2013 12:20 PM, Shaun Thomas wrote: > On 09/18/2013 11:05 AM, "Gabriel E. Sánchez Martínez" wrote: > >> Or would it have to check every child partition, as follows? >> >> # >> Check partition 2000-01-01, 2000-01-02, ... , 2013-09-18. 5010 checks. >> All days but 2012-01-01 are excluded. 5009 partitions excluded with >> 5010 constraint exclusion checks. > > My guess based on the fact the planner has no concept of ranges aside > from their width, is that this is the approach it'll likely take. You > can't really teach it that your particular inheritance tree is range > constrained per level, so it has to check every range. > In that case maybe this could be considered as a performance improvement for future versions of PG? If I am not mistaken check constraints are inherited from parent table to child table, making it feasible for the query planner to prune entire branches at once by checking the inheritance structure. I know that query performance should be very good with partitions for each month and indexes on the timestamp column, but I can think of many applications where data is extracted in daily batches and it is more natural and convenient for DB maintenance to have a partition per day. In big data applications, it may be feasible to fit a day's worth of data in RAM, but not a whole month; in such cases there could be performance gains from tree constraint exclusion. Thanks for your insight.