Thread: Dynamic Partitioning using Segment Visibility Maps
Happy New Year, everybody. This proposal follows on from previous thinking about partitioning, where I've taken up Andrew Sullivan's suggestion to re-examine the current partitioning concept of using tables as partitions. So I've come up with an alternative concept to allow us to discuss the particular merits of each. ISTM that this new proposal has considerable potential. Recap: Very Large Table Use Case -------------------------------- Many large tables have a regular access pattern: - new rows inserted frequently - some updates and deletes on the "recent" data - older data becomes effectively read-only - at some point data may be removed in bulk from the table Most tables vary on what "recent" means; some are insert only, many not. A typical example might be a large history table where the last 6 months data is updated, but after that the data has almost zero change. Partitioning relies on knowledge of the physical location of data to exclude sections of data from a scan. Currently the partitioning knowledge is provided by user declarations in the form of constraints on tables linked together by views or inheritance. We also currently do this in "constraint-first" fashion, i.e. we put the data where the constraints say we should, rather than deriving the constraints from the data. In-table Partitioning --------------------- In the common use-case there is a clear affinity between certain data columns and the physical placement of data in a table. This is true for columns such as LOG_DATE or TRANSACTION_DATE, but is also true for any columns where the id column is assigned by a Sequence. HOT helps this to remain true over time. Given the above use case, when there is an affinity between data values and data placement *and* we know that older data tends to become read only, we can then realise that certain physical sections of a table will become effectively read only. (My experience is that the larger the table, the less random the write pattern - whatever the guys that wrote TPC-C say). If we were able to keep track of which sections of a table are now read-only then we would be able to record information about the data in that section and use it to help solve queries. This is turning the current thinking on its head: we could derive the constraints from the data, rather than placing the data according to the constraints. That would be much more natural: load data into the table and have the system work out the rest. We can imagine that we do this within a single table. Imagine: split a table up into 100 sections, then record the min/max values of all tuples in those sections. When we perform a SeqScan we could skip over sections that could be proved would never satisfy the query predicate. Same thing as partitioning, but within the table. Currently tables are already broken up into 1GB file segments, so it seems fairly natural to pick that as our section size. That fits reasonably well with recommendations for ideal partition size. Segment Exclusion ----------------- After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then "implicit constraints", which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. The implicit constraints can then allow SeqScans to assess segment exclusion for each segment at execution time, i.e. a scan can skip a whole segment if constraints don't match. If a segment does not (yet) have implicit constraints it will always be scanned in full. This allows partitioning, synch scans and buffer recycling to work much better together than they currently do. Other scans types might also use segment exclusion, though this would only be useful for scans retrieving many rows, otherwise the overhead of segment exclusion might not be worthwhile. This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Note that all of this exclusion is happening in the executor, not the planner. That allows this form of exclusion to work with stable functions and parameters without problem. Noting which segments are read-only ----------------------------------- Everything so far has relied upon our ability to note which segments of a table are read-only. We could do this in two ways 1) have the system automatically keep track of non-changing data 2) have the DBA issue a command to "mark" a segment as read-only now Probably a combination of the two is better, so we have three states for segments - READ_WRITE_ALLOWED - EFFECTIVE_READ_ONLY (set by 1 only) - MARKED_READ_ONLY (set by 2 only) Having said that I want to concentrate on (1), though consider the other one also in case requested by reviewers. Visibility Map -------------- So *how* would we mark segments as read only? It turns out that attempting to do this is extremely similar to Heikki's Visibility Map idea, so I'll describe it in those terms, even though there are some differences. It also brings to mind Andrew Sullivan's read-only tuples concept. We would keep a dynamic visibility map at *segment* level, showing which segments have all rows as 100% visible. No freespace map data would be held at this level. Most everything about Heikki's Visibility Map idea holds, just that we have a bit for each segment in the table, rather than each block. The map is very small and hardly ever changes for a table, so we can cache it easily on each backend. If the map does change we can perform a relcache invalidation to make everybody re-read the map. No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a Scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least. In most cases the visibility map can be summarised as a single boolean to show whether any 100% visible segments exist. That makes accessing the map very cheap in the common, unset case. Setting the Visibility Map -------------------------- VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to all current backends. Marking the map will cause a rel cache invalidation. We would never mark the highest numbered segment EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur. This is also a useful way of excluding smaller tables from the overheads involved for this feature. In an insert-only table this would mean that only the last 1 or 2 segments would be read write, so VACUUM would always run in a bounded time, no matter how large the table. When would we attempt to set the visibility map? If we do this aggressively, then it will be quickly unset. If we wait too long then we might lose some of the benefits of this approach. First, we recognise that read only segments will change the autovacuum calculation - we simply exclude them entirely. This will mean that VACUUMs are run more regularly than they are now, but when they do run they will be quicker. (Note that currently, VACUUMs on a growing table get further and further apart as the table grows, assuming constant write rate). My feeling is that VACUUMs are sufficiently infrequent that we should set the visibility map aggressively after each VACUUM. If the map is quickly unset, then we wait some time before trying again. Unsetting seems to be relatively low contention (see below), so that seems acceptable. We would then re-scan newly marked segments to derive min/max values for implicit constraints, performed by autovacuum worker processes. The best way seems to be that an ANALYZE command would automatically detect that a segment has been recently set EFFECTIVE_READ_ONLY and perform an all-rows scan rather than just a sample. (We would keep min/max values only for datatypes with valid btree sort operators, just as ANALYZE already does). The data for this can be stored as part of the stats histogram for each column. pg_class.reltuples would become an array of row counts in each segment, known accurate for read only segments. A later VACUUM will still be required to freeze the rows, though that can happen later when we hit the frozen limit, or earlier if we run an explicit VACUUM FREEZE. We would need to store the oldest xid for each segment, so we know when to kick off a FREEZE on that part of the table. We would do this by making pg_class.relfrozenxid an array also. This departs completely from the idea that the visibility map and the freespace map are related somehow. However, there is a change required in the FSM to ensure the proposed technique is stable and useful: If we scan a segment and it is 100% visible, if there is freespace in just one of the blocks in the segment then we will soon find that our visibility-bit will be set to off again very quickly. If there is more than 5% freespace total in the segment then we will not set EFFECTIVE_READ_ONLY, nor report blocks in a segment to the FSM. That way, small amounts of freespace won't destroy the benefits of segment exclusion. VACUUM currently overwrites FSM information, though if this was not the case then we would have to actively remove it for the newly read only segments. Perhaps that limit would be (100 - fillfactor)%, which is normally set according to a user's expectation of the number of updates likely on a table. Unsetting the visibility map ---------------------------- For EFFECTIVE_READ_ONLY, INSERTs, UPDATEs and DELETEs are still possible. For MARKED_READ_ONLY it would be explicitly refused. The relcache invalidation caused by making a segment EFFECTIVE_READ_ONLY can flush the rd_targBlock for the relation in each backend. Other INSERTs are not possible since all current code-paths either use extension or the FSM when the table is non-empty, and neither of those will cause a problem. Extension only touches last segment, which will never by EFFECTIVE_READ_ONLY. The FSM contains no entries for an EFFECTIVE_READ_ONLY segment. UPDATEs or DELETEs are possible on EFFECTIVE_READ_ONLY segments, but we don't expect it often. If this occurs, we will simply reset the visibility map for the table. This can be handled with a non-transactional overwrite - the bit always exists already, so the overwritten data is always same size. That's pessimistic, since it resets the state even if the UPDATE/DELETE aborts, but its better than holding the row locked until the transaction completes, which might prevent other things from happening. We probably also want VACUUM to clear up any aborted transactions anyway. The user may also wish to clear down very old data, so allowing DELETEs can ensure the user can still remove old data from the table. By carefully choosing the values to be deleted, a whole segment can be deleted and then returned to the FSM. Running a huge DELETE will likely perform a SeqScan that avoids all but the deleting data. The following VACUUM will also ignore the other segments, so removing older data seems fairly well optimised already, in comparison with now. It would be possible to optimise this to perform a segment-level truncate, but since the partitioning will already improve the DELETE and VACUUM, I'm not proposing that yet, if ever. SHARE locks currently write to data blocks, but since they represent transient state we do not need to unset either the partitioning or the visibility map. Overlap with Visibility Map Proposal ------------------------------------ This proposal offers many of the advantages of the earlier Visibility Map proposal, yet without major changes to heap structure. Since the segment-level visibility map is more granular it would only be 80% as effective as the more detailed block-level map. Having said that, the bookkeeping overheads will also be considerably reduced, so it does seem a good joint approach. It also allows freezing to be handled fully, which was a problem with the original visibility map proposal. WAL logging visibility map changes is also now much easier. My thoughts have been targeted directly at partitioning, yet I have to admit that this idea overlaps, and in my world view, replaces the Visibility Map proposal. I very much like the name Visibility Map though. I've re-read the earlier thread and agree with Jeff Davis that we *could* have multiple kinds of visibility map, but also agree with Heikki that it seems like too much work. It was never my intent to address both challenges in one proposal and this is just pure coincidence, or possibly just luck, depending upon how you like the proposal. If we do need to differentiate between the two proposals, we can refer to this one as the Segment Visibility Map (SVM). Index Scans ----------- Index-only scans would look at the Visibility Map, just as they would have done in the earlier proposal. It would be costly for an index scan to repeatedly check the visibility map, but probably worth it to do so. Other Changes ------------- We can handle select count(*) by scanning the non-100% visible segments of a table, then adding the stored counts for each segment to get a final total. Not sure if its really worth doing, but it does sound like an added bonus. There would be additional complexity in selectivity estimation and plan costing. The above proposal allows dynamic segment exclusion, which cannot be assessed at planning time anyway, so suggestions welcome... I think it would be prudent to have an override option on VACUUM to allow us to scan a table in full, checking rather than using the visibility map. This mode would be called VACUUM CHECK. None of the above blocks earlier proposals for read only tables, and the two features could work together very straightforwardly. Comparison with other Partitioning approaches --------------------------------------------- Once I realised this was possible in fairly automatic way, I've tried hard to keep away from manual overrides, commands and new DDL. Declarative partitioning is a big overhead, though worth it for large databases. No overhead is *much* better though. This approach to partitioning solves the following challenges - allows automated table extension, so works automatically with Slony - responds dynamically to changing data - allows stable functions, nested loop joins and parametrised queries - allows RI via SHARE locks - avoids the need for operator push-down through Append nodes - allows unique indexes - allows both global indexes (because we only have one table) - allows advanced planning/execution using read-only/visible data - works naturally with synchronous scans and buffer recycling All of the above are going to take considerably longer to do in any of the other ways I've come up with so far... And yes, this is a different conclusion to earlier discussions, particularly those in 2005 where I was still thinking in terms of declarative partitioning. Conclusions ----------- This technique would be useful for any table with historical data keyed by date or timestamp. It would also be useful for data where a time-of-insert component is implicit, such as many major entity tables where the object ids are assigned by a sequence. e.g. an Orders table with an OrderId as PK. Once all orders placed in a period have been shipped/resolved/closed then the segments will be marked read-only. Its not really going to change the code path much for small tables, yet seems likely to work reasonably well for large tables of all shapes and sizes. If a segment is being updated, we leave it alone, and maybe never actually set the visibility map at all. So overall, this idea seems to cover the main use case well, yet with only minimal impact on the existing use cases we support. As before, I will maintain this proposal on the PG developer Wiki, once we get down to detailed design. Like it? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote: > Like it? Sounds good. I've only given it a quick scan though. Would read-only segments retain the same disk-level format as is currently? It seems possible to remove the MVCC fields and hence get more tuples per page--- whether this would actually be a net performance gain/loss seems like a difficult question question to answer, it would definitly be a complexity increase though. Reading this reminds me of the design of the store for a persistent operating system called EROS. It has a very good paper[1] describing the design (implementation and careful benchmarking thereof) that I think could be a useful read. A lot of your design sounds like the EROS store, with the the "Checkpoint Area" being, in current and all previous versions of Postgres, the only place data is stored. Data in EROS also has a "Home Location" which is where the data ends up after a checkpoint, and sounds somewhat like the proposed read-only. Checkpoints here serve a similar purpose than checkpoints to PG, so the following analogy may get a bit confusing. When you're reading the paper I'd try and imagine the checkpoints not occurring as one event, but spread across time as the database recognizes that data is now (or has been marked as) read-only. The home locations would then store only the read-only copies of the data and all the churn would (if the recognition of read-only data works) be done in the checkpoint area. Sam [1] http://www.eros-os.org/papers/storedesign2002.pdf
On Thu, 2008-01-03 at 00:41 +0000, Sam Mason wrote: > On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote: > > Like it? > > Sounds good. I've only given it a quick scan though. Would read-only > segments retain the same disk-level format as is currently? Yes, no changes at all to the table. So your app would just work, without any DDL changes. Existing partitioning apps would not change. > It seems > possible to remove the MVCC fields and hence get more tuples per page--- > whether this would actually be a net performance gain/loss seems like > a difficult question question to answer, it would definitly be a > complexity increase though. I've been looking at general compression at table and segment level, but thats further down the track. Removing the MVCC fields is too much work, I think. > Reading this reminds me of the design of the store for a persistent > operating system called EROS. It has a very good paper[1] describing > the design (implementation and careful benchmarking thereof) that I > think could be a useful read. Thanks, will do. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hi Simon,<br /> Looks like a novel idea. I just want to confirm my understanding of the proposal.<br />a) Thisproposal would work for the kind of table organizations which are currently partitioned and maintained based on somekind of timestamp. Consider one of the use-case. A large Retail firm has a lot of stores. DBA maintains and updates theinventories of those stores in hash-partitions based on store-no. As the inventory gets updated, the corresponding partitionreceives the update and it goes like that.. <br /> Here all the partitions are going to get constantly updated.So no partition can possibly become a read only partition. You have clearly called it out in your para. i am justre-confirming on that. Or do you have something for this in your soln.? <br /><br />To my limited experience, most partitionstrategies are based on some form of time-stamp. If the proposed soln. can cater to those, it has lot of use-cases.<br/><br />Thanks,<br />Gokul.<br /><br />
On Fri, 2008-01-04 at 13:06 +0530, Gokulakannan Somasundaram wrote: > a) This proposal would work for the kind of table organizations which > are currently partitioned and maintained based on some kind of > timestamp. Consider one of the use-case. A large Retail firm has a lot > of stores. DBA maintains and updates the inventories of those stores > in hash-partitions based on store-no. As the inventory gets updated, > the corresponding partition receives the update and it goes like > that.. > Here all the partitions are going to get constantly updated. > So no partition can possibly become a read only partition. You have > clearly called it out in your para. i am just re-confirming on that. > Or do you have something for this in your soln.? > > To my limited experience, most partition strategies are based on some > form of time-stamp. If the proposed soln. can cater to those, it has > lot of use-cases. I don't think it would apply to an Inventory table. That is a current state table. It is designed for any large tables that would grow naturally over time if we left them to do so. Solutions that it would work for: - any Fact table where measurements/observations/events are accumulated e.g. Web Hits (any Internet events) Call Detail Records Sales Security Events Scientific Measurements Process Control - any Major Entity where new entities are created from a sequence e.g. Orders, OrderItems Invoices Shipments, Returns most SCM/DCM events It's not aimed at any particular benchmark, just real usage scenarios. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > We would keep a dynamic visibility map at *segment* level, showing which > segments have all rows as 100% visible. No freespace map data would be > held at this level. Small dumb-user question. I take it you've considered some more flexible consecutive-run-of-blocks unit of flagging rather than file-segments. That obviously complicates the tracking but means you can cope with infrequent updates as well as mark most of the "most recent" segment for log-style tables. -- Richard Huxton Archonet Ltd
On Fri, 2008-01-04 at 10:22 +0000, Richard Huxton wrote: > Simon Riggs wrote: > > We would keep a dynamic visibility map at *segment* level, showing which > > segments have all rows as 100% visible. No freespace map data would be > > held at this level. > > Small dumb-user question. > > I take it you've considered some more flexible consecutive-run-of-blocks > unit of flagging rather than file-segments. That obviously complicates > the tracking but means you can cope with infrequent updates as well as > mark most of the "most recent" segment for log-style tables. I'm writing the code to abstract that away, so yes. Now you mention it, it does seem straightforward to have a table storage parameter for partition size, which defaults to 1GB. The partition size is simply a number of consecutive blocks, as you say. The smaller the partition size the greater the overhead of managing it. Also I've been looking at read-only tables and compression, as you may know. My idea was that in the future we could mark segments as either - read-only - compressed - able to be shipped off to hierarchical storage Those ideas work best if the partitioning is based around the physical file sizes we use for segments. Changing the partition size would simply reset the visibility map for that table, in its easiest implementation. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2008-01-04 at 10:22 +0000, Richard Huxton wrote: >> Simon Riggs wrote: >>> We would keep a dynamic visibility map at *segment* level, showing which >>> segments have all rows as 100% visible. No freespace map data would be >>> held at this level. >> Small dumb-user question. >> >> I take it you've considered some more flexible consecutive-run-of-blocks >> unit of flagging rather than file-segments. That obviously complicates >> the tracking but means you can cope with infrequent updates as well as >> mark most of the "most recent" segment for log-style tables. > > I'm writing the code to abstract that away, so yes. > > Now you mention it, it does seem straightforward to have a table storage > parameter for partition size, which defaults to 1GB. The partition size > is simply a number of consecutive blocks, as you say. > > The smaller the partition size the greater the overhead of managing it. Oh, obviously, but with smaller partition sizes this also becomes useful for low-end systems as well as high-end ones. Skipping 80% of a seq-scan on a date-range query is a win for even small (by your standards) tables. I shouldn't be surprised if the sensible-number-of-partitions remained more-or-less constant as you scaled the hardware, but the partition size grew. > Also I've been looking at read-only tables and compression, as you may > know. My idea was that in the future we could mark segments as either > - read-only > - compressed > - able to be shipped off to hierarchical storage > > Those ideas work best if the partitioning is based around the physical > file sizes we use for segments. I can see why you've chosen file segments. It certainly makes things easier. Hmm - thinking about the date-range scenario above, it occurs to me that for seq-scan purposes the correct partition sizedepends upon the data value you are interested in. What I want to know is what blocks Jan 07 covers (or rather what blocks it doesn't) rather than knowing blocks 1-9999999 cover 2005-04-12 to 2007-10-13. Of course that means that you'd eventually want different partition sizes tracking visibility for different columns (e.g. id, timestamp). I suspect the same would be true for read-only/compressed/archived flags, but I can see how they are tightly linked to physical files (particularly the last two). -- Richard Huxton Archonet Ltd
Hello Simon, Simon Riggs wrote: > I've come > up with an alternative concept to allow us to discuss the particular > merits of each. ISTM that this new proposal has considerable potential. Hm.. interesting idea. > If we were able to keep track of which sections of a table are now > read-only then we would be able to record information about the data in > that section and use it to help solve queries. This is turning the > current thinking on its head: we could derive the constraints from the > data, rather than placing the data according to the constraints. That > would be much more natural: load data into the table and have the system > work out the rest. Yeah, but that's also the most limiting factor of your approach: it covers only horizontal partitioning by time (or to be more precise, by columns which are very likely to increase or decrease with time). All other columns will very likely contain values from the full range of possible values. As you have pointed out, that might be a very frequent use case. I can't argue about that, however, I think it's important to be well aware of that limitation. > Other scans types might also use segment exclusion, though this would > only be useful for scans retrieving many rows, otherwise the overhead of > segment exclusion might not be worthwhile. Uh.. the overhead of checking against min/max values doesn't seem that big to me. I rather think the gain for index scans would be prohibitively small, because (given frequent enough vacuuming) an index scan shouldn't return many pointers to tuples in segments which could be optimized away by segment exclusion. > If we collect data for all columns then many of our implicit constraints > would be useless. e.g. if a column only has a few values and these are > present in all segments. Matching our predicate against all constraints > would be expensive, so we must weed out poor constraints. We would do > this by removing any constraint that overlapped more than 10% of other > segments. Various heuristics would likely need to be discovered during > development to make this work without resorting to manual commands. Uh, well, that's about the limitation I've pointed out above. But is it really worth maintaining statistics about overlapping values and removing min/max checks for certain columns? It would save you the min/max check per segment and scan, but cost maintaining the statistics and checking against the statistics once per scan. AFAICS the block with the min/max tuple per segment will often remain cached anyway... dunno. > Noting which segments are read-only > ----------------------------------- > > Everything so far has relied upon our ability to note which segments of > a table are read-only. We could do this in two ways > > 1) have the system automatically keep track of non-changing data > 2) have the DBA issue a command to "mark" a segment as read-only now > > Probably a combination of the two is better, so we have three states for > segments > - READ_WRITE_ALLOWED > - EFFECTIVE_READ_ONLY (set by 1 only) > - MARKED_READ_ONLY (set by 2 only) > > Having said that I want to concentrate on (1), though consider the other > one also in case requested by reviewers. Hm.. AFAICT, horizontal partitioning often serves manageability, which is quite limited having all data in one table (you can't move a single segment to a different tablespace). Thus I think option 2 is pretty constrained is usability. What would the DBA gain by setting a segment to read only? How does the DBA figure out, in which segment a tuple is stored in (so she can decide to mark it read only)? > The user may also wish to clear down very old data, so allowing DELETEs > can ensure the user can still remove old data from the table. By > carefully choosing the values to be deleted, a whole segment can be > deleted and then returned to the FSM. Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie! > This proposal offers many of the advantages of the earlier Visibility > Map proposal, yet without major changes to heap structure. Since the > segment-level visibility map is more granular it would only be 80% as > effective as the more detailed block-level map. Having said that, the > bookkeeping overheads will also be considerably reduced, so it does seem > a good joint approach. It also allows freezing to be handled fully, > which was a problem with the original visibility map proposal. WAL > logging visibility map changes is also now much easier. I generally agree, although I'm somewhat dubious about the 80% factor. > My thoughts have been targeted directly at partitioning, yet I have to > admit that this idea overlaps, and in my world view, replaces the > Visibility Map proposal. I very much like the name Visibility Map > though. I would even say, that partitioning is somewhat of a misleading term here, because it normally allows the DBA to decide on where to split. Given that we are operating on segments here, to which the DBA has very limited information and access, I prefer the term "Segment Exclusion". I think of that as an optimization of sequential scans on tables with the above mentioned characteristics. > If we do need to differentiate between the two proposals, we can refer > to this one as the Segment Visibility Map (SVM). I'm clearly in favor of separating between the two proposals. SVM is a good name, IMHO. > We can handle select count(*) by scanning the non-100% visible segments > of a table, then adding the stored counts for each segment to get a > final total. Not sure if its really worth doing, but it does sound like > an added bonus. Yup, sounds tempting. Although it's contrary to Postgres position so far. And one could argue that you'd have to maintain not only count(), but also avg(), sum(), etc... as well for all tuples in the 100% visible segment. > There would be additional complexity in selectivity estimation and plan > costing. The above proposal allows dynamic segment exclusion, which > cannot be assessed at planning time anyway, so suggestions welcome... Hm.. that looks like a rather bad downside of an executor-only optimization. > Comparison with other Partitioning approaches > --------------------------------------------- > > Once I realised this was possible in fairly automatic way, I've tried > hard to keep away from manual overrides, commands and new DDL. > > Declarative partitioning is a big overhead, though worth it for large > databases. No overhead is *much* better though. > > This approach to partitioning solves the following challenges > - allows automated table extension, so works automatically with Slony > - responds dynamically to changing data > - allows stable functions, nested loop joins and parametrised queries > - allows RI via SHARE locks > - avoids the need for operator push-down through Append nodes > - allows unique indexes > - allows both global indexes (because we only have one table) > - allows advanced planning/execution using read-only/visible data > - works naturally with synchronous scans and buffer recycling > > All of the above are going to take considerably longer to do in any of > the other ways I've come up with so far... I fully agree. But as I tried to point out above, the gains in manageability from Segment Exclusion are also pretty close to zero. So I'd argue they only fulfill parts of the needs for general horizontal partitioning. > This technique would be useful for any table with historical data keyed > by date or timestamp. It would also be useful for data where a > time-of-insert component is implicit, such as many major entity tables > where the object ids are assigned by a sequence. e.g. an Orders table > with an OrderId as PK. Once all orders placed in a period have been > shipped/resolved/closed then the segments will be marked read-only. Agreed. Just a minor note: I find "marked read-only" too strong, as it implies an impossibility to write. I propose speaking about mostly-read segments, or optimized for reading or similar. > Its not really going to change the code path much for small tables, yet > seems likely to work reasonably well for large tables of all shapes and > sizes. That sounds a bit too optimistic to me. For Segment Exclusion, it takes only *one* tuple to enlarge the min/max range dramatically in any direction. So it's not the overall correlation between column values and storage location, but rather only the min/max column values which matter. Have you ever checked, how well these min/max values correlate with the segment number? Pretty much the same argument applies to SVM: an update to only one tuple in a segment is enough to remove the optimization for reading (EFFECTIVE_READ_ONLY property) for the segment. The assumption here (being that updates happen mostly to newer segments) is not quite the same as above. Maybe a combination with CLUSTERing would be worthwhile? Or even enforced CLUSTERing for the older segments? > If a segment is being updated, we leave it alone, and maybe never > actually set the visibility map at all. So overall, this idea seems to > cover the main use case well, yet with only minimal impact on the > existing use cases we support. Yup. > As before, I will maintain this proposal on the PG developer Wiki, once > we get down to detailed design. Cool. Thanks for working out yet another great proposal. I hope to have been of help with my questions and remarks. ;-) Regards Markus
Hi, Simon Riggs wrote: > - any Fact table where measurements/observations/events are accumulated > e.g. > Web Hits (any Internet events) > Call Detail Records > Sales > Security Events > Scientific Measurements > Process Control > > - any Major Entity where new entities are created from a sequence > e.g. > Orders, OrderItems > Invoices > Shipments, Returns > most SCM/DCM events ...and only changed very seldom after a while, I would add. Because changing an old tuple would invalidate the optimization for the affected segment. That's why this optimization can't help for inventory tables, where an id might correlate with time and storage location, but writing access doesn't correlate with storage location (segment number) and time. Regards Markus
Hi, Simon Riggs wrote: > The smaller the partition size the greater the overhead of managing it. > Also I've been looking at read-only tables and compression, as you may > know. My idea was that in the future we could mark segments as either > - read-only > - compressed > - able to be shipped off to hierarchical storage > > Those ideas work best if the partitioning is based around the physical > file sizes we use for segments. As much as I'd like this simplification.. But I'm still thinking of these segments as an implementation detail of Postgres, and not something the user should have to deal with. Allowing the DBA to move segments to a different table space and giving him the possibility to check which tuples are in which segment seems awkward from a users perspective, IMO. Regards Markus
On Fri, 2008-01-04 at 13:29 +0100, Markus Schiltknecht wrote: > Given that we are operating on segments here, to which the DBA has very > limited information and access, I prefer the term "Segment Exclusion". I > think of that as an optimization of sequential scans on tables with the > above mentioned characteristics. > > > If we do need to differentiate between the two proposals, we can refer > > to this one as the Segment Visibility Map (SVM). > > I'm clearly in favor of separating between the two proposals. SVM is a > good name, IMHO. OK, I'll refer to this as proposal as SVM. > > There would be additional complexity in selectivity estimation and plan > > costing. The above proposal allows dynamic segment exclusion, which > > cannot be assessed at planning time anyway, so suggestions welcome... > > Hm.. that looks like a rather bad downside of an executor-only optimization. I think that's generally true. We already have that problem with planned statements and work_mem, for example, and parameterised query planning is a difficult problem. Stable functions are already estimated at plan time, so we hopefully should be getting that right. I don't see any show stoppers here, just more of the usual problems of query optimization. > > Comparison with other Partitioning approaches > > --------------------------------------------- > > > > Once I realised this was possible in fairly automatic way, I've tried > > hard to keep away from manual overrides, commands and new DDL. > > > > Declarative partitioning is a big overhead, though worth it for large > > databases. No overhead is *much* better though. > > > > This approach to partitioning solves the following challenges > > - allows automated table extension, so works automatically with Slony > > - responds dynamically to changing data > > - allows stable functions, nested loop joins and parametrised queries > > - allows RI via SHARE locks > > - avoids the need for operator push-down through Append nodes > > - allows unique indexes > > - allows both global indexes (because we only have one table) > > - allows advanced planning/execution using read-only/visible data > > - works naturally with synchronous scans and buffer recycling > > > > All of the above are going to take considerably longer to do in any of > > the other ways I've come up with so far... > > I fully agree. But as I tried to point out above, the gains in > manageability from Segment Exclusion are also pretty close to zero. So > I'd argue they only fulfill parts of the needs for general horizontal > partitioning. Agreed. My focus for this proposal wasn't manageability, as it had been in other recent proposals. I think there are some manageability wins to be had as well, but we need to decide what sort of partitioning we want/need first. So in the case of SVM, enhanced manageability is really a phase 2 thing. Plus, you can always combine a design with constraint and segment exclusion. > Maybe a combination with CLUSTERing would be worthwhile? Or even > enforced CLUSTERing for the older segments? I think there's merit in Heikki's maintain cluster order patch and that should do an even better job of maintaining locality. Thanks for detailed comments. I'll do my best to include all of the viewpoints you've expressed as the design progresses. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote: > > Agreed. Just a minor note: I find "marked read-only" too strong, as it > implies an impossibility to write. I propose speaking about mostly-read > segments, or optimized for reading or similar. I do want some segments to be _marked_ read-only: I want attempted writes to them to _fail_. A
On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote: > On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote: > > > > Agreed. Just a minor note: I find "marked read-only" too strong, as it > > implies an impossibility to write. I propose speaking about mostly-read > > segments, or optimized for reading or similar. > > I do want some segments to be _marked_ read-only: I want attempted writes to > them to _fail_. I think Markus thought that we would mark them read only automatically, which was not my intention. I believe its possible to have this in a way that will make you both happy. Some more explanation: There would be three different states for a segment: 1. read write 2. "optimized for reading", as Markus says it 3. marked read only by explicit command Transition 1 -> 2 is by autovacuum under the SVM proposal, transition 2 -> 3 is by user command only. So throwing an ERROR is acceptable for segments in state 3. I came up with a complex scheme for going from 1 -> 3 previously, but I don't think its needed any longer (for this, at least). It's trivial to go from 2 -> 3 using an ALTER TABLE statement, along the lines of ALTER TABLE .... WHERE .... Files that are completely in state 3 can then be archived by a hierarchical storage manager without problem. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hi, Simon Riggs wrote: > On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote: >> On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote: >>> Agreed. Just a minor note: I find "marked read-only" too strong, as it >>> implies an impossibility to write. I propose speaking about mostly-read >>> segments, or optimized for reading or similar. Hm.. yeah, after rereading, I realize that I've mixed up states no. 2 and 3 here, sorry. >> I do want some segments to be _marked_ read-only: I want attempted writes to >> them to _fail_. Well, I can see use cases for marking tuples or complete relations as read only. But segments? I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? If not, a more user friendly command like "MARK READ ONLY WHERE date <= 2006" would have to move tuples around between segments, so as to be able to satisfy the split point exactly, right? > I think Markus thought that we would mark them read only automatically, > which was not my intention. I believe its possible to have this in a way > that will make you both happy. Some more explanation: > > There would be three different states for a segment: > 1. read write > 2. "optimized for reading", as Markus says it > 3. marked read only by explicit command Thanks for clarification. Regards Markus
On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote: > > I'm still puzzled about how a DBA is expected to figure out which > segments to mark. I think that part might be hand-wavy still. But once this facility is there, what's to prevent the current active segment (and the rest) from also getting this mark, which would mean "the table is read only"? A
On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: > I'm still puzzled about how a DBA is expected to figure out which > segments to mark. Simon, are you assuming we are going to pass on > segment numbers to the DBA one day? No Way! That would stop Richard's idea to make the segment stride configurable, apart from being a generally ugly thing. > If not, a more user friendly command like "MARK READ ONLY WHERE date <= > 2006" would have to move tuples around between segments, so as to be > able to satisfy the split point exactly, right? Yes, just a simple WHERE clause that we can translate into segments under the covers. It would be an alter table, so we get an exclusive lock. ALTER TABLE foo SET READ ONLY WHERE .... possibly with a few restrictions on the WHERE clause. Anyway this is just futures and dreams, so far, so lets just say something like that is possible in the future and work out more when we pass the first few hurdles. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Friday 04 January 2008 17:01, Simon Riggs wrote: > On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: > > I'm still puzzled about how a DBA is expected to figure out which > > segments to mark. Simon, are you assuming we are going to pass on > > segment numbers to the DBA one day? > > No Way! > > That would stop Richard's idea to make the segment stride configurable, > apart from being a generally ugly thing. > > > If not, a more user friendly command like "MARK READ ONLY WHERE date <= > > 2006" would have to move tuples around between segments, so as to be > > able to satisfy the split point exactly, right? > > Yes, just a simple WHERE clause that we can translate into segments > under the covers. It would be an alter table, so we get an exclusive > lock. > > ALTER TABLE foo SET READ ONLY WHERE .... > > possibly with a few restrictions on the WHERE clause. Anyway this is > just futures and dreams, so far, so lets just say something like that is > possible in the future and work out more when we pass the first few > hurdles. Not to be negative, but istm how this feature would be managed is as important as the bits under the hood. Or at least we have to believe there will be some practical way to manage this, which as of yet I am skeptical. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Fri, 2008-01-04 at 22:31 -0500, Robert Treat wrote: > Not to be negative, but istm how this feature would be managed is as important > as the bits under the hood. Agreed. On this part of the thread, we've been discussing an extension to the basic proposal, which is why I have not been concentrating there. Core management wise, the basic proposal showed how we would be able to have VACUUM run much faster than before and how DELETE will also be optimised naturally by this approach. Loading isn't any slower than it is now; loading does need some work, but that's another story. > Or at least we have to believe there will be > some practical way to manage this, which as of yet I am skeptical. Skepticism is OK, but I'd like to get your detailed thoughts on this. I've been an advocate of the multi-tables approach now for many years, so I don't expect everybody to switch their beliefs on my say-so overnight. Let me make a few more comments in this area: The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. In this case, partitioning is way too complex to administer effectively and requires application changes that make it impossible to use for packaged applications. The latest Oracle TPC-H benchmark uses 10 pages of DDL to set it up and if I can find a way to avoid that, I'd recommend it to all. I do still want some knobs and dials, just not 10 pages worth, though I'd like yours and others' guidance on what those should be. Oracle have been responding to feedback with their new interval partitioning, but its still a multi-table approach in essence. My observation of partitioned databases is that they all work beautifully at the design stage, but problems emerge over time. A time-based range partitioned table can often have different numbers of rows per partition, giving inconsistent response times. A height-balanced approach where we make the partitions all the same size, yet vary the data value boundaries will give much more consistent query times and can be completely automated much more easily. The SVM concept doesn't cover everything that you can do with partitioning, but my feeling is it covers the main use cases well. If that's not true, in broad strokes or in the detail, then we need to uncover that. Everybody's help in doing that is appreciated, whatever the viewpoint and whatever the outcome. It's probably worth examining existing applications to see how well they would migrate to segmented tables approach. The following query will analyse one column of a table to produce a list of boundary values, given a segment size of 131072 blocks (1 GB). select substr(ctid::text,2,strpos(ctid::text,',')-2)::integer/131072 as seg, min(PrimaryKey), max(PrimaryKey) from bigtable group by seg; We should be able to see whether this works for existing use cases, or not fairly easily. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Andrew Sullivan wrote: > On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote: >> I'm still puzzled about how a DBA is expected to figure out which >> segments to mark. > > I think that part might be hand-wavy still. But once this facility is > there, what's to prevent the current active segment (and the rest) from also > getting this mark, which would mean "the table is read only"? Well, sure, marking *all* segments read-only is pretty easy. But that was not quite my point. Regards Markus
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sat, Jan 05, 2008 at 09:33:45AM +0000, Simon Riggs wrote: [...] > The main proposal deliberately has few, if any, knobs and dials. That's > a point of philosophy that I've had views on previously: my normal > stance is that we need some knobs to allow the database to be tuned to > individual circumstances. One thought I had back then, with partitioned tables was "gee -- B-tree index is already doing a partition; why do a manual partition on top of that?". It's an exhilarating experience to watch you making a design before I could even spell this nebulous and unclear thougt :-) Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFHf3v7Bcgs9XrR2kYRApP7AJwIW1cQwdK99fl+uzDelGEaqZFnEgCfUNjl y0zGzkhf6qel/FC+rArxQu8= =FZuv -----END PGP SIGNATURE-----
Hi, tomas@tuxteam.de wrote: >> The main proposal deliberately has few, if any, knobs and dials. That's >> a point of philosophy that I've had views on previously: my normal >> stance is that we need some knobs to allow the database to be tuned to >> individual circumstances. > > One thought I had back then, with partitioned tables was "gee -- B-tree > index is already doing a partition; why do a manual partition on top of > that?". Well, that's why I'm so focused on manageability: I think the primary purpose of partitioning is manageability (of the underlying storage for a table). Regards Markus
Hi, Simon Riggs wrote:> On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:>>> I'm still puzzled about how a DBA isexpected to figure out which>> segments to mark. Simon, are you assuming we are going to pass on>> segment numbers to theDBA one day?>> No Way! Ah, I'm glad ;-) Simon Riggs wrote: > Skepticism is OK, but I'd like to get your detailed thoughts on this. > I've been an advocate of the multi-tables approach now for many years, > so I don't expect everybody to switch their beliefs on my say-so > overnight. Let me make a few more comments in this area: I've so far always thought about some sort of multi-relations approach for partitioning, yes. Let's see if I can get my mind around single-table partitioning. > The main proposal deliberately has few, if any, knobs and dials. That's > a point of philosophy that I've had views on previously: my normal > stance is that we need some knobs to allow the database to be tuned to > individual circumstances. > > In this case, partitioning is way too complex to administer effectively > and requires application changes that make it impossible to use for > packaged applications. The latest Oracle TPC-H benchmark uses 10 pages > of DDL to set it up and if I can find a way to avoid that, I'd recommend > it to all. I do still want some knobs and dials, just not 10 pages > worth, though I'd like yours and others' guidance on what those should > be. Oracle have been responding to feedback with their new interval > partitioning, but its still a multi-table approach in essence. I can absolutely support your efforts to minimize knobs and configuration DDL. However, my current feeling is, that segments based partitioning complicates things, because the DBA doesn't have tools and commands to handle segments. To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. However, what I certainly like is the automated split point definition. Instead of having to create tables by hand and "linking" them via inheritance and constraint exclusion, I have something very similar in mind, like what you proposed for marking read-only segments. Something like: SPLIT TABLE customers AT cust_name > 'n'; or: SPLIT TABLE inventory AT inv_id % 4 >= 2; In my imagination, this should automatically create the underlying relations, i.e.: NOTICE: relations inventory__l and inventory__r have been created. That way, the DBA could then handle those like normal relations, querying them or moving them to different table spaces like all other normal relations. In a way, that's not so different from possible extensions on top of Segment Exclusion, except that the DBA additionally get a relation name to be able to address the set of segments which form a partition. Or put it the other way around: go for Segment Exclusion, but add some sort of a sentinel relation for each set of segments, to make them reachable for the DBA. > My observation of partitioned databases is that they all work > beautifully at the design stage, but problems emerge over time. A > time-based range partitioned table can often have different numbers of > rows per partition, giving inconsistent response times. A > height-balanced approach where we make the partitions all the same size, > yet vary the data value boundaries will give much more consistent query > times and can be completely automated much more easily. Uh.. well, consistent query time isn't the first thing I'm expecting from partitioning by time ranges. If I wanted consistent query times I'd rather use hash partition or something, no? I'd even state, that one *wants* inconsistent response times when using time based range partitioning, by moving old, seldom used data to slower storage and keeping only a small amount of often used tuples on the faster disks, for example. > The SVM concept doesn't cover everything that you can do with > partitioning, but my feeling is it covers the main use cases well. As I regard manageability to be the main advantage of partitioning, which you've intentionally left out for now, I disagree here. How could SVM or Segment Exclusion potentially be covering what hash partitioning does? Maybe together with the ability to store different segments of a table on different table spaces. That could be considered an approach to range partitioning. But then, that would be the partitioning, and not SVM or Segment Exclusion. To me, both of SVM and SE look much more like an optimization for certain special cases and don't have much to do with partitioning. Regards Markus
On Saturday 05 January 2008 10:42, Markus Schiltknecht wrote: > > The main proposal deliberately has few, if any, knobs and dials. That's > > a point of philosophy that I've had views on previously: my normal > > stance is that we need some knobs to allow the database to be tuned to > > individual circumstances. > > > > In this case, partitioning is way too complex to administer effectively > > and requires application changes that make it impossible to use for > > packaged applications. The latest Oracle TPC-H benchmark uses 10 pages > > of DDL to set it up and if I can find a way to avoid that, I'd recommend > > it to all. I do still want some knobs and dials, just not 10 pages > > worth, though I'd like yours and others' guidance on what those should > > be. Oracle have been responding to feedback with their new interval > > partitioning, but its still a multi-table approach in essence. > > I can absolutely support your efforts to minimize knobs and > configuration DDL. However, my current feeling is, that segments based > partitioning complicates things, because the DBA doesn't have tools and > commands to handle segments. > Personally I cant say it complicates things, because it isn't clear how it will be managed. :-) > To satisfy all the different requirements of partitioning with segments > based partitioning, we'd have to allow a table to span multiple table > spaces. I'm not very keen on going that way. > Why? > However, what I certainly like is the automated split point definition. > Instead of having to create tables by hand and "linking" them via > inheritance and constraint exclusion, I have something very similar in > mind, like what you proposed for marking read-only segments. Something > like: > > SPLIT TABLE customers AT cust_name > 'n'; > > or: > > SPLIT TABLE inventory AT inv_id % 4 >= 2; > > In my imagination, this should automatically create the underlying > relations, i.e.: > > NOTICE: relations inventory__l and inventory__r have been created. > > That way, the DBA could then handle those like normal relations, > querying them or moving them to different table spaces like all other > normal relations. > > In a way, that's not so different from possible extensions on top of > Segment Exclusion, except that the DBA additionally get a relation name > to be able to address the set of segments which form a partition. Or put > it the other way around: go for Segment Exclusion, but add some sort of > a sentinel relation for each set of segments, to make them reachable for > the DBA. > So the one thing that always scares me about these "define it all and let the database sort it out" methods is they seem to lead to cases where the system ends up rewriting the data to fit into some new partition layout. One thing that is nice about the current partitioning scheme is you can control the impact of this behavior in these scenarios, but moving around small portions of the table at a time. > > My observation of partitioned databases is that they all work > > beautifully at the design stage, but problems emerge over time. A > > time-based range partitioned table can often have different numbers of > > rows per partition, giving inconsistent response times. A > > height-balanced approach where we make the partitions all the same size, > > yet vary the data value boundaries will give much more consistent query > > times and can be completely automated much more easily. > > Uh.. well, consistent query time isn't the first thing I'm expecting > from partitioning by time ranges. If I wanted consistent query times I'd > rather use hash partition or something, no? > More to the point (I think) is that people define access to the data based on the meaning of the data, not how it is stored on disk. For example, in some tables we only need to be active on 1 months worth of data... how that is laid out on disk (# partitions, which tablespaces) is a means to the end of working actively on 1 months worth of data. I can't think of many cases where people would actually say the want to work actively on the most recent GB of data. > I'd even state, that one *wants* inconsistent response times when using > time based range partitioning, by moving old, seldom used data to slower > storage and keeping only a small amount of often used tuples on the > faster disks, for example. > Yes, we do this quite a bit; and at least one of our partitioned tables (in total) is larger in size than our tablespace with the fastest disks. > > The SVM concept doesn't cover everything that you can do with > > partitioning, but my feeling is it covers the main use cases well. > > As I regard manageability to be the main advantage of partitioning, > which you've intentionally left out for now, I disagree here. > > How could SVM or Segment Exclusion potentially be covering what hash > partitioning does? Maybe together with the ability to store different > segments of a table on different table spaces. That could be considered > an approach to range partitioning. But then, that would be the > partitioning, and not SVM or Segment Exclusion. To me, both of SVM and > SE look much more like an optimization for certain special cases and > don't have much to do with partitioning. > Even if this were true, it might still be a useful optimization. One table I am thinking of in particular in my system has one query we need to run across partitions, which ends up doing a slew of bitmap index scans for all the partitions. If using segment exclusion on it meant that I could get a global index to help that query, I'd be happy. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Hi, Robert Treat wrote: > Personally I cant say it complicates things, because it isn't clear how it > will be managed. :-) Well, management of relations is easy enough, known to the DBA and most importantly: it already exists. Having to set up something which is *not* tied to a relation complicates things just because it's an additional concept. But as I've pointed out, maybe what we have in mind isn't that different at all. Just have a sentinel relation mean a set of segments, i.e. all read-only segments of a table. Then again, a table - in a way - is not much else than a set of segments. So where's the real difference? >> To satisfy all the different requirements of partitioning with segments >> based partitioning, we'd have to allow a table to span multiple table >> spaces. I'm not very keen on going that way. >> > > Why? Uh.. if a table (RELKIND_RELATION) can only span one table space, as it is now, all of its segments are in the same table space. I don't quite call that partitioning. Well, sure, you could call it so, but then, each and every Postgres table is already partitioned in 1G segments. It all depends on the definitions, but in my world, horizontal partitioning for databases involves multiple table spaces (and is quite useless without that). Calling anything else partitioning is confusing, IMO. > So the one thing that always scares me about these "define it all and let the > database sort it out" methods is they seem to lead to cases where the system > ends up rewriting the data to fit into some new partition layout. That holds true no matter if you shuffle between segments or relations. To be able to let the DBA define an exact split point, the database *will* have to shuffle tuples around. Why does that scare you? It's a regular database system's maintenance procedure. > One thing > that is nice about the current partitioning scheme is you can control the > impact of this behavior in these scenarios, but moving around small portions > of the table at a time. Uh.. I'm not quite following. What "current partitioning scheme" are you referring to? Why should that not be possible with other schemes? Moving the split point between two partitions involves moving tuples around, no matter if you are going to move them between segments or between relations building the partitions. > More to the point (I think) is that people define access to the data based on > the meaning of the data, not how it is stored on disk. For example, in some > tables we only need to be active on 1 months worth of data... how that is > laid out on disk (# partitions, which tablespaces) is a means to the end of > working actively on 1 months worth of data. I can't think of many cases where > people would actually say the want to work actively on the most recent GB of > data. Agreed. I'd say that's why the DBA needs to be able to define the split point between partitions: only he knows the meaning of the data. >> To me, both of SVM and >> SE look much more like an optimization for certain special cases and >> don't have much to do with partitioning. > > Even if this were true, it might still be a useful optimization. Possibly, yes. To me, the use case seems pretty narrow, though. For example it doesn't affect index scans much. > One table I > am thinking of in particular in my system has one query we need to run across > partitions, which ends up doing a slew of bitmap index scans for all the > partitions. If using segment exclusion on it meant that I could get a global > index to help that query, I'd be happy. As proposed, Segment Exclusion works only on exactly one table. Thus, if you already have your data partitioned into multiple relations, it most probably won't affect your setup much. It certainly has nothing to do with what I understand by 'global index' (that's an index spanning multiple tables, right?). Regards Markus
<br /><br /><div class="gmail_quote">On Jan 5, 2008 6:15 PM, <<a href="mailto:tomas@tuxteam.de">tomas@tuxteam.de</a>>wrote:<br /><blockquote class="gmail_quote" style="border-left: 1pxsolid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br /></div>One thought Ihad back then, with partitioned tables was "gee -- B-tree<br />index is already doing a partition; why do a manual partitionon top of<br />that?".<br /><br /></blockquote></div> Can you please explain more on what you are trying to sayhere?<br /><br /><br />Thanks,<br />Gokul.<br />
On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote: > >> To satisfy all the different requirements of partitioning with segments > >> based partitioning, we'd have to allow a table to span multiple table > >> spaces. I'm not very keen on going that way. > > > > Why? > > Uh.. if a table (RELKIND_RELATION) can only span one table space, as it > is now, all of its segments are in the same table space. I don't quite > call that partitioning. Well, sure, you could call it so, but then, each > and every Postgres table is already partitioned in 1G segments. > > It all depends on the definitions, but in my world, horizontal > partitioning for databases involves multiple table spaces (and is quite > useless without that). Calling anything else partitioning is confusing, > IMO. I'm not following this. If we can work out a scheme, I see no reason not to allow a single table to span multiple tablespaces. Do you see a problem with that? > > > So the one thing that always scares me about these "define it all and let > > the database sort it out" methods is they seem to lead to cases where the > > system ends up rewriting the data to fit into some new partition layout. > > That holds true no matter if you shuffle between segments or relations. > To be able to let the DBA define an exact split point, the database > *will* have to shuffle tuples around. Why does that scare you? It's a > regular database system's maintenance procedure. > It's a question of control.... see below. > > One thing > > that is nice about the current partitioning scheme is you can control the > > impact of this behavior in these scenarios, but moving around small > > portions of the table at a time. > > Uh.. I'm not quite following. What "current partitioning scheme" are you > referring to? > The current, constraint exclusion/ inheritence based, partitioning we have now in postgresql. > Why should that not be possible with other schemes? Moving the split > point between two partitions involves moving tuples around, no matter if > you are going to move them between segments or between relations > building the partitions. > The difference is that, if I currently have a table split by month, I can "re-partition" it into weekly segments, and only shuffle one months data at a time minimize impact on the system while I shuffle it. This can even be used to do dynamic management, where data from the current month is archived by day, data from the past year by week, and data beyond that done monthly. On many other databases, if you change the partition scheme, it requires exclusive locks and a shuffleing of all of the data, even data whose partitions arent being redefined. Even worse are systems like mysql, where you need to rewrite the indexes as well. To me, these requirements always seem like show stoppers; I generally can't afford to lock a table while the database rewrites a billion rows of data. > > More to the point (I think) is that people define access to the data > > based on the meaning of the data, not how it is stored on disk. For > > example, in some tables we only need to be active on 1 months worth of > > data... how that is laid out on disk (# partitions, which tablespaces) is > > a means to the end of working actively on 1 months worth of data. I can't > > think of many cases where people would actually say the want to work > > actively on the most recent GB of data. > > Agreed. I'd say that's why the DBA needs to be able to define the split > point between partitions: only he knows the meaning of the data. > > >> To me, both of SVM and > >> SE look much more like an optimization for certain special cases and > >> don't have much to do with partitioning. > > > > Even if this were true, it might still be a useful optimization. > > Possibly, yes. To me, the use case seems pretty narrow, though. For > example it doesn't affect index scans much. > > > One table I > > am thinking of in particular in my system has one query we need to run > > across partitions, which ends up doing a slew of bitmap index scans for > > all the partitions. If using segment exclusion on it meant that I could > > get a global index to help that query, I'd be happy. > > As proposed, Segment Exclusion works only on exactly one table. Thus, if > you already have your data partitioned into multiple relations, it most > probably won't affect your setup much. It certainly has nothing to do > with what I understand by 'global index' (that's an index spanning > multiple tables, right?). > In a more general sense, a global index is a an index that spans multiple partitions, as opposed to a local index, which is an index on specific partitions; postgresql current supports the latter, not the former. In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, use segment exclusion to handle what is currently handled by partitions, and create a "global index" across all the other data for that other, currently killer, query. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote: > On Jan 5, 2008 6:15 PM, <tomas@tuxteam.de> wrote: > > > > > One thought I had back then, with partitioned tables was "gee -- B-tree > > index is already doing a partition; why do a manual partition on top of > > that?". > Can you please explain more on what you are trying to say here? Sure. A B-tree is just a device to partition something along some order. If you have , say, a table of orders (to use the example upthread) and a B-tree index on order date, this index partitions your set (at recursively finer levels). Of course, you'd have to "sort" your data alogn this index, but PostgreSQL knows how to do this trick: CLUSTER. This was just a vague idea, many things were missing (for example to separate out the more quiescent parts of the table into their own files) which are spelled out in Simon Riggs' proposal. This struck me when seeing people partition tables by hand -- and I was delighted to actually watch Simon forging a real design. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFHgG3QBcgs9XrR2kYRAgKhAJ93KUybgMfG07ta67DiR8bgAbHPrgCeOI2V by/xeXKrDJ5O0JZHyFurego= =R/vC -----END PGP SIGNATURE-----
On Jan 6, 2008 3:00 AM, Robert Treat <xzilla@users.sourceforge.net> wrote:
If we can span two different partitions under two tablespaces, it would help concepts like parallel queries, parallel updates etc in data warehousing. If we allow a single table to span across multiple tablespaces, the question arises how we would be able to move around different parts of the table,as we like. Segments, i believe doesn't let the user/DBA draw the split-points.
That's a good idea. Local index , then would be equivalent to partial indexes. But single Table partition maintenance might not be as flexible as the multi-table partition. In real partitioning, dropping a single partition should not affect the local indexes of other partitions. But i think that would be very difficult to implement under this scheme. On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:I'm not following this. If we can work out a scheme, I see no reason not to
> >> To satisfy all the different requirements of partitioning with segments
> >> based partitioning, we'd have to allow a table to span multiple table
> >> spaces. I'm not very keen on going that way.
> >
> > Why?
>
> Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
> is now, all of its segments are in the same table space. I don't quite
> call that partitioning. Well, sure, you could call it so, but then, each
> and every Postgres table is already partitioned in 1G segments.
>
> It all depends on the definitions, but in my world, horizontal
> partitioning for databases involves multiple table spaces (and is quite
> useless without that). Calling anything else partitioning is confusing,
> IMO.
allow a single table to span multiple tablespaces. Do you see a problem with
that?
If we can span two different partitions under two tablespaces, it would help concepts like parallel queries, parallel updates etc in data warehousing. If we allow a single table to span across multiple tablespaces, the question arises how we would be able to move around different parts of the table,as we like. Segments, i believe doesn't let the user/DBA draw the split-points.
In a more general sense, a global index is a an index that spans multiple
partitions, as opposed to a local index, which is an index on specific
partitions; postgresql current supports the latter, not the former.
In any case, my thinking is if we had the segment exclusion technique, I could
convert that partitioned table into a regular table again, use segment
exclusion to handle what is currently handled by partitions, and create
a "global index" across all the other data for that other, currently killer,
query.
Thanks,
Gokul.
On Jan 6, 2008 11:27 AM, <tomas@tuxteam.de> wrote:
But the current index scans - Index Scan and Bitmap Index Scan, doesn't provide the exact benefit of partitioning, even if all the columns are covered by the index. It does a lot more disk reads than the partitioning scheme. I think you are looking for something like Block indexes in Multi-dimensional Clusters in DB2. Heikki did something like that in a more subtle way.
Postgresql Clusters, as you may know doesn't maintain the order with inserts. We might go for Index Organized Tables/Clustered indexes. But then, B-tree would give lot of performance problems, if the Index Tuple size increases beyond a certain point.
Thanks,
Gokul.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1Sure. A B-tree is just a device to partition something along some order.On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote:
> On Jan 5, 2008 6:15 PM, < tomas@tuxteam.de> wrote:
>
> >
> > One thought I had back then, with partitioned tables was "gee -- B-tree
> > index is already doing a partition; why do a manual partition on top of
> > that?".
> Can you please explain more on what you are trying to say here?
If you have , say, a table of orders (to use the example upthread) and a
B-tree index on order date, this index partitions your set (at
recursively finer levels).
But the current index scans - Index Scan and Bitmap Index Scan, doesn't provide the exact benefit of partitioning, even if all the columns are covered by the index. It does a lot more disk reads than the partitioning scheme. I think you are looking for something like Block indexes in Multi-dimensional Clusters in DB2. Heikki did something like that in a more subtle way.
Postgresql Clusters, as you may know doesn't maintain the order with inserts. We might go for Index Organized Tables/Clustered indexes. But then, B-tree would give lot of performance problems, if the Index Tuple size increases beyond a certain point.
Thanks,
Gokul.
Hi, Gokulakannan Somasundaram wrote: > > > On Jan 5, 2008 6:15 PM, <tomas@tuxteam.de <mailto:tomas@tuxteam.de>> wrote: > > > One thought I had back then, with partitioned tables was "gee -- B-tree > index is already doing a partition; why do a manual partition on top of > that?". > > Can you please explain more on what you are trying to say here? I think this has to do with SE not being of much use for index scans. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. With an index, the planner gets a hard time deciding between the index scan and the (possibly SE optimized) seq scan. Regards Markus
Hi, Robert Treat wrote: > On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote: >>>> To satisfy all the different requirements of partitioning with segments >>>> based partitioning, we'd have to allow a table to span multiple table >>>> spaces. I'm not very keen on going that way. >>> Why? >> Uh.. if a table (RELKIND_RELATION) can only span one table space, as it >> is now, all of its segments are in the same table space. I don't quite >> call that partitioning. Well, sure, you could call it so, but then, each >> and every Postgres table is already partitioned in 1G segments. >> >> It all depends on the definitions, but in my world, horizontal >> partitioning for databases involves multiple table spaces (and is quite >> useless without that). Calling anything else partitioning is confusing, >> IMO. > > I'm not following this. If we can work out a scheme, I see no reason not to > allow a single table to span multiple tablespaces. Do you see a problem with > that? Uhm... well, no. I was just pointing out that it's a requirement. It depends on how you define things, but I'm seeing it that way: table -- 1:n -- partition -- 1:1 -- table space -- 1:n -- segments What I'm advocating is making partitions available to the DBA as some kind of a relation, she can query separately and move around between table spaces. >> Why should that not be possible with other schemes? Moving the split >> point between two partitions involves moving tuples around, no matter if >> you are going to move them between segments or between relations >> building the partitions. > > The difference is that, if I currently have a table split by month, I > can "re-partition" it into weekly segments, and only shuffle one months data > at a time minimize impact on the system while I shuffle it. This can even be > used to do dynamic management, where data from the current month is archived > by day, data from the past year by week, and data beyond that done monthly. This should be possible for both schemes, I see no connection to what we've discussed. SE doesn't magically give you this level of control you are requesting here. Quite the opposite: referring to CLUSTERing to makes me wonder, if that's not going to shuffle way too many tuples around. What I'm saying is, that SE doesn't partition the segments into different table spaces. Thus I don't consider it "database partitioning" in the first place. As I currently understand it, it's: table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments > On many other databases, if you change the partition scheme, it requires > exclusive locks and a shuffleing of all of the data, even data whose > partitions arent being redefined. Even worse are systems like mysql, where > you need to rewrite the indexes as well. To me, these requirements always > seem like show stoppers; I generally can't afford to lock a table while the > database rewrites a billion rows of data. I fully agree here. How do you plan to solve that problem on top of SE? > In a more general sense, a global index is a an index that spans multiple > partitions, as opposed to a local index, which is an index on specific > partitions; postgresql current supports the latter, not the former. > > In any case, my thinking is if we had the segment exclusion technique, I could > convert that partitioned table into a regular table again, ... on a single table space ... > use segment > exclusion to handle what is currently handled by partitions, ... except, that there is no partitioning (!?!) (between table spaces) > and create > a "global index" across all the other data for that other, currently killer, > query. I thought the table you are referring to is bigger than your fastest table space? That would even make it impossible. See where I'm coming from? And why I'm stating that SE is an optimization (for seq scans), but not partitioning? Regards Markus
On Jan 6, 2008 4:09 PM, Markus Schiltknecht <markus@bluegap.ch> wrote:
That's a good point. But i think Simon is planning not to give the job to the planner, but to the executor. So SE optimization will come into play, only when planner has decided on Sequential scan.
Hi,
Gokulakannan Somasundaram wrote:>I think this has to do with SE not being of much use for index scans. Or
>
> On Jan 5, 2008 6:15 PM, <tomas@tuxteam.de <mailto:tomas@tuxteam.de >> wrote:
>
>
> One thought I had back then, with partitioned tables was "gee -- B-tree
> index is already doing a partition; why do a manual partition on top of
> that?".
>
> Can you please explain more on what you are trying to say here?
put it another way: SE is an optimization for sequential scans. For
tables where it works well, it could possibly replace the index entirely.
Without the index, you would rely on SE to always be able to exclude
enough segments, so that the seq scan is less expensive than an index
scan with the following table lookups.
With an index, the planner gets a hard time deciding between the index
scan and the (possibly SE optimized) seq scan.
That's a good point. But i think Simon is planning not to give the job to the planner, but to the executor. So SE optimization will come into play, only when planner has decided on Sequential scan.
On Sunday 06 January 2008 05:48, Markus Schiltknecht wrote: > What I'm saying is, that SE doesn't partition the segments into > different table spaces. Thus I don't consider it "database partitioning" > in the first place. As I currently understand it, it's: > > table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments > Ah, was a slight misunderstanding of terminology. I see what your getting at. > > On many other databases, if you change the partition scheme, it requires > > exclusive locks and a shuffleing of all of the data, even data whose > > partitions arent being redefined. Even worse are systems like mysql, > > where you need to rewrite the indexes as well. To me, these requirements > > always seem like show stoppers; I generally can't afford to lock a table > > while the database rewrites a billion rows of data. > > I fully agree here. How do you plan to solve that problem on top of SE? > > > In a more general sense, a global index is a an index that spans multiple > > partitions, as opposed to a local index, which is an index on specific > > partitions; postgresql current supports the latter, not the former. > > > > In any case, my thinking is if we had the segment exclusion technique, I > > could convert that partitioned table into a regular table again, > > .... on a single table space ... > > > use segment > > exclusion to handle what is currently handled by partitions, > > .... except, that there is no partitioning (!?!) (between table spaces) > > > and create > > a "global index" across all the other data for that other, currently > > killer, query. > > I thought the table you are referring to is bigger than your fastest > table space? That would even make it impossible. > Nope, different table :-) Although the above global/local one would probably live entirely on the slow disks, using SE and global index to satisfy the queries. And really our slow disks aren't exactly slow, but do have poor concurrency, so we put primarily read data on them to keep writes to a minimum. > See where I'm coming from? And why I'm stating that SE is an > optimization (for seq scans), but not partitioning? > Yes, but aiui, SE should allow seq scans to achieve performance similar to partitioning, especially if the planner can exclude segments based on values in the data. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Wed, 2008-01-02 at 17:56 +0000, Simon Riggs wrote: > Like it? Very cool :-) One additional thought: what about a kind of "segment fill factor" ? Meaning: each segment has some free space reserved for future updates/inserts of records in the same range of it's partitioning constraint. And when inserting/updating you put the new record into the corresponding segment... just like a very coarse clustering. Then you could vacuum the segments separately to keep the free space not running out. For active segments you would then fix the partitioning constraint range once the fill factor is reached, to allow for keeping it's constraint even when heavily updating (heavily vacuuming it too as response to that), and create a new segment for the unbounded range for new inserts... this would work fine for tables where the constraint is based on ever increasing keys and accidental inserts in old ranges (which do happen occasionally in real life). When the change rate of old segments get down, the segments could be reorganized to have a smaller fill factor, so that you still allow for accidental updates but keep space usage efficient. This would be some similar action as a clustering, but hopefully not blocking (which might be a hard thing to do)... and later again you could mark some of the really old things as read only and put them in special segments with no wasted space. One problem would be when the segment's free space runs out, so you must put records from the same constraint range in multiple segments - but that could still work, you just would have multiple segments covering the same range, but if the "segment fill factor" is chosen properly it should not be the case... you could normally maintain a set of non-overlapping segments in terms of the partitioning constraint. Cheers, Csaba.
Hi Csaba, Csaba Nagy wrote: > One additional thought: what about a kind of "segment fill factor" ? > Meaning: each segment has some free space reserved for future > updates/inserts of records in the same range of it's partitioning > constraint. And when inserting/updating you put the new record into the > corresponding segment... just like a very coarse clustering. Hm.. yeah. That way, a few writes to a "read optimized" segment could be accepted, without having to drop the optimization immediately. And the other way around: generally prevent having to drop the optimization by forcing tuples to be written to a segment with matching min/max tuples. Although, that's not exactly trivial, I think. However, for tables which don't fit the use case of SE, people certainly don't want such a fill factor to bloat their tables. Regards Markus
On Mon, 2008-01-07 at 13:59 +0100, Markus Schiltknecht wrote: > However, for tables which don't fit the use case of SE, people certainly > don't want such a fill factor to bloat their tables. Sure, but it could be configurable and should only be enabled if the table is marked as partitioned on some condition... I think it would be a bad idea anyway if the DB would start partitioning on some arbitrary criteria based on analyzing he table, so the DBA should be the one to decide on what criteria to partition. In particular it could be a bad idea on occasions to partition on the clustering criteria for tables which were clustered once. Cheers, Csaba.
On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote: > Why is that? AFAIUI, Segment Exclusion combines perfectly well with > clustering. Or even better, with an upcoming feature to maintain > clustered ordering. Where do you see disadvantages such an optimization > for sequential scans? Well, as I understood it, this would be some kind of special case of clustering, where the cluster key is expected to be ever increasing in time and new rows would not be randomly distributed over the complete possible range. In theory you could also have each segment in turn be clustered on some other criteria than the partitioning criteria so indexed access could also be better on the main selection criteria which could be different than the partitioning criteria. All this is of course just speculations - but I guess that's what you expected too in this discussion :-) Cheers, Csaba.
Hi, Csaba Nagy wrote: > Sure, but it could be configurable and should only be enabled if the > table is marked as partitioned on some condition... As I'm regarding SE as an optimization, I disagree here.. As all optimizations, SE should conceptually be reasonably close to cost-less when unneeded. > I think it would be > a bad idea anyway if the DB would start partitioning on some arbitrary > criteria based on analyzing he table, so the DBA should be the one to > decide on what criteria to partition. I absolutely agree for real partitioning, which targets manageability of table partitions. > In particular it could be a bad > idea on occasions to partition on the clustering criteria for tables > which were clustered once. Why is that? AFAIUI, Segment Exclusion combines perfectly well with clustering. Or even better, with an upcoming feature to maintain clustered ordering. Where do you see disadvantages such an optimization for sequential scans? Regards Markus
On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote: > Well, management of relations is easy enough, known to the DBA and most > importantly: it already exists. Having to set up something which is > *not* tied to a relation complicates things just because it's an > additional concept. But we're already dealing with some complicated concepts. There isn't anything that will prevent current-style partitioning strategies from continuing to work in the face of Simon's proposal. But let me see if I can outline the sort of cases where I see real value in what he's outlined. There is a tendency in data systems to gather all manner of data that, in retrospect, _might_ turn out to be be valuable; but which, at the time, is not really valuable at all. Moreover, the value later on might be relatively low: if you can learn something much later from that data, and do so easily, then it will be worth doing. But if the work involved passes some threshold (say 1/2 a day), it's suddenly not worth it any more. It's simple economics: below a certain cost, the data is valuable. Above a certain cost, you simply shouldn't keep the data in the first place, because the cost of using it is higher than any value you'll likely be able to extract. Simon's proposal changes the calculations you have to do. If keeping some data online longer does not impose administrative or operational overhead (you have it marked read only, so there's no I/O for vacuum; you don't need to do anything to get the data marked read only; &c.), then all it costs is a little more disk, which is relatively cheap these days. More importantly, if the longer-term effect of this strategy is to make it possible to move such data offline _without imposing a big cost_ when moving it back online, then the value is potentially very high. Without even trying, I can think of a dozen examples in the past 5 years where I could have used that sort of functionality. Because the cost of data retrieval was high enough, we had to decide that the question wasn't worth answering. Some of those answers might have been quite valuable indeed to the Internet community, to be frank; but because I had to pay the cost without getting much direct benefit, it just wasn't worth the effort. The thing about Simon's proposal that is beguiling is that it is aimed at a very common use pattern. The potential for automatic management under such a use pattern makes it seem to me to be worth exploring in some detail. > Agreed. I'd say that's why the DBA needs to be able to define the split > point between partitions: only he knows the meaning of the data. I think this is only partly true. A casual glance at the -general list will reveal all manner of false assumptions on the parts of administrators about how their data is structured. My experience is that, given that the computer has way more information about the data than I do, it is more likely to make the right choice. To the extent it doesn't do so, that's a problem in the planning (or whatever) algorithms, and it ought to be fixed there. A
Hi, Andrew Sullivan wrote: > On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote: >> Well, management of relations is easy enough, known to the DBA and most >> importantly: it already exists. Having to set up something which is >> *not* tied to a relation complicates things just because it's an >> additional concept. > > But we're already dealing with some complicated concepts. Possibly, yes, but that's by far no reason to introduce even more complicated concepts... Does anything speak against letting the DBA handle partitions as relations? > There isn't anything that will prevent current-style partitioning strategies > from continuing to work in the face of Simon's proposal. Agreed. Nor will Simon's proposal completely replace them. > Without even trying, I can think of a dozen examples in the past 5 years > where I could have used that sort of functionality. Because the cost of > data retrieval was high enough, we had to decide that the question wasn't > worth answering. Some of those answers might have been quite valuable > indeed to the Internet community, to be frank; but because I had to pay the > cost without getting much direct benefit, it just wasn't worth the effort. Sure, there's value in Simon's proposal. But it has pretty strict requirements. IMO, it's pretty hard to say, if it would have helped at all for your cases. Any of them still available to check? Remember the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. As said before, combination with CLUSTERing might help, yes. But if you need to maintain CLUSTERed ordering, aren't there better ways? For example, you could use binary searching on the relation directly, much like with indices, instead of sequentially scanning on the CLUSTERed relation. That would even give us some sort of "indices with visibility". >> Agreed. I'd say that's why the DBA needs to be able to define the split >> point between partitions: only he knows the meaning of the data. > > I think this is only partly true. A casual glance at the -general list will > reveal all manner of false assumptions on the parts of administrators about > how their data is structured. My experience is that, given that the > computer has way more information about the data than I do, it is more > likely to make the right choice. To the extent it doesn't do so, that's a > problem in the planning (or whatever) algorithms, and it ought to be fixed > there. Well, Postgres doesn't automatically create indices, for a counter example. With regard to partitioning over multiple table spaces, I think the DBA definitely has more information available, than the computer. A DBA (hopefully) knows future plans and emergency strategies for the storage system, for example. Lacking such information, the database system will have a pretty hard time taking a good decision on how to partition between table spaces, IMO. Regards Markus
On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: > > Does anything speak against letting the DBA handle partitions as relations? Yes: it doesn't solve the problem I have, which is that I don't want to have to manage a whole bunch of tables. I want one table, and I want to be able to say, "That section is closed". > Sure, there's value in Simon's proposal. But it has pretty strict > requirements. IMO, it's pretty hard to say, if it would have helped at > all for your cases. Any of them still available to check? No, but one of your worries doesn't bother me: > Remember the requirements: no single tuple in the segment may be > significantly out of the average bounds. Otherwise, the min/max gets > pretty useless and the segment can never be excluded. The segment can never be excluded in a search on that key, yes. But consider the likely cases we're looking at: WHERE some_date >= '1999-01-01' AND some_date < '2001-01-01';WHERE sequence_field BETWEEN 3000 AND 300000; &c. These are the two obvious cases: you're searching for data in a given date range or for primary (sometimes artificial) identifiers in a range, and the source data increases (almost) monotonically. You have to do this now anyway, because there's _some_ basis on which you're partitioning your data; but today, you do this with a lot of fooling around with views and nasty triggers that push data into the "right" table, assuming someone doesn't screw it up. > need to maintain CLUSTERed ordering, aren't there better ways? For > example, you could use binary searching on the relation directly, much > like with indices, instead of sequentially scanning on the CLUSTERed > relation. That would even give us some sort of "indices with visibility". I think this is a nice idea too :) > Well, Postgres doesn't automatically create indices, for a counter example. Yes, and it has no data-use analyser tools that automatically suggest indexes, either. That's the sort of thing people coming from other (err, "Other" ;-) products complain about, in fact. > definitely has more information available, than the computer. A DBA > (hopefully) knows future plans and emergency strategies for the storage > system, for example. Perhaps my jaundice comes from too much time spent in operational trenches, but while good DBAs have some ideas about that, large numbers of them are harried and overwhelmed just by the piles of work they already have. Nevertheless, while what you say is true, I'm not sure what it has to do with the present case. I don't think the current proposal is to address partitioning across table spaces. It's to do with the way certain segments of a table are interpreted by the system. It's undoubtedly true that this strategy is of questionable utility for many kinds of use of PostgreSQL. But it seems to offer very significant advantages for one use-pattern that is very common. That said, I am not trying to argue it should be adopted without poking at its weaknesses. I just think it unfair to ask the proposal to address problems it's not really aimed at. A
Andrew Sullivan wrote: > On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: >> ...the requirements: no single tuple in the segment may be >> significantly out of the average bounds. Otherwise, the min/max gets >> pretty useless and the segment can never be excluded. > > The segment can never be excluded in a search on that key, yes. But > consider the likely cases we're looking at: > ...you're searching for data in a given > date range or for primary (sometimes artificial) identifiers in a range, > and the source data increases (almost) monotonically. It seems to me the idea discussed elsewhere in the thread[1] about configurable segment sizes would make this limitation much less problematic for some types of data. Some of my biggest tables are clustered by zip-code; and are insert mostly. Common queries are "where state_provence='TX'" or "where city='Dallas'". While I doubt I have enough data to fill a 1GB segment for any but the largest cities; I certainly have runs of many consecutive blocks - since clustering by zip tends to cluster cities as well. Even though the table's not monotonically increasing or decreasing, like values for cities and states are clustered together. Is my understanding right that these Segment Visibility Maps could help this case as well? [1] http://archives.postgresql.org/pgsql-hackers/2008-01/msg00065.php
Hi, Andrew Sullivan wrote: > On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: >> Does anything speak against letting the DBA handle partitions as relations? > > Yes: it doesn't solve the problem I have, which is that I don't want to have > to manage a whole bunch of tables. I want one table, and I want to be able > to say, "That section is closed". That's a fair requirement. Wanting to manage partitions manually is another fair requirement, IMO. They can well coexist. >> Remember the requirements: no single tuple in the segment may be >> significantly out of the average bounds. Otherwise, the min/max gets >> pretty useless and the segment can never be excluded. > > The segment can never be excluded in a search on that key, yes. But > consider the likely cases we're looking at: Uh, which key are you talking about? AFAIU Simon's proposal, he suggests maintaining min/max values for all columns of the table. > WHERE some_date >= '1999-01-01' AND some_date < '2001-01-01'; Yeah, and if only *one* tuple in the 1G segment has: some_date <= '1998-12-31' OR some_date >= '2001-01-01' Segment Exclusion can't exclude that segment. That's all I'm saying. > That said, I am not trying to argue it should be adopted without poking at > its weaknesses. I just think it unfair to ask the proposal to address > problems it's not really aimed at. Huh? I'm certainly not the one asking for it. Quite the opposite, I'm warning from over-estimating the use of SE. In his proposal, Simon was explicitly comparing to "declarative partitioning", pointing out lots of advantages and implicitly stating that SE could cover most, if not all needs of what's commonly understand by partitioning. That's where I disagree. But certainly, SE and SVM has some merit for it's use case. And I'm looking forward to test patches ;-) Regards Markus
"Andrew Sullivan" <ajs@crankycanuck.ca> writes: > On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: >> >> Does anything speak against letting the DBA handle partitions as relations? > > Yes: it doesn't solve the problem I have, which is that I don't want to have > to manage a whole bunch of tables. I want one table, and I want to be able > to say, "That section is closed". That's not your problem, that's the solution you're looking for. You're assuming a particular solution in your problem statement. I posit that the whole point of partitioning is to create an object which serves to represent the semantically meaningful chunks of data. The reason this is useful is precisely because it serves as a short-hand for the DBA describe the data and how it will be used. I think Simon's proposal loses the very feature that makes partitioning useful. The DBA doesn't have a "thing" to describe, he has to define what parts of the table he's describing for every operation. And if you define a whole new object to name these "things" I think you'll end up with something that looks a lot like tables. I also don't understand how this proposal deals with the more common use case of unloading and loading data. Normally in partitioned tables we build the data in a side table until the data is all correct then load it as a partition. If you treat it as a lower-level object then I don't see that working. The layout of the new table won't often match the layout of the target partitioned table. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark wrote: > I also don't understand how this proposal deals with the more common use case > of unloading and loading data. Normally in partitioned tables we build the > data in a side table until the data is all correct then load it as a > partition. If you treat it as a lower-level object then I don't see that > working. The layout of the new table won't often match the layout of the > target partitioned table. > > > +1 In addition, a similar use case is archival of "old" partitions as they are not longer (commonly) accessed - perhaps to a different tablespace (or even to backup media). I don't see how dynamic in-table partitioning handles this, and I think it would highly desirable to be able to do these things! best wishes Mark
On Tue, Jan 08, 2008 at 01:08:52AM +0100, Markus Schiltknecht wrote: > > Uh, which key are you talking about? AFAIU Simon's proposal, he suggests > maintaining min/max values for all columns of the table. Right, but I think that's just because that approach is automatable. Only some use cases are going to be approproate to this. > Yeah, and if only *one* tuple in the 1G segment has: > > some_date <= '1998-12-31' OR some_date >= '2001-01-01' > > Segment Exclusion can't exclude that segment. That's all I'm saying. Correct. > Huh? I'm certainly not the one asking for it. Quite the opposite, I'm > warning from over-estimating the use of SE. Right; I think one should be clear that there are many -- maybe most -- uses of PostgreSQL where the proposal will be of no use. I just think we need to be clear that for the areas where it _can_ be useful, it could be very useful indeed. A
On Tue, Jan 08, 2008 at 02:12:28AM +0000, Gregory Stark wrote: > > Yes: it doesn't solve the problem I have, which is that I don't want to > > have to manage a whole bunch of tables. I want one table, and I want to > > be able to say, "That section is closed". > > That's not your problem, that's the solution you're looking for. You're > assuming a particular solution in your problem statement. Probably in that one, yes. I'm still waiting for permission to post my original problem statement; I suspect it's not going to be forthcoming by next Monday, so it's not going to happen. But I did outline something like what I'm talking about elsewhere in this thread. For my case, I'm thinking of the sort of data that builds up over time, and most of which happens probably not to be useful at any moment, but all of which _might_ be useful over the long haul. So what I wanted, originally, was to be able to set arbitrary ranges of tuples to be read-only, and to be able to set them offline if I wanted. Pseudo-DDL: ALTER TABLE fooSET read_only='t'WHERE created_on < '2007-01-01';ALTER TABLE fooSET tuple_offline='t'WHERE created_on < '2006-01-01'; Now, the second segment is marked "offline". If I query the table for things in that range, I get an ERROR telling me there might be data in the range, but it's not mounted at the moment. If I try to update records in the read-only (first) range, I get an error telling me the tuple is marked read only. The idea then is that these older tuples can be put off into long-term storage (wave hands here about the management of that stuff), and this keeps my online system compact but yet allows me, for just the cost of mounting a backup tape and reading the segments back, to go back and query those old ranges. The case I was particularly aiming at originally was for a case of data that cannot cost more than fractions of pennies to store, but that might represent a hugely expensive liability if the answer is not always right. Driving down that storage cost was mostly what I was aiming at, but people gradually convinced me that slightly more generic implementations might be useful. Simon's proposal came along, and it seems to me to be something like the generic implementation that others already convinced me was needed. > I think Simon's proposal loses the very feature that makes partitioning > useful. The DBA doesn't have a "thing" to describe, he has to define what > parts of the table he's describing for every operation. And if you define a > whole new object to name these "things" I think you'll end up with something > that looks a lot like tables. I don't see how that's the case at all. In fact, I have the feeling it's the opposite, so perhaps I've misunderstood something. A
On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote: > This technique would be useful for any table with historical data keyed > by date or timestamp. It would also be useful for data where a > time-of-insert component is implicit, such as many major entity tables > where the object ids are assigned by a sequence. e.g. an Orders table > with an OrderId as PK. Once all orders placed in a period have been > shipped/resolved/closed then the segments will be marked read-only. > > Its not really going to change the code path much for small tables, yet > seems likely to work reasonably well for large tables of all shapes and > sizes. If a segment is being updated, we leave it alone, and maybe never > actually set the visibility map at all. So overall, this idea seems to > cover the main use case well, yet with only minimal impact on the > existing use cases we support. > > > As before, I will maintain this proposal on the PG developer Wiki, once > we get down to detailed design. > > > > Like it? Simon, A novel approach to the problem. For me, this proposal addresses some of the other problems in postgres (visibility in the heap vs. index) rather than the problems with partitioning itself. It might seem otherwise, but for me partitioning is tool for managing large volumes of data. It allows the user to encode what they know about the nature of their data, and that's often about management. In this way, your proposal actually makes partitioning too smart: the user wants to tell the system how to organise the data, as opposed to the other way around. At Greenplum, we've been discussing this in depth. Interestingly, we also felt that the storage layer could be much smarter with large tables with predictable layouts and predictable data patterns. But the thing is, people with large tables like partitioning, putting different partitions on different storage; they like the ability to merge and split partitions; they like truncating old partitions. In a word, they seem to like the managability partitions give them -- as well as the performance that comes with this. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for example, a date range), list (a list of distinct values) and hash partitioning be supported on multiple columns. Partitions can be added, merged, truncated. Partitions can reside on different tablespaces. The existing issues with the rewriter, COPY support and the like go away by smartening up the backend. To explore the grammar and semantics Jeff and I (to a small extent) have implemented the parsing of such a grammar in the backend. At the moment, this just results in the generation of a bunch of rules (i.e., it produces the current behaviour, as if someone had written rules themselves) and debugging output. The fully fledged approach will see partitioning rules stored in a new system catalog which can be interrogated by the planner or COPY to determine which partition a given tuple belongs in or which partition(s) a WHERE clause matches. Yes, this is the traditional approach but I think it still has merit. We shall continue working on this approach because it is what our customers have asked for. We would also like to see it get into postgres too. Thanks, Gavin
On Sat, 2008-01-05 at 16:30 -0500, Robert Treat wrote: > I'm not following this. If we can work out a scheme, I see no reason not to > allow a single table to span multiple tablespaces. That seems to be something we might want anyway, so yes. > The difference is that, if I currently have a table split by month, I > can "re-partition" it into weekly segments, and only shuffle one months data > at a time minimize impact on the system while I shuffle it. This can even be > used to do dynamic management, where data from the current month is archived > by day, data from the past year by week, and data beyond that done monthly. Understood > On many other databases, if you change the partition scheme, it requires > exclusive locks and a shuffleing of all of the data, even data whose > partitions arent being redefined. Even worse are systems like mysql, where > you need to rewrite the indexes as well. To me, these requirements always > seem like show stoppers; I generally can't afford to lock a table while the > database rewrites a billion rows of data. Agreed > In any case, my thinking is if we had the segment exclusion technique, I could > convert that partitioned table into a regular table again, use segment > exclusion to handle what is currently handled by partitions, and create > a "global index" across all the other data for that other, currently killer, > query. Yes, that's what I have in mind. Can I ask that you produce a "gap analysis" between what you have now and what you would have in the future, so we can see what omissions or errors there are in the segex proposal? If we had indexes that spanned partitions, would we find that some of the queries that were producing seq scans will now produce better join and index plans, do you think? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Mon, 2008-01-07 at 12:14 +0100, Csaba Nagy wrote: > On Wed, 2008-01-02 at 17:56 +0000, Simon Riggs wrote: > > Like it? > > Very cool :-) Thanks. As ever, a distillation of various thoughts, not all mine. > One additional thought: what about a kind of "segment fill factor" ? > Meaning: each segment has some free space reserved for future > updates/inserts of records in the same range of it's partitioning > constraint. And when inserting/updating you put the new record into the > corresponding segment... just like a very coarse clustering. Then you > could vacuum the segments separately to keep the free space not running > out. For active segments you would then fix the partitioning constraint > range once the fill factor is reached, to allow for keeping it's > constraint even when heavily updating (heavily vacuuming it too as > response to that), and create a new segment for the unbounded range for > new inserts... this would work fine for tables where the constraint is > based on ever increasing keys and accidental inserts in old ranges > (which do happen occasionally in real life). Lots of thoughts there, so I'll try to separate them out and analyse. The way I originally described it is a very simple mechanism and we could tweak that some more. All ideas welcome. If we had dynamic segment constraints when the segment was not yet read only that would lead to changes in the segment constraint for each INSERT when we have increasing keys. It seems better to set the constraints once only, when the segment was full, then prevent further INSERTs. The accidental inserts in old ranges seem like something that can be avoided if it is a real-world problem. For UPDATEs on a segment with constraints we might choose to apply the constraints to see what to do. You might want to allow the UPDATE and have it stretch the constraints outwards or you might want to prevent it and throw an error, or you might want to allow the UPDATE, yet migrate the new tuple to the appropriate segment. Dynamic partitioning works in multiple dimensions, so there isn't just one single valid location for any row. i.e. if we update a have a row with OrderId, OrderDate, RequiredByDate, ShipDate and LastModifiedDate on it, we'll probably expand the constraints on at least one of those. If we were lucky enough to have only changed one of those it might turn out there was *in that case* a single more sensible location for the new tuple, but that probably isn't the common case. So the likely best behaviour for UPDATEs is to try to keep the new row version in the same segment, then change the constraints. The segment constraints concept and the read only concept were linked. You're right we could separate them, thought that turns out not to be that desirable. When we do an DELETE or an UPDATE we don't know whether the deleted row version was the last tuple with that particular boundary value. So we don't know whether the DELETE or UPDATE changes the constraints or not, and however we try to avoid it, we'll probably need to recalc the constraints in some circumstance. So updating the constraints dynamically isn't anywhere near as easy as it sounds and ultimately probably isn't worth the effort. So thats why constraints and read-only go together. HOT allows us to keep the new row version within the segment, in many cases. What might be worth doing is marking the FSM space for that segment as update-only to exclude inserts from using it, then forcing UPDATEs to stay within the segment if possible by providing the current block number in each call to the FSM. That would also mean that every UPDATE that wants to do a multi-block update on a currently read-only segment would need to call the FSM. Sounds good, but that would only work for the first UPDATE on a segment after it is marked read only, which isn't much use, or we would do it for *every* block-spanning UPDATE, which would cause contention in other use cases. So although I'm willing to listen and tweak, that hasn't yet resulted in any additional design points, unless I missed something above? > When the change rate of old segments get down, the segments could be > reorganized to have a smaller fill factor, so that you still allow for > accidental updates but keep space usage efficient. This would be some > similar action as a clustering, but hopefully not blocking (which might > be a hard thing to do)... and later again you could mark some of the > really old things as read only and put them in special segments with no > wasted space. Yes, hopefully marking read only and compressed segments is a possible future. > One problem would be when the segment's free space runs out, so you must > put records from the same constraint range in multiple segments - but > that could still work, you just would have multiple segments covering > the same range, but if the "segment fill factor" is chosen properly it > should not be the case... you could normally maintain a set of > non-overlapping segments in terms of the partitioning constraint. Thanks very much for your detailed thoughts on the proposal. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Sun, 2008-01-06 at 11:39 +0100, Markus Schiltknecht wrote: > I think this has to do with SE not being of much use for index scans. Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be easy to apply the index condition and/or filters to see which segments are excluded and then turn off bits in the bitmap appropriately. Not fully sure about IndexScans yet. I don't think it would be worth trying to apply SE until we estimated we would return say 100 rows. It needs to be able to work without slowing down the common path. > Or > put it another way: SE is an optimization for sequential scans. For > tables where it works well, it could possibly replace the index entirely. True > Without the index, you would rely on SE to always be able to exclude > enough segments, so that the seq scan is less expensive than an index > scan with the following table lookups. It would have to be a very fat index scan for so large a table... -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote: > AFAIUI, Segment Exclusion combines perfectly well with > clustering. Yes, seems like it would be possible to have a segment-aware CLUSTER, so it was actually usable on large tables. Not planning that initially though. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be > easy to apply the index condition and/or filters to see which segments > are excluded and then turn off bits in the bitmap appropriately. Yeah, good point. > Not fully sure about IndexScans yet. I don't think it would be worth > trying to apply SE until we estimated we would return say 100 rows. It > needs to be able to work without slowing down the common path. Yup. >> Or >> put it another way: SE is an optimization for sequential scans. For >> tables where it works well, it could possibly replace the index entirely. > > True > >> Without the index, you would rely on SE to always be able to exclude >> enough segments, so that the seq scan is less expensive than an index >> scan with the following table lookups. > > It would have to be a very fat index scan for so large a table... ..for SE to be faster than an index scan, you mean? Yes. Regards Markus
On Sat, 2008-01-05 at 16:42 +0100, Markus Schiltknecht wrote: > Simon Riggs wrote: > > On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: > > > >> I'm still puzzled about how a DBA is expected to figure out which > >> segments to mark. Simon, are you assuming we are going to pass on > >> segment numbers to the DBA one day? > > > > No Way! > > Ah, I'm glad ;-) But now you want names on partitions? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Gavin and all, This is quite a long reply, so apologies for that. On Wed, 2008-01-09 at 07:28 +0100, Gavin Sherry wrote: > On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote: > > This technique would be useful for any table with historical data keyed > > by date or timestamp. It would also be useful for data where a > > time-of-insert component is implicit, such as many major entity tables > > where the object ids are assigned by a sequence. e.g. an Orders table > > with an OrderId as PK. Once all orders placed in a period have been > > shipped/resolved/closed then the segments will be marked read-only. > > > A novel approach to the problem. For me, this proposal addresses some > of the other problems in postgres (visibility in the heap vs. index) > rather than the problems with partitioning itself. I think you're right that it isn't attempting to address the problems with partitioning directly, it is attempting to address the core use case itself. I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. There are some low-level technical issues associated with declarative partitioning that I'll address last, which are in many ways the big reason to want to go to non-declarative partitioning. > It might seem otherwise, but for me partitioning is tool for managing > large volumes of data. It allows the user to encode what they know about > the nature of their data, and that's often about management. In this > way, your proposal actually makes partitioning too smart: the user wants > to tell the system how to organise the data, as opposed to the other way > around. > At Greenplum, we've been discussing this in depth. Interestingly, we > also felt that the storage layer could be much smarter with large tables > with predictable layouts and predictable data patterns. But the thing > is, people with large tables like partitioning, putting different > partitions on different storage; they like the ability to merge and > split partitions; they like truncating old partitions. In a word, they > seem to like the managability partitions give them -- as well as the > performance that comes with this. Do people really "like" running all that DDL? There is significant manpower cost in implementing and maintaining a partitioning scheme, plus significant costs in getting it wrong. If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of the major goals for this next set of Postgres partitioning functionality also, whether or not we have declarative partitioning. My thinking is lets go for the best ideas and skip over the stuff that will be (is) obsolete before its left the design stage. I see many more systems in the Terabyte range now than I did 10 years ago, but I see about the same number of DBAs. We'll always need experts, but I feel we should be using our expertise to simplify the standard cases, not just maintain the same level of difficulty in managing them. One of the big benefits of the dynamic partitioning approach is that it needs no DDL. So it will work out-of-the-box, for anybody. Deleting older data would be optimised under the proposed scheme, so that's not really a problem. Loading data is actually slightly harder and slower with declarative partitioning (see below). Merging and splitting partitions are tools for fine tuning a very complex partitioning scheme. They do also allow a non-linear segmentation scheme, which might aid performance in some cases. (On a different note, I'm strongly in favour of a declarative approach to other optimizer information to allow the DBA to say what they know about how a table will be used. But that's off-topic for now.) One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some "sloppiness" in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. > To this end, we (well, Jeff Cohen) looked at the syntax and semantics of > partitining in leading databases (Oracle, Informix, DB2) and came up > with a highly expressive grammar which takes the best of each I think > (I'll post details on the grammar in a seperate thread). The idea is > that range (for example, a date range), list (a list of distinct values) > and hash partitioning be supported on multiple columns. Partitions can > be added, merged, truncated. Partitions can reside on different > tablespaces. The existing issues with the rewriter, COPY support and the > like go away by smartening up the backend. > To explore the grammar and semantics Jeff and I (to a small extent) have > implemented the parsing of such a grammar in the backend. At the moment, > this just results in the generation of a bunch of rules (i.e., it > produces the current behaviour, as if someone had written rules > themselves) and debugging output. > > The fully fledged approach will see partitioning rules stored in > a new system catalog which can be interrogated by the planner or COPY to > determine which partition a given tuple belongs in or which partition(s) > a WHERE clause matches. When loading data we would need to examine each incoming tuple and distribute it to the correct place in the partitioned heap. That's a non-trivial overhead on data loading that I would like to avoid, since if we rely on the natural locality of loaded data then that overhead is zero. > ...partitioning rules stored in > a new system catalog which can be interrogated... Agreed. I think this would be the best approach whatever we do. pg_class would eventually impose a limit on the number of partitions for us. > Yes, this is the traditional approach but I think it still has merit. We > shall continue working on this approach because it is what our customers > have asked for. We would also like to see it get into postgres too. I think there very likely some merit hiding in the "traditional approach", but we should be clear about what that is and is not. (I still have half my brain in that way of doing things, so I'm working through everything to make sure I miss as little as possible.) I'd like to hear from anybody with some descriptions of use cases where the differences between what's been proposed and what is missing provide a significant gain. If nobody has ever run such tests or had those thoughts, lets examine them now to make sure we're doing the right thing. I don't want to copy what others have done without considering whether it is correct for us. After thinking about what you've said ISTM that the declarative partition approach has these advantages over the dynamic approach - allows partitions of different sizes, for when a non-linear partitioning scheme gives considerable performance gain - allows partitioning to continue to work whilst UPDATEs continue within a particular partition and these disadvantages - only works with one set of columns, cannot exclude partitions by other columns - requires manual definition - requires partition names and numbers to be user visible - declarative partitioing can improve the situation, or make it worse, depending upon whether the definitions match the reality of the data - acquires table locks during DDL execution, to enforce validity of the rules, possibly for extended periods while data moves/is checked - requires loaded data to execute partitioning code to identify where it should insert itself My assessment is that the dynamic partitioning approach wins in most common cases for large growing tables, though there are additional use cases where more control is desirable. (The two advantages of declarative partitioning might be something we could change with the dynamic approach, but not in first phase. For example we might see a way to break down segments into 4 sub-segments recursively using a quad tree approach. Thus when segments are being updated we simply isolate the issue around the changing data.) I would like to proceed in a way that allows dynamic partitioning to occur naturally as the default, with the option to add explicit partitioning, if required and possible at a later stage. At this stage I'm not sure whether those additional cases apply to truly growing tables and whether there is sufficient additional advantage to make declarative partitioning worth considering. Technical Issues ---------------- The main problems I see occur in the storage layer. If we can discuss what you think is needed at the storage layer then we might stand a chance of laying the foundations for both options. The catalog changes and executor changes are probably fairly similar for both. My first problem sounds quite mundane: expandability. If you define a partition manually and logically then it needs to cope with an arbitrary number of rows. That means that the view of a partition as a range of contiguous blocks disappears and is replaced by a more complex definition. When you turn partitioning on its head and say "What data is in this range of blocks?" you don't need to have expandability and the partition can be a fixed set of contiguous blocks. That works well with any shape of data, so merging/splitting operations are not required because we never have fat or thin partitions to cope with. You can also redefine partitioning simply by changing the segment size and re-VACUUMing the table. At the moment, smgr provides the executor with a single uniform view of a table as a single contiguous range of blocks. The current implementation of md.c allows a table to actually consist of a range of files, without the executor knowing. If we violate that, then we'd need to change all code that accepts a BlockNumber, i.e. indexes, buffer manager etc. So anything we do that is of any complexity must be below the smgr layer. Allowing expandable partitions eventually implies a non-contiguous range of blocks. In the worst case coping with a non-contiguous block range means you need to consult a block index for each block, which would be horrible. One better solution might be to have logical partitions broken up into physical partitions all of the same size. Various solutions possible, but most seem to detract from the thing which we were seeking in the first place and/or seem to restrict the range of block addresses available, unless we go for variable-size BlockNumbers also. Urgh. If anybody thinks we can get around this by putting a maximum size on partitions, then I'd say you're mad. The last thing we would want is a mechanism that stops you from adding new data if you get the partition definitions wrong, especially since my experience is that most people set up partitioning in a way that is proven sub-optimal after some time in real operation. The same comments would also apply to Oracle's Hash Cluster and DB2's Multi-Dimensional Clustering techniques. They're great performance features when declared correctly, but almost unusable in real applications that can't control what business and the real world will throw at them. So it seems likely that we'd be heading for a level of change within the storage layer that would not be easily accepted, given that not everybody needs partitioning in Postgres. We're probably caught on the question of whether this is a general purpose or special purpose database system. Realistically, I have to set limits, so I'd have to describe my current goal of being able to manage 16TB easily and quickly with Postgres, without too much additional thought on the part of the DBA, when they have one or more naturally growing tables. Special purpose requirements for parallelism or larger databases can use another database system, e.g. GP, XDB, ParAccel etc. So, there are specific disadvantages that I see to declarative partitioning with respect to its deeper implementation for us. I see no need to rule it out completely, but I do see reasons to de-prioritise it. Detailed comments and thoughts most welcome. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
simon@2ndquadrant.com (Simon Riggs) writes: > I think we have an opportunity to bypass the legacy-of-thought that > Oracle has left us and implement something more usable. This seems like a *very* good thing to me, from a couple of perspectives. 1. I think you're right on in terms of the issue of the cost of "running all that DDL" in managing partitioning schemes. When I was working as DBA, I was decidedly *NOT* interested in doing a lot of low level partition management work, andthose that are in that role now would, I'm quite sure, agree that they are not keen on spending a lot of their timetrying to figure out what tablespace to shift a particular table into, or what tablespace filesystem to get sysadminsto set up. 2. Blindly following what Oracle does has always been a dangerous sort of thing to do. There are two typical risks: a) There's always the worry that they may have patented some part of how they implement things, and if you followtoo closely, There Be Dragons... b) They have enough billion$ of development dollar$ and development re$ource$ that they can follow strategiesthat are too expensive for us to even try to follow. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentallybetter. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the "clock" systemthat emerged was a lot better than ARC ever was. > One major advantage of the dynamic approach is that it can work on > multiple dimensions simultaneously, which isn't possible with > declarative partitioning. For example if you have a table of Orders then > you will be able to benefit from Segment Exclusion on all of these > columns, rather than just one of them: OrderId, OrderDate, > RequiredByDate, LastModifiedDate. This will result in some "sloppiness" > in the partitioning, e.g. if we fill 1 partition a day of Orders, then > the OrderId and OrderData columns will start out perfectly arranged. Any > particular RequiredByDate will probably be spread out over 7 partitions, > but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. We can therefore expect useful constraints for attributes that are assigned in some kind of more or less chronological order. Such attributes will include: - Object ID, if set by a sequence- Processing dates There may be a bit of sloppiness, but the constraints may still be useful enough to exclude enough segments to improve efficiency. _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. In order to do any sort of "Exclusion" for non-"chronological" attributes, it will be necessary to use some mechanism other than the patterns that fall out of "natural chronological insertions." If you want exclusion on such attributes, then there needs to be some sort of rule system to spread such items across additional partitions. Mind you, if you do such, that will weaken the usefulness of Segment Exclusion. For instance, suppose you have 4 regions, and scatter insertions by region. In that case, there will be more segments that overlap any given chronological range. > When we look at the data in the partition we can look at any number of > columns. When we declaratively partition, you get only one connected set > of columns, which is one of the the reasons you want multi-dimensional > partitioning in the first place. Upside: Yes, you get to exclude based on examining any number of columns. Downside: You only get the exclusions that are "emergent properties" of the data... The more I'm looking at the dynamic approach, the more I'm liking it... -- "cbbrowne","@","cbbrowne.com" http://linuxfinances.info/info/linuxxian.html "Feel free to contribute build files. Or work on your motivational skills, and maybe someone somewhere will write them for you..." -- "Fredrik Lundh" <effbot@telia.com>
On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote: > simon@2ndquadrant.com (Simon Riggs) writes: > > I think we have an opportunity to bypass the legacy-of-thought that > > Oracle has left us and implement something more usable. > > This seems like a *very* good thing to me, from a couple of > perspectives. > [snip] > 2. Blindly following what Oracle does has always been a dangerous > sort of thing to do. > > There are two typical risks: > > a) There's always the worry that they may have patented some > part of how they implement things, and if you follow too > closely, There Be Dragons... I think that could be equally said of the dynamic partitioning approach. In fact, it might be more likely since declarative partitioning has been around for eons. > b) They have enough billion$ of development dollar$ and > development re$ource$ that they can follow strategies that > are too expensive for us to even try to follow. I don't see that as an argument against the declarative approach. Reading the details on this thread so far, I think Simon's approach is probably more complex from an implementation POV. > > 3. If, rather than blindly following, we create something at least > quasi-new, there is the chance of doing fundamentally better. > > This very thing happened when it was discovered that IBM had a > patent on the ARC cacheing scheme; the "clock" system that emerged > was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. > > One major advantage of the dynamic approach is that it can work on > > multiple dimensions simultaneously, which isn't possible with > > declarative partitioning. For example if you have a table of Orders then > > you will be able to benefit from Segment Exclusion on all of these > > columns, rather than just one of them: OrderId, OrderDate, > > RequiredByDate, LastModifiedDate. This will result in some "sloppiness" > > in the partitioning, e.g. if we fill 1 partition a day of Orders, then > > the OrderId and OrderData columns will start out perfectly arranged. Any > > particular RequiredByDate will probably be spread out over 7 partitions, > > but thats way better than being spread out over 365+ partitions. > > I think it's worth observing both the advantages and demerits of this > together. > > In effect, with the dynamic approach, Segment Exclusion provides its > benefits as an emergent property of the patterns of how INSERTs get > drawn into segments. > > The tendancy will correspondly be that Segment Exclusion will be able > to provide useful constraints for those patterns that can naturally > emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to compare user agent (look at bot activity). In this case, the partitioning is across a small list of strings (well, numbers most likely). Here, the partitions would all have the same range. Partitioning would be useless. > We can therefore expect useful constraints for attributes that are > assigned in some kind of more or less chronological order. Such > attributes will include: > > - Object ID, if set by a sequence > - Processing dates > > There may be a bit of sloppiness, but the constraints may still be > useful enough to exclude enough segments to improve efficiency. I really like the idea of the system being able to identify trends and patterns in the data (actually useful at the user level) but it's the uncertainty of how the partitioning will work for different kinds of data that I don't like. > > _On The Other Hand_, there will be attributes that are *NOT* set in a > more-or-less chronological order, and Segment Exclusion will be pretty > useless for these attributes. > > In order to do any sort of "Exclusion" for non-"chronological" > attributes, it will be necessary to use some mechanism other than the > patterns that fall out of "natural chronological insertions." If you > want exclusion on such attributes, then there needs to be some sort of > rule system to spread such items across additional partitions. Mind > you, if you do such, that will weaken the usefulness of Segment > Exclusion. For instance, suppose you have 4 regions, and scatter > insertions by region. In that case, there will be more segments that > overlap any given chronological range. Right, and so the system seems to 'degrade' to a declarative approach. Consider all the kinds of data other than time series people are storing in Postgres. Some of the biggest I've seen are GIS information or network information. Also, what about people with there own data types? It may be that the data types do not support ordering so range partitioning does not make sense. > > When we look at the data in the partition we can look at any number of > > columns. When we declaratively partition, you get only one connected set > > of columns, which is one of the the reasons you want multi-dimensional > > partitioning in the first place. > > Upside: Yes, you get to exclude based on examining any number of > columns. > > Downside: You only get the exclusions that are "emergent properties" > of the data... There's also a cost involved with calculating all this in the dymanic approach. Consider tables with a non-trivial number of columns. Thanks, Gavin
Chris Browne wrote: > _On The Other Hand_, there will be attributes that are *NOT* set in a > more-or-less chronological order, and Segment Exclusion will be pretty > useless for these attributes. Really? I was hoping that it'd be useful for any data with long runs of the same value repeated - regardless of ordering. My biggest tables are clustered by zip/postal-code -- which means that while the City, State, Country attributes aren't monotonically increasing or decreasing; they are grouped tightly together. I'd expect all queries for San Francisco to be able to come from at most 2 segments; and all queries for Texas to be able to come from only a fraction of the whole. If the segment sizes are configurable - I imagine this would even be useful for other data - like a people table organized by last_name,first_name. "John"'s may be scattered through out the table -- but at least the John Smith's would all be on one segment, while the Aaron-through-Jim Smith segments might get excluded. Or am I missing something?
On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote: > I think Simon's approach is > probably more complex from an implementation POV. Much of the implementation is exactly the same, and I'm sure we agree on more than 50% of how this should work already. We just need to close in on the remainder. My current opinion on the SE approach is the opposite of the one above, though it gets us nowhere just to state it. I'm trying to avoid opinion and look at the details, which is the reason my viewpoint recently changed in favour of the dynamic approach as the main thrust for implementation. I've written a detailed email this morning to explain where and how the problems lie, which are nowhere near the syntax level. I haven't ruled out a declarative approach yet, but I need some detailed technical review of the issues. I hope you'll be replying to that? > > 3. If, rather than blindly following, we create something at least > > quasi-new, there is the chance of doing fundamentally better. > > > > This very thing happened when it was discovered that IBM had a > > patent on the ARC cacheing scheme; the "clock" system that emerged > > was a lot better than ARC ever was. > > Well, I don't think I'm proposing we /blindly follow/ others. I propose > we choose a grammar which takes the best of what others have tried to > do. Oracle's grammar is hideous, IBM's is too restrictive, for example. I assume the new grammar is good and if we do go that way, it sounds like the right starting place. > > > One major advantage of the dynamic approach is that it can work on > > > multiple dimensions simultaneously, which isn't possible with > > > declarative partitioning. For example if you have a table of Orders then > > > you will be able to benefit from Segment Exclusion on all of these > > > columns, rather than just one of them: OrderId, OrderDate, > > > RequiredByDate, LastModifiedDate. This will result in some "sloppiness" > > > in the partitioning, e.g. if we fill 1 partition a day of Orders, then > > > the OrderId and OrderData columns will start out perfectly arranged. Any > > > particular RequiredByDate will probably be spread out over 7 partitions, > > > but thats way better than being spread out over 365+ partitions. > > > > I think it's worth observing both the advantages and demerits of this > > together. > > > > In effect, with the dynamic approach, Segment Exclusion provides its > > benefits as an emergent property of the patterns of how INSERTs get > > drawn into segments. > > > > The tendancy will correspondly be that Segment Exclusion will be able > > to provide useful constraints for those patterns that can naturally > > emerge from the INSERTs. > > Many people, in my experience, doing the kind of data processing which > benefits from partitioning are regularly loading data, rather than > collecting it in an OLTP fashion. Lets take the easily understandable > concept of processing web site traffic. If the amount of data is large > enough to benefit from partitioning, they probably have multiple web > servers and therefore almost certainly multiple log files. If these > files are not sorted into a single file, the records will not have a > naturally progressing chronology: every file we go back to the beginning > of the period of time the load covers. If you add parallelism to your load, > things look even more different. This means you could end up with a > bunch of partitions, under the dynamic model, which all cover the same > time range. Depends how big we make the partitions and how sloppy this is as to whether that is a problem or not. We might still expect a x100 gain from using the SE approach depending upon the data volume. > Then there's the way that really big databases are used (say, up around > Simon's upper bound of 16 TB). It is expensive to keep data online so > people aren't. They're loading and unloading data all the time, to > perform different forms of analysis. That isn't my experience. That sounds very time consuming. The storage cost issue was the reason Andrew wanted offline segments, and why I have been talking about hierarchical storage. > A common scenario in the example > above might be to unload all but the current month's data and then load > the same month from the previous year. The unload step needs to be > costly (i.e., TRUNCATE). Then, there's no guarantee that what they're > interested in is the date range at all. They may want to compare user > agent (look at bot activity). In this case, the partitioning is across a > small list of strings (well, numbers most likely). Here, the partitions > would all have the same range. Partitioning would be useless. I take it you mean SE-based partitioning would be useless, but declarative partitioning would be useful? I would agree, assuming they run queries with a few of the small list of strings. Seems like a contrived case. > Some of the biggest I've seen are GIS information or network > information. Those are good examples of where a declarative approach would be the only way of getting partitioning to work well. So based on that I'll agree that some people *need* a declarative approach to get the benefits of partitioning, rather than just want. We do already have constraint exclusion. Can we get by with constraint and segment exclusion together? > Also, what about people with there own data types? It may > be that the data types do not support ordering so range partitioning > does not make sense. I'd like to hear some examples. Oleg? > There's also a cost involved with calculating all this in the dymanic > approach. Consider tables with a non-trivial number of columns. True, but that balances the cost of doing that at load time. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > Chris Browne wrote: >> _On The Other Hand_, there will be attributes that are *NOT* set in a >> more-or-less chronological order, and Segment Exclusion will be pretty >> useless for these attributes. > > Really? I was hoping that it'd be useful for any data > with long runs of the same value repeated - regardless of ordering. > > My biggest tables are clustered by zip/postal-code -- which means that > while the City, State, Country attributes aren't monotonically increasing > or decreasing; they are grouped tightly together. I'd expect all queries > for San Francisco to be able to come from at most 2 segments; and all queries > for Texas to be able to come from only a fraction of the whole. > > > If the segment sizes are configurable - I imagine this would even > be useful for other data - like a people table organized > by last_name,first_name. "John"'s may be scattered through out > the table -- but at least the John Smith's would all be on one > segment, while the Aaron-through-Jim Smith segments might get excluded. > > Or am I missing something? Well, this can head in two directions... 1. Suppose we're not using an "organize in CLUSTER order" approach. If the data is getting added in roughly "by order of insertion" order, then there's no reason to expect San Francisco data to be clustered together. Ditto for "John Smiths;" we can expect them to be as scattered as their dates of creation. 1. Suppose we *are* using an "organize in CLUSTER order" approach. In that case, yes, indeed, you can expect segments to contain specific ranges of the data. However, in that case, the ONLY dimension under which the Segment Exclusion may be expected to be useful is that of the first column of the index being used. "Smith" should be useful to SE, but not "John," because, as a secondary criteria, the first name values will be scattered across all segments. -- (reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc")) http://www3.sympatico.ca/cbbrowne/linuxdistributions.html PALINDROME spelled backwards is EMORDNILAP.
On Tue, 2008-01-08 at 02:12 +0000, Gregory Stark wrote: > I also don't understand how this proposal deals with the more common use case > of unloading and loading data. Normally in partitioned tables we build the > data in a side table until the data is all correct then load it as a > partition. If you treat it as a lower-level object then I don't see that > working. The layout of the new table won't often match the layout of the > target partitioned table. We optimised for that in 8.2, but I would say that not many people noticed and that it isn't normal. The problem with that approach, and the reason many people don't use it is that it requires all data for a partition to be available at the time you add the partition. That necessarily implies a time delay into the process of loading data, which is no long acceptable in the world of straight-through-processing or whatever you call the need for zero processing delay in an/your industry. So people choose to load data directly into the main table, allowing it to be immediately available, though at the cost of some performance. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Wed, Jan 09, 2008 at 02:38:21PM -0500, Chris Browne wrote: > Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > > Or am I missing something? > > Well, this can head in two directions... > > 1. Suppose we're not using an "organize in CLUSTER order" approach. > > If the data is getting added in roughly "by order of insertion" order, > then there's no reason to expect San Francisco data to be clustered > together. > > Ditto for "John Smiths;" we can expect them to be as scattered as > their dates of creation. > > 1. Suppose we *are* using an "organize in CLUSTER order" approach. > > In that case, yes, indeed, you can expect segments to contain specific > ranges of the data. > > However, in that case, the ONLY dimension under which the Segment > Exclusion may be expected to be useful is that of the first column of > the index being used. "Smith" should be useful to SE, but not "John," > because, as a secondary criteria, the first name values will be > scattered across all segments. One of the reasons for using partitions is that the usefulness of indexes breaks down because of the low cardinality of the indexed column. Also, the random IO can kill your performance. In my opinion, CLUSTER (and vacuum, mentioned else where) are not data management tools. Thanks, Gavin
On Wed, Jan 09, 2008 at 08:17:41PM +0000, Simon Riggs wrote: > On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote: > > > I think Simon's approach is > > probably more complex from an implementation POV. > > Much of the implementation is exactly the same, and I'm sure we agree on > more than 50% of how this should work already. We just need to close in > on the remainder. > > My current opinion on the SE approach is the opposite of the one above, > though it gets us nowhere just to state it. I'm trying to avoid opinion > and look at the details, which is the reason my viewpoint recently > changed in favour of the dynamic approach as the main thrust for > implementation. > > I've written a detailed email this morning to explain where and how the > problems lie, which are nowhere near the syntax level. I haven't ruled > out a declarative approach yet, but I need some detailed technical > review of the issues. I hope you'll be replying to that? I'm responding to that email seperately. Just a matter or addressing all the points in the detail it deserves. > > > > 3. If, rather than blindly following, we create something at least > > > quasi-new, there is the chance of doing fundamentally better. > > > > > > This very thing happened when it was discovered that IBM had a > > > patent on the ARC cacheing scheme; the "clock" system that emerged > > > was a lot better than ARC ever was. > > > > Well, I don't think I'm proposing we /blindly follow/ others. I propose > > we choose a grammar which takes the best of what others have tried to > > do. Oracle's grammar is hideous, IBM's is too restrictive, for example. > > I assume the new grammar is good and if we do go that way, it sounds > like the right starting place. > > > > > One major advantage of the dynamic approach is that it can work on > > > > multiple dimensions simultaneously, which isn't possible with > > > > declarative partitioning. For example if you have a table of Orders then > > > > you will be able to benefit from Segment Exclusion on all of these > > > > columns, rather than just one of them: OrderId, OrderDate, > > > > RequiredByDate, LastModifiedDate. This will result in some "sloppiness" > > > > in the partitioning, e.g. if we fill 1 partition a day of Orders, then > > > > the OrderId and OrderData columns will start out perfectly arranged. Any > > > > particular RequiredByDate will probably be spread out over 7 partitions, > > > > but thats way better than being spread out over 365+ partitions. > > > > > > I think it's worth observing both the advantages and demerits of this > > > together. > > > > > > In effect, with the dynamic approach, Segment Exclusion provides its > > > benefits as an emergent property of the patterns of how INSERTs get > > > drawn into segments. > > > > > > The tendancy will correspondly be that Segment Exclusion will be able > > > to provide useful constraints for those patterns that can naturally > > > emerge from the INSERTs. > > > > Many people, in my experience, doing the kind of data processing which > > benefits from partitioning are regularly loading data, rather than > > collecting it in an OLTP fashion. Lets take the easily understandable > > concept of processing web site traffic. If the amount of data is large > > enough to benefit from partitioning, they probably have multiple web > > servers and therefore almost certainly multiple log files. If these > > files are not sorted into a single file, the records will not have a > > naturally progressing chronology: every file we go back to the beginning > > of the period of time the load covers. If you add parallelism to your load, > > things look even more different. This means you could end up with a > > bunch of partitions, under the dynamic model, which all cover the same > > time range. > > Depends how big we make the partitions and how sloppy this is as to > whether that is a problem or not. We might still expect a x100 gain from > using the SE approach depending upon the data volume. This comes back to my problem with the dynamic approach: the user might get a certain experience but with the declarative approach the expectation matches the result. > > > Then there's the way that really big databases are used (say, up around > > Simon's upper bound of 16 TB). It is expensive to keep data online so > > people aren't. They're loading and unloading data all the time, to > > perform different forms of analysis. > > That isn't my experience. That sounds very time consuming. I cannot offer any evidence but it's what we see companies doing. > The storage cost issue was the reason Andrew wanted offline segments, > and why I have been talking about hierarchical storage. The thing is, people with lots of data usually have good hierarchical storage systems. Can we really do better? > > > A common scenario in the example > > above might be to unload all but the current month's data and then load > > the same month from the previous year. The unload step needs to be > > costly (i.e., TRUNCATE). Then, there's no guarantee that what they're > > interested in is the date range at all. They may want to compare user > > agent (look at bot activity). In this case, the partitioning is across a > > small list of strings (well, numbers most likely). Here, the partitions > > would all have the same range. Partitioning would be useless. > > I take it you mean SE-based partitioning would be useless, but > declarative partitioning would be useful? I would agree, assuming they > run queries with a few of the small list of strings. Seems like a > contrived case. I don't think it's contrived. How about the use case else where in the thread which is based on post code? > > > Some of the biggest I've seen are GIS information or network > > information. > > Those are good examples of where a declarative approach would be the > only way of getting partitioning to work well. So based on that I'll > agree that some people *need* a declarative approach to get the benefits > of partitioning, rather than just want. > > We do already have constraint exclusion. Can we get by with constraint > and segment exclusion together? The reason we are (Greenplum) looking at better partitioning is that using the existing approach is too error prone, too different to other vendors, too difficult to management and too hard to reconfigure. > > > Also, what about people with there own data types? It may > > be that the data types do not support ordering so range partitioning > > does not make sense. > > I'd like to hear some examples. Oleg? > > > There's also a cost involved with calculating all this in the dymanic > > approach. Consider tables with a non-trivial number of columns. > > True, but that balances the cost of doing that at load time. I agree, but I'll put more detail into the other email. Thanks, gavin
Hi Simon, On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote: > Segment Exclusion > ----------------- > > After we note that a segment is read-only we can scan the segment and > record min/max values for all columns. These are then "implicit > constraints", which can then be used for segment exclusion in a similar > manner as we do with the current constraint exclusion mechanism. What about columns which are varlen? The maximum value could be very long -- 1 GB in fact. Say we decide to not support those. Well, what about numeric? What about user defined types? > This would also allow a Merge Join to utilise exclusion for the inner > plan at execution time. Instead of scanning the whole inner table plan > from the start, we would be able to start the scan from the appropriate > segment. This would require us to pass the current value of the outer > plan down into the inner plan. The typical executor nodes on the inner > plan would be a SortNode and below that a SeqScan. In that case the > executor would need to pass the outer value from the MergeJoinNode down > thru the SortNode to the SeqScan node. The SeqScan node could then > perform partition exclusion, reducing the time for that scan and also > reducing the time for the resulting sort. This sounds like it might be > worth doing in normal cases also. It might turn out that the potentially > applicable cases are already excluded during planning, I haven't thought > about that aspect in enough detail yet. I don't think that would work at all. Say you have have skew of data across the partitions. You may end up doing a seq scan for every outer tuple. It would look like a really expensive nested loop. > If we collect data for all columns then many of our implicit constraints > would be useless. e.g. if a column only has a few values and these are > present in all segments. Matching our predicate against all constraints > would be expensive, so we must weed out poor constraints. We would do > this by removing any constraint that overlapped more than 10% of other > segments. Various heuristics would likely need to be discovered during > development to make this work without resorting to manual commands. > > Note that all of this exclusion is happening in the executor, not the > planner. That allows this form of exclusion to work with stable > functions and parameters without problem. So, in that case, you've totally undercut the planner. All the steps up the tree would be based on the stats for scanning the entire table but you've only returned part of it. To solve that, you could put the min/max values in a catalog somewhere. For a 16 TB table, that's 16000 min/max values. That's pretty expensive. And how do we handle making things non-read-only. You'd have to wait for all current queries to finish... that could take a long time. How do we handle crashes during setting of min/max? I presume it needs WAL support. What happens if the free space map gives us a free block in a read only segment. Then we need to look at concurrency and transactionality of min/max values don't we? If changing it makes it not read-only (which seems to be the case) it would be trivial to totally degrade your partitioning scheme to full seq scan. One a 16 TB table. And getting the performance back might involve a day or more or VACUUMing. Impossible. > Visibility Map > -------------- [snip] > No dynamic shared memory cache is required because any concurrent > changes to the table would be ignored by a Scan anyway, so it doesn't > matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any > new scans that start will attempt to lock the table and then perform a > rel cache check before continuing. So the visibility will be set > correctly for *that* scan at least. Is that true in the presence of READ COMMITTED? > > In most cases the visibility map can be summarised as a single boolean > to show whether any 100% visible segments exist. That makes accessing > the map very cheap in the common, unset case. What do we do in failure scenarios. Does this go into WAL? Is it replicated via log shipping techniques. You mention Slony support below. I don't know that Slony's target is Very Big Tables (up to 16 TB) but even if it was being used for a 'small' system of a few hundred GB, when you fail over the system has degraded if you aren't vacuuming it. More over, the layout of data may be different and therefore the performance of the segment exclusion. That's a bit of a surprise. > > Setting the Visibility Map > -------------------------- > > VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of > them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to > all current backends. Marking the map will cause a rel cache > invalidation. Argh. Remember, we're talking about 16 TB tables here. With the best of tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With the existing partitioning approach, there's no such thing. > > We would never mark the highest numbered segment > EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur. > This is also a useful way of excluding smaller tables from the overheads > involved for this feature. What about concurrent update or deletes -- during VACUUM. Surely we don't have to vacuum full? Concurrent updates and deletes could change the partition boundaries and that will impact on the accuracy of segment exclusion (I do not think it leads to data corruption but it might lead to the inclusion of a segment that shouldn't have been included). That leads to unpredictable behaviour. > > In an insert-only table this would mean that only the last 1 or 2 > segments would be read write, so VACUUM would always run in a bounded > time, no matter how large the table. You said else where that unload was low cost. What about segments which cross the unload boundary. Half of their data needs to be removed, say. That sounds expensive. And, they'll have to be vacuumed. [snip] > Unsetting the visibility map > ---------------------------- > [snip] With the existing partitioning, we could just truncate a partition. This seems really expensive in comparison. > There would be additional complexity in selectivity estimation and plan > costing. The above proposal allows dynamic segment exclusion, which > cannot be assessed at planning time anyway, so suggestions welcome... As I've noted elsewhere, you cannot just undermine the planner like this. You're plans will be totally bogus. I hope you're not proposing that. If you're proposing tracking information for each segment then I there are problems there too. > Comparison with other Partitioning approaches > --------------------------------------------- > > Once I realised this was possible in fairly automatic way, I've tried > hard to keep away from manual overrides, commands and new DDL. > > Declarative partitioning is a big overhead, though worth it for large > databases. No overhead is *much* better though. I think you're making this out to be harder than it is. Dealing with lots of data is difficult, issuing one DDL command is not such a big deal. Thanks, Gavin
Hi Simon, On Wed, Jan 09, 2008 at 03:08:08PM +0000, Simon Riggs wrote: > Do people really "like" running all that DDL? There is significant > manpower cost in implementing and maintaining a partitioning scheme, > plus significant costs in getting it wrong. Well... that's impossible for me to say. Most of the people who use lots of data like using MDX (a large and quite complex reporting and modelling language from MS, which is more complex than SQL IMHO) so... > > If people with large tables like partitioning why is Oracle moving > towards automated partitioning in 11g? Automated partitioning was one of Have you used Oracle's partitioning? I'm not surprised they're moving away ;-). More seriously, I didn't know that. Do you have a URL? > the major goals for this next set of Postgres partitioning functionality > also, whether or not we have declarative partitioning. My thinking is > lets go for the best ideas and skip over the stuff that will be (is) > obsolete before its left the design stage. Well, you're assuming that because declarative partitioning has been around a while, it's obselete. I don't think that's the case. Postgres has been around about as long... > I see many more systems in the Terabyte range now than I did 10 years > ago, but I see about the same number of DBAs. We'll always need experts, > but I feel we should be using our expertise to simplify the standard > cases, not just maintain the same level of difficulty in managing them. > One of the big benefits of the dynamic partitioning approach is that it > needs no DDL. So it will work out-of-the-box, for anybody. It's fine to say that, but people have already started talking about grammar extensions for dynamic partitions. People want the tools to manage it. There's difficulty in learning how and when to use the different VACUUM options that have been discussed. > Deleting older data would be optimised under the proposed scheme, so > that's not really a problem. Loading data is actually slightly harder > and slower with declarative partitioning (see below). > > Merging and splitting partitions are tools for fine tuning a very > complex partitioning scheme. They do also allow a non-linear > segmentation scheme, which might aid performance in some cases. What about adding partitions into a set of partitions. Greg's post on that was dismissed and it might not be representative but we have users doing that all the time. > One major advantage of the dynamic approach is that it can work on > multiple dimensions simultaneously, which isn't possible with > declarative partitioning. For example if you have a table of Orders then > you will be able to benefit from Segment Exclusion on all of these > columns, rather than just one of them: OrderId, OrderDate, > RequiredByDate, LastModifiedDate. This will result in some "sloppiness" > in the partitioning, e.g. if we fill 1 partition a day of Orders, then > the OrderId and OrderData columns will start out perfectly arranged. Any > particular RequiredByDate will probably be spread out over 7 partitions, > but thats way better than being spread out over 365+ partitions. > > When we look at the data in the partition we can look at any number of > columns. When we declaratively partition, you get only one connected set > of columns, which is one of the the reasons you want multi-dimensional > partitioning in the first place. > > > To this end, we (well, Jeff Cohen) looked at the syntax and semantics of > > partitining in leading databases (Oracle, Informix, DB2) and came up > > with a highly expressive grammar which takes the best of each I think > > (I'll post details on the grammar in a seperate thread). The idea is > > that range (for example, a date range), list (a list of distinct values) > > and hash partitioning be supported on multiple columns. Partitions can > > be added, merged, truncated. Partitions can reside on different > > tablespaces. The existing issues with the rewriter, COPY support and the > > like go away by smartening up the backend. > > > To explore the grammar and semantics Jeff and I (to a small extent) have > > implemented the parsing of such a grammar in the backend. At the moment, > > this just results in the generation of a bunch of rules (i.e., it > > produces the current behaviour, as if someone had written rules > > themselves) and debugging output. > > > > The fully fledged approach will see partitioning rules stored in > > a new system catalog which can be interrogated by the planner or COPY to > > determine which partition a given tuple belongs in or which partition(s) > > a WHERE clause matches. > > When loading data we would need to examine each incoming tuple and > distribute it to the correct place in the partitioned heap. That's a > non-trivial overhead on data loading that I would like to avoid, since > if we rely on the natural locality of loaded data then that overhead is > zero. Yes, but it's a lot faster than what we have at the moment and the dynamic approach is not without its costs. I've pointed it out before, but I will again: updating a key which occurs in all segments would mean the whole thing needs to be vacuumed to get partitioning back. That's a MASSIVE cost. > I'd like to hear from anybody with some descriptions of use cases where > the differences between what's been proposed and what is missing provide > a significant gain. If nobody has ever run such tests or had those > thoughts, lets examine them now to make sure we're doing the right > thing. I don't want to copy what others have done without considering > whether it is correct for us. > I don't think you're stating the question clearly enough. It's not just a case of how many people hate typing out DDL commands. How many people would be adversely affected by the difficulties I've pointed out. If you update 100 rows in your big table, you're going to have to consider VACUUMing it since your partition may now have 100 odd read-write tables selected for scanning, not just 1 or 2. > > After thinking about what you've said ISTM that the declarative > partition approach has these advantages over the dynamic approach > > - allows partitions of different sizes, for when a non-linear > partitioning scheme gives considerable performance gain > > - allows partitioning to continue to work whilst UPDATEs continue within > a particular partition What about consistency over time? High levels of control for users if they want it? Even simple things like tablespace support? High performance unload and load performance much better than now? Reliable and consistent query plans? Support for datatypes other than those relating to time. > > and these disadvantages > > - only works with one set of columns, cannot exclude partitions by other > columns > > - requires manual definition > > - requires partition names and numbers to be user visible In fact the grammar we will propose does not make this mandatory. But either way, how is that a disadvantage worth listing? > > - declarative partitioing can improve the situation, or make it worse, > depending upon whether the definitions match the reality of the data Which is the same as dynamic partitioning. > - acquires table locks during DDL execution, to enforce validity of the > rules, possibly for extended periods while data moves/is checked > > - requires loaded data to execute partitioning code to identify where it > should insert itself Sure, that's how it works. > I would like to proceed in a way that allows dynamic partitioning to > occur naturally as the default, with the option to add explicit > partitioning, if required and possible at a later stage. As the default? What about effects on people who do not need partitioning? > Technical Issues > ---------------- > > The main problems I see occur in the storage layer. If we can discuss > what you think is needed at the storage layer then we might stand a > chance of laying the foundations for both options. The catalog changes > and executor changes are probably fairly similar for both. I agree, there are major issues with the storage layer. I could list a lot of things I'd like to see improved but they come down to taking advantage of how storage works (reading 1 block is as good as reading 100), compression and much faster load. > > My first problem sounds quite mundane: expandability. If you define a > partition manually and logically then it needs to cope with an arbitrary > number of rows. That means that the view of a partition as a range of > contiguous blocks disappears and is replaced by a more complex > definition. When you turn partitioning on its head and say "What data is > in this range of blocks?" you don't need to have expandability and the > partition can be a fixed set of contiguous blocks. That works well with > any shape of data, so merging/splitting operations are not required > because we never have fat or thin partitions to cope with. You can also > redefine partitioning simply by changing the segment size and > re-VACUUMing the table. And that's what I don't like about it. It makes VACUUM more expensive and suddenly it becomes a data management tool. You said elsewhere that load is faster but it's not. It's two times slower because you /must/ VACUUM after load. > At the moment, smgr provides the executor with a single uniform view of > a table as a single contiguous range of blocks. The current > implementation of md.c allows a table to actually consist of a range of > files, without the executor knowing. If we violate that, then we'd need > to change all code that accepts a BlockNumber, i.e. indexes, buffer > manager etc. So anything we do that is of any complexity must be below > the smgr layer. Allowing expandable partitions eventually implies a > non-contiguous range of blocks. > > In the worst case coping with a non-contiguous block range means you > need to consult a block index for each block, which would be horrible. > One better solution might be to have logical partitions broken up into > physical partitions all of the same size. Various solutions possible, > but most seem to detract from the thing which we were seeking in the > first place and/or seem to restrict the range of block addresses > available, unless we go for variable-size BlockNumbers also. Urgh. > > If anybody thinks we can get around this by putting a maximum size on > partitions, then I'd say you're mad. The last thing we would want is a > mechanism that stops you from adding new data if you get the partition > definitions wrong, especially since my experience is that most people > set up partitioning in a way that is proven sub-optimal after some time > in real operation. The same comments would also apply to Oracle's Hash > Cluster and DB2's Multi-Dimensional Clustering techniques. They're great > performance features when declared correctly, but almost unusable in > real applications that can't control what business and the real world > will throw at them. > > So it seems likely that we'd be heading for a level of change within the > storage layer that would not be easily accepted, given that not > everybody needs partitioning in Postgres. Originally, when Jeff and others at Greenplum were looking at better partitioning they came up with a system quite similar to the one you're proposing. There were two main reasons it was put to one side: it undermines the planner and it doesn't give predictable results, the performance changes and generally degrades over time.. I've talked about the latter else where so, to the former. If the exclusion is executor driven, the planner cannot help but create a seq scan plan. The planner will think you're returning 100X rows when really you end up returning X rows. After that, all decisions made by the planner are totally bogus. The solution means having the planner look at those values for each partition so that it can generate an accurate plan. But then we have locking issues: what if the partition range changes underneath us? Postgres wouldn't be the only database with this technology. Netezza works in a similar way AND it degrades in the that I've said. You can also assume that lots of obvious approaches to doing this are patented by them, by the way. > > We're probably caught on the question of whether this is a general > purpose or special purpose database system. Realistically, I have to set > limits, so I'd have to describe my current goal of being able to manage > 16TB easily and quickly with Postgres, without too much additional > thought on the part of the DBA, when they have one or more naturally > growing tables. Special purpose requirements for parallelism or larger > databases can use another database system, e.g. GP, XDB, ParAccel etc. Sure, but this still affects everyone. With the declarative approach, those who want it, use it. With the dynamic approach, everyone experiences the cost. Thanks, Gavin
On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: > If the exclusion is executor driven, the planner cannot help but > create a seq scan plan. The planner will think you're returning 100X > rows when really you end up returning X rows. After that, all > decisions made by the planner are totally bogus. One of the most important queries on large tables is handlingWHERE Eventdate = CURRENT DATE; We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can decide that some form of partition exclusion is possible, but the actual partition we access can *only* be decided within the executor. That necessarily effects the selectivity estimates. The planner can see we want some data, but it can't tell which data, so it doesn't know whether we will hit the day with the most data or the date with the least data. You've mentioned this a few times, as if its a problem with dynamic partitioning, yet its clearly an issue for any form of partitioning. So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: > > If people with large tables like partitioning why is Oracle moving > > towards automated partitioning in 11g? Automated partitioning was one of > > Have you used Oracle's partitioning? Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich. > What about consistency over time? High levels of control for users if > they want it? Even simple things like tablespace support? High > performance unload and load performance much better than now? > Reliable and consistent query plans? Support for datatypes other than > those relating to time. Most of these comments seem like emotional appeals, so refuting them one by one is just going to look and smell like an argument, and a fruitless one as well since we agree on so many things. So, I get the message that you really want the DDL approach and agree that you've demonstrated there are use cases that need it that you are interested in. That's fine by me as long as we can each work on parts of it to get it done. Will it be possible for you to do that? I feel I can say that because AFAICS we can in principle have both dynamic and declarative techniques, even on the same table. Around half of the mechanics would be identical anyway. So from here I say lets work together to get the basic mechanics right. My experience with PostgreSQL is that the team/list always comes up with a better idea in the end and I'm sure we will this time too. I have one main technical concern which is the storage model, which is the source of the difference between the two types of partitioning we've been discussing. I see difficulties originating there, so I will start another thread to examine just those items in detail. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
hris Browne wrote: > _On The Other Hand_, there will be attributes that are *NOT* set in a > more-or-less chronological order, and Segment Exclusion will be pretty > useless for these attributes. Short summary: With the appropriate clustering, ISTM Segment Exclusion can be useful on all columns in a table. Cluster the table by "one-bit-of-column1 || one-bit-of-column-2"... That way any "col2=value" query could exclude at leastabout half the table regardless of if it's monotonically increasing or even totally random. It seems one could make Segment Exclusion even useful for multiple unrelated columns in a table. Consider a large table of people where one might want segment exclusion to help with both first name, and last name. One could cluster the table by "first-letter-of-last-name || first-letter-of-first-name". Then a query for last name Smith could exclude all but the consecutive segments of S's; while the query for first name John would only have to look in the 26 runs of segments with AJ, BJ, CJ, ... Perhaps better - hash each column and interleave the bits col1bit1, col2bit1, col3bit1, col1bit2, col2bit2, col3bit3 If I understand segment exclusion right, that way on any table bigger than 8 segments any query of col1=val, or col2=val or col3=val would scan at most half the table; on a table bigger than 64 segments any such query would scan at most 1/4 of the table. Obviously this only benefits the rarely changing parts of tables; and new and updated values would be in a very hot segment at the end of the table that wouldn't be segment excluded much.
On Thu, Jan 10, 2008 at 07:25:00AM +0000, Simon Riggs wrote: > On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: > > If the exclusion is executor driven, the planner cannot help but > > create a seq scan plan. The planner will think you're returning 100X > > rows when really you end up returning X rows. After that, all > > decisions made by the planner are totally bogus. > > One of the most important queries on large tables is handling > WHERE Eventdate = CURRENT DATE; Really? Well, this isn't handled by the current partitioning approach (i.e., constraint exclusion) so users have worked around it. > > We cannot perform partition exclusion using this type of WHERE clause at > planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. > So it seems clear that we need to make partition exclusion work at > executor time, whatever else we do. One example doesn't make the rule. Most people doing range partitioning are going to be doing either specific dates or date ranges -- i.e., constants or things that can be folded to constants by the planner. At least, that's what I've seen. Thanks, Gavin
On Thu, Jan 10, 2008 at 04:51:04PM +0000, Simon Riggs wrote: > On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: > > > > If people with large tables like partitioning why is Oracle moving > > > towards automated partitioning in 11g? Automated partitioning was one of > > > > Have you used Oracle's partitioning? > > Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich. It was just a joke. > > > What about consistency over time? High levels of control for users if > > they want it? Even simple things like tablespace support? High > > performance unload and load performance much better than now? > > Reliable and consistent query plans? Support for datatypes other than > > those relating to time. > > Most of these comments seem like emotional appeals, so refuting them one > by one is just going to look and smell like an argument, and a fruitless > one as well since we agree on so many things. Sigh. You can call those questions what you like but your approach doesn't deal with them and the current partitioning approach, for it's problems, does deal with them. > So, I get the message that you really want the DDL approach and agree > that you've demonstrated there are use cases that need it that you are > interested in. That's fine by me as long as we can each work on parts of > it to get it done. Will it be possible for you to do that? I assured you offlist that it was. This is something Greenplum needs right now and I'm being paid to deliver it. > I feel I can say that because AFAICS we can in principle have both > dynamic and declarative techniques, even on the same table. Around half > of the mechanics would be identical anyway. > > So from here I say lets work together to get the basic mechanics right. > My experience with PostgreSQL is that the team/list always comes up with > a better idea in the end and I'm sure we will this time too. I agree. Scrutiny of ideas almost always leads to better ideas. > I have one main technical concern which is the storage model, which is > the source of the difference between the two types of partitioning we've > been discussing. I see difficulties originating there, so I will start > another thread to examine just those items in detail. I look forward to it. As promised, I'll soon post some syntax of what we have been working on. Thanks, Gavin
On Thu, 2008-01-10 at 21:43 +0100, Gavin Sherry wrote: > On Thu, Jan 10, 2008 at 07:25:00AM +0000, Simon Riggs wrote: > > On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: > > > If the exclusion is executor driven, the planner cannot help but > > > create a seq scan plan. The planner will think you're returning 100X > > > rows when really you end up returning X rows. After that, all > > > decisions made by the planner are totally bogus. > > > > One of the most important queries on large tables is handling > > WHERE Eventdate = CURRENT DATE; > > Really? Well, this isn't handled by the current partitioning approach Yes, exactly why we need to fix it and why I'm mentioning it here. > > We cannot perform partition exclusion using this type of WHERE clause at > > planning time because the CURRENT DATE function is STABLE. > > We can do the exact same thing -- if it's a direction people want to > take. In fact, we can do it better/faster because once we've evaluated one > partition we know that there are no others to evaluate. Lost you completely here. I'm explaining to you that *nobody* can solve those problems solely at planning time, by definition, so it has to be done at execution time. I'm not saying anything about your way, my way. > > So it seems clear that we need to make partition exclusion work at > > executor time, whatever else we do. > > One example doesn't make the rule. Most people doing range partitioning > are going to be doing either specific dates or date ranges -- i.e., > constants or things that can be folded to constants by the planner. At > least, that's what I've seen. It's not always true that planning time = execution time. Using CURRENT DATE to access current day, week, month is common in many applications, as is parameterised SQL for higher transaction rate apps. We we can't re-write all the SQL, so we need to make it work. I think if you demand a full function implementation of partitioning, you'd better take into account *all* of the requirements. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, 2008-01-10 at 21:49 +0100, Gavin Sherry wrote: > > So, I get the message that you really want the DDL approach and agree > > that you've demonstrated there are use cases that need it that you are > > interested in. That's fine by me as long as we can each work on parts of > > it to get it done. Will it be possible for you to do that? > > I assured you offlist that it was. This is something Greenplum needs > right now and I'm being paid to deliver it. Cool, I'll assist you where possible. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, Jan 10, 2008 at 09:30:10PM +0000, Simon Riggs wrote: > > > We cannot perform partition exclusion using this type of WHERE clause at > > > planning time because the CURRENT DATE function is STABLE. > > > > We can do the exact same thing -- if it's a direction people want to > > take. In fact, we can do it better/faster because once we've evaluated one > > partition we know that there are no others to evaluate. > > Lost you completely here. I'm explaining to you that *nobody* can solve > those problems solely at planning time, by definition, so it has to be > done at execution time. I'm not saying anything about your way, my way. Sorry, I wasn't clear enough. I was trying to say, if we're going to do something in the executor (for right or wrong) the declarative approach can do it too. Since there will be partition bounding information available, we can do partition selection in the executor (maybe the planner should tell us to do it). > > > > So it seems clear that we need to make partition exclusion work at > > > executor time, whatever else we do. > > > > One example doesn't make the rule. Most people doing range partitioning > > are going to be doing either specific dates or date ranges -- i.e., > > constants or things that can be folded to constants by the planner. At > > least, that's what I've seen. > > It's not always true that planning time = execution time. > > Using CURRENT DATE to access current day, week, month is common in many > applications, as is parameterised SQL for higher transaction rate apps. > > We we can't re-write all the SQL, so we need to make it work. > > I think if you demand a full function implementation of partitioning, > you'd better take into account *all* of the requirements. Okay. As I said above, nothing in declarative partitioning rules out partition selection with stable functions. So, we lets do it, assuming everyone else thinks it is a good idea. Thanks, Gavin
On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote: > On Thu, Jan 10, 2008 at 09:30:10PM +0000, Simon Riggs wrote: > > > > We cannot perform partition exclusion using this type of WHERE clause at > > > > planning time because the CURRENT DATE function is STABLE. > > > > > > We can do the exact same thing -- if it's a direction people want to > > > take. In fact, we can do it better/faster because once we've evaluated one > > > partition we know that there are no others to evaluate. > > > > Lost you completely here. I'm explaining to you that *nobody* can solve > > those problems solely at planning time, by definition, so it has to be > > done at execution time. I'm not saying anything about your way, my way. > > Sorry, I wasn't clear enough. I was trying to say, if we're going to do > something in the executor (for right or wrong) the declarative approach > can do it too. Since there will be partition bounding information > available, we can do partition selection in the executor (maybe the > planner should tell us to do it). Of course. It's an identical situation for both. Regrettably, none of your comments about dynamic partitioning and planning were accurate as a result. > Okay. As I said above, nothing in declarative partitioning rules out > partition selection with stable functions. So, we lets do it, assuming > everyone else thinks it is a good idea. If you check the archives this was long ago been identified as a requirement. And I said exactly the things you said, BTW, when trying to say it didn't matter. I've kept a list of requests for improvement that I can share with you; I've always been loathe to publish a list of bad points. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Fri, Jan 11, 2008 at 08:07:18AM +0000, Simon Riggs wrote: > On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote: > > On Thu, Jan 10, 2008 at 09:30:10PM +0000, Simon Riggs wrote: > > > > > We cannot perform partition exclusion using this type of WHERE clause at > > > > > planning time because the CURRENT DATE function is STABLE. > > > > > > > > We can do the exact same thing -- if it's a direction people want to > > > > take. In fact, we can do it better/faster because once we've evaluated one > > > > partition we know that there are no others to evaluate. > > > > > > Lost you completely here. I'm explaining to you that *nobody* can solve > > > those problems solely at planning time, by definition, so it has to be > > > done at execution time. I'm not saying anything about your way, my way. > > > > Sorry, I wasn't clear enough. I was trying to say, if we're going to do > > something in the executor (for right or wrong) the declarative approach > > can do it too. Since there will be partition bounding information > > available, we can do partition selection in the executor (maybe the > > planner should tell us to do it). > > Of course. It's an identical situation for both. Regrettably, none of > your comments about dynamic partitioning and planning were accurate as a > result. That's not true. We will still have planning drive the partition selection when the predicate is immutable, thus having more accurate plans. Some but not all plans use current_date and similar. > > > Okay. As I said above, nothing in declarative partitioning rules out > > partition selection with stable functions. So, we lets do it, assuming > > everyone else thinks it is a good idea. > > If you check the archives this was long ago been identified as a > requirement. And I said exactly the things you said, BTW, when trying to > say it didn't matter. > > I've kept a list of requests for improvement that I can share with you; > I've always been loathe to publish a list of bad points. Why, please send them. Thanks, Gavin
Gavin, I've responded to the technical questions you raise here: On Thu, 2008-01-10 at 02:22 +0100, Gavin Sherry wrote: > > After we note that a segment is read-only we can scan the segment and > > record min/max values for all columns. These are then "implicit > > constraints", which can then be used for segment exclusion in a similar > > manner as we do with the current constraint exclusion mechanism. > > What about columns which are varlen? The maximum value could be very long -- 1 > GB in fact. Say we decide to not support those. Well, what about > numeric? What about user defined types? No problem with any datatypes that I can see. Why would there be? Very long values as boundary values would be a problem with any type of partitioning, so some sensible heuristics would need to apply in any case. > > This would also allow a Merge Join to utilise exclusion for the inner > > plan at execution time. Instead of scanning the whole inner table plan > > from the start, we would be able to start the scan from the appropriate > > segment. This would require us to pass the current value of the outer > > plan down into the inner plan. The typical executor nodes on the inner > > plan would be a SortNode and below that a SeqScan. In that case the > > executor would need to pass the outer value from the MergeJoinNode down > > thru the SortNode to the SeqScan node. The SeqScan node could then > > perform partition exclusion, reducing the time for that scan and also > > reducing the time for the resulting sort. This sounds like it might be > > worth doing in normal cases also. It might turn out that the potentially > > applicable cases are already excluded during planning, I haven't thought > > about that aspect in enough detail yet. > > I don't think that would work at all. Say you have have skew of data > across the partitions. You may end up doing a seq scan for every outer > tuple. It would look like a really expensive nested loop. No, you've misunderstood. I said start the plan, i.e. do this once, not for every row. That would be ridiculous. > > If we collect data for all columns then many of our implicit constraints > > would be useless. e.g. if a column only has a few values and these are > > present in all segments. Matching our predicate against all constraints > > would be expensive, so we must weed out poor constraints. We would do > > this by removing any constraint that overlapped more than 10% of other > > segments. Various heuristics would likely need to be discovered during > > development to make this work without resorting to manual commands. > > > > Note that all of this exclusion is happening in the executor, not the > > planner. That allows this form of exclusion to work with stable > > functions and parameters without problem. > > So, in that case, you've totally undercut the planner. All the steps up > the tree would be based on the stats for scanning the entire table but > you've only returned part of it. Answered already. Not an issue solely for this particular approach. > To solve that, you could put the min/max values in a catalog somewhere. > For a 16 TB table, that's 16000 min/max values. That's pretty expensive. We agreed the partitioning could and would be adjustable, so the number need not be 16000. The number of partitions we can support needs to be fairly high to support a wide range of applications. Requests for 1-2000 partitions seem fairly typical to me and we need to be able to handle that, however we do partitioning. > And how do we handle making things non-read-only. You'd have to wait for > all current queries to finish... that could take a long time. No, I explained that it would not require locking or waiting. The existing queries would just not take advantage of the newly read-only state until they complete. > How do we handle crashes during setting of min/max? I presume it needs > WAL support. It's in a catalog table, as we've said. Handled. > What happens if the free space map gives us a free block in a read only > segment. Then we need to look at concurrency and transactionality of > min/max values don't we? I discussed that and explained how it would be handled. > > Visibility Map > > -------------- > [snip] > > No dynamic shared memory cache is required because any concurrent > > changes to the table would be ignored by a Scan anyway, so it doesn't > > matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any > > new scans that start will attempt to lock the table and then perform a > > rel cache check before continuing. So the visibility will be set > > correctly for *that* scan at least. > > Is that true in the presence of READ COMMITTED? Yes, because READ COMMITTED has nothing to do with it. The SVM would be refreshed via the relcache invalidation mechanism, which gets checked every time we obtain a lock we did not already hold. In the case of a table with bits set in the SVM we would need to recheck that each time we try to obtain a lock, even if we already hold it. > > In most cases the visibility map can be summarised as a single boolean > > to show whether any 100% visible segments exist. That makes accessing > > the map very cheap in the common, unset case. > > What do we do in failure scenarios. Does this go into WAL? Yes > Is it > replicated via log shipping techniques. Yes > > Setting the Visibility Map > > -------------------------- > > > > VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of > > them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to > > all current backends. Marking the map will cause a rel cache > > invalidation. > > Argh. Remember, we're talking about 16 TB tables here. With the best of > tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With > the existing partitioning approach, there's no such thing. I would differ with you there. The first query after a large COPY will be forced to set the hint bits and re-write the table. If you just loaded 16TB then you'd have to rewrite 16TB, which as you point out can take a while. The VACUUM will kick in automatically following a large load and either it or a query will perform the re-writing. For a 16TB table with declared partitions, each one of those tuples would need to have their partition calculated prior to insertion. I think the cost of post-assessing the partition boundaries is the same or cheaper than the cost of pre-assessing the partition location. > > We would never mark the highest numbered segment > > EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur. > > This is also a useful way of excluding smaller tables from the overheads > > involved for this feature. > > What about concurrent update or deletes -- during VACUUM. Surely we > don't have to vacuum full? Concurrent updates and deletes could change > the partition boundaries and that will impact on the accuracy of segment > exclusion (I do not think it leads to data corruption but it might lead > to the inclusion of a segment that shouldn't have been included). That > leads to unpredictable behaviour. The VACUUM would need to set a read-only pending flag. After the VACUUM has finished any update or delete would clear the flag. As you say its possible that concurrent updates and deletes could have changed rows in blocks behind the first VACUUM scan. After the VACUUM we re-scan a read-only pending segment to derive boundary values. During this scan, if we find a change made by a transaction after the VACUUM's xmin then we would clear the flag. So concurrent changes of any kind would be caught and we don't need locks or VACUUM FULL. > > In an insert-only table this would mean that only the last 1 or 2 > > segments would be read write, so VACUUM would always run in a bounded > > time, no matter how large the table. > > You said else where that unload was low cost. What about segments which > cross the unload boundary. Half of their data needs to be removed, say. > That sounds expensive. And, they'll have to be vacuumed. Well, yes, but if you wanted this to be properly optimised you would need to delete using the boundary values of the partition, so its effectively equivalent to deleting a declared partition. > With the existing partitioning, we could just truncate a partition. This > seems really expensive in comparison. I explained it would be possible to do that if it was required. > > There would be additional complexity in selectivity estimation and plan > > costing. The above proposal allows dynamic segment exclusion, which > > cannot be assessed at planning time anyway, so suggestions welcome... > > As I've noted elsewhere, you cannot just undermine the planner like > this. You're plans will be totally bogus. I hope you're not proposing > that. If you're proposing tracking information for each segment then I > there are problems there too. Discussed already, not an issue related solely to dynamic partitioning. > > Comparison with other Partitioning approaches > > --------------------------------------------- > > > > Once I realised this was possible in fairly automatic way, I've tried > > hard to keep away from manual overrides, commands and new DDL. > > > > Declarative partitioning is a big overhead, though worth it for large > > databases. No overhead is *much* better though. > > I think you're making this out to be harder than it is. Dealing with > lots of data is difficult, issuing one DDL command is not such a big > deal. I agree, that's not really the hard part. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote: > > > > Of course. It's an identical situation for both. Regrettably, none of > > your comments about dynamic partitioning and planning were accurate as a > > result. > > That's not true. We will still have planning drive the partition > selection when the predicate is immutable, thus having more accurate > plans. Not really. The planner already evaluates stable functions at plan time to estimate selectivity against statistics. It can do the same here. The boundary values can't be completely trusted at plan time because they are dynamic, but they're at least as accurate as ANALYZE statistics (and probably derived at identical times), so can be used as estimates. So I don't see any reason to imagine the plans will be badly adrift. We're back to saying that if the visibility map is volatile, then SE won't help you much. I agree with that and haven't argued otherwise. Does saying it make us throw away SE? No, at least, not yet and not for that reason. SE does what I was looking for it to do, but doesn't do all of what you'd like to achieve with partitioning, because we're looking at different use cases. I'm sure you'd agree that all large databases are not the same and that they can have very different requirements. I'd characterise our recent positions on this that I've been focused on archival requirements, whereas you've been focused on data warehousing. The difference really lies in how much activity and of what kind occurs on a table. You don't unload and reload archives regularly, nor do you perform random updates against the whole table. I'm sure we'd quickly agree that many of the challenges you've seen recently at Greenplum would not be possible with core Postgres, so I'm not really trying too hard to compete. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Mon, 2008-01-07 at 12:53 -0800, Ron Mayer wrote: > Is my understanding right that these Segment Visibility Maps could > help this case as well? Yes, I think so. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
> I've kept a list of requests for improvement that I can share with you; > I've always been loathe to publish a list of bad points. I think it would help understand the proposal if you also present the shortcomings. When you presented the positive and negative points, the negative list did look intentionally short :-) This imho provokes negative replies, the average hackers that reply are no dummies eighter. Some of the issues I see, with the proposal made so far: - segment boundaries will imho sometimes be too fuzzy for a reasonably short segment describing where clause - fault isolation - huge global indexes (index create/reorg needs to repeatedly sort data that is long static) - unpredictability - when you delete large parts of the table you want that to be instantaneous and cause little work on the db side - partitioning that also partitions the indexes, this is imho a must have for huge non static tables - when the user tweaks segment size this is already declarative - some indexes only needed for recent data (a partial index would need to slide == recreate) The whole subject is imho very interesting, and I expect more feedback after 8.3 is out. I am also in the declarative camp. In my projects the partitioning is the developer/designer's responsibility, and thus all add/drop partition tasks are automated, no dba. Needless to say, all existing declarative implementations lack some of the essential features on the implementation side, e.g.: - use the btree order of partitions in plans that need more than one partition - a useful syntax that allows automatic creation/exclusion of partitions (e.g. each month of a timestamp in one partition)e.g. partition 'hugetable_'||to_char(ts, 'YYYYMM') with extend(ts, year to month) - unique indexes, equally partitioned like table, that don't contain the partitioning column[s] - some lack expression based - some lack instantaneous attach using a prefilled preindexed table - some lack detach - some need separate tablespace per partition Andreas
On Fri, 2008-01-11 at 17:31 +0100, Zeugswetter Andreas ADI SD wrote: > > I've kept a list of requests for improvement that I can share with > you; > > I've always been loathe to publish a list of bad points. The "list of bad points" doesn't refer to shortcomings in my proposal, which I would never hide. It refers to a list of general requirements for improvements to partitioning that I have accumulated over a number of years and will publish as soon as I've retyped it. It's not a secret, it all comes from things discussed on list at various points. > I think it would help understand the proposal if you also present the > shortcomings. > When you presented the positive and negative points, the negative list > did look intentionally short :-) If I am aware of a downside in any proposal, I always identify it. That's why the damn things are so long. I design many things, but only post some of them. The proposals I do post always have more positive than negative posts, otherwise I wouldn't waste anybody's time. I regularly get comments my proposal style. First, my proposals are too long and they need a summary. Next that I don't explain myself enough and need to stop leaving gaps that have people assume I've not thought it through thoroughly. So I try to put the summary in for some people and the detail for others. For this proposal I identified the target use case at the very top of the proposal, and IMHO it does meet the needs of that very well, as many people have agreed. It doesn't do everything that everybody wants because the list is very long indeed, probably man years of effort, all things considered. We won't cross that gap in a single step. I'm working on a revised version which will include all of the comments and points that people have made, about 20 different topics in all from my notes. In progress here: http://developer.postgresql.org/index.php/Segment_Exclusion > This imho provokes negative replies, the average hackers that reply are > no dummies eighter. I post to -hackers because I know there's no dummies here. If I've provoked negative replies, I'm happy to apologise to all. Thanks for your candid thoughts, -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Fri, Jan 11, 2008 at 11:49:50AM +0000, Simon Riggs wrote: > On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote: > > > > > > Of course. It's an identical situation for both. Regrettably, none of > > > your comments about dynamic partitioning and planning were accurate as a > > > result. > > > > That's not true. We will still have planning drive the partition > > selection when the predicate is immutable, thus having more accurate > > plans. > > Not really. > > The planner already evaluates stable functions at plan time to estimate > selectivity against statistics. It can do the same here. > > The boundary values can't be completely trusted at plan time because > they are dynamic, but they're at least as accurate as ANALYZE statistics > (and probably derived at identical times), so can be used as estimates. > So I don't see any reason to imagine the plans will be badly adrift. Okay, it's good that you want the planner to look at those. Did you consider the point I made about the sheer amount of data the planner would have to consider for large cases? > We're back to saying that if the visibility map is volatile, then SE > won't help you much. I agree with that and haven't argued otherwise. > Does saying it make us throw away SE? No, at least, not yet and not for > that reason. Yes, I'm not against SE I just think that only having it would see a serious regression for larger user. Personally, I think SE would be a great idea for append only tables since it removes the thing I'm most worried about with it: the need to vacuum to 'turn it on'. > > SE does what I was looking for it to do, but doesn't do all of what > you'd like to achieve with partitioning, because we're looking at > different use cases. I'm sure you'd agree that all large databases are > not the same and that they can have very different requirements. I'd > characterise our recent positions on this that I've been focused on > archival requirements, whereas you've been focused on data warehousing. I think that sums it up, although I'd also say that declarative partitioning is suitable for all those with largish amounts of data which they know how they want stored. This points to another case that SE suits: those who don't know how or (maybe more importantly) don't care to manage their data. I'll go back to what I said above. SE looks like a good performance boost for archival read only data. If we tighten up the definitions of how some tables can be used -- append only -- then we can remove the vacuum requirement and also change other characteristics of the storage. For example, reduced visibilty information, compression, etc. These are hot topics for people with that kind of data. > The difference really lies in how much activity and of what kind occurs > on a table. You don't unload and reload archives regularly, nor do you I couldn't agree more. > perform random updates against the whole table. I'm sure we'd quickly > agree that many of the challenges you've seen recently at Greenplum > would not be possible with core Postgres, so I'm not really trying too > hard to compete. There we diverge. Yes, Greenplum produces systems for very large amounts of data, peta-byte range in fact. However, the architecture -- individual post-masters on a single CPU with their own storage -- means that we see the problems of non-distributed database users when we look at data at the node level. This is why I say that VACUUMing such systems, under the SE model, after you do a load of data is just impossible (I know that after time the cost of vacuum will stabilise but there's always the initial data load). Thanks, Gavin
On Fri, 2008-01-11 at 20:03 +0100, Gavin Sherry wrote: > Okay, it's good that you want the planner to look at those. Did you > consider the point I made about the sheer amount of data the planner > would have to consider for large cases? Sorry, thought I had somewhere in all those emails... If you really do have 16TB of data and its in the use case of mostly read only, volatile in last portion of table, then it would be straightforward to increase segment size. Whatever form of partitioning we go for we do need to allow 1-2,000 partitions fairly easily. We can't sustain a sequential method of applying the rules, which gives O(n) behaviour. We need some form of indexing/tree search mechanism that gives roughly O(logN) behaviour. > > We're back to saying that if the visibility map is volatile, then SE > > won't help you much. I agree with that and haven't argued otherwise. > > Does saying it make us throw away SE? No, at least, not yet and not > for > > that reason. > > Yes, I'm not against SE I just think that only having it would see a > serious regression for larger user. If we do find regressions, then I'd suggest we look at ways of turning it on/off automatically. We can keep track of the volatility in the map. We can also have an off switch if you're really against it. But I'm still skeptical. I think we can sustain the last few segments in a table being volatile, as long as the greater proportion of segments are not. The volatile part of the table is the subject of most of the queries anyway, so excluding it from seq scans isn't that important. Index access to historical data doesn't slow down whether or not you have a volatile map. > Personally, I think SE would be a > great idea for append only tables since it removes the thing I'm most > worried about with it: the need to vacuum to 'turn it on'. You do need to VACUUM anyway, so thats no problem. Sure you have to scan it to derive the boundary values, but thats one scan that saves many in the future. > I'll go back to what I said above. SE looks like a good performance > boost for archival read only data. If we tighten up the definitions of > how some tables can be used -- append only -- then we can remove the > vacuum requirement and also change other characteristics of the > storage. > For example, reduced visibilty information, compression, etc. These > are > hot topics for people with that kind of data. SE isn't aimed at solely INSERT-only data. It's much wider than that. An ERP/SCM type application would easily benefit, say where orders are received and processed within a relatively short period, but order data needs to be maintained online for 2+ years. Anything where the table grows either by date, or by increasing key, that has a read-only "tail". That's a lot of applications and a ton of data. It might not apply to the "Customer" table in that app, but it will apply to Orders, OrderItem etc. We can compress older read-only data as a way of shrinking file size. That's way easier, plus it applies to more real-world cases than trying to remove all the visibility stuff. The Insert-only people will still be able to take advantage of it compression, but the more general purpose people will never be able to make use of the visibility removal code. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > Happy New Year, everybody. > > This proposal follows on from previous thinking about partitioning, > where I've taken up Andrew Sullivan's suggestion to re-examine the > current partitioning concept of using tables as partitions. So I've come > up with an alternative concept to allow us to discuss the particular > merits of each. ISTM that this new proposal has considerable potential. I've been lurking and reading a huge number of threads on partitioning. I see that postgresql is likely to give the user lots of knobs to define partitions very flexibly, which is a good thing for things like sales region reports etc. All that to say, I hope some form of this very automatic tunning makes it in. This automatic option would provide a benefit (not perfect but improved) for a significant set of use cases. Even better, it is trivial to setup, though I would want a knob for the underlying partition sizes, 1GB feels a bit too big for some situations. Even expensive databases have found I think that there is a cost to administrative complexity. In many cases someone may not be ready to go down the declarative path, but be open to allowing the system take some optimizing approaches. What I especially like is the ability to later mark things read only. Perhaps a consultant who checks in on things monthly might mark partitions in that form. Good luck though with it all, great to see this. - August