Dynamic Partitioning using Segment Visibility Maps - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Dynamic Partitioning using Segment Visibility Maps |
Date | |
Msg-id | 1199296574.7260.149.camel@ebony.site Whole thread Raw |
Responses |
Re: Dynamic Partitioning using Segment Visibility Maps
Re: Dynamic Partitioning using Segment Visibility Maps Re: Dynamic Partitioning using Segment Visibility Maps Re: Dynamic Partitioning using Segment Visibility Maps Re: Dynamic Partitioning using Segment Visibility Maps Re: Dynamic Partitioning using Segment Visibility Maps Re: Dynamic Partitioning using Segment Visibility Maps |
List | pgsql-hackers |
Happy New Year, everybody. This proposal follows on from previous thinking about partitioning, where I've taken up Andrew Sullivan's suggestion to re-examine the current partitioning concept of using tables as partitions. So I've come up with an alternative concept to allow us to discuss the particular merits of each. ISTM that this new proposal has considerable potential. Recap: Very Large Table Use Case -------------------------------- Many large tables have a regular access pattern: - new rows inserted frequently - some updates and deletes on the "recent" data - older data becomes effectively read-only - at some point data may be removed in bulk from the table Most tables vary on what "recent" means; some are insert only, many not. A typical example might be a large history table where the last 6 months data is updated, but after that the data has almost zero change. Partitioning relies on knowledge of the physical location of data to exclude sections of data from a scan. Currently the partitioning knowledge is provided by user declarations in the form of constraints on tables linked together by views or inheritance. We also currently do this in "constraint-first" fashion, i.e. we put the data where the constraints say we should, rather than deriving the constraints from the data. In-table Partitioning --------------------- In the common use-case there is a clear affinity between certain data columns and the physical placement of data in a table. This is true for columns such as LOG_DATE or TRANSACTION_DATE, but is also true for any columns where the id column is assigned by a Sequence. HOT helps this to remain true over time. Given the above use case, when there is an affinity between data values and data placement *and* we know that older data tends to become read only, we can then realise that certain physical sections of a table will become effectively read only. (My experience is that the larger the table, the less random the write pattern - whatever the guys that wrote TPC-C say). If we were able to keep track of which sections of a table are now read-only then we would be able to record information about the data in that section and use it to help solve queries. This is turning the current thinking on its head: we could derive the constraints from the data, rather than placing the data according to the constraints. That would be much more natural: load data into the table and have the system work out the rest. We can imagine that we do this within a single table. Imagine: split a table up into 100 sections, then record the min/max values of all tuples in those sections. When we perform a SeqScan we could skip over sections that could be proved would never satisfy the query predicate. Same thing as partitioning, but within the table. Currently tables are already broken up into 1GB file segments, so it seems fairly natural to pick that as our section size. That fits reasonably well with recommendations for ideal partition size. Segment Exclusion ----------------- After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then "implicit constraints", which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. The implicit constraints can then allow SeqScans to assess segment exclusion for each segment at execution time, i.e. a scan can skip a whole segment if constraints don't match. If a segment does not (yet) have implicit constraints it will always be scanned in full. This allows partitioning, synch scans and buffer recycling to work much better together than they currently do. Other scans types might also use segment exclusion, though this would only be useful for scans retrieving many rows, otherwise the overhead of segment exclusion might not be worthwhile. This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Note that all of this exclusion is happening in the executor, not the planner. That allows this form of exclusion to work with stable functions and parameters without problem. Noting which segments are read-only ----------------------------------- Everything so far has relied upon our ability to note which segments of a table are read-only. We could do this in two ways 1) have the system automatically keep track of non-changing data 2) have the DBA issue a command to "mark" a segment as read-only now Probably a combination of the two is better, so we have three states for segments - READ_WRITE_ALLOWED - EFFECTIVE_READ_ONLY (set by 1 only) - MARKED_READ_ONLY (set by 2 only) Having said that I want to concentrate on (1), though consider the other one also in case requested by reviewers. Visibility Map -------------- So *how* would we mark segments as read only? It turns out that attempting to do this is extremely similar to Heikki's Visibility Map idea, so I'll describe it in those terms, even though there are some differences. It also brings to mind Andrew Sullivan's read-only tuples concept. We would keep a dynamic visibility map at *segment* level, showing which segments have all rows as 100% visible. No freespace map data would be held at this level. Most everything about Heikki's Visibility Map idea holds, just that we have a bit for each segment in the table, rather than each block. The map is very small and hardly ever changes for a table, so we can cache it easily on each backend. If the map does change we can perform a relcache invalidation to make everybody re-read the map. No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a Scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least. In most cases the visibility map can be summarised as a single boolean to show whether any 100% visible segments exist. That makes accessing the map very cheap in the common, unset case. Setting the Visibility Map -------------------------- VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to all current backends. Marking the map will cause a rel cache invalidation. We would never mark the highest numbered segment EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur. This is also a useful way of excluding smaller tables from the overheads involved for this feature. In an insert-only table this would mean that only the last 1 or 2 segments would be read write, so VACUUM would always run in a bounded time, no matter how large the table. When would we attempt to set the visibility map? If we do this aggressively, then it will be quickly unset. If we wait too long then we might lose some of the benefits of this approach. First, we recognise that read only segments will change the autovacuum calculation - we simply exclude them entirely. This will mean that VACUUMs are run more regularly than they are now, but when they do run they will be quicker. (Note that currently, VACUUMs on a growing table get further and further apart as the table grows, assuming constant write rate). My feeling is that VACUUMs are sufficiently infrequent that we should set the visibility map aggressively after each VACUUM. If the map is quickly unset, then we wait some time before trying again. Unsetting seems to be relatively low contention (see below), so that seems acceptable. We would then re-scan newly marked segments to derive min/max values for implicit constraints, performed by autovacuum worker processes. The best way seems to be that an ANALYZE command would automatically detect that a segment has been recently set EFFECTIVE_READ_ONLY and perform an all-rows scan rather than just a sample. (We would keep min/max values only for datatypes with valid btree sort operators, just as ANALYZE already does). The data for this can be stored as part of the stats histogram for each column. pg_class.reltuples would become an array of row counts in each segment, known accurate for read only segments. A later VACUUM will still be required to freeze the rows, though that can happen later when we hit the frozen limit, or earlier if we run an explicit VACUUM FREEZE. We would need to store the oldest xid for each segment, so we know when to kick off a FREEZE on that part of the table. We would do this by making pg_class.relfrozenxid an array also. This departs completely from the idea that the visibility map and the freespace map are related somehow. However, there is a change required in the FSM to ensure the proposed technique is stable and useful: If we scan a segment and it is 100% visible, if there is freespace in just one of the blocks in the segment then we will soon find that our visibility-bit will be set to off again very quickly. If there is more than 5% freespace total in the segment then we will not set EFFECTIVE_READ_ONLY, nor report blocks in a segment to the FSM. That way, small amounts of freespace won't destroy the benefits of segment exclusion. VACUUM currently overwrites FSM information, though if this was not the case then we would have to actively remove it for the newly read only segments. Perhaps that limit would be (100 - fillfactor)%, which is normally set according to a user's expectation of the number of updates likely on a table. Unsetting the visibility map ---------------------------- For EFFECTIVE_READ_ONLY, INSERTs, UPDATEs and DELETEs are still possible. For MARKED_READ_ONLY it would be explicitly refused. The relcache invalidation caused by making a segment EFFECTIVE_READ_ONLY can flush the rd_targBlock for the relation in each backend. Other INSERTs are not possible since all current code-paths either use extension or the FSM when the table is non-empty, and neither of those will cause a problem. Extension only touches last segment, which will never by EFFECTIVE_READ_ONLY. The FSM contains no entries for an EFFECTIVE_READ_ONLY segment. UPDATEs or DELETEs are possible on EFFECTIVE_READ_ONLY segments, but we don't expect it often. If this occurs, we will simply reset the visibility map for the table. This can be handled with a non-transactional overwrite - the bit always exists already, so the overwritten data is always same size. That's pessimistic, since it resets the state even if the UPDATE/DELETE aborts, but its better than holding the row locked until the transaction completes, which might prevent other things from happening. We probably also want VACUUM to clear up any aborted transactions anyway. The user may also wish to clear down very old data, so allowing DELETEs can ensure the user can still remove old data from the table. By carefully choosing the values to be deleted, a whole segment can be deleted and then returned to the FSM. Running a huge DELETE will likely perform a SeqScan that avoids all but the deleting data. The following VACUUM will also ignore the other segments, so removing older data seems fairly well optimised already, in comparison with now. It would be possible to optimise this to perform a segment-level truncate, but since the partitioning will already improve the DELETE and VACUUM, I'm not proposing that yet, if ever. SHARE locks currently write to data blocks, but since they represent transient state we do not need to unset either the partitioning or the visibility map. Overlap with Visibility Map Proposal ------------------------------------ This proposal offers many of the advantages of the earlier Visibility Map proposal, yet without major changes to heap structure. Since the segment-level visibility map is more granular it would only be 80% as effective as the more detailed block-level map. Having said that, the bookkeeping overheads will also be considerably reduced, so it does seem a good joint approach. It also allows freezing to be handled fully, which was a problem with the original visibility map proposal. WAL logging visibility map changes is also now much easier. My thoughts have been targeted directly at partitioning, yet I have to admit that this idea overlaps, and in my world view, replaces the Visibility Map proposal. I very much like the name Visibility Map though. I've re-read the earlier thread and agree with Jeff Davis that we *could* have multiple kinds of visibility map, but also agree with Heikki that it seems like too much work. It was never my intent to address both challenges in one proposal and this is just pure coincidence, or possibly just luck, depending upon how you like the proposal. If we do need to differentiate between the two proposals, we can refer to this one as the Segment Visibility Map (SVM). Index Scans ----------- Index-only scans would look at the Visibility Map, just as they would have done in the earlier proposal. It would be costly for an index scan to repeatedly check the visibility map, but probably worth it to do so. Other Changes ------------- We can handle select count(*) by scanning the non-100% visible segments of a table, then adding the stored counts for each segment to get a final total. Not sure if its really worth doing, but it does sound like an added bonus. There would be additional complexity in selectivity estimation and plan costing. The above proposal allows dynamic segment exclusion, which cannot be assessed at planning time anyway, so suggestions welcome... I think it would be prudent to have an override option on VACUUM to allow us to scan a table in full, checking rather than using the visibility map. This mode would be called VACUUM CHECK. None of the above blocks earlier proposals for read only tables, and the two features could work together very straightforwardly. Comparison with other Partitioning approaches --------------------------------------------- Once I realised this was possible in fairly automatic way, I've tried hard to keep away from manual overrides, commands and new DDL. Declarative partitioning is a big overhead, though worth it for large databases. No overhead is *much* better though. This approach to partitioning solves the following challenges - allows automated table extension, so works automatically with Slony - responds dynamically to changing data - allows stable functions, nested loop joins and parametrised queries - allows RI via SHARE locks - avoids the need for operator push-down through Append nodes - allows unique indexes - allows both global indexes (because we only have one table) - allows advanced planning/execution using read-only/visible data - works naturally with synchronous scans and buffer recycling All of the above are going to take considerably longer to do in any of the other ways I've come up with so far... And yes, this is a different conclusion to earlier discussions, particularly those in 2005 where I was still thinking in terms of declarative partitioning. Conclusions ----------- This technique would be useful for any table with historical data keyed by date or timestamp. It would also be useful for data where a time-of-insert component is implicit, such as many major entity tables where the object ids are assigned by a sequence. e.g. an Orders table with an OrderId as PK. Once all orders placed in a period have been shipped/resolved/closed then the segments will be marked read-only. Its not really going to change the code path much for small tables, yet seems likely to work reasonably well for large tables of all shapes and sizes. If a segment is being updated, we leave it alone, and maybe never actually set the visibility map at all. So overall, this idea seems to cover the main use case well, yet with only minimal impact on the existing use cases we support. As before, I will maintain this proposal on the PG developer Wiki, once we get down to detailed design. Like it? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
pgsql-hackers by date: