Thread: Latest plans for Utilities with HOT
Overview -------- CREATE INDEX, CREATE INDEX CONCURRENTLY and VACUUM FULL all need some adaptation to work correctly with HOT. [This summary and proposal supercedes all previous proposals by me regarding utilities with HOT] The Problem ----------- With HOT, CREATE INDEX may find tuples that are HEAP_ONLY_TUPLE, yet have an index value(s) that differs from the indexable value of the tuple version which is at the root of the hot chain. e.g. we have a simple hot chain that looks like this on one page: tup1(attr1=1, attr2=A)[ROOT] -> tup2(attr1=1, attr2=B)[HOT) there is already an index(attr1), and we are trying to build an index on attr2. Index(attr2) will to point to both tup1 and tup2. So by the end of the build we need to have inserted an additional index pointer into index(attr1) (yes, the other index) for tup2 and also broken the link between tup1->tup2. This operation is known as chilling, or de-HOT-ifying. If we don't do this, we will have an index inconsistency. If we attempt to chill a tuple concurrently, we also get the possibility of a concurrent index scan following multiple paths to the same tuple. This effects all types of indexes, not just b-trees. Solutions --------- There are three basic solutions: LOCK, UPDATE & WAIT and MARK 1. LOCK - Use some form of locking to prevent concurrent index scans while we modify the various tuple versions and fiddle with the index entries. This has some sub-options: a) AccessExclusiveLock on main relation prevents all index accesses b) Lock all of the indexes with an AccessExclusiveLock, in a consistent order, so that no concurrent index scans are possible. This would be a non-transactional lock, so could be dropped before end of transaction - we're not locking the data we're just locking the physical access mechanism. 2. UPDATE & WAIT - From Hannu - Perform cold-null-UPDATEs on each tuple, which can happen concurrently. This can happen concurrently, but then requires us to wait until we are older than all non-utility transactions, so that these UPDATEs are visible to all. 3. MARK - From Heikki - verbatim: One solution I thought of is to add another flag to heap tuples, CHILL_IN_PROGRESS. To chill a tuple, you would: 1. Mark heap tuple with CHILL_IN_PROGRESS 2. Insert missing index entries 3. Clear CHILL_IN_PROGRESS and HEAP_ONLY_TUPLE flags Index scans would ignore tuples with CHILL_IN_PROGRESS and directly pointed to from the index. That way if we crash in the middle of step 2, scans and updates would work normally after replay, as if the index entries weren't there. CREATE INDEX would have to fail if there's any CHILL_IN_PROGRESS tuples, because we wouldn't know which index entries need to be inserted; some might already be there. To clear the CHILL_IN_PROGRESS flag, a vacuum would be needed. Vacuum would remove all index entries for those tuples, but instead of removing the heap tuple in the 2nd scan it would just clear the CHILL_IN_PROGRESS flag, bringing us back to where we started. Thoughts -------- - CREATE INDEX CONCURRENTLY can use UPDATE & WAIT or MARK - VACUUM FULL could use any of the solutions. It does currently hold AccessExclusiveLock. UPDATE & WAIT would be very good. - CREATE INDEX has the most problems, but MARK solves most of them. Multiple index builds may need to chill different sets of tuples, but if there is any overlap then the second build will fail when it sees the CHILL_IN_PROGRESS flag that the other has set. So we can no longer achieve multiple concurrent index builds without great care. It might be possible to clean up old CHILL_IN_PROGRESS flags as we go, but if we did that we'd need to make sure concurrent index scans never pass each other, otherwise one index build would clean up the others flags. So we have two choices: a) we continue to allow concurrent index builds, but we throw an ERROR when an index build finds a CHILL_IN_PROGRESS flag b) we clean up CHILL_IN_PROGRESS flags as we go, but we give up the ability to perform concurrent index builds Proposal -------- - CREATE INDEX can use the MARK solution as part of the first scan (cos there's only one), setting flags called HEAP_CHILL_IN_PROGRESS. As the index build scan proceeds we insert records directly into each index using the standard WAL-logged insertion path. No spool file is built up while we do this. Setting HEAP_CHILL_IN_PROGRESS must must also write WAL records and dirty the block, since we might fail after we have performed an index insert but before we remove the mark. This could mean we'd end up writing lots of database blocks when we do CREATE INDEX. (Note here that we use WAL to chill, even though the main index build may still skip writing WAL as it does now, if archive_command is not set). Note we might fail to build an index because of a uniqueness violation that occurs *during* the tuplesort run. So we must mark the tuple, insert the indexes, clean the tuple and then offer the tuple to the index callback. We pick option a): continue to allow concurrent index builds, but we throw an ERROR when an index build finds a HEAP_CHILL_IN_PROGRESS flag. That's the optimistic approach. Building indexes concurrently only usually happens after a restore/large load, which will be when there are no HOT tuples or flags set anyway. This means we also need to modify the VACUUM process to clean up after a failed index build. VACUUM will build a second array of tids that can be used to perform removal of HEAP_CHILL_IN_PROGRESS during lazy_scan_index() [[perhaps with uncool_tuple()? :-) ]]. We use a second array because we need only remove HEAP_CHILL_IN_PROGRESS flags when a tuple will not be removed by the normal vacuum process, so there is by definition no overlap between the two lists. We use an array of tids to allow us to reuse all the existing code for heap/index cleaning. - CREATE INDEX CONCURRENTLY can use the MARK solution as part of the second scan. After the first scan, the new index will be visible to other transactions and no new HOT tuples will be created that could effect the new index. So chilling as part of the second scan will ensure that we catch all HOT tuples. - VACUUM FULL - The best solution, for now, is to make VACUUM FULL perform a reindex on all indexes on the table. Chilling may require us to modify considerably more index entries than previously. UPDATE & WAIT would be very good, but probably should wait for the next release now, since we now have changes to make to 4 utilities. Are we cool with that? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote:>> - VACUUM FULL - The best solution, for now, is to make VACUUM FULL> perform a reindex on all indexeson the table. Chilling may require us> to modify considerably more index entries than previously. UPDATE & WAIT> wouldbe very good, but probably should wait for the next release now,> since we now have changes to make to 4 utilities. On my way back home, I was thinking about VACUUM FULL. Is there really a problem with VACUUM FULL and HOT ? VF moves tuple chains in the table and does that while holding AccessExclusive lock on the table. If we prune the HOT-update chains (which we anyways do for lazy vacuum), we can guarantee that the entire HOT-update chain will be moved, if any of the tuples in the chain is moved. Also, when VF moves a tuple chain appropriate index entries are inserted into all the indexes. If we don't carry the HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE flags to the moved location, the HOT-update chain will be broken, but that should not have any correctness problems. If VACUUM FULL crashes before completeing its operation, the original HOT-update chain would still remain valid and the new tuples will be discarded. Is there something I am missing ? Thanks, Pavan EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-03-05 at 21:39 +0530, Pavan Deolasee wrote: > Simon Riggs wrote: > > > > - VACUUM FULL - The best solution, for now, is to make VACUUM FULL > > perform a reindex on all indexes on the table. Chilling may require us > > to modify considerably more index entries than previously. UPDATE & WAIT > > would be very good, but probably should wait for the next release now, > > since we now have changes to make to 4 utilities. > > On my way back home, I was thinking about VACUUM FULL. Is there really > a problem with VACUUM FULL and HOT ? VF moves tuple chains in the > table and does that while holding AccessExclusive lock on the table. No problem, just additional work for little benefit. > If we prune the HOT-update chains (which we anyways do for lazy vacuum), > we can guarantee that the entire HOT-update chain will be moved, if > any of the tuples in the chain is moved. Currently each tuple is moved individually. You'd need to inspect the whole HOT chain on a page, calculate space for that and then try to move them all in one go. I was originally thinking that would be a problem, but its not so bad - but it may cause us to end repair_frag() earlier than we otherwise would depending upon the game of Tetris plays out. Thats harder than it sounds. My concern is that VACUUM FULL code is fairly ugly and doing this might introduce a bug into a rarely used piece of code that we don't pick up. I would not choose that path myself, but we can go there if the consensus is that doing a reindex would be the wrong way to go. Reindex will be fast to code and much more likely to be bug-free. > Also, when VF moves a tuple > chain appropriate index entries are inserted into all the indexes. > If we don't carry the HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE flags > to the moved location, the HOT-update chain will be broken, but > that should not have any correctness problems. If VACUUM FULL > crashes before completeing its operation, the original HOT-update > chain would still remain valid and the new tuples will be discarded. You mean all moves will be cold? So VACUUM FULL will actually bloat the indexes? I hope I've misunderstood you. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Mon, 2007-03-05 at 21:39 +0530, Pavan Deolasee wrote: > > Currently each tuple is moved individually. You'd need to inspect the > whole HOT chain on a page, calculate space for that and then try to move > them all in one go. I was originally thinking that would be a problem, > but its not so bad - but it may cause us to end repair_frag() earlier > than we otherwise would depending upon the game of Tetris plays out. > > Umm.. I still need to look deeper to understand the VACUUM FULL code, but ISTM that we can move tuple chains just the way its done today, without bothering to keep HOT-update chains intact. The tuples may actually got into different pages and have equal number of index entries. To my mind, this is not such a big problem because we shouldn't expect too many HOT-update chains while running VACUUM FULL. Isn't that true ? > Thats harder than it sounds. My concern is that VACUUM FULL code is > fairly ugly and doing this might introduce a bug into a rarely used > piece of code that we don't pick up. I would not choose that path > myself, but we can go there if the consensus is that doing a reindex > would be the wrong way to go. Reindex will be fast to code and much more > likely to be bug-free. > > Oh I agree. VACUUM FULL code is hard to assimilate. I am hoping that we can make HOT work fine with VACUUM FULL without too many changes. If thats not the case, I won't push for it. > You mean all moves will be cold? > > So VACUUM FULL will actually bloat the indexes? I hope I've > misunderstood you. > > Yes, thats what I meant. But are we really worried about bloating indexes while VACUUM FULL ? Any other tuple movements would anyway cause index inserts. And I don't really expect too many tuple chains since VACUUM FULL runs with AccessExclusive lock on the table. Thanks, Pavan
On Mon, 2007-03-05 at 22:25 +0530, Pavan Deolasee wrote: > Simon Riggs wrote: > > On Mon, 2007-03-05 at 21:39 +0530, Pavan Deolasee wrote: > > > > Currently each tuple is moved individually. You'd need to inspect the > > whole HOT chain on a page, calculate space for that and then try to move > > them all in one go. I was originally thinking that would be a problem, > > but its not so bad - but it may cause us to end repair_frag() earlier > > than we otherwise would depending upon the game of Tetris plays out. > > > > > Umm.. I still need to look deeper to understand the VACUUM FULL code, > but ISTM > that we can move tuple chains just the way its done today, without > bothering to keep > HOT-update chains intact. The tuples may actually got into different > pages and have > equal number of index entries. To my mind, this is not such a big > problem because > we shouldn't expect too many HOT-update chains while running VACUUM FULL. > Isn't that true ? Well, its true enough to be a great argument. So what you're saying is: we do nothing and it just works. At least not too badly, and at very least: no worse than it does today. [Oh dear! I just finished writing prototype of VACUUM FULL-with-reindex when I read this, so either way it looks like nothing more needed on this utility. 1 down, 3 to go.] -- Simon Riggs EnterpriseDB http://www.enterprisedb.com