Promise index tuples for UPSERT - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Promise index tuples for UPSERT |
Date | |
Msg-id | CA+U5nMJK6A66TBmJrDSOw1UMZYLmxycUGJu1OUYXVwLGWFT_uA@mail.gmail.com Whole thread Raw |
Responses |
Re: Promise index tuples for UPSERT
Re: Promise index tuples for UPSERT |
List | pgsql-hackers |
Summary of algorithm to use "promise tuples" for concurrency control during UPSERT 1. Perform btree search to location of key, if it exists. a) If an unkilled index tuple exists, we decide this is an UPDATE and drop straight thru to step 2 b) If it does not exist, insert a "promise" tuple into unique index(s) - marked with the xid of the inserting transaction, but using the key. This happens while the page is locked, so it is not possible to insert a second promise tuple concurrently. Record the btree blockid on the index scan and move to step 3 When later insert scans see the promise tuple they perform XactLockTableWait() and when they get control they look again for the key. If they find a promise tuple with an aborted xid they replace that value with their own xid and continue as a). Otherwise b). 2. Find existing heap tuple Find heap tuple. Check it is actually valid. If not, go back to (1), kill the prior tuple and follow 1b) path If it is valid, perform heap_update as normal. 3. Insert new heap tuple Perform heap_insert Re-find index tuple using the btree blockid recorded at step 1; this may require moving right until we find the actual value we are looking for, so block splits don't negatively affect this approach. Once re-found we change the index tuple from a promise tuple to a normal index tuple, by setting tid and removing promise flag. Tuple remains same length because the value was known when promise tuple inserted, so this is an inplace update. Insert other index values normally. If a transaction that inserted a promise tuple dies, how is that cleaned up? Any user that sees a dead promise tuple with a value they want will clean it up. Dead promise tuples can be removed as needed, just as killed tuples currently are. VACUUM can remove dead transactions from index as it scans, no problems. Index bloat is less of a problem than with normal inserts since there are additional ways of removing promise tuples. Only one index tuple at a time can have a specific value, so we would actually reduce heap bloat caused by concurrent inserts. It's possible we find existing rows that we can't see according to our snapshot. That is handled in exactly the same as the way we treat UPDATEs. There is a potential loop here in that we might find an index tuple, follow it, find out the tuple is actually dead then return to insert a promise tuple, only to find that someone else just did that and we have to wait/re-follow the link to update the new tuple. That is an extremely unlikely race condition, though possible; there is no heavy locking to prevent that in this approach. In principle deadlocks are possible from that, but that is not any different from the current case where people that insert same key at same time might cause deadlocks. Its not a common application pattern and not something we should be protecting against. All of this is only needed for unique indexes. It can handled by a new API call for acquire_value_lock() and release_value_lock(), or similar. Advantages * We don't do anything weird in the heap - if this breaks, data is not corrupt * There is no heap bloat or index bloat above existing levels * The approach leverages existing mechanism for transaction waiting * Optimistic approach to value locking will improve performance over heavy weight locking Disadvantages * Not written yet - <1 month to code, still possible for 9.5, doesn't look hard Other stated possible disadvantages * Violates existing invariant that every index tuple has a corresponding heap tuple, which goes back to the Berkeley days. Currently, we always create heap tuples first, and physically delete them last. [Why is that a problem? Could be, but why?] ("Deleting them last" implies something is being touched in that regard by this suggestion. I'm not sure what you are referring to) * Relatedly, concern about breaking VACUUM. VACUUMing of btrees occurs with a list of TIDs to kill in index, built from the heap. Current bulk delete callback cannot be used for this. Super-exclusive lock is needed to delete tuples in btree page (i.e. VACUUM). Normally skipping of LockBufferForCleanup() (as added by bbb6e559) works okay in heap because it tends to imply that list of tids won't be bulk deleted in index, where we currently cannot skip for failure to quickly acquire super exclusive lock. So this could make the problem addressed by bbb6e559 return to some extent. [Don't see any problems; just test the xid as we scan a promise tuple and remove it if needed] "Index-only" bloat becomes a possibility. Implications are not well understood. [FUD, no reason to believe there is a problem.] We have to re-find any items using an ordinary index scan, not a tid. Must match our xid to that. [Explained above, easy and efficient.] Doesn't have a strategy for dealing with unprincipled deadlocks, at least at the moment. Presumably some aspects of #2 could be adopted here. [FUD. Unprincipled deadlock still not properly explained as yet. No way of telling whether it will happen with this approach, or not]. Comments please. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: