Thread: Hot Standby, deferred conflict resolution for cleanup records (v2)
I think I've found a better way of doing deferred conflict resolution for WAL cleanup records. (This does not check for conflicts at block level). When a cleanup arrives, check *lock* conflicts to see who is accessing the relation about to be cleaned. If there are any lock conflicts, then wait, if requested. If we waited, re-check *lock* conflicts to see who is accessing the relation about to be cleaned. While holding lock, set latestRemovedXid for the relation (protected by the partition lock). Anyone acquiring a lock on a table should check the latestRemovedXid for the table and abort if their xmin is too old. This prevents new lockers from accessing a cleaned relation immediately after we decide to abort anyone looking at that table. (Anyone queuing for the existing locks would be caught by this). We then cancel the list of current lock conflicts using the latestRemovedXid (if there is one) as a cross-check to see if we can avoid cancelling the query. So if latestRemovedXid advances on a table you have locked, you will have your xmin re-checked. If you access a table that has been or is about to be cleaned then you will check xmin also. Taken together this will mean that far fewer queries get cancelled, since we check on both relid and latestRemovedXid. Reasonably simple queries that take locks on a small number of relations at the start of their execution will continue processing for long periods if they do not access fast changing relations. In particular, IMHO, this will cure about 90% of the btree delete issue, since only users accessing a particularly busy index will need to cancel themselves. Since many longer running queries don't use indexes at all that trait alone will ensure that queries survive longer. We need to keep track of latestRemovedXids for various relations in shared memory. ISTM we can track top 8? common relids per lock partition using a trivial LRU and then have a catch-all value for others. That will allow us to track more than 100 relations without sweating too much. All the fuss is handled during hot standby, so if you choose not to use it, you have no impact. -- Simon Riggs www.2ndQuadrant.com
On Sat, Dec 12, 2009 at 3:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Anyone acquiring a lock on a table should check the latestRemovedXid for > the table and abort if their xmin is too old. This prevents new lockers > from accessing a cleaned relation immediately after we decide to abort > anyone looking at that table. (Anyone queuing for the existing locks > would be caught by this). I fear given HOT pruning that this could mean no query can even get started against a busy table. It seems like you would have to start your transaction several times until you manage to get a lock on the busy table soon enough after taking the snapshot to not have missed any cleanups in the table. Or have I missed something that protects against that? The bigger problem with this is that I don't see any way to tune this to have a safe replica. In the current system you can set standby_max_delay to 0 or -1 or whatever to completely disable killing off valid queries on the replica. In this setup you're going ahead with cleanup records which may or may not be safe and then have no recourse if they turn out to conflict. -- greg
Re: Hot Standby, deferred conflict resolution for cleanup records (v2)
From
Heikki Linnakangas
Date:
Greg Stark wrote: > On Sat, Dec 12, 2009 at 3:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Anyone acquiring a lock on a table should check the latestRemovedXid for >> the table and abort if their xmin is too old. This prevents new lockers >> from accessing a cleaned relation immediately after we decide to abort >> anyone looking at that table. (Anyone queuing for the existing locks >> would be caught by this). > > I fear given HOT pruning that this could mean no query can even get > started against a busy table. It seems like you would have to start > your transaction several times until you manage to get a lock on the > busy table soon enough after taking the snapshot to not have missed > any cleanups in the table. Or have I missed something that protects > against that? I presume max_standby_delay would still apply, and we would only use the new mechanism where we would otherwise outright kill a query. > The bigger problem with this is that I don't see any way to tune this > to have a safe replica. Yeah, it's very opportunistic. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-12-14 at 04:57 +0000, Greg Stark wrote: > On Sat, Dec 12, 2009 at 3:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Anyone acquiring a lock on a table should check the latestRemovedXid for > > the table and abort if their xmin is too old. This prevents new lockers > > from accessing a cleaned relation immediately after we decide to abort > > anyone looking at that table. (Anyone queuing for the existing locks > > would be caught by this). > > I fear given HOT pruning that this could mean no query can even get > started against a busy table. It seems like you would have to start > your transaction several times until you manage to get a lock on the > busy table soon enough after taking the snapshot to not have missed > any cleanups in the table. The proposal improves this situation. Right now we would cancel all queries, not just the ones looking at the busy table. > Or have I missed something that protects > against that? At your suggestion, I previously added a feature, described in docs: "It is also possible to set vacuum_defer_cleanup_age on the primary to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow more time for queries to execute before they are cancelled on the standby, without the need for setting a high max_standby_delay." vacuum_defer_cleanup_age delays globalxmin by a fixed number of xids. That is fairly crude and so the proposal here is to add a finer-grained conflict resolution, please read on. > The bigger problem with this is that I don't see any way to tune this > to have a safe replica. In the current system you can set > standby_max_delay to 0 or -1 or whatever to completely disable killing > off valid queries on the replica. In this setup you're going ahead > with cleanup records which may or may not be safe and then have no > recourse if they turn out to conflict. Attempting a full analysis... An example of current and proposed behaviours, using tables A, B and C T0: An AccessExclusiveLock is applied to B T1: Q1 takes snapshot, takes lock on A and begins query T2: Q2 takes snapshot, queues for lock on B behind AccessExclusiveLock T3: Cleanup on table C is handled that will conflict with both snapshots T4: Q3 takes snapshot, takes lock on C and begins query (if possible) T5: Cleanup on table C is handled that will conflict with Q3 Current: At T3, current conflict resolution will wait for max_standby_delay and then cancel Q1 and Q2. Q3 can begin processing immediately because the snapshot it takes will always be same or later than the xmin that generated the cleanup at T3 (*). At T5, Q3 will be quickly cancelled because all the standby delay was used up at T3 and there is none left to spend on delaying for Q3. (*) is obviously a key assumption. Proposal1: Conflict resolution will not wait at T3 at all and Q1 and Q2 will continue towards completion. At T5, Q3 will be cancelled without much delay, as explained for current. Proposal1 seems better than current situation. Taking up your point about timing delays, if the sequence of actions is T6: Q4 takes snapshot T7: commit of transaction that advances xmin T8: Cleanup on table C handled without delay T9: Q4 takes lock on C and cancels then yes, Q4 is cancelled without a delay. There is a possible race condition that would allow this, but in the vast majority of read committed queries this would be a small window, since T7, T8 are seldom adjacent in WAL, whereas T6, T9 are typically very fast. If the race does occur then the effect is not incorrect query results, just a query cancelled earlier than we would ideally like it to have been. A slight modification to the proposal would be to check for conflicts based upon the snapshot first, wait, then check for lock conflicts before cancelling, rather than the other way around. That closes the timing window you've pointed out, at least as far as max_standby_delay. Call that Proposal2. The first example would then result like this: Proposal2: Conflict resolution will wait at T3; Q1 and Q2 will continue towards completion because there is no lock conflict. At T5, Q3 will be cancelled without much delay, as explained for current. So almost identical outcome in the typical case, but Proposal 2 doesn't cancel queries early in some cases. In summary: * Current: Q1, Q2 and Q3 are cancelled. * Proposal1: Q1, Q2 continue until completion. Q3 is cancelled, with roughly same delay as in current proposal. * Proposal2: Q1, Q2 continue until completion. Q3 is cancelled, with roughly same delay as in current proposal, though consistent handling in all, not just typical cases. So Proposal2 wins, though problems still possible. AFAICS the faster a table generates cleanup records, the shorter the window of opportunity for queries to run against it without conflict on Hot Standby. The worst case is a single row table being updated constantly by short transactions, which will generate a rapid stream of cleanup from WAL. The only simple solution in that case is to pause the standby entirely while queries complete. That option has been removed during review to allow the remainder of the patch to proceed, though clearly needs to be replaced. I don't think I've solved every usability issue and I think this needs to go to Alpha to get better feedback on that. More ideas welcome. -- Simon Riggs www.2ndQuadrant.com