Thread: Hot standby, conflict resolution
The FATAL and ERROR cancellation modes are quite different. In FATAL mode, you just want to kill the backend. The target connection doesn't need to know the LSN. In ERROR mode, you don't really want to interrupt the target backend. In ReadBuffer, you're checking a global variable, BufferRecoveryConflictPending on each call, and if it's set, you check the buffer's LSN against the LSN of the earliest LSN conflicting LSN, and throw an error if it's greater than that. Why do we jump through so many hoops to get the earliest conflicting LSN to where it's needed? At the moment: 1. Startup process sets the LSN in the target backend's PGPROC entry, and signals it with SIGINT. 2. The target backend receives the signal; ProcessInterrupts is called either immediately or at the next CHECK_FOR_INTERRUPTS() call. 3. ProcessInterrupts reads the value from PGPROC, and passes it to bufmgr.c ISTM that if ReadBuffer read the value directly from the PGPROC entry, there would be no need for the signaling (in the ERROR mode). Correct me if I'm wrong, but I thought the idea of this new conflict resolution was that the startup process doesn't need to wait for the target backend to die. Instead, the target backend knows to commit suicide if it stumbles into a buffer that's been modified in a conflicting way. Looking at ResolveRecoveryConflictWithVirtualXIDs, it looks like we still wait. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-01-23 at 17:51 +0200, Heikki Linnakangas wrote: > In ERROR mode, you don't really want to interrupt the target backend. In > ReadBuffer, you're checking a global variable, > BufferRecoveryConflictPending on each call, and if it's set, you check > the buffer's LSN against the LSN of the earliest LSN conflicting LSN, > and throw an error if it's greater than that. Why do we jump through so > many hoops to get the earliest conflicting LSN to where it's needed? At > the moment: > > 1. Startup process sets the LSN in the target backend's PGPROC entry, > and signals it with SIGINT. > 2. The target backend receives the signal; ProcessInterrupts is called > either immediately or at the next CHECK_FOR_INTERRUPTS() call. > 3. ProcessInterrupts reads the value from PGPROC, and passes it to bufmgr.c > > ISTM that if ReadBuffer read the value directly from the PGPROC entry, > there would be no need for the signaling (in the ERROR mode). That is possible and I considered it. If we did it that way we would need to read the PGPROC each time we read a buffer. AFAICS we would need to use a spinlock to do that since reading an XLogRecPtr would not be atomic. So doing it the way I've done it allows us to use a local variable which can be more easily cached and avoids the locking overhead. We do still need to signal anyway for the FATAL case, so we're not significantly affecting the patch footprint by changing that. > Correct me if I'm wrong, but I thought the idea of this new conflict > resolution was that the startup process doesn't need to wait for the > target backend to die. Instead, the target backend knows to commit > suicide if it stumbles into a buffer that's been modified in a > conflicting way. Looking at ResolveRecoveryConflictWithVirtualXIDs, it > looks like we still wait. err, no, that's just an oversight, not intentional. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Fri, 2009-01-23 at 17:51 +0200, Heikki Linnakangas wrote: >> ISTM that if ReadBuffer read the value directly from the PGPROC entry, >> there would be no need for the signaling (in the ERROR mode). > > That is possible and I considered it. If we did it that way we would > need to read the PGPROC each time we read a buffer. AFAICS we would need > to use a spinlock to do that since reading an XLogRecPtr would not be > atomic. Hmm, I think you could make it lock-free by ordering the instructions carefully. Not sure it's worth the code complexity, though. >> Correct me if I'm wrong, but I thought the idea of this new conflict >> resolution was that the startup process doesn't need to wait for the >> target backend to die. Instead, the target backend knows to commit >> suicide if it stumbles into a buffer that's been modified in a >> conflicting way. Looking at ResolveRecoveryConflictWithVirtualXIDs, it >> looks like we still wait. > > err, no, that's just an oversight, not intentional. Ok, then I think we have a little race condition. The startup process doesn't get any reply indicating that the target backend has processed the SIGINT and set the cached conflict LSN. The target backend might move ahead using the old LSN for a little while, even though the startup process has already gone ahead and replayed a vacuum record. Another tiny issue is that it looks like a new conflict LSN always overwrites the old one. But you should always use the oldest conflicted LSN in the checks, not the newest. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-01-23 at 21:30 +0200, Heikki Linnakangas wrote: > >> Correct me if I'm wrong, but I thought the idea of this new conflict > >> resolution was that the startup process doesn't need to wait for the > >> target backend to die. Instead, the target backend knows to commit > >> suicide if it stumbles into a buffer that's been modified in a > >> conflicting way. Looking at ResolveRecoveryConflictWithVirtualXIDs, it > >> looks like we still wait. > > > > err, no, that's just an oversight, not intentional. > > Ok, then I think we have a little race condition. The startup process > doesn't get any reply indicating that the target backend has processed > the SIGINT and set the cached conflict LSN. The target backend might > move ahead using the old LSN for a little while, even though the startup > process has already gone ahead and replayed a vacuum record. Hah! You had me either way. Very neat :-) I'll think some more and reply, but not tonight. > Another tiny issue is that it looks like a new conflict LSN always > overwrites the old one. But you should always use the oldest conflicted > LSN in the checks, not the newest. OK -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2009-01-23 at 21:30 +0200, Heikki Linnakangas wrote: > Ok, then I think we have a little race condition. The startup process > doesn't get any reply indicating that the target backend has > processed > the SIGINT and set the cached conflict LSN. The target backend might > move ahead using the old LSN for a little while, even though the > startup > process has already gone ahead and replayed a vacuum record. > > Another tiny issue is that it looks like a new conflict LSN always > overwrites the old one. But you should always use the oldest > conflicted > LSN in the checks, not the newest. That makes it easier, because it is either not set, or it is set and does not need to be reset as new conflict LSNs appear. I can see a simple scheme emerging, which I will detail tomorrow morning. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sun, 2009-01-25 at 16:19 +0000, Simon Riggs wrote: > On Fri, 2009-01-23 at 21:30 +0200, Heikki Linnakangas wrote: > > > Ok, then I think we have a little race condition. The startup process > > doesn't get any reply indicating that the target backend has > > processed > > the SIGINT and set the cached conflict LSN. The target backend might > > move ahead using the old LSN for a little while, even though the > > startup > > process has already gone ahead and replayed a vacuum record. > > > > Another tiny issue is that it looks like a new conflict LSN always > > overwrites the old one. But you should always use the oldest > > conflicted > > LSN in the checks, not the newest. > > That makes it easier, because it is either not set, or it is set and > does not need to be reset as new conflict LSNs appear. > > I can see a simple scheme emerging, which I will detail tomorrow > morning. Rather than signalling, we could use a hasconflict boolean for each proc in a shared data structure. It can be read without spinlock, but should only be written while holding spinlock. Each time we read a block we check if hasconflict is set. If it is, we grab spinlock, recheck if it is set, if so read the conflict details, clear the flag and drop the spinlock. The aim of this type of conflict resolution was to reduce the footprint of users that would be effected and defer it as much as possible. We've spent time getting the latestCompletedXid, but we know deriving that value is very difficult in the btree case at least. So what I would like to do is pass the relid of a conflict across as well and use that to reduce the footprint, now that we are performing the test inside the buffer manager. We would keep a relid cache with a very small number of relids, perhaps just one, maybe as many as 4 or 8, so that we can fit relids and associated LSNs in a single cache line. We can match the relid using a simple for loop, which we know is well optimised when there is no dependency between the elements of the loop and the loop has a compile-time fixed number of iterations. I would be inclined to make this a separate shared memory area rather than try to weld that onto PGPROC. We could index that using backendid. If the relid cache overflows, we just apply a general LSN value. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Rather than signalling, we could use a hasconflict boolean for each proc > in a shared data structure. It can be read without spinlock, but should > only be written while holding spinlock. > > Each time we read a block we check if hasconflict is set. If it is, we > grab spinlock, recheck if it is set, if so read the conflict details, > clear the flag and drop the spinlock. Yeah, that seems workable. > The aim of this type of conflict resolution was to reduce the footprint > of users that would be effected and defer it as much as possible. We've > spent time getting the latestCompletedXid, but we know deriving that > value is very difficult in the btree case at least. So what I would like > to do is pass the relid of a conflict across as well and use that to > reduce the footprint, now that we are performing the test inside the > buffer manager. I agree that would be useful, but I'd prefer to keep it simple for now... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com