Thread: Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.
On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>> On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote: >>>> On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote: >>>> > On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote: >>>> >> + bool >>>> >> + RecoveryMightBeInProgress(void) >>>> >> + { >>>> >> + /* >>>> >> + * We check shared state each time only until we leave recovery mode. We >>>> >> + * can't re-enter recovery, so there's no need to keep checking after the >>>> >> + * shared variable has once been seen false. >>>> >> + */ >>>> >> + if (!LocalRecoveryInProgress) >>>> >> + return false; >>>> >> + else >>>> >> + { >>>> >> + /* use volatile pointer to prevent code rearrangement */ >>>> >> + volatile XLogCtlData *xlogctl = XLogCtl; >>>> >> + >>>> >> + /* Intentionally query xlogctl without spinlocking! */ >>>> >> + LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress; >>>> >> + >>>> >> + return LocalRecoveryInProgress; >>>> >> + } >>>> >> + } >>>> > >>>> > I don't think it's acceptable to *set* LocalRecoveryInProgress >>>> > here. That should only be done in the normal routine. >>>> >>>> quite right -- that was a major error -- you could bypass the >>>> initialization call to the xlog with some bad luck. >>> >>> I've seen this in profiles since, so I'd appreciate pushing this >>> forward. >> >> roger that -- will push ahead when I get into the office... > > attached is new version fixing some comment typos. Attached is simplified patch that replaces the spinlock with a read barrier based on a suggestion made by Andres offlist. The approach has different performance characteristics -- a barrier call is being issued instead of a non-blocking read. I don't have a performance test case in hand to prove that's better so I'm going with Andre's approach because it's simpler. Aside: can this truly the only caller of pg_read_barrier()? Also, moving to -hackers from -performance. merlin
Attachment
On 27.09.2013 22:00, Merlin Moncure wrote: > On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure<mmoncure@gmail.com> wrote: >> On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure<mmoncure@gmail.com> wrote: >>> On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund<andres@2ndquadrant.com> wrote: >>>> On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote: >>>>> On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund<andres@2ndquadrant.com> wrote: >>>>>> On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote: >>>>>>> + bool >>>>>>> + RecoveryMightBeInProgress(void) >>>>>>> + { >>>>>>> + /* >>>>>>> + * We check shared state each time only until we leave recovery mode. We >>>>>>> + * can't re-enter recovery, so there's no need to keep checking after the >>>>>>> + * shared variable has once been seen false. >>>>>>> + */ >>>>>>> + if (!LocalRecoveryInProgress) >>>>>>> + return false; >>>>>>> + else >>>>>>> + { >>>>>>> + /* use volatile pointer to prevent code rearrangement */ >>>>>>> + volatile XLogCtlData *xlogctl = XLogCtl; >>>>>>> + >>>>>>> + /* Intentionally query xlogctl without spinlocking! */ >>>>>>> + LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress; >>>>>>> + >>>>>>> + return LocalRecoveryInProgress; >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> I don't think it's acceptable to *set* LocalRecoveryInProgress >>>>>> here. That should only be done in the normal routine. >>>>> >>>>> quite right -- that was a major error -- you could bypass the >>>>> initialization call to the xlog with some bad luck. >>>> >>>> I've seen this in profiles since, so I'd appreciate pushing this >>>> forward. >>> >>> roger that -- will push ahead when I get into the office... >> >> attached is new version fixing some comment typos. > > Attached is simplified patch that replaces the spinlock with a read > barrier based on a suggestion made by Andres offlist. The approach > has different performance characteristics -- a barrier call is being > issued instead of a non-blocking read. I don't have a performance > test case in hand to prove that's better so I'm going with Andre's > approach because it's simpler. Can you think of a concrete example of what might go wrong if we simply removed the spinlock, and did an unprotected read through the volatile pointer? Is there some particular call sequence that might get optimized in an unfortunate way? I'm not suggesting that we do that - better safe than sorry even if we can't think of any particular scenario - but it would help me understand this. I don't think a read-barrier is enough here. See README.barrier: > When a memory barrier is being used to separate two > loads, use pg_read_barrier(); when it is separating two stores, use > pg_write_barrier(); when it is a separating a load and a store (in either > order), use pg_memory_barrier(). The function RecoveryInProgress() function does just one load, to read the variable, and wouldn't even need a barrier by itself. The other load or store that needs to be protected by the barrier happens in the caller, before or after the function, and we can't say for sure if it's a load or a store. So, let's use pg_memory_barrier(). > Aside: can this truly the only caller > of pg_read_barrier()? Yep. I only added the first caller of the barriers altogether in the xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it wasn't used until now, in 9.4. - Heikki
On Fri, Sep 27, 2013 at 1:28 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Yep. I only added the first caller of the barriers altogether in the > xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it > wasn't used until now, in 9.4. FWIW, it was actually during 9.2 development that Robert first added the barriers. -- Peter Geoghegan
On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote: > On 27.09.2013 22:00, Merlin Moncure wrote: > >Attached is simplified patch that replaces the spinlock with a read > >barrier based on a suggestion made by Andres offlist. The approach > >has different performance characteristics -- a barrier call is being > >issued instead of a non-blocking read. I don't have a performance > >test case in hand to prove that's better so I'm going with Andre's > >approach because it's simpler. > > Can you think of a concrete example of what might go wrong if we simply > removed the spinlock, and did an unprotected read through the volatile > pointer? Is there some particular call sequence that might get optimized in > an unfortunate way? I'm not suggesting that we do that - better safe than > sorry even if we can't think of any particular scenario - but it would help > me understand this. I think in some architecutres the write from the other side might just not immediately be visible, so RecoveryInProgress() might return a spurious true, although we already finished. I haven't checked whether there's any callers that would have a problem with that. > I don't think a read-barrier is enough here. See README.barrier: > > >When a memory barrier is being used to separate two > >loads, use pg_read_barrier(); when it is separating two stores, use > >pg_write_barrier(); when it is a separating a load and a store (in either > >order), use pg_memory_barrier(). I think the documentation is just worded in an easy to misunderstand way. If you look at the following example, we're fine. The "two stores", "two loads" it is talking about are what's happening in one thread of execution. The important part of a read barrier is that it forces a fresh LOAD, separated from earlier LOADs to the same address. > The function RecoveryInProgress() function does just one load, to read the > variable, and wouldn't even need a barrier by itself. The other load or > store that needs to be protected by the barrier happens in the caller, > before or after the function, and we can't say for sure if it's a load or a > store. So, let's use pg_memory_barrier(). The caller uses a spinlock, so it's guaranteed to write out before the spinlock is released. A write barrier (the spinlock in the startup process) should always be paired by a read barrier. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, What confuses me is that pg_read_barrier() is just a compiler barrier on x86[-64] in barrier.h. According to my knowledge it needs to be an lfence or the full barrier? The linked papers from Paul McKenney - which are a great read - seem to agree? On 2013-09-27 23:12:17 +0200, Andres Freund wrote: > On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote: > > The function RecoveryInProgress() function does just one load, to read the > > variable, and wouldn't even need a barrier by itself. The other load or > > store that needs to be protected by the barrier happens in the caller, > > before or after the function, and we can't say for sure if it's a load or a > > store. So, let's use pg_memory_barrier(). > > The caller uses a spinlock, so it's guaranteed to write out before the > spinlock is released. A write barrier (the spinlock in the startup > process) should always be paired by a read barrier. s/caller/startup process/ Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 9/27/13 3:00 PM, Merlin Moncure wrote: > Attached is simplified patch that replaces the spinlock with a read > barrier based on a suggestion made by Andres offlist. This patch doesn't apply.
On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote: > What confuses me is that pg_read_barrier() is just a compiler barrier on > x86[-64] in barrier.h. According to my knowledge it needs to be an > lfence or the full barrier? > The linked papers from Paul McKenney - which are a great read - seem to > agree? x86 provides a pretty strong memory model, the only (relevant) allowed reordering is that stores may be delayed beyond subsequent loads. A simple and functionally accurate model of this is to ignore all caches and think of the system as a bunch of CPU's accessing the main memory through a shared bus, but each CPU has a small buffer that stores writes done by that CPU. Intel and AMD have performed feats of black magic in the memory pipelines that this isn't a good performance model, but for correctness it is valid. The exceptions are few unordered memory instructions that you have to specifically ask for and non-writeback memory that concerns the kernel. (See section 8.2 in [1] for details) The note by McKenney stating that lfence is required for a read barrier is for those unordered instructions. I don't think it's necessary in PostgreSQL. As for the specific patch being discussed here. The read barrier is in the wrong place and with the wrong comment, and the write side is assuming that SpinLockAcquire() is a write barrier, which it isn't documented to do at the moment. The correct way to think of this is that StartupXLOG() does a bunch of state modifications and then advertises the fact that it's done by setting xlogctl->SharedRecoveryInProgress = false; The state modifications should better be visible to anyone seeing that last write, so you need one write barrier between the state modifications and setting the flag. On the other side, after reading the flag as false in RecoveryInProgress() the state modified by StartupXLOG() may be investigated, those loads better not happen before we read the flag. So we need a read barrier somewhere *after* reading the flag in RecoveryInProgress() and reading the shared memory structures, and in theory a full barrier if we are going to be writing data. In practice x86 is covered thanks to it's memory model, Power is covered thanks to the control dependency and ARM would need a read barrier, but I don't think any real ARM CPU does speculative stores as that would be insane. [1] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2013-10-01 03:51:50 +0300, Ants Aasma wrote: > On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > What confuses me is that pg_read_barrier() is just a compiler barrier on > > x86[-64] in barrier.h. According to my knowledge it needs to be an > > lfence or the full barrier? > > The linked papers from Paul McKenney - which are a great read - seem to > > agree? > > x86 provides a pretty strong memory model, the only (relevant) allowed > reordering is that stores may be delayed beyond subsequent loads. A > simple and functionally accurate model of this is to ignore all caches > and think of the system as a bunch of CPU's accessing the main memory > through a shared bus, but each CPU has a small buffer that stores > writes done by that CPU. Intel and AMD have performed feats of black > magic in the memory pipelines that this isn't a good performance > model, but for correctness it is valid. The exceptions are few > unordered memory instructions that you have to specifically ask for > and non-writeback memory that concerns the kernel. (See section 8.2 in > [1] for details) The note by McKenney stating that lfence is required > for a read barrier is for those unordered instructions. I don't think > it's necessary in PostgreSQL. I am actually referring to his x86 description where he states smp_rmb() is defined as lock xadd;. But I've since done some research and that was removed for most x86 platforms after intel and AMD relaxed the requirements to match the actually implemented reality. For reference: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b6c7347fffa655a3000d9d41640d222c19fc3065 > As for the specific patch being discussed here. The read barrier is in > the wrong place and with the wrong comment, and the write side is > assuming that SpinLockAcquire() is a write barrier, which it isn't > documented to do at the moment. Lots of things we do pretty much already assume it is one - and I have a hard time imagining any spinlock implementation that isn't a write barrier. In fact it's the only reason we use spinlocks in quite some places. What we are missing is that it's not guaranteed to be a compiler barrier on all platforms :(. But that's imo a bug. > The correct way to think of this is > that StartupXLOG() does a bunch of state modifications and then > advertises the fact that it's done by setting > xlogctl->SharedRecoveryInProgress = false; The state modifications > should better be visible to anyone seeing that last write, so you need > one write barrier between the state modifications and setting the > flag. SpinLockAcquire() should provide that. > On the other side, after reading the flag as false in > RecoveryInProgress() the state modified by StartupXLOG() may be > investigated, those loads better not happen before we read the flag. Agreed. > So we need a read barrier somewhere *after* reading the flag in > RecoveryInProgress() and reading the shared memory structures, and in > theory a full barrier if we are going to be writing data. In practice > x86 is covered thanks to it's memory model, Power is covered thanks to > the control dependency and ARM would need a read barrier, but I don't > think any real ARM CPU does speculative stores as that would be > insane. Does there necessarily have to be a "visible" control dependency? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-10-01 14:31:11 +0300, Ants Aasma wrote: > >> The correct way to think of this is > >> that StartupXLOG() does a bunch of state modifications and then > >> advertises the fact that it's done by setting > >> xlogctl->SharedRecoveryInProgress = false; The state modifications > >> should better be visible to anyone seeing that last write, so you need > >> one write barrier between the state modifications and setting the > >> flag. > > > > SpinLockAcquire() should provide that. > > Yes. It's notable that in this case it's a matter of correctness that > the global state modifications do *not* share the critical section > with the flag update. Otherwise the flag update may become visible > before the state updates. I think we're currently essentially assuming that not only SpinLockAcquire() is a barrier but also that SpinLockRelease() is one... - which is actually far less likely to be true. > >> So we need a read barrier somewhere *after* reading the flag in > >> RecoveryInProgress() and reading the shared memory structures, and in > >> theory a full barrier if we are going to be writing data. In practice > >> x86 is covered thanks to it's memory model, Power is covered thanks to > >> the control dependency and ARM would need a read barrier, but I don't > >> think any real ARM CPU does speculative stores as that would be > >> insane. > > > > Does there necessarily have to be a "visible" control dependency? > > Unfortunately no That's what I thought :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 1, 2013 at 2:08 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> As for the specific patch being discussed here. The read barrier is in >> the wrong place and with the wrong comment, and the write side is >> assuming that SpinLockAcquire() is a write barrier, which it isn't >> documented to do at the moment. > > Lots of things we do pretty much already assume it is one - and I have a > hard time imagining any spinlock implementation that isn't a write > barrier. In fact it's the only reason we use spinlocks in quite some > places. > What we are missing is that it's not guaranteed to be a compiler barrier > on all platforms :(. But that's imo a bug. Agree on both counts. >> The correct way to think of this is >> that StartupXLOG() does a bunch of state modifications and then >> advertises the fact that it's done by setting >> xlogctl->SharedRecoveryInProgress = false; The state modifications >> should better be visible to anyone seeing that last write, so you need >> one write barrier between the state modifications and setting the >> flag. > > SpinLockAcquire() should provide that. Yes. It's notable that in this case it's a matter of correctness that the global state modifications do *not* share the critical section with the flag update. Otherwise the flag update may become visible before the state updates. >> So we need a read barrier somewhere *after* reading the flag in >> RecoveryInProgress() and reading the shared memory structures, and in >> theory a full barrier if we are going to be writing data. In practice >> x86 is covered thanks to it's memory model, Power is covered thanks to >> the control dependency and ARM would need a read barrier, but I don't >> think any real ARM CPU does speculative stores as that would be >> insane. > > Does there necessarily have to be a "visible" control dependency? Unfortunately no, the compiler is quite free to hoist loads and even stores out from the conditional block losing the control dependency. :( It's quite unlikely to do so as it would be a very large code motion and it probably has no reason to do it, but I don't see anything that would disallow it. I wonder if we should just emit a full fence there for platforms with weak memory models. Trying to enforce the control dependency seems quite fragile. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Mon, Sep 30, 2013 at 8:15 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 9/27/13 3:00 PM, Merlin Moncure wrote: >> Attached is simplified patch that replaces the spinlock with a read >> barrier based on a suggestion made by Andres offlist. > > This patch doesn't apply. works for me: merlin@mmoncure-ubuntu:~/pgdev/pgsql$ git reset --hard HEAD HEAD is now at 200ba16 Add regression test for bug fixed by recent refactoring. merlin@mmoncure-ubuntu:~/pgdev/pgsql$ patch -p1 < buffer5.patch patching file src/backend/access/transam/xlog.c On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma <ants@cybertec.at> wrote: > So we need a read barrier somewhere *after* reading the flag in > RecoveryInProgress() and reading the shared memory structures, and in > theory a full barrier if we are going to be writing data. wow -- thanks for your review and provided detail. Considering there are no examples of barrier instructions to date, I think some of your commentary should be included in the in-source documentation. In this particular case, a read barrier should be sufficient? By 'writing data', do you mean to the xlog control structure? This routine only sets a backend local flag so that should be safe? merlin
On 2013-10-02 08:39:59 -0500, Merlin Moncure wrote: > wow -- thanks for your review and provided detail. Considering there > are no examples of barrier instructions to date, I think some of your > commentary should be included in the in-source documentation. I think most of it is in README.barrier Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 2, 2013 at 4:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma <ants@cybertec.at> wrote: >> So we need a read barrier somewhere *after* reading the flag in >> RecoveryInProgress() and reading the shared memory structures, and in >> theory a full barrier if we are going to be writing data. > > wow -- thanks for your review and provided detail. Considering there > are no examples of barrier instructions to date, I think some of your > commentary should be included in the in-source documentation. > > In this particular case, a read barrier should be sufficient? By > 'writing data', do you mean to the xlog control structure? This > routine only sets a backend local flag so that should be safe? I haven't reviewed the code in as much detail to say if there is an actual race here, I tend to think there's probably not, but the specific pattern that I had in mind is that with the following actual code: Process A: globalVar = 1; write_barrier(); recoveryInProgress = false; Process B: if (!recoveryInProgress) { globalVar = 2; doWork(); } If process B speculatively executes line 2 before reading the flag for line 1, then it's possible that the store in process B is executed before the store in process A, making globalVar move backwards. The barriers as they are defined don't make this scenario impossible. That said, I don't know of any hardware that would make speculatively executed stores visible to non-speculative state, as I said, that would be completely insane. However currently compilers consider it completely legal to rewrite the code into the following form: tmp = globalVar; globalVar = 2; if (!recoveryInProgress) { doWork(); } else { globalVar = tmp; } That still exhibits the same problem. An abstract read barrier would not be enough here, as this requires a LoadStore barrier. However, the control dependency is enough for the hardware and PostgreSQL pg_read_barrier() is a full compiler barrier, so in practice a simple pg_read_barrier() is enough. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote: > I haven't reviewed the code in as much detail to say if there is an > actual race here, I tend to think there's probably not, but the > specific pattern that I had in mind is that with the following actual > code: hm. I think there *is* a race. 2+ threads could race to the line: LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress; and simultaneously set the value of LocalRecoveryInProgress to false, and both engage InitXLOGAccess, which is destructive. The move operation is atomic, but I don't think there's any guarantee the reads to xlogctl->SharedRecoveryInProgress are ordered between threads without a lock. I don't think the memory barrier will fix this. Do you agree? If so, my earlier patch (recovery4) is another take on this problem, and arguable safer; the unlocked read is in a separate path that does not engage InitXLOGAccess() merlin
On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote: >> I haven't reviewed the code in as much detail to say if there is an >> actual race here, I tend to think there's probably not, but the >> specific pattern that I had in mind is that with the following actual >> code: > > hm. I think there *is* a race. 2+ threads could race to the line: > > LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress; > > and simultaneously set the value of LocalRecoveryInProgress to false, > and both engage InitXLOGAccess, which is destructive. The move > operation is atomic, but I don't think there's any guarantee the reads > to xlogctl->SharedRecoveryInProgress are ordered between threads > without a lock. > > I don't think the memory barrier will fix this. Do you agree? If so, > my earlier patch (recovery4) is another take on this problem, and > arguable safer; the unlocked read is in a separate path that does not > engage InitXLOGAccess() InitXLOGAccess only writes backend local variables, so it can't be destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock acquisition cycle, which should be a full memory barrier. A read barrier after accessing SharedRecoveryInProgress is good enough, and it seems to be necessary to avoid a race condition for XLogCtl->ThisTimeLineID. Admittedly the race is so unprobable that in practice it probably doesn't matter. I just wanted to be spell out the correct way to do unlocked accesses as it can get quite confusing. I found Herb Sutter's "atomic<> weapons" talk very useful, thinking about the problem in terms of acquire and release makes it much clearer to me. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Wed, Oct 2, 2013 at 3:43 PM, Ants Aasma <ants@cybertec.at> wrote: > On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote: >>> I haven't reviewed the code in as much detail to say if there is an >>> actual race here, I tend to think there's probably not, but the >>> specific pattern that I had in mind is that with the following actual >>> code: >> >> hm. I think there *is* a race. 2+ threads could race to the line: >> >> LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress; >> >> and simultaneously set the value of LocalRecoveryInProgress to false, >> and both engage InitXLOGAccess, which is destructive. The move >> operation is atomic, but I don't think there's any guarantee the reads >> to xlogctl->SharedRecoveryInProgress are ordered between threads >> without a lock. >> >> I don't think the memory barrier will fix this. Do you agree? If so, >> my earlier patch (recovery4) is another take on this problem, and >> arguable safer; the unlocked read is in a separate path that does not >> engage InitXLOGAccess() > > InitXLOGAccess only writes backend local variables, so it can't be > destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock > acquisition cycle, which should be a full memory barrier. A read > barrier after accessing SharedRecoveryInProgress is good enough, and > it seems to be necessary to avoid a race condition for > XLogCtl->ThisTimeLineID. > > Admittedly the race is so unprobable that in practice it probably > doesn't matter. I just wanted to be spell out the correct way to do > unlocked accesses as it can get quite confusing. I found Herb Sutter's > "atomic<> weapons" talk very useful, thinking about the problem in > terms of acquire and release makes it much clearer to me. If we don't care about multiple calls to InitXLOGAccess() (for a backend) then we can get away with just a barrier. That's a pretty weird assumption though (even if 'improbable') and I think it should be well documented in the code, particularly since previous versions went though such trouble to do a proper check->set->init. I'm still leaning on my earlier idea (recovery4.patch) since it keeps the old semantics by exploiting the fact that the call site of interest (only) does not require a correct answer (that is, for the backend to think it's in recovery when it's not). That just defers the heap prune for a time and you don't have to mess around with barriers at all or be concerned about present or future issues caused by spurious extra calls to InitXLOGAccess(). The other code paths to RecoveryInProgress() appear not to be interesting from a spinlock perspective. merlin
On 03.10.2013 00:14, Merlin Moncure wrote: > On Wed, Oct 2, 2013 at 3:43 PM, Ants Aasma <ants@cybertec.at> wrote: >> On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote: >>>> I haven't reviewed the code in as much detail to say if there is an >>>> actual race here, I tend to think there's probably not, but the >>>> specific pattern that I had in mind is that with the following actual >>>> code: >>> >>> hm. I think there *is* a race. 2+ threads could race to the line: >>> >>> LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress; >>> >>> and simultaneously set the value of LocalRecoveryInProgress to false, >>> and both engage InitXLOGAccess, which is destructive. The move >>> operation is atomic, but I don't think there's any guarantee the reads >>> to xlogctl->SharedRecoveryInProgress are ordered between threads >>> without a lock. The backend is single-threaded, so this is moot. >> InitXLOGAccess only writes backend local variables, so it can't be >> destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock >> acquisition cycle, which should be a full memory barrier. A read >> barrier after accessing SharedRecoveryInProgress is good enough, and >> it seems to be necessary to avoid a race condition for >> XLogCtl->ThisTimeLineID. >> >> Admittedly the race is so unprobable that in practice it probably >> doesn't matter. I just wanted to be spell out the correct way to do >> unlocked accesses as it can get quite confusing. I found Herb Sutter's >> "atomic<> weapons" talk very useful, thinking about the problem in >> terms of acquire and release makes it much clearer to me. > > If we don't care about multiple calls to InitXLOGAccess() (for a > backend) then we can get away with just a barrier. That's a pretty > weird assumption though (even if 'improbable') and I think it should > be well documented in the code, particularly since previous versions > went though such trouble to do a proper check->set->init. > > I'm still leaning on my earlier idea (recovery4.patch) since it keeps > the old semantics by exploiting the fact that the call site of > interest (only) does not require a correct answer (that is, for the > backend to think it's in recovery when it's not). That just defers > the heap prune for a time and you don't have to mess around with > barriers at all or be concerned about present or future issues caused > by spurious extra calls to InitXLOGAccess(). The other code paths to > RecoveryInProgress() appear not to be interesting from a spinlock > perspective. Hmm. All callers of RecoveryInProgress() must be prepared to handle the case that RecoveryInProgress() returns true, but the system is no longer in recovery. No matter what locking we do in RecoveryInProgress(), the startup process might finish recovery just after RecoveryInProgress() has returned. What about the attached? It reads the shared variable without a lock or barrier. If it returns 'true', but the system in fact just exited recovery, that's OK. As explained above, all the callers must tolerate that anyway. But if it returns 'false', then it performs a full memory barrier, which should ensure that it sees any other shared variables as it is after the startup process cleared SharedRecoveryInProgress (notably, XLogCtl->ThisTimeLineID). - Heikki
Attachment
On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote: > Hmm. All callers of RecoveryInProgress() must be prepared to handle the case > that RecoveryInProgress() returns true, but the system is no longer in > recovery. No matter what locking we do in RecoveryInProgress(), the startup > process might finish recovery just after RecoveryInProgress() has returned. True. > What about the attached? It reads the shared variable without a lock or > barrier. If it returns 'true', but the system in fact just exited recovery, > that's OK. As explained above, all the callers must tolerate that anyway. > But if it returns 'false', then it performs a full memory barrier, which > should ensure that it sees any other shared variables as it is after the > startup process cleared SharedRecoveryInProgress (notably, > XLogCtl->ThisTimeLineID). I'd argue that we should also remove the spinlock in StartupXLOG and replace it with a write barrier. Obviously not for performance reasons, but because somebody might add more code to run under that spinlock. Looks good otherwise, although a read memory barrier ought to suffice. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote: >> Hmm. All callers of RecoveryInProgress() must be prepared to handle the case >> that RecoveryInProgress() returns true, but the system is no longer in >> recovery. No matter what locking we do in RecoveryInProgress(), the startup >> process might finish recovery just after RecoveryInProgress() has returned. > > True. > >> What about the attached? It reads the shared variable without a lock or >> barrier. If it returns 'true', but the system in fact just exited recovery, >> that's OK. As explained above, all the callers must tolerate that anyway. >> But if it returns 'false', then it performs a full memory barrier, which >> should ensure that it sees any other shared variables as it is after the >> startup process cleared SharedRecoveryInProgress (notably, >> XLogCtl->ThisTimeLineID). > > I'd argue that we should also remove the spinlock in StartupXLOG and > replace it with a write barrier. Obviously not for performance reasons, > but because somebody might add more code to run under that spinlock. > > Looks good otherwise, although a read memory barrier ought to suffice. This code is in a very hot code path. Are we *sure* that the read barrier is fast enough that we don't want to provide an alternate function that only returns the local flag? I don't know enough about them to say either way. merlin
On 2013-11-21 09:08:05 -0600, Merlin Moncure wrote: > This code is in a very hot code path. Are we *sure* that the read > barrier is fast enough that we don't want to provide an alternate > function that only returns the local flag? I don't know enough about > them to say either way. A read barrier is just a compiler barrier on x86. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 21, 2013 at 9:09 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-21 09:08:05 -0600, Merlin Moncure wrote: >> This code is in a very hot code path. Are we *sure* that the read >> barrier is fast enough that we don't want to provide an alternate >> function that only returns the local flag? I don't know enough about >> them to say either way. > > A read barrier is just a compiler barrier on x86. That's good enough for me then. merlin
On 21.11.2013 17:08, Merlin Moncure wrote: > On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote: >>> Hmm. All callers of RecoveryInProgress() must be prepared to handle the case >>> that RecoveryInProgress() returns true, but the system is no longer in >>> recovery. No matter what locking we do in RecoveryInProgress(), the startup >>> process might finish recovery just after RecoveryInProgress() has returned. >> >> True. >> >>> What about the attached? It reads the shared variable without a lock or >>> barrier. If it returns 'true', but the system in fact just exited recovery, >>> that's OK. As explained above, all the callers must tolerate that anyway. >>> But if it returns 'false', then it performs a full memory barrier, which >>> should ensure that it sees any other shared variables as it is after the >>> startup process cleared SharedRecoveryInProgress (notably, >>> XLogCtl->ThisTimeLineID). >> >> I'd argue that we should also remove the spinlock in StartupXLOG and >> replace it with a write barrier. Obviously not for performance reasons, >> but because somebody might add more code to run under that spinlock. >> >> Looks good otherwise, although a read memory barrier ought to suffice. > > This code is in a very hot code path. Are we *sure* that the read > barrier is fast enough that we don't want to provide an alternate > function that only returns the local flag? I don't know enough about > them to say either way. In my patch, I put the barrier inside the if (!LocalRecoveryInProgress) block. That codepath can only execute once in a backend, so performance is not an issue there. Does that look sane to you? - Heikki
On Thu, Nov 21, 2013 at 10:37 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 21.11.2013 17:08, Merlin Moncure wrote: >> >> On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> >> wrote: >>> >>> On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote: >>>> >>>> Hmm. All callers of RecoveryInProgress() must be prepared to handle the >>>> case >>>> that RecoveryInProgress() returns true, but the system is no longer in >>>> recovery. No matter what locking we do in RecoveryInProgress(), the >>>> startup >>>> process might finish recovery just after RecoveryInProgress() has >>>> returned. >>> >>> >>> True. >>> >>>> What about the attached? It reads the shared variable without a lock or >>>> barrier. If it returns 'true', but the system in fact just exited >>>> recovery, >>>> that's OK. As explained above, all the callers must tolerate that >>>> anyway. >>>> But if it returns 'false', then it performs a full memory barrier, which >>>> should ensure that it sees any other shared variables as it is after the >>>> startup process cleared SharedRecoveryInProgress (notably, >>>> XLogCtl->ThisTimeLineID). >>> >>> >>> I'd argue that we should also remove the spinlock in StartupXLOG and >>> replace it with a write barrier. Obviously not for performance reasons, >>> but because somebody might add more code to run under that spinlock. >>> >>> Looks good otherwise, although a read memory barrier ought to suffice. >> >> >> This code is in a very hot code path. Are we *sure* that the read >> barrier is fast enough that we don't want to provide an alternate >> function that only returns the local flag? I don't know enough about >> them to say either way. > > > In my patch, I put the barrier inside the if (!LocalRecoveryInProgress) > block. That codepath can only execute once in a backend, so performance is > not an issue there. Does that look sane to you? oh right -- certainly! merlin
On 21.11.2013 21:37, Merlin Moncure wrote: > On Thu, Nov 21, 2013 at 10:37 AM, Heikki Linnakangas >> In my patch, I put the barrier inside the if (!LocalRecoveryInProgress) >> block. That codepath can only execute once in a backend, so performance is >> not an issue there. Does that look sane to you? > > oh right -- certainly! Ok, commmited. - Heikki