Thread: Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Merlin Moncure
Date:
On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:
>>>> On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>>> > On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:
>>>> >> + bool
>>>> >> + RecoveryMightBeInProgress(void)
>>>> >> + {
>>>> >> +     /*
>>>> >> +      * We check shared state each time only until we leave recovery mode. We
>>>> >> +      * can't re-enter recovery, so there's no need to keep checking after the
>>>> >> +      * shared variable has once been seen false.
>>>> >> +      */
>>>> >> +     if (!LocalRecoveryInProgress)
>>>> >> +             return false;
>>>> >> +     else
>>>> >> +     {
>>>> >> +             /* use volatile pointer to prevent code rearrangement */
>>>> >> +             volatile XLogCtlData *xlogctl = XLogCtl;
>>>> >> +
>>>> >> +             /* Intentionally query xlogctl without spinlocking! */
>>>> >> +             LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
>>>> >> +
>>>> >> +             return LocalRecoveryInProgress;
>>>> >> +     }
>>>> >> + }
>>>> >
>>>> > I don't think it's acceptable to *set* LocalRecoveryInProgress
>>>> > here. That should only be done in the normal routine.
>>>>
>>>> quite right -- that was a major error -- you could bypass the
>>>> initialization call to the xlog with some bad luck.
>>>
>>> I've seen this in profiles since, so I'd appreciate pushing this
>>> forward.
>>
>> roger that -- will push ahead when I get into the office...
>
> attached is new version fixing some comment typos.

Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist.  The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read.   I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler.  Aside: can this truly the only caller
of pg_read_barrier()?

Also, moving to -hackers from -performance.

merlin

Attachment

Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Heikki Linnakangas
Date:
On 27.09.2013 22:00, Merlin Moncure wrote:
> On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure<mmoncure@gmail.com>  wrote:
>> On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure<mmoncure@gmail.com>  wrote:
>>> On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund<andres@2ndquadrant.com>  wrote:
>>>> On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:
>>>>> On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund<andres@2ndquadrant.com>  wrote:
>>>>>> On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:
>>>>>>> + bool
>>>>>>> + RecoveryMightBeInProgress(void)
>>>>>>> + {
>>>>>>> +     /*
>>>>>>> +      * We check shared state each time only until we leave recovery mode. We
>>>>>>> +      * can't re-enter recovery, so there's no need to keep checking after the
>>>>>>> +      * shared variable has once been seen false.
>>>>>>> +      */
>>>>>>> +     if (!LocalRecoveryInProgress)
>>>>>>> +             return false;
>>>>>>> +     else
>>>>>>> +     {
>>>>>>> +             /* use volatile pointer to prevent code rearrangement */
>>>>>>> +             volatile XLogCtlData *xlogctl = XLogCtl;
>>>>>>> +
>>>>>>> +             /* Intentionally query xlogctl without spinlocking! */
>>>>>>> +             LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
>>>>>>> +
>>>>>>> +             return LocalRecoveryInProgress;
>>>>>>> +     }
>>>>>>> + }
>>>>>>
>>>>>> I don't think it's acceptable to *set* LocalRecoveryInProgress
>>>>>> here. That should only be done in the normal routine.
>>>>>
>>>>> quite right -- that was a major error -- you could bypass the
>>>>> initialization call to the xlog with some bad luck.
>>>>
>>>> I've seen this in profiles since, so I'd appreciate pushing this
>>>> forward.
>>>
>>> roger that -- will push ahead when I get into the office...
>>
>> attached is new version fixing some comment typos.
>
> Attached is simplified patch that replaces the spinlock with a read
> barrier based on a suggestion made by Andres offlist.  The approach
> has different performance characteristics -- a barrier call is being
> issued instead of a non-blocking read.   I don't have a performance
> test case in hand to prove that's better so I'm going with Andre's
> approach because it's simpler.

Can you think of a concrete example of what might go wrong if we simply 
removed the spinlock, and did an unprotected read through the volatile 
pointer? Is there some particular call sequence that might get optimized 
in an unfortunate way? I'm not suggesting that we do that - better safe 
than sorry even if we can't think of any particular scenario - but it 
would help me understand this.

I don't think a read-barrier is enough here. See README.barrier:

> When a memory barrier is being used to separate two
> loads, use pg_read_barrier(); when it is separating two stores, use
> pg_write_barrier(); when it is a separating a load and a store (in either
> order), use pg_memory_barrier().

The function RecoveryInProgress() function does just one load, to read 
the variable, and wouldn't even need a barrier by itself. The other load 
or store that needs to be protected by the barrier happens in the 
caller, before or after the function, and we can't say for sure if it's 
a load or a store. So, let's use pg_memory_barrier().

>  Aside: can this truly the only caller
> of pg_read_barrier()?

Yep. I only added the first caller of the barriers altogether in the 
xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, 
but it wasn't used until now, in 9.4.

- Heikki



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Peter Geoghegan
Date:
On Fri, Sep 27, 2013 at 1:28 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Yep. I only added the first caller of the barriers altogether in the
> xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it
> wasn't used until now, in 9.4.

FWIW, it was actually during 9.2 development that Robert first added
the barriers.

-- 
Peter Geoghegan



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Andres Freund
Date:
On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:
> On 27.09.2013 22:00, Merlin Moncure wrote:
> >Attached is simplified patch that replaces the spinlock with a read
> >barrier based on a suggestion made by Andres offlist.  The approach
> >has different performance characteristics -- a barrier call is being
> >issued instead of a non-blocking read.   I don't have a performance
> >test case in hand to prove that's better so I'm going with Andre's
> >approach because it's simpler.
> 
> Can you think of a concrete example of what might go wrong if we simply
> removed the spinlock, and did an unprotected read through the volatile
> pointer? Is there some particular call sequence that might get optimized in
> an unfortunate way? I'm not suggesting that we do that - better safe than
> sorry even if we can't think of any particular scenario - but it would help
> me understand this.

I think in some architecutres the write from the other side might just
not immediately be visible, so RecoveryInProgress() might return a
spurious true, although we already finished.
I haven't checked whether there's any callers that would have a problem
with that.

> I don't think a read-barrier is enough here. See README.barrier:
> 
> >When a memory barrier is being used to separate two
> >loads, use pg_read_barrier(); when it is separating two stores, use
> >pg_write_barrier(); when it is a separating a load and a store (in either
> >order), use pg_memory_barrier().

I think the documentation is just worded in an easy to misunderstand
way. If you look at the following example, we're fine.
The "two stores", "two loads" it is talking about are what's happening
in one thread of execution. The important part of a read barrier is that
it forces a fresh LOAD, separated from earlier LOADs to the same
address.

> The function RecoveryInProgress() function does just one load, to read the
> variable, and wouldn't even need a barrier by itself. The other load or
> store that needs to be protected by the barrier happens in the caller,
> before or after the function, and we can't say for sure if it's a load or a
> store. So, let's use pg_memory_barrier().

The caller uses a spinlock, so it's guaranteed to write out before the
spinlock is released. A write barrier (the spinlock in the startup
process) should always be paired by a read barrier.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Andres Freund
Date:
Hi,

What confuses me is that pg_read_barrier() is just a compiler barrier on
x86[-64] in barrier.h. According to my knowledge it needs to be an
lfence or the full barrier?
The linked papers from Paul McKenney - which are a great read - seem to
agree?

On 2013-09-27 23:12:17 +0200, Andres Freund wrote:
> On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:
> > The function RecoveryInProgress() function does just one load, to read the
> > variable, and wouldn't even need a barrier by itself. The other load or
> > store that needs to be protected by the barrier happens in the caller,
> > before or after the function, and we can't say for sure if it's a load or a
> > store. So, let's use pg_memory_barrier().
> 
> The caller uses a spinlock, so it's guaranteed to write out before the
> spinlock is released. A write barrier (the spinlock in the startup
> process) should always be paired by a read barrier.

s/caller/startup process/

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Peter Eisentraut
Date:
On 9/27/13 3:00 PM, Merlin Moncure wrote:
> Attached is simplified patch that replaces the spinlock with a read
> barrier based on a suggestion made by Andres offlist.

This patch doesn't apply.



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Ants Aasma
Date:
On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> What confuses me is that pg_read_barrier() is just a compiler barrier on
> x86[-64] in barrier.h. According to my knowledge it needs to be an
> lfence or the full barrier?
> The linked papers from Paul McKenney - which are a great read - seem to
> agree?

x86 provides a pretty strong memory model, the only (relevant) allowed
reordering is that stores may be delayed beyond subsequent loads. A
simple and functionally accurate model of this is to ignore all caches
and think of the system as a bunch of CPU's accessing the main memory
through a shared bus, but each CPU has a small buffer that stores
writes done by that CPU. Intel and AMD have performed feats of black
magic in the memory pipelines that this isn't a good performance
model, but for correctness it is valid. The exceptions are few
unordered memory instructions that you have to specifically ask for
and non-writeback memory that concerns the kernel. (See section 8.2 in
[1] for details) The note by McKenney stating that lfence is required
for a read barrier is for those unordered instructions. I don't think
it's necessary in PostgreSQL.

As for the specific patch being discussed here. The read barrier is in
the wrong place and with the wrong comment, and the write side is
assuming that SpinLockAcquire() is a write barrier, which it isn't
documented to do at the moment. The correct way to think of this is
that StartupXLOG() does a bunch of state modifications and then
advertises the fact that it's done by setting
xlogctl->SharedRecoveryInProgress = false; The state modifications
should better be visible to anyone seeing that last write, so you need
one write barrier between the state modifications and setting the
flag. On the other side, after reading the flag as false in
RecoveryInProgress() the state modified by StartupXLOG() may be
investigated, those loads better not happen before we read the flag.
So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data. In practice
x86 is covered thanks to it's memory model, Power is covered thanks to
the control dependency and ARM would need a read barrier, but I don't
think any real ARM CPU does speculative stores as that would be
insane.

[1]
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Andres Freund
Date:
On 2013-10-01 03:51:50 +0300, Ants Aasma wrote:
> On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > What confuses me is that pg_read_barrier() is just a compiler barrier on
> > x86[-64] in barrier.h. According to my knowledge it needs to be an
> > lfence or the full barrier?
> > The linked papers from Paul McKenney - which are a great read - seem to
> > agree?
> 
> x86 provides a pretty strong memory model, the only (relevant) allowed
> reordering is that stores may be delayed beyond subsequent loads. A
> simple and functionally accurate model of this is to ignore all caches
> and think of the system as a bunch of CPU's accessing the main memory
> through a shared bus, but each CPU has a small buffer that stores
> writes done by that CPU. Intel and AMD have performed feats of black
> magic in the memory pipelines that this isn't a good performance
> model, but for correctness it is valid. The exceptions are few
> unordered memory instructions that you have to specifically ask for
> and non-writeback memory that concerns the kernel. (See section 8.2 in
> [1] for details) The note by McKenney stating that lfence is required
> for a read barrier is for those unordered instructions. I don't think
> it's necessary in PostgreSQL.

I am actually referring to his x86 description where he states smp_rmb()
is defined as lock xadd;. But I've since done some research and that was
removed for most x86 platforms after intel and AMD relaxed the
requirements to match the actually implemented reality.

For reference:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b6c7347fffa655a3000d9d41640d222c19fc3065

> As for the specific patch being discussed here. The read barrier is in
> the wrong place and with the wrong comment, and the write side is
> assuming that SpinLockAcquire() is a write barrier, which it isn't
> documented to do at the moment.

Lots of things we do pretty much already assume it is one - and I have a
hard time imagining any spinlock implementation that isn't a write
barrier. In fact it's the only reason we use spinlocks in quite some
places.
What we are missing is that it's not guaranteed to be a compiler barrier
on all platforms :(. But that's imo a bug.

> The correct way to think of this is
> that StartupXLOG() does a bunch of state modifications and then
> advertises the fact that it's done by setting
> xlogctl->SharedRecoveryInProgress = false; The state modifications
> should better be visible to anyone seeing that last write, so you need
> one write barrier between the state modifications and setting the
> flag.

SpinLockAcquire() should provide that.

> On the other side, after reading the flag as false in
> RecoveryInProgress() the state modified by StartupXLOG() may be
> investigated, those loads better not happen before we read the flag.

Agreed.

> So we need a read barrier somewhere *after* reading the flag in
> RecoveryInProgress() and reading the shared memory structures, and in
> theory a full barrier if we are going to be writing data. In practice
> x86 is covered thanks to it's memory model, Power is covered thanks to
> the control dependency and ARM would need a read barrier, but I don't
> think any real ARM CPU does speculative stores as that would be
> insane.

Does there necessarily have to be a "visible" control dependency?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Andres Freund
Date:
On 2013-10-01 14:31:11 +0300, Ants Aasma wrote:
> >> The correct way to think of this is
> >> that StartupXLOG() does a bunch of state modifications and then
> >> advertises the fact that it's done by setting
> >> xlogctl->SharedRecoveryInProgress = false; The state modifications
> >> should better be visible to anyone seeing that last write, so you need
> >> one write barrier between the state modifications and setting the
> >> flag.
> >
> > SpinLockAcquire() should provide that.
> 
> Yes. It's notable that in this case it's a matter of correctness that
> the global state modifications do *not* share the critical section
> with the flag update. Otherwise the flag update may become visible
> before the state updates.

I think we're currently essentially assuming that not only
SpinLockAcquire() is a barrier but also that SpinLockRelease() is
one... - which is actually far less likely to be true.

> >> So we need a read barrier somewhere *after* reading the flag in
> >> RecoveryInProgress() and reading the shared memory structures, and in
> >> theory a full barrier if we are going to be writing data. In practice
> >> x86 is covered thanks to it's memory model, Power is covered thanks to
> >> the control dependency and ARM would need a read barrier, but I don't
> >> think any real ARM CPU does speculative stores as that would be
> >> insane.
> >
> > Does there necessarily have to be a "visible" control dependency?
> 
> Unfortunately no

That's what I thought :(

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Ants Aasma
Date:
On Tue, Oct 1, 2013 at 2:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> As for the specific patch being discussed here. The read barrier is in
>> the wrong place and with the wrong comment, and the write side is
>> assuming that SpinLockAcquire() is a write barrier, which it isn't
>> documented to do at the moment.
>
> Lots of things we do pretty much already assume it is one - and I have a
> hard time imagining any spinlock implementation that isn't a write
> barrier. In fact it's the only reason we use spinlocks in quite some
> places.
> What we are missing is that it's not guaranteed to be a compiler barrier
> on all platforms :(. But that's imo a bug.

Agree on both counts.

>> The correct way to think of this is
>> that StartupXLOG() does a bunch of state modifications and then
>> advertises the fact that it's done by setting
>> xlogctl->SharedRecoveryInProgress = false; The state modifications
>> should better be visible to anyone seeing that last write, so you need
>> one write barrier between the state modifications and setting the
>> flag.
>
> SpinLockAcquire() should provide that.

Yes. It's notable that in this case it's a matter of correctness that
the global state modifications do *not* share the critical section
with the flag update. Otherwise the flag update may become visible
before the state updates.

>> So we need a read barrier somewhere *after* reading the flag in
>> RecoveryInProgress() and reading the shared memory structures, and in
>> theory a full barrier if we are going to be writing data. In practice
>> x86 is covered thanks to it's memory model, Power is covered thanks to
>> the control dependency and ARM would need a read barrier, but I don't
>> think any real ARM CPU does speculative stores as that would be
>> insane.
>
> Does there necessarily have to be a "visible" control dependency?

Unfortunately no, the compiler is quite free to hoist loads and even
stores out from the conditional block losing the control dependency.
:( It's quite unlikely to do so as it would be a very large code
motion and it probably has no reason to do it, but I don't see
anything that would disallow it. I wonder if we should just emit a
full fence there for platforms with weak memory models. Trying to
enforce the control dependency seems quite fragile.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Merlin Moncure
Date:
On Mon, Sep 30, 2013 at 8:15 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 9/27/13 3:00 PM, Merlin Moncure wrote:
>> Attached is simplified patch that replaces the spinlock with a read
>> barrier based on a suggestion made by Andres offlist.
>
> This patch doesn't apply.

works for me:

merlin@mmoncure-ubuntu:~/pgdev/pgsql$ git reset --hard HEAD
HEAD is now at 200ba16 Add regression test for bug fixed by recent refactoring.
merlin@mmoncure-ubuntu:~/pgdev/pgsql$ patch -p1 < buffer5.patch
patching file src/backend/access/transam/xlog.c


On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma <ants@cybertec.at> wrote:
> So we need a read barrier somewhere *after* reading the flag in
> RecoveryInProgress() and reading the shared memory structures, and in
> theory a full barrier if we are going to be writing data.

wow -- thanks for your review and provided detail.  Considering there
are no examples of barrier instructions to date, I think some of your
commentary should be included in the in-source documentation.

In this particular case, a read barrier should be sufficient?  By
'writing data', do you mean to the xlog control structure?  This
routine only sets a backend local flag so that should be safe?

merlin



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Andres Freund
Date:
On 2013-10-02 08:39:59 -0500, Merlin Moncure wrote:
> wow -- thanks for your review and provided detail.  Considering there
> are no examples of barrier instructions to date, I think some of your
> commentary should be included in the in-source documentation.

I think most of it is in README.barrier

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Ants Aasma
Date:
On Wed, Oct 2, 2013 at 4:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma <ants@cybertec.at> wrote:
>> So we need a read barrier somewhere *after* reading the flag in
>> RecoveryInProgress() and reading the shared memory structures, and in
>> theory a full barrier if we are going to be writing data.
>
> wow -- thanks for your review and provided detail.  Considering there
> are no examples of barrier instructions to date, I think some of your
> commentary should be included in the in-source documentation.
>
> In this particular case, a read barrier should be sufficient?  By
> 'writing data', do you mean to the xlog control structure?  This
> routine only sets a backend local flag so that should be safe?

I haven't reviewed the code in as much detail to say if there is an
actual race here, I tend to think there's probably not, but the
specific pattern that I had in mind is that with the following actual
code:

Process A:
globalVar = 1;
write_barrier();
recoveryInProgress = false;

Process B:
if (!recoveryInProgress) {   globalVar = 2;   doWork();
}

If process B speculatively executes line 2 before reading the flag for
line 1, then it's possible that the store in process B is executed
before the store in process A, making globalVar move backwards. The
barriers as they are defined don't make this scenario impossible. That
said, I don't know of any hardware that would make speculatively
executed stores visible to non-speculative state, as I said, that
would be completely insane. However currently compilers consider it
completely legal to rewrite the code into the following form:

tmp = globalVar;
globalVar = 2;
if (!recoveryInProgress) {   doWork();
} else {   globalVar = tmp;
}

That still exhibits the same problem. An abstract read barrier would
not be enough here, as this requires a LoadStore barrier. However, the
control dependency is enough for the hardware and PostgreSQL
pg_read_barrier() is a full compiler barrier, so in practice a simple
pg_read_barrier() is enough.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Merlin Moncure
Date:
On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:
> I haven't reviewed the code in as much detail to say if there is an
> actual race here, I tend to think there's probably not, but the
> specific pattern that I had in mind is that with the following actual
> code:

hm.  I think there *is* a race.  2+ threads could race to the line:

LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;

and simultaneously set the value of LocalRecoveryInProgress to false,
and both engage InitXLOGAccess, which is destructive.   The move
operation is atomic, but I don't think there's any guarantee the reads
to xlogctl->SharedRecoveryInProgress are ordered between threads
without a lock.

I don't think the memory barrier will fix this.  Do you agree?  If so,
my earlier patch (recovery4) is another take on this problem, and
arguable safer; the unlocked read is in a separate path that does not
engage InitXLOGAccess()

merlin



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Ants Aasma
Date:
On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:
>> I haven't reviewed the code in as much detail to say if there is an
>> actual race here, I tend to think there's probably not, but the
>> specific pattern that I had in mind is that with the following actual
>> code:
>
> hm.  I think there *is* a race.  2+ threads could race to the line:
>
> LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
>
> and simultaneously set the value of LocalRecoveryInProgress to false,
> and both engage InitXLOGAccess, which is destructive.   The move
> operation is atomic, but I don't think there's any guarantee the reads
> to xlogctl->SharedRecoveryInProgress are ordered between threads
> without a lock.
>
> I don't think the memory barrier will fix this.  Do you agree?  If so,
> my earlier patch (recovery4) is another take on this problem, and
> arguable safer; the unlocked read is in a separate path that does not
> engage InitXLOGAccess()

InitXLOGAccess only writes backend local variables, so it can't be
destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock
acquisition cycle, which should be a full memory barrier. A read
barrier after accessing SharedRecoveryInProgress is good enough, and
it seems to be necessary to avoid a race condition for
XLogCtl->ThisTimeLineID.

Admittedly the race is so unprobable that in practice it probably
doesn't matter. I just wanted to be spell out the correct way to do
unlocked accesses as it can get quite confusing. I found Herb Sutter's
"atomic<> weapons" talk very useful, thinking about the problem in
terms of acquire and release makes it much clearer to me.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Merlin Moncure
Date:
On Wed, Oct 2, 2013 at 3:43 PM, Ants Aasma <ants@cybertec.at> wrote:
> On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:
>>> I haven't reviewed the code in as much detail to say if there is an
>>> actual race here, I tend to think there's probably not, but the
>>> specific pattern that I had in mind is that with the following actual
>>> code:
>>
>> hm.  I think there *is* a race.  2+ threads could race to the line:
>>
>> LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
>>
>> and simultaneously set the value of LocalRecoveryInProgress to false,
>> and both engage InitXLOGAccess, which is destructive.   The move
>> operation is atomic, but I don't think there's any guarantee the reads
>> to xlogctl->SharedRecoveryInProgress are ordered between threads
>> without a lock.
>>
>> I don't think the memory barrier will fix this.  Do you agree?  If so,
>> my earlier patch (recovery4) is another take on this problem, and
>> arguable safer; the unlocked read is in a separate path that does not
>> engage InitXLOGAccess()
>
> InitXLOGAccess only writes backend local variables, so it can't be
> destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock
> acquisition cycle, which should be a full memory barrier. A read
> barrier after accessing SharedRecoveryInProgress is good enough, and
> it seems to be necessary to avoid a race condition for
> XLogCtl->ThisTimeLineID.
>
> Admittedly the race is so unprobable that in practice it probably
> doesn't matter. I just wanted to be spell out the correct way to do
> unlocked accesses as it can get quite confusing. I found Herb Sutter's
> "atomic<> weapons" talk very useful, thinking about the problem in
> terms of acquire and release makes it much clearer to me.

If we don't care about multiple calls to InitXLOGAccess() (for a
backend) then we can get away with just a barrier.  That's a pretty
weird assumption though (even if 'improbable') and I think it should
be well documented in the code, particularly since previous versions
went though such trouble to do a proper check->set->init.

I'm still leaning on my earlier idea (recovery4.patch) since it keeps
the old semantics by exploiting the fact that the call site of
interest (only) does not require a correct answer (that is, for the
backend to think it's in recovery when it's not).  That just defers
the heap prune for a time and you don't have to mess around with
barriers at all or be concerned about present or future issues caused
by spurious extra calls to InitXLOGAccess().  The other code paths to
RecoveryInProgress() appear not to be interesting from a spinlock
perspective.

merlin



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Heikki Linnakangas
Date:
On 03.10.2013 00:14, Merlin Moncure wrote:
> On Wed, Oct 2, 2013 at 3:43 PM, Ants Aasma <ants@cybertec.at> wrote:
>> On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:
>>>> I haven't reviewed the code in as much detail to say if there is an
>>>> actual race here, I tend to think there's probably not, but the
>>>> specific pattern that I had in mind is that with the following actual
>>>> code:
>>>
>>> hm.  I think there *is* a race.  2+ threads could race to the line:
>>>
>>> LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
>>>
>>> and simultaneously set the value of LocalRecoveryInProgress to false,
>>> and both engage InitXLOGAccess, which is destructive.   The move
>>> operation is atomic, but I don't think there's any guarantee the reads
>>> to xlogctl->SharedRecoveryInProgress are ordered between threads
>>> without a lock.

The backend is single-threaded, so this is moot.

>> InitXLOGAccess only writes backend local variables, so it can't be
>> destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock
>> acquisition cycle, which should be a full memory barrier. A read
>> barrier after accessing SharedRecoveryInProgress is good enough, and
>> it seems to be necessary to avoid a race condition for
>> XLogCtl->ThisTimeLineID.
>>
>> Admittedly the race is so unprobable that in practice it probably
>> doesn't matter. I just wanted to be spell out the correct way to do
>> unlocked accesses as it can get quite confusing. I found Herb Sutter's
>> "atomic<> weapons" talk very useful, thinking about the problem in
>> terms of acquire and release makes it much clearer to me.
>
> If we don't care about multiple calls to InitXLOGAccess() (for a
> backend) then we can get away with just a barrier.  That's a pretty
> weird assumption though (even if 'improbable') and I think it should
> be well documented in the code, particularly since previous versions
> went though such trouble to do a proper check->set->init.
>
> I'm still leaning on my earlier idea (recovery4.patch) since it keeps
> the old semantics by exploiting the fact that the call site of
> interest (only) does not require a correct answer (that is, for the
> backend to think it's in recovery when it's not).  That just defers
> the heap prune for a time and you don't have to mess around with
> barriers at all or be concerned about present or future issues caused
> by spurious extra calls to InitXLOGAccess().  The other code paths to
> RecoveryInProgress() appear not to be interesting from a spinlock
> perspective.

Hmm. All callers of RecoveryInProgress() must be prepared to handle the
case that RecoveryInProgress() returns true, but the system is no longer
in recovery. No matter what locking we do in RecoveryInProgress(), the
startup process might finish recovery just after RecoveryInProgress()
has returned.

What about the attached? It reads the shared variable without a lock or
barrier. If it returns 'true', but the system in fact just exited
recovery, that's OK. As explained above, all the callers must tolerate
that anyway. But if it returns 'false', then it performs a full memory
barrier, which should ensure that it sees any other shared variables as
it is after the startup process cleared SharedRecoveryInProgress
(notably, XLogCtl->ThisTimeLineID).

- Heikki

Attachment

Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Andres Freund
Date:
On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:
> Hmm. All callers of RecoveryInProgress() must be prepared to handle the case
> that RecoveryInProgress() returns true, but the system is no longer in
> recovery. No matter what locking we do in RecoveryInProgress(), the startup
> process might finish recovery just after RecoveryInProgress() has returned.

True.

> What about the attached? It reads the shared variable without a lock or
> barrier. If it returns 'true', but the system in fact just exited recovery,
> that's OK. As explained above, all the callers must tolerate that anyway.
> But if it returns 'false', then it performs a full memory barrier, which
> should ensure that it sees any other shared variables as it is after the
> startup process cleared SharedRecoveryInProgress (notably,
> XLogCtl->ThisTimeLineID).

I'd argue that we should also remove the spinlock in StartupXLOG and
replace it with a write barrier. Obviously not for performance reasons,
but because somebody might add more code to run under that spinlock.

Looks good otherwise, although a read memory barrier ought to suffice.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Merlin Moncure
Date:
On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:
>> Hmm. All callers of RecoveryInProgress() must be prepared to handle the case
>> that RecoveryInProgress() returns true, but the system is no longer in
>> recovery. No matter what locking we do in RecoveryInProgress(), the startup
>> process might finish recovery just after RecoveryInProgress() has returned.
>
> True.
>
>> What about the attached? It reads the shared variable without a lock or
>> barrier. If it returns 'true', but the system in fact just exited recovery,
>> that's OK. As explained above, all the callers must tolerate that anyway.
>> But if it returns 'false', then it performs a full memory barrier, which
>> should ensure that it sees any other shared variables as it is after the
>> startup process cleared SharedRecoveryInProgress (notably,
>> XLogCtl->ThisTimeLineID).
>
> I'd argue that we should also remove the spinlock in StartupXLOG and
> replace it with a write barrier. Obviously not for performance reasons,
> but because somebody might add more code to run under that spinlock.
>
> Looks good otherwise, although a read memory barrier ought to suffice.

This code is in a very hot code path.  Are we *sure* that the read
barrier is fast enough that we don't want to provide an alternate
function that only returns the local flag?  I don't know enough about
them to say either way.

merlin



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Andres Freund
Date:
On 2013-11-21 09:08:05 -0600, Merlin Moncure wrote:
> This code is in a very hot code path.  Are we *sure* that the read
> barrier is fast enough that we don't want to provide an alternate
> function that only returns the local flag?  I don't know enough about
> them to say either way.

A read barrier is just a compiler barrier on x86.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Merlin Moncure
Date:
On Thu, Nov 21, 2013 at 9:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-11-21 09:08:05 -0600, Merlin Moncure wrote:
>> This code is in a very hot code path.  Are we *sure* that the read
>> barrier is fast enough that we don't want to provide an alternate
>> function that only returns the local flag?  I don't know enough about
>> them to say either way.
>
> A read barrier is just a compiler barrier on x86.

That's good enough for me then.

merlin



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Heikki Linnakangas
Date:
On 21.11.2013 17:08, Merlin Moncure wrote:
> On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:
>>> Hmm. All callers of RecoveryInProgress() must be prepared to handle the case
>>> that RecoveryInProgress() returns true, but the system is no longer in
>>> recovery. No matter what locking we do in RecoveryInProgress(), the startup
>>> process might finish recovery just after RecoveryInProgress() has returned.
>>
>> True.
>>
>>> What about the attached? It reads the shared variable without a lock or
>>> barrier. If it returns 'true', but the system in fact just exited recovery,
>>> that's OK. As explained above, all the callers must tolerate that anyway.
>>> But if it returns 'false', then it performs a full memory barrier, which
>>> should ensure that it sees any other shared variables as it is after the
>>> startup process cleared SharedRecoveryInProgress (notably,
>>> XLogCtl->ThisTimeLineID).
>>
>> I'd argue that we should also remove the spinlock in StartupXLOG and
>> replace it with a write barrier. Obviously not for performance reasons,
>> but because somebody might add more code to run under that spinlock.
>>
>> Looks good otherwise, although a read memory barrier ought to suffice.
>
> This code is in a very hot code path.  Are we *sure* that the read
> barrier is fast enough that we don't want to provide an alternate
> function that only returns the local flag?  I don't know enough about
> them to say either way.

In my patch, I put the barrier inside the if (!LocalRecoveryInProgress) 
block. That codepath can only execute once in a backend, so performance 
is not an issue there. Does that look sane to you?

- Heikki



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Merlin Moncure
Date:
On Thu, Nov 21, 2013 at 10:37 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 21.11.2013 17:08, Merlin Moncure wrote:
>>
>> On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>>>
>>> On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:
>>>>
>>>> Hmm. All callers of RecoveryInProgress() must be prepared to handle the
>>>> case
>>>> that RecoveryInProgress() returns true, but the system is no longer in
>>>> recovery. No matter what locking we do in RecoveryInProgress(), the
>>>> startup
>>>> process might finish recovery just after RecoveryInProgress() has
>>>> returned.
>>>
>>>
>>> True.
>>>
>>>> What about the attached? It reads the shared variable without a lock or
>>>> barrier. If it returns 'true', but the system in fact just exited
>>>> recovery,
>>>> that's OK. As explained above, all the callers must tolerate that
>>>> anyway.
>>>> But if it returns 'false', then it performs a full memory barrier, which
>>>> should ensure that it sees any other shared variables as it is after the
>>>> startup process cleared SharedRecoveryInProgress (notably,
>>>> XLogCtl->ThisTimeLineID).
>>>
>>>
>>> I'd argue that we should also remove the spinlock in StartupXLOG and
>>> replace it with a write barrier. Obviously not for performance reasons,
>>> but because somebody might add more code to run under that spinlock.
>>>
>>> Looks good otherwise, although a read memory barrier ought to suffice.
>>
>>
>> This code is in a very hot code path.  Are we *sure* that the read
>> barrier is fast enough that we don't want to provide an alternate
>> function that only returns the local flag?  I don't know enough about
>> them to say either way.
>
>
> In my patch, I put the barrier inside the if (!LocalRecoveryInProgress)
> block. That codepath can only execute once in a backend, so performance is
> not an issue there. Does that look sane to you?

oh right -- certainly!

merlin



Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

From
Heikki Linnakangas
Date:
On 21.11.2013 21:37, Merlin Moncure wrote:
> On Thu, Nov 21, 2013 at 10:37 AM, Heikki Linnakangas
>> In my patch, I put the barrier inside the if (!LocalRecoveryInProgress)
>> block. That codepath can only execute once in a backend, so performance is
>> not an issue there. Does that look sane to you?
>
> oh right -- certainly!

Ok, commmited.

- Heikki