Thread: sinval synchronization considered harmful

sinval synchronization considered harmful

From
Robert Haas
Date:
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts.  The answer, in a word, is SIGetDataEntries().  I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things.  Here's the
data.

On unpatched master, performance scales pretty much linearly out to 32
clients.  As you add more clients, it drops off:

[1 client]
tps = 4459.203159 (including connections establishing)
tps = 4488.456559 (including connections establishing)
tps = 4449.570530 (including connections establishing)
[8 clients]
tps = 33524.668762 (including connections establishing)
tps = 33501.833907 (including connections establishing)
tps = 33382.380908 (including connections establishing)
[32 clients]
tps = 178394.863906 (including connections establishing)
tps = 178457.706972 (including connections establishing)
tps = 179365.310179 (including connections establishing)
[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)

With the lazy vxid locks patch, all of those numbers get better by a
few percentage points (the more clients, the more points) except at 80
clients, where the performance is sometimes better and sometimes
worse.  These are 5-minute runs:

[80 clients, with lazy vxid locks]
tps = 119215.958372 (including connections establishing)
tps = 113056.859871 (including connections establishing)
tps = 160562.770998 (including connections establishing)

I can't explain in detail why there is so much variation between the
5-minute runs, or why some runs perform worse than without the lazy
vxid locks, but I can say that apply the first of the two attached
patches (sinval-per-backend-mutex.patch) fixes it.  The patch gets rid
of SInvalReadLock and instead gives each backend its own spinlock.
SICleanupQueue() must therefore get somewhat fuzzy answers, but it
shouldn't matter.  The only real problem is if you need to do the
super-duper-cleanup where you subtract a constant from all the values
in unison.  I just got rid of that completely, by widening the
counters to 64 bits.  Assuming I haven't botched the implementation,
this is clearly a win.  There is still some performance drop-off going
from 32 clients to 80 clients, but it is much less.

[80 clients, with lazy vxid locks and sinval-per-backend-mutex]
tps = 167392.042393 (including connections establishing)
tps = 171336.145020 (including connections establishing)
tps = 170500.529303 (including connections establishing)

Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex.  The whole of
SIGetDataEntries() can pretty easily be made lock-free.  The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement and (2) to recheck resetState after
reading the entries from the queue, to see if we got reset while we
were reading those entries.  Since x86 guarantees that writes will
become visible in the order they are executed, we only need to make
sure that the compiler doesn't rearrange things.  As long as we first
read the maxMsgNum and then read the messages, we can't read garbage.
As long as we read the messages before we check resetState, we will be
sure to notice if we got reset before we read all the messages (which
is the only way that we can have read garbage messages).

Now this implementation (sinval-unlocked.patch) is obviously not a
serious proposal as it stands, because on processors with weak memory
ordering it will presumably blow up all over the place.  But here are
the results on x86:

[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)

Now, that's actually *faster* then the above 32-core numbers.  Turns
out, with this approach, there's essentially no degradation between 32
clients and 80 clients.  It appears that even one spinlock acquisition
in SIGetDataEntries() is too many.  Currently, we have *three*: one to
get SInvalReadLock, one to get msgnumLock, and one to release
SInvalReadLock.

For contrast, on a two-core MacBook Pro, I can't measure any
difference at all between lazy-vxid,
lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked.
The last might even be a touch slower, although there's enough noise
that it's hard to tell for sure, and the effect is very small if there
is one.  But on the 32-core machine, the differences are dramatic.

Thoughts?  Comments?  Ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: sinval synchronization considered harmful

From
"Kevin Grittner"
Date:
Robert Haas  wrote:
> SIGetDataEntries(). I believe we need to bite the bullet and
> rewrite this using a lock-free algorithm, using memory barriers on
> processors with weak memory ordering.
> [32 processors; 80 clients]
> On unpatched master
> tps = 132518.586371 (including connections establishing)
> tps = 130968.749747 (including connections establishing)
> tps = 132574.338942 (including connections establishing)
> With the lazy vxid locks patch
> tps = 119215.958372 (including connections establishing)
> tps = 113056.859871 (including connections establishing)
> tps = 160562.770998 (including connections establishing)
> gets rid of SInvalReadLock and instead gives each backend its own
> spinlock.
> tps = 167392.042393 (including connections establishing)
> tps = 171336.145020 (including connections establishing)
> tps = 170500.529303 (including connections establishing)
> SIGetDataEntries() can pretty easily be made lock-free.
> tps = 203256.701227 (including connections establishing)
> tps = 190637.957571 (including connections establishing)
> tps = 190228.617178 (including connections establishing)
> Thoughts? Comments? Ideas?
Very impressive!  Those numbers definitely justify some #ifdef code
to provide alternatives for weak memory ordering machines versus
others.  With the number of CPUs climbing as it is, this is very
important work!
-Kevin


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 12:16 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Very impressive!  Those numbers definitely justify some #ifdef code
> to provide alternatives for weak memory ordering machines versus
> others.  With the number of CPUs climbing as it is, this is very
> important work!

Thanks.  I'm not thinking so much about #ifdef (although that could
work, too) as I am about providing some primitives to allow this sort
of code-writing to be done in a somewhat less ad-hoc fashion.  It
seems like there are basically two categories of machines we need to
worry about.

1. Machines with strong memory ordering.  On this category of machines
(which include x86), the CPU basically does not perform loads or
stores out of order.  On some of these machines, it is apparently
possible for there to be some ordering of stores relative to loads,
but if the program stores two values or loads two values, those
operations will performed in the same order they appear in the
program.  The main thing you need to make your code work reliably on
these machines is a primitive that keeps the compiler from reordering
your code during optimization.  On x86, certain categories of exotic
instructions do require

2. Machines with weak memory ordering.  On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization.  You still need to keep the
compiler from moving things around, too.  Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself.  This is otherwise known as "we
have basically no cache coherency circuits on this chip at all".  On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.

So you can imagine a primitive that is defined to be a compiler
barrier on machines with strong memory ordering, and as a memory
fencing instruction on machines with weak memory ordering.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... On these machines, you need to issue an explicit memory barrier
> instruction at each sequence point, or just acquire and release a
> spinlock.

Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.

To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support.  In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> ... On these machines, you need to issue an explicit memory barrier
>> instruction at each sequence point, or just acquire and release a
>> spinlock.
>
> Right, and the reason that a spinlock fixes it is that we have memory
> barrier instructions built into the spinlock code sequences on machines
> where it matters.
>
> To get to the point where we could do the sort of optimization Robert
> is talking about, someone will have to build suitable primitives for
> all the platforms we support.  In the cases where we use gcc ASM in
> s_lock.h, it shouldn't be too hard to pull out the barrier
> instruction(s) ... but on platforms where we rely on OS-supplied
> functions, some research is going to be needed.

Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.

I think the real challenge is going to be testing.  If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] This was a suggestion from Noah Misch.  I wasn't quite convinced
when he initially made it, but having studied the issue a lot more, I
now am.  The CPU doesn't know how many processes have the memory
mapped into their address space.


Re: sinval synchronization considered harmful

From
Dave Page
Date:
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I think the real challenge is going to be testing.  If anyone has a
> machine with weak memory ordering they can give me access to, that
> would be really helpful for flushing the bugs out of this stuff.
> Getting it to work on x86 is not the hard part.

I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 3:22 PM, Dave Page <dpage@pgadmin.org> wrote:
> On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think the real challenge is going to be testing.  If anyone has a
>> machine with weak memory ordering they can give me access to, that
>> would be really helpful for flushing the bugs out of this stuff.
>> Getting it to work on x86 is not the hard part.
>
> I believe there's a PPC box in our storage facility in NJ that we
> might be able to dig out for you. There's also a couple in our India
> office. Let me know if they'd be of help.

Yes!

More processors is better, of course, but having anything at all to
test on would be an improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Dave Page
Date:
On Thu, Jul 21, 2011 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jul 21, 2011 at 3:22 PM, Dave Page <dpage@pgadmin.org> wrote:
>> On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I think the real challenge is going to be testing.  If anyone has a
>>> machine with weak memory ordering they can give me access to, that
>>> would be really helpful for flushing the bugs out of this stuff.
>>> Getting it to work on x86 is not the hard part.
>>
>> I believe there's a PPC box in our storage facility in NJ that we
>> might be able to dig out for you. There's also a couple in our India
>> office. Let me know if they'd be of help.
>
> Yes!
>
> More processors is better, of course, but having anything at all to
> test on would be an improvement.

OK, will check with India first, as it'll be easier for them to deploy.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think the real challenge is going to be testing.  If anyone has a
> machine with weak memory ordering they can give me access to, that
> would be really helpful for flushing the bugs out of this stuff.

There are multi-CPU PPCen in the buildfarm, or at least there were last
time I broke the sinval code ;-).  Note that testing on a single-core
PPC will prove nothing.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think the real challenge is going to be testing.  If anyone has a
>> machine with weak memory ordering they can give me access to, that
>> would be really helpful for flushing the bugs out of this stuff.
>
> There are multi-CPU PPCen in the buildfarm, or at least there were last
> time I broke the sinval code ;-).  Note that testing on a single-core
> PPC will prove nothing.

Yeah, I was just thinking about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
> For the last week or so, in between various other tasks, I've been
> trying to understand why performance drops off when you run pgbench -n
> -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
> counts.  The answer, in a word, is SIGetDataEntries().  I believe we
> need to bite the bullet and rewrite this using a lock-free algorithm,
> using memory barriers on processors with weak memory ordering.
> Perhaps there is another way to do it, but nothing I've tried has
> really worked so far, and I've tried quite a few things.  Here's the
> data.
> 
> On unpatched master, performance scales pretty much linearly out to 32
> clients.  As you add more clients, it drops off:

> [80 clients]
> tps = 132518.586371 (including connections establishing)
> tps = 130968.749747 (including connections establishing)
> tps = 132574.338942 (including connections establishing)

> [80 clients, with lazy vxid locks and sinval-unlocked]
> tps = 203256.701227 (including connections establishing)
> tps = 190637.957571 (including connections establishing)
> tps = 190228.617178 (including connections establishing)

Nice numbers.  The sinval-unlocked.patch implementation looks like it's taking a
sound direction.

In
http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>> For the last week or so, in between various other tasks, I've been
>> trying to understand why performance drops off when you run pgbench -n
>> -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
>> counts.  The answer, in a word, is SIGetDataEntries().  I believe we
>> need to bite the bullet and rewrite this using a lock-free algorithm,
>> using memory barriers on processors with weak memory ordering.
>> Perhaps there is another way to do it, but nothing I've tried has
>> really worked so far, and I've tried quite a few things.  Here's the
>> data.
>>
>> On unpatched master, performance scales pretty much linearly out to 32
>> clients.  As you add more clients, it drops off:
>
>> [80 clients]
>> tps = 132518.586371 (including connections establishing)
>> tps = 130968.749747 (including connections establishing)
>> tps = 132574.338942 (including connections establishing)
>
>> [80 clients, with lazy vxid locks and sinval-unlocked]
>> tps = 203256.701227 (including connections establishing)
>> tps = 190637.957571 (including connections establishing)
>> tps = 190228.617178 (including connections establishing)
>
> Nice numbers.  The sinval-unlocked.patch implementation looks like it's taking a
> sound direction.
>
> In
> http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
> you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
> correct to conclude that AcceptInvalidationMessages() still reduces the
> transaction rate by 5-10% with this stack of patches?

Good question - I have not tested.

One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe
we could make AcceptInvalidationMessages() a macro, something like
this:

if (MyProcState->nextMsgNum != shmInvalState->maxMsgNum)   ReallyAcceptInvalidationMessages();

That ought to be extremely cheap and - if we use 64-bit counters for
the message-number counters - safe.  You might object that the load of
maxMsgNum might migrate backward, but it can't possibly back up any
further than the preceding lock acquisition, since that's required to
be a full memory barrier on every architecture.  And if we haven't
acquired a relevant lock, then a relevant sinval message could show up
an instance after we check regardless of the implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Florian Pflug
Date:
On Jul21, 2011, at 21:15 , Robert Haas wrote:
> On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> ... On these machines, you need to issue an explicit memory barrier
>>> instruction at each sequence point, or just acquire and release a
>>> spinlock.
>> 
>> Right, and the reason that a spinlock fixes it is that we have memory
>> barrier instructions built into the spinlock code sequences on machines
>> where it matters.
>> 
>> To get to the point where we could do the sort of optimization Robert
>> is talking about, someone will have to build suitable primitives for
>> all the platforms we support.  In the cases where we use gcc ASM in
>> s_lock.h, it shouldn't be too hard to pull out the barrier
>> instruction(s) ... but on platforms where we rely on OS-supplied
>> functions, some research is going to be needed.
> 
> Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
> on a backend-private slock_t should work anywhere that PostgreSQL
> works at all[1]. That will probably be slower than a memory fence
> instruction and certainly slower than a compiler barrier, but the
> point is that - right now - we're doing it the slow way everywhere.

As I discovered while playing with various lockless algorithms to
improve our LWLocks, spin locks aren't actually a replacement for
a (full) barrier.

Lock acquisition only really needs to guarantee that loads and stores
which come after the acquisition operation in program order (i.e., in
the instruction stream) aren't globally visible before that operation
completes. This kind of barrier behaviour is often fittingly called
"acquire barrier".

Similarly, a lock release operation only needs to guarantee that loads
and stores which occur before that operation in program order are
globally visible before the release operation completes. This, again,
is fittingly called "release barrier".

Now assume the following code fragment

global1 = 1;
SpinLockAcquire();
SpinLockRelease();
global2 = 1;

If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
has "release barrier" sematics, the it's possible for the store to global1
to be delayed until after SpinLockAcquire(), and similarly for the store
to global2 to be executed before SpinLockRelease() completes. In other
words, what happens is

SpinLockAcquire();
global1 = 1;
global2 = 1;
SpinLockRelease();

But once that can happens, there's no reason that it couldn't also be

SpinLockAcquire();
global2 = 1;
global1 = 1;
SpinLockRelease();

I didn't check if any of our spin lock implementations is actually affected
by this, but it doesn't seem wise to rely on them being full barriers, even
if it may be true today.

best regards,
Florian Pflug



Re: sinval synchronization considered harmful

From
Dan Ports
Date:
On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
> 1. Machines with strong memory ordering.  On this category of machines
> (which include x86), the CPU basically does not perform loads or
> stores out of order.  On some of these machines, it is apparently
> possible for there to be some ordering of stores relative to loads,
> but if the program stores two values or loads two values, those
> operations will performed in the same order they appear in the
> program. 

This is all correct, but...

> The main thing you need to make your code work reliably on
> these machines is a primitive that keeps the compiler from reordering
> your code during optimization. 

If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.

Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.

> 2. Machines with weak memory ordering.  On this category of machines
> (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
> reorders memory accesses arbitrarily unless you explicitly issue
> instructions that enforce synchronization.  You still need to keep the
> compiler from moving things around, too.  Alpha is particularly
> pernicious, because something like a->b can fetch the pointed-to value
> before loading the pointer itself.  This is otherwise known as "we
> have basically no cache coherency circuits on this chip at all".  On
> these machines, you need to issue an explicit memory barrier
> instruction at each sequence point, or just acquire and release a
> spinlock.

The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)

Alternatively, one might question how important it is to support the
Alpha these days...

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
> Profiling this combination of patches reveals that there is still some
> pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
> to me that on x86, we really don't need this lock ... or
> SInvalReadLock ... or a per-backend mutex.  The whole of
> SIGetDataEntries() can pretty easily be made lock-free.  The only real
> changes that seem to be are needed are (1) to use a 64-bit counter, so
> you never need to decrement

On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores?  You must avoid reading it between those
stores.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Florian Pflug
Date:
On Jul21, 2011, at 03:46 , Robert Haas wrote:
> Profiling this combination of patches reveals that there is still some
> pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
> to me that on x86, we really don't need this lock ... or
> SInvalReadLock ... or a per-backend mutex.  The whole of
> SIGetDataEntries() can pretty easily be made lock-free.  The only real
> changes that seem to be are needed are (1) to use a 64-bit counter, so
> you never need to decrement and (2) to recheck resetState after
> reading the entries from the queue, to see if we got reset while we
> were reading those entries.  Since x86 guarantees that writes will
> become visible in the order they are executed, we only need to make
> sure that the compiler doesn't rearrange things.  As long as we first
> read the maxMsgNum and then read the messages, we can't read garbage.
> As long as we read the messages before we check resetState, we will be
> sure to notice if we got reset before we read all the messages (which
> is the only way that we can have read garbage messages).

Sounds sensible. There're one additional hazard though - you'll also
need the reads to be atomic. x86 guarantees that for up to 32 (i386)
respectively 64 (x64) loads, but only for reads from properly aligned
addresses (4 bytes for 4-byte reads, 8 bytes for 8-byte reads).

I founds that out the hard way a few days ago, again while playing with
different LWLock implementations, when I botched my test setup and
the proc array entries ended up being miss-aligned. Boy, was it fun
to debug the random crashes caused by non-atomic pointer reads...

If we widen the counter to 64-bit, reading it atomically on x86 becomes
a bit of a challenge on i386, but is doable also. From what I remember,
there are two options. You can either use the 8-byte compare-and-exchange
operation, but it might be that only quite recent CPUs support that. The
other options seems to be to use floating-point instructions. I believe
the latter is what Intel's own Thread Building Blocks library does, but
I'd have to re-check to be sure. It might also be that, once you starting
using floating-point instructions, you find that you actually do need
fencing instructions even on x86. Dunno if the weaker ordering affects only
SIMD instructions or all floating point stuff...

best regards,
Florian Pflug



Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 6:22 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jul21, 2011, at 21:15 , Robert Haas wrote:
>> On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> ... On these machines, you need to issue an explicit memory barrier
>>>> instruction at each sequence point, or just acquire and release a
>>>> spinlock.
>>>
>>> Right, and the reason that a spinlock fixes it is that we have memory
>>> barrier instructions built into the spinlock code sequences on machines
>>> where it matters.
>>>
>>> To get to the point where we could do the sort of optimization Robert
>>> is talking about, someone will have to build suitable primitives for
>>> all the platforms we support.  In the cases where we use gcc ASM in
>>> s_lock.h, it shouldn't be too hard to pull out the barrier
>>> instruction(s) ... but on platforms where we rely on OS-supplied
>>> functions, some research is going to be needed.
>>
>> Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
>> on a backend-private slock_t should work anywhere that PostgreSQL
>> works at all[1]. That will probably be slower than a memory fence
>> instruction and certainly slower than a compiler barrier, but the
>> point is that - right now - we're doing it the slow way everywhere.
>
> As I discovered while playing with various lockless algorithms to
> improve our LWLocks, spin locks aren't actually a replacement for
> a (full) barrier.
>
> Lock acquisition only really needs to guarantee that loads and stores
> which come after the acquisition operation in program order (i.e., in
> the instruction stream) aren't globally visible before that operation
> completes. This kind of barrier behaviour is often fittingly called
> "acquire barrier".
>
> Similarly, a lock release operation only needs to guarantee that loads
> and stores which occur before that operation in program order are
> globally visible before the release operation completes. This, again,
> is fittingly called "release barrier".
>
> Now assume the following code fragment
>
> global1 = 1;
> SpinLockAcquire();
> SpinLockRelease();
> global2 = 1;
>
> If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
> has "release barrier" sematics, the it's possible for the store to global1
> to be delayed until after SpinLockAcquire(), and similarly for the store
> to global2 to be executed before SpinLockRelease() completes. In other
> words, what happens is
>
> SpinLockAcquire();
> global1 = 1;
> global2 = 1;
> SpinLockRelease();
>
> But once that can happens, there's no reason that it couldn't also be
>
> SpinLockAcquire();
> global2 = 1;
> global1 = 1;
> SpinLockRelease();
>
> I didn't check if any of our spin lock implementations is actually affected
> by this, but it doesn't seem wise to rely on them being full barriers, even
> if it may be true today.

Hmm.  I'm not worried about that.  AFAIK, only IA64 has such an
implementation, and our existing spinlock implementation doesn't use
it.  If we were to add something like that in the future, we'd
presumably know that we were doing it, and would add the appropriate
memory barrier primitive at the same time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 6:44 PM, Dan Ports <drkp@csail.mit.edu> wrote:
> If you're suggesting that hardware memory barriers aren't going to be
> needed to implement lock-free code on x86, that isn't true. Because a
> read can be reordered with respect to a write to a different memory
> location, you can still have problems. So you do still need memory
> barriers, just fewer of them.
>
> Dekker's algorithm is the classic example: two threads each set a flag
> and then check whether the other thread's flag is set. In any
> sequential execution, at least one should see the other's flag set, but
> on the x86 that doesn't always happen. One thread's read might be
> reordered before its write.

In the case of sinval, what we need to do for SIGetDataEntries() is,
approximately, a bunch of loads, followed by a store to one of the
locations we loaded (which no one else can have written meanwhile).
So I think that's OK.

In SIInsertDataEntries(), what we need to do is, approximately, take a
lwlock, load from a location which can only be written while holding
the lwlock, do a bunch of stores, ending with a store to that first
location, and release the lwlock.  I think that's OK, too.

>> 2. Machines with weak memory ordering.  On this category of machines
>> (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
>> reorders memory accesses arbitrarily unless you explicitly issue
>> instructions that enforce synchronization.  You still need to keep the
>> compiler from moving things around, too.  Alpha is particularly
>> pernicious, because something like a->b can fetch the pointed-to value
>> before loading the pointer itself.  This is otherwise known as "we
>> have basically no cache coherency circuits on this chip at all".  On
>> these machines, you need to issue an explicit memory barrier
>> instruction at each sequence point, or just acquire and release a
>> spinlock.
>
> The Alpha is pretty much unique (thankfully!) in allowing dependent
> reads to be reordered. That makes it even weaker than the typical
> weak-ordering machine. Since reading a pointer and then dereferencing
> it is a pretty reasonable thing to do regularly in RCU code, you
> probably don't want to emit barriers in between on architectures where
> it's not actually necessary. That argues for another operation that's
> defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
> Certainly the Linux kernel found it useful to do so
> (read_barrier_depends)
>
> Alternatively, one might question how important it is to support the
> Alpha these days...

Well, currently, we do, so we probably don't want to drop support for
that without some careful thought.  I searched the archive and found
someone trying to compile 8.3.something on Alpha just a few years ago,
so it's apparently not totally dead yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>> Profiling this combination of patches reveals that there is still some
>> pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
>> to me that on x86, we really don't need this lock ... or
>> SInvalReadLock ... or a per-backend mutex.  The whole of
>> SIGetDataEntries() can pretty easily be made lock-free.  The only real
>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>> you never need to decrement
>
> On second thought, won't this be inadequate on 32-bit systems, where updating
> the 64-bit counter produces two stores?  You must avoid reading it between those
> stores.

Now that is a potentially big problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
>> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>>> SIGetDataEntries() can pretty easily be made lock-free.  The only real
>>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>>> you never need to decrement

>> On second thought, won't this be inadequate on 32-bit systems, where updating
>> the 64-bit counter produces two stores?  You must avoid reading it between those stores.

> Now that is a potentially big problem.

Could we do something similar to the xxid hacks?  That is, we have a lot
of counters that should be fairly close to each other, so we store only
the low-order 32 bits of each notional value, and separately maintain a
common high-order word.  You probably would need some additional
overhead each time the high-order word bumps, but that's reasonably
infrequent.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 21, 2011 at 9:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
>>> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>>>> SIGetDataEntries() can pretty easily be made lock-free.  The only real
>>>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>>>> you never need to decrement
>
>>> On second thought, won't this be inadequate on 32-bit systems, where updating
>>> the 64-bit counter produces two stores?  You must avoid reading it between those stores.
>
>> Now that is a potentially big problem.
>
> Could we do something similar to the xxid hacks?  That is, we have a lot
> of counters that should be fairly close to each other, so we store only
> the low-order 32 bits of each notional value, and separately maintain a
> common high-order word.  You probably would need some additional
> overhead each time the high-order word bumps, but that's reasonably
> infrequent.

Well, the trouble is figuring out what the shape of that additional
overhead needs to look like.  I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:

+       if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
+               return 0;

Patch (with comment explaining why I think this is OK) attached.  If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.

Test results, with the lazy vxid patch plus this patch, at 8 clients:

tps = 34028.144439 (including connections establishing)
tps = 34079.085935 (including connections establishing)
tps = 34125.295938 (including connections establishing)

And at 32 clients:

tps = 185521.605364 (including connections establishing)
tps = 188250.700451 (including connections establishing)
tps = 186077.847215 (including connections establishing)

And at 80 clients:

tps = 188568.886569 (including connections establishing)
tps = 191035.971512 (including connections establishing)
tps = 189363.019377 (including connections establishing)

Not quite as good as the unlocked version, but better than the
per-backend mutex, and a whole lot simpler.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
> I think I have a simpler idea, though:
> before acquiring any locks, just have SIGetDataEntries() do this:
> 
> +       if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
> +               return 0;
> 
> Patch (with comment explaining why I think this is OK) attached.  If
> the message numbers happen to be equal only because the counter has
> wrapped, then stateP->resetState will be true, so we'll still realize
> we need to do some work.

This is attractive, and I don't see any problems with it.  (In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match.  Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)

+1 for doing this and moving on.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
>> I think I have a simpler idea, though:
>> before acquiring any locks, just have SIGetDataEntries() do this:
>>
>> +       if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
>> +               return 0;
>>
>> Patch (with comment explaining why I think this is OK) attached.  If
>> the message numbers happen to be equal only because the counter has
>> wrapped, then stateP->resetState will be true, so we'll still realize
>> we need to do some work.
>
> This is attractive, and I don't see any problems with it.  (In theory, you could
> hit a case where the load of resetState gives an ancient "false" just as the
> counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> reset interval, I'm not worried about problems on actual silicon.)

It's actually 262,144 times as long - see MSGNUMWRAPAROUND.

It would be pretty easy to eliminate even the theoretical possibility
of a race by getting rid of resetState altogether and using nextMsgNum
= -1 to mean that.  Maybe I should go ahead and do that.

> +1 for doing this and moving on.

Yeah, I think I'll go ahead and commit something along these lines if
no one objects.  We can always fine-tune it more if needed (but it
probably isn't).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> > This is attractive, and I don't see any problems with it.  (In theory, you could
> > hit a case where the load of resetState gives an ancient "false" just as the
> > counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> > reset interval, I'm not worried about problems on actual silicon.)
> 
> It's actually 262,144 times as long - see MSGNUMWRAPAROUND.

Ah, so it is.

> It would be pretty easy to eliminate even the theoretical possibility
> of a race by getting rid of resetState altogether and using nextMsgNum
> = -1 to mean that.  Maybe I should go ahead and do that.

Seems like a nice simplification.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
>> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
>> > This is attractive, and I don't see any problems with it.  (In theory, you could
>> > hit a case where the load of resetState gives an ancient "false" just as the
>> > counters wrap to match.  Given that the wrap interval is 1000000x as long as the
>> > reset interval, I'm not worried about problems on actual silicon.)
>>
>> It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
>
> Ah, so it is.
>
>> It would be pretty easy to eliminate even the theoretical possibility
>> of a race by getting rid of resetState altogether and using nextMsgNum
>> = -1 to mean that.  Maybe I should go ahead and do that.
>
> Seems like a nice simplification.

On further reflection, I don't see that this helps: it just moves the
problem around.  With resetState as a separate variable, nextMsgNum is
never changed by anyone other than the owner, so we can never have a
stale load.  But if we overload nextMsgNum to also indicate whether
our state has been reset, then there's a race between when we load
nextMsgNum and when we load maxMsgNum (instead of code I posted
previously, which has a race between when we load resetState and when
we load maxMsgNum).  Now, as you say, it seems really, really
difficult to hit that in practice, but I don't see a way of getting
rid of the theoretical possibility without either (1) a spinlock or
(2) a fence.  (Of course, on x86, the fence could be optimized down to
a compiler barrier.)  I guess the question is "should we worry about
that?".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On further reflection, I don't see that this helps: it just moves the
> problem around.  With resetState as a separate variable, nextMsgNum is
> never changed by anyone other than the owner, so we can never have a
> stale load.  But if we overload nextMsgNum to also indicate whether
> our state has been reset, then there's a race between when we load
> nextMsgNum and when we load maxMsgNum (instead of code I posted
> previously, which has a race between when we load resetState and when
> we load maxMsgNum).  Now, as you say, it seems really, really
> difficult to hit that in practice, but I don't see a way of getting
> rid of the theoretical possibility without either (1) a spinlock or
> (2) a fence.  (Of course, on x86, the fence could be optimized down to
> a compiler barrier.)  I guess the question is "should we worry about
> that?".

Uh, yes.  I've lost count of the number of times I've seen people hit
one-instruction-wide race condition windows, SIGSEGV crash based on
accessing only one byte past the valid data structure, etc etc.
The worst thing about this type of bug is that you can't reproduce the
failure when you try; doesn't mean your users won't see it.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mar jul 26 13:43:08 -0400 2011:

> Uh, yes.  I've lost count of the number of times I've seen people hit
> one-instruction-wide race condition windows, SIGSEGV crash based on
> accessing only one byte past the valid data structure, etc etc.

I think you need a better locking protocol on that counter of yours.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: sinval synchronization considered harmful

From
Simon Riggs
Date:
On Tue, Jul 26, 2011 at 6:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Now, as you say, it seems really, really
> difficult to hit that in practice, but I don't see a way of getting
> rid of the theoretical possibility without either (1) a spinlock or
> (2) a fence.  (Of course, on x86, the fence could be optimized down to
> a compiler barrier.)  I guess the question is "should we worry about
> that?".

Perhaps the answer lies in a different direction altogether?

Let me ask a few questions to stimulate a different solution

* Can we do this using an active technique (e.g. signals) rather than
a passive one (reading a counter?)

* Can we partition the sinval lock, so we have multiple copies? That
increases the task for those who trigger an invalidation, but will
relieve the pressure for most readers.

* Can we put the sinval info in a different place? e.g. inside each
lock partition.

* Why do we have a different mechanism for cache invalidation
internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
Why don't we have just one?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Alvaro Herrera
Date:
Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:

> Let me ask a few questions to stimulate a different solution
> 
> * Can we do this using an active technique (e.g. signals) rather than
> a passive one (reading a counter?)

Signals are already in use for special cases (queue is full), and I
think going through the kernel to achieve much more will lower
performance significantly.

> * Can we partition the sinval lock, so we have multiple copies? That
> increases the task for those who trigger an invalidation, but will
> relieve the pressure for most readers.

Not sure there's a way to meaningfully partition the queue.  In any
case, I think the problem being dealt with here is how to update the
read heads of the queue, not its contents.

> * Can we put the sinval info in a different place? e.g. inside each
> lock partition.

I don't think you can relate sinval messages to particular locks, which
would make this infeasible.  I think this bit is directly related to the
point above.

> * Why do we have a different mechanism for cache invalidation
> internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
> Why don't we have just one?

Performance.  Not sure there are other reasons as well; but just this
one is significant enough.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: sinval synchronization considered harmful

From
Simon Riggs
Date:
On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
>
>> Let me ask a few questions to stimulate a different solution
>>
>> * Can we do this using an active technique (e.g. signals) rather than
>> a passive one (reading a counter?)
>
> Signals are already in use for special cases (queue is full), and I
> think going through the kernel to achieve much more will lower
> performance significantly.

If there are no invalidations, there would be no signals. How would
zero signals decrease performance?


>> * Can we partition the sinval lock, so we have multiple copies? That
>> increases the task for those who trigger an invalidation, but will
>> relieve the pressure for most readers.
>
> Not sure there's a way to meaningfully partition the queue.  In any
> case, I think the problem being dealt with here is how to update the
> read heads of the queue, not its contents.

I agree there's no meaningful way to partition the queue, but we can
store the information in more than one place to reduce the contention
of people requesting it.

Both those ideas are still on the table.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Tue, Jul 26, 2011 at 2:56 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
>>
>>> Let me ask a few questions to stimulate a different solution
>>>
>>> * Can we do this using an active technique (e.g. signals) rather than
>>> a passive one (reading a counter?)
>>
>> Signals are already in use for special cases (queue is full), and I
>> think going through the kernel to achieve much more will lower
>> performance significantly.
>
> If there are no invalidations, there would be no signals. How would
> zero signals decrease performance?

It wouldn't, although it might be bad in the case where there are lots
of temp tables being created and dropped.  I think the biggest problem
with signals is that they don't provide any meaningful synchronization
guarantees.  When you send somebody a signal, you don't really know
how long it's going to take for them to receive it.

>>> * Can we partition the sinval lock, so we have multiple copies? That
>>> increases the task for those who trigger an invalidation, but will
>>> relieve the pressure for most readers.
>>
>> Not sure there's a way to meaningfully partition the queue.  In any
>> case, I think the problem being dealt with here is how to update the
>> read heads of the queue, not its contents.
>
> I agree there's no meaningful way to partition the queue, but we can
> store the information in more than one place to reduce the contention
> of people requesting it.

I thought about that.  Basically, that saves you a factor of N in
contention on the read side (where N is the number of copies) and
costs you a factor of N on the write side (you have to update N
copies, taking a spinlock or lwlock for each).  In the limit, you
could do one copy of the counter per backend.

I think, though, that a lock-free implementation using memory barriers
is going to end up being the only real solution.  We might possibly
convince ourselves that we're OK with increasing the cost of
SIInsertDataEntries(), but any solution that involves replication the
data is still based on doing at least some locking.  And I am pretty
well convinced that even one spinlock acquisition in
SIInsertDataEntries() is too many.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> Signals are already in use for special cases (queue is full), and I
>> think going through the kernel to achieve much more will lower
>> performance significantly.

> If there are no invalidations, there would be no signals. How would
> zero signals decrease performance?

But if there *is* an invalidation (which is not a negligible case),
it'd get significantly slower.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Simon Riggs
Date:
On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> It wouldn't, although it might be bad in the case where there are lots
> of temp tables being created and dropped.

Do temp tables cause relcache invalidations?

That seems like something we'd want to change in itself.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Simon Riggs
Date:
On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>>>> * Can we partition the sinval lock, so we have multiple copies? That
>>>> increases the task for those who trigger an invalidation, but will
>>>> relieve the pressure for most readers.
>>>
>>> Not sure there's a way to meaningfully partition the queue.  In any
>>> case, I think the problem being dealt with here is how to update the
>>> read heads of the queue, not its contents.
>>
>> I agree there's no meaningful way to partition the queue, but we can
>> store the information in more than one place to reduce the contention
>> of people requesting it.
>
> I thought about that.  Basically, that saves you a factor of N in
> contention on the read side (where N is the number of copies) and
> costs you a factor of N on the write side (you have to update N
> copies, taking a spinlock or lwlock for each).  In the limit, you
> could do one copy of the counter per backend.
>
> I think, though, that a lock-free implementation using memory barriers
> is going to end up being the only real solution.  We might possibly
> convince ourselves that we're OK with increasing the cost of
> SIInsertDataEntries(), but any solution that involves replication the
> data is still based on doing at least some locking.  And I am pretty
> well convinced that even one spinlock acquisition in
> SIInsertDataEntries() is too many.

You might be right, but I think we have little knowledge of how some
memory barrier code you haven't written yet effects performance on
various architectures.

A spinlock per backend would cache very nicely, now you mention it. So
my money would be on the multiple copies.

It's not completely clear to me that updating N copies would be more
expensive. Accessing N low contention copies rather than 1
high-contention value might actually be a win.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Noah Misch <noah@2ndQuadrant.com> writes:
> On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
>> I think I have a simpler idea, though:
>> before acquiring any locks, just have SIGetDataEntries() do this:
>> 
>> +       if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
>> +               return 0;
>> 
>> Patch (with comment explaining why I think this is OK) attached.  If
>> the message numbers happen to be equal only because the counter has
>> wrapped, then stateP->resetState will be true, so we'll still realize
>> we need to do some work.

> This is attractive, and I don't see any problems with it.  (In theory, you could
> hit a case where the load of resetState gives an ancient "false" just as the
> counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> reset interval, I'm not worried about problems on actual silicon.)

> +1 for doing this and moving on.

After some further reflection I believe this patch actually is pretty
safe, although Noah's explanation of why seems a bit confused.  The key
points are that (1) each of the reads is atomic, and (2) we should not
be able to see a value that is older than our last acquisition/release
of a shared memory lock.  These being granted, we will make a decision
that is correct, or at least was correct as of some time after that last
lock action.  As stated in the patch comments, we are not required to
promptly assimilate sinval actions on objects we don't hold any lock on,
so this seems sufficient.  In essence, we are relying on an assumed
prior visit to the lock manager to provide memory access synchronization
for the sinval no-work-to-do test.

The case Noah speculates about above amounts to supposing that this
reasoning fails to hold for the resetState value, and I don't see why
that would be true.  Even if the above scenario did manage to happen,
it would not mean that we missed the reset, just that we hadn't noticed
it *yet*.  And by hypothesis, none of the as-yet-not-seen catalog
updates are a problem for us.

BTW, the patch really ought to add something to the comment around
line 100.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote:
> Noah Misch <noah@2ndQuadrant.com> writes:
> > On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
> >> I think I have a simpler idea, though:
> >> before acquiring any locks, just have SIGetDataEntries() do this:
> >> 
> >> +       if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
> >> +               return 0;
> >> 
> >> Patch (with comment explaining why I think this is OK) attached.  If
> >> the message numbers happen to be equal only because the counter has
> >> wrapped, then stateP->resetState will be true, so we'll still realize
> >> we need to do some work.
> 
> > This is attractive, and I don't see any problems with it.  (In theory, you could
> > hit a case where the load of resetState gives an ancient "false" just as the
> > counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> > reset interval, I'm not worried about problems on actual silicon.)
> 
> > +1 for doing this and moving on.
> 
> After some further reflection I believe this patch actually is pretty
> safe, although Noah's explanation of why seems a bit confused.  The key
> points are that (1) each of the reads is atomic, and (2) we should not
> be able to see a value that is older than our last acquisition/release
> of a shared memory lock.  These being granted, we will make a decision
> that is correct, or at least was correct as of some time after that last
> lock action.  As stated in the patch comments, we are not required to
> promptly assimilate sinval actions on objects we don't hold any lock on,
> so this seems sufficient.  In essence, we are relying on an assumed
> prior visit to the lock manager to provide memory access synchronization
> for the sinval no-work-to-do test.
> 
> The case Noah speculates about above amounts to supposing that this
> reasoning fails to hold for the resetState value, and I don't see why
> that would be true.  Even if the above scenario did manage to happen,
> it would not mean that we missed the reset, just that we hadn't noticed
> it *yet*.  And by hypothesis, none of the as-yet-not-seen catalog
> updates are a problem for us.

Here's the way it can fail:

1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState
= false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505.  The CPU has those
latest stateP values in cache, but segP->maxMsgNum is *not* in cache.
2. Backend stalls for <long time>.  Meanwhile, other backends issue
MSGNUMWRAPAROUND - 5 invalidation messages.  Main memory bears
stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND,
segP->maxMsgNum = 500.
3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum =
500 from main memory.  The patch's test finds no need to reset or process
invalidation messages.

That's the theoretical risk I wished to illustrate.  Though this appears
possible on an abstract x86_64 system, I think it's unrealistic to suppose that
a dirty cache line could persist *throughout* the issuance of more than 10^9
invalidation messages on a concrete implementation.

A way to think about the problem is that our read of segP->maxMsgNum is too new.
If we had used a snapshot of values as of the most recent lock
acquisition/release, there would be no problem.  It's the mix of a new-enough
stateP with an all-too-new segP that yields the anomaly.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Tue, Jul 26, 2011 at 01:36:26PM -0400, Robert Haas wrote:
> On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> > On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
> >> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> >> > This is attractive, and I don't see any problems with it.  (In theory, you could
> >> > hit a case where the load of resetState gives an ancient "false" just as the
> >> > counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> >> > reset interval, I'm not worried about problems on actual silicon.)
> >>
> >> It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
> >
> > Ah, so it is.
> >
> >> It would be pretty easy to eliminate even the theoretical possibility
> >> of a race by getting rid of resetState altogether and using nextMsgNum
> >> = -1 to mean that.  Maybe I should go ahead and do that.
> >
> > Seems like a nice simplification.
> 
> On further reflection, I don't see that this helps: it just moves the
> problem around.

Alas, yes.

> With resetState as a separate variable, nextMsgNum is
> never changed by anyone other than the owner, so we can never have a
> stale load.

It's also changed when SICleanupQueue() decides to wrap everyone.  This could
probably be eliminated by using uint32 and letting overflow take care of wrap
implicitly, but ...

> But if we overload nextMsgNum to also indicate whether
> our state has been reset, then there's a race between when we load
> nextMsgNum and when we load maxMsgNum (instead of code I posted
> previously, which has a race between when we load resetState and when
> we load maxMsgNum).  Now, as you say, it seems really, really
> difficult to hit that in practice, but I don't see a way of getting
> rid of the theoretical possibility without either (1) a spinlock or
> (2) a fence.  (Of course, on x86, the fence could be optimized down to
> a compiler barrier.)

No new ideas come to mind, here.  We could migrate back toward your original
proposal of making the counter a non-wrapping uint64; Florian described some
nice techniques for doing atomic 64-bit reads on x86:

http://archives.postgresql.org/message-id/7C94C386-122F-4918-8624-A14352E8DBC5@phlo.org

I'm not personally acquainted with those approaches, but they sound promising.
Since the overlap between 32-bit installations and performance-sensitive
installations is all but gone, we could even just use a spinlock under 32-bit.

> I guess the question is "should we worry about that?".

I'd lean toward "no".  I share Tom's unease about writing off a race condition
as being too improbable, but this is quite exceptionally improbable.  On the
other hand, some of the fixes don't look too invasive.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Noah Misch <noah@2ndQuadrant.com> writes:
> On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote:
>> After some further reflection I believe this patch actually is pretty
>> safe, although Noah's explanation of why seems a bit confused.

> Here's the way it can fail:

> 1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState
> = false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505.  The CPU has those
> latest stateP values in cache, but segP->maxMsgNum is *not* in cache.
> 2. Backend stalls for <long time>.  Meanwhile, other backends issue
> MSGNUMWRAPAROUND - 5 invalidation messages.  Main memory bears
> stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND,
> segP->maxMsgNum = 500.
> 3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum =
> 500 from main memory.  The patch's test finds no need to reset or process
> invalidation messages.

[ squint... ]  Hmm, you're right.  The case where this would break
things is if (some of) the five unprocessed messages relate to some
object we've just locked.  But the initial state you describe would be
valid right after obtaining such a lock.

> That's the theoretical risk I wished to illustrate.  Though this appears
> possible on an abstract x86_64 system, I think it's unrealistic to suppose that
> a dirty cache line could persist *throughout* the issuance of more than 10^9
> invalidation messages on a concrete implementation.

Dirty cache line, maybe not, but what if the assembly code commands the
CPU to load those variables into CPU registers before doing the
comparison?  If they're loaded with maxMsgNum coming in last (or at
least after resetState), I think you can have the problem without any
assumptions about cache line behavior at all.  You just need the process
to lose the CPU at the right time.

If we marked the pointers volatile, we could probably ensure that the
assembly code tests resetState last, but that's not sufficient to avoid
the stale-cache-line risk.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
> Noah Misch <noah@2ndQuadrant.com> writes:
> > That's the theoretical risk I wished to illustrate.  Though this appears
> > possible on an abstract x86_64 system, I think it's unrealistic to suppose that
> > a dirty cache line could persist *throughout* the issuance of more than 10^9
> > invalidation messages on a concrete implementation.
> 
> Dirty cache line, maybe not, but what if the assembly code commands the
> CPU to load those variables into CPU registers before doing the
> comparison?  If they're loaded with maxMsgNum coming in last (or at
> least after resetState), I think you can have the problem without any
> assumptions about cache line behavior at all.  You just need the process
> to lose the CPU at the right time.

True.  If the compiler places the resetState load first, you could hit the
anomaly by "merely" setting a breakpoint on the next instruction, waiting for
exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
I think, though, we should either plug that _and_ the cache incoherency case or
worry about neither.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Noah Misch <noah@2ndQuadrant.com> writes:
> On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
>> Dirty cache line, maybe not, but what if the assembly code commands the
>> CPU to load those variables into CPU registers before doing the
>> comparison?  If they're loaded with maxMsgNum coming in last (or at
>> least after resetState), I think you can have the problem without any
>> assumptions about cache line behavior at all.  You just need the process
>> to lose the CPU at the right time.

> True.  If the compiler places the resetState load first, you could hit the
> anomaly by "merely" setting a breakpoint on the next instruction, waiting for
> exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
> I think, though, we should either plug that _and_ the cache incoherency case or
> worry about neither.

How do you figure that?  The poor-assembly-code-order risk is both a lot
easier to fix and a lot higher probability.  Admittedly, it's still way
way down there, but you only need a precisely-timed sleep, not a
precisely-timed sleep *and* a cache line that somehow remained stale.
        regards, tom lane


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Tue, Jul 26, 2011 at 3:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> You might be right, but I think we have little knowledge of how some
> memory barrier code you haven't written yet effects performance on
> various architectures.
>
> A spinlock per backend would cache very nicely, now you mention it. So
> my money would be on the multiple copies.

Maybe so, but you can see from the numbers in my OP that the results
still leave something to be desired.

> It's not completely clear to me that updating N copies would be more
> expensive. Accessing N low contention copies rather than 1
> high-contention value might actually be a win.

Yeah, I haven't tested that approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Tue, Jul 26, 2011 at 3:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It wouldn't, although it might be bad in the case where there are lots
>> of temp tables being created and dropped.
>
> Do temp tables cause relcache invalidations?
>
> That seems like something we'd want to change in itself.

I agree.  Unfortunately, I think it's a non-trivial fix.

I've also been wondering if we could avoid taking an
AccessExclusiveLock on a newly created (temporary?) table.  It seems
like no one should be able to see it until commit, at which point we'd
be releasing the lock anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> No new ideas come to mind, here.

OK, I have a new idea.  :-)

1. Add a new flag to each procState called something like "timeToPayAttention".
2. Each call to SIGetDataEntries() iterates over all the ProcStates
whose index is < lastBackend and sets stateP->timeToPayAttention =
TRUE for each.
3. At the beginning of SIGetDataEntries(), we do an unlocked if
(!stateP->timeToPayAttention) return 0.
4. Immediately following that if statement and before acquiring any
locks, we set stateP->timeToPayAttention = FALSE.

The LWLockRelease() in SIGetDataEntries() will be sufficient to force
all of the stateP->timeToPayAttention writes out to main memory, and
the read side is OK because we either just took a lock (which acted as
a fence) or else there's a race anyway.  But unlike my previous
proposal, it doesn't involve *comparing* anything.  We needn't worry
about whether we could read two different values that are through
great misfortune out of sync because we're only reading one value.

If, by chance, the value is set to true just after we set it to false,
that won't mess us up either: we'll still read all the messages after
acquiring the lock.

Now, there's some downside to this approach - it involves doing O(N)
work for each invalidation message we receive.  But as long as it's
only O(N) stores and not O(N) lock acquisitions and releases, I think
that might be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Tue, Jul 26, 2011 at 06:04:16PM -0400, Tom Lane wrote:
> Noah Misch <noah@2ndQuadrant.com> writes:
> > On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
> >> Dirty cache line, maybe not, but what if the assembly code commands the
> >> CPU to load those variables into CPU registers before doing the
> >> comparison?  If they're loaded with maxMsgNum coming in last (or at
> >> least after resetState), I think you can have the problem without any
> >> assumptions about cache line behavior at all.  You just need the process
> >> to lose the CPU at the right time.
> 
> > True.  If the compiler places the resetState load first, you could hit the
> > anomaly by "merely" setting a breakpoint on the next instruction, waiting for
> > exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
> > I think, though, we should either plug that _and_ the cache incoherency case or
> > worry about neither.
> 
> How do you figure that?  The poor-assembly-code-order risk is both a lot
> easier to fix and a lot higher probability.  Admittedly, it's still way
> way down there, but you only need a precisely-timed sleep, not a
> precisely-timed sleep *and* a cache line that somehow remained stale.

I think both probabilities are too low to usefully distinguish.  An sinval
wraparound takes a long time even in a deliberate test setup: almost 30 hours @
10k messages/sec.  To get a backend to sleep that long, you'll probably need
something like SIGSTOP or a debugger attach.  The sleep has to fall within the
space of no more than a few instructions.  Then, you'd need to release the
process at the exact moment for it to observe wrapped equality.  In other words,
you get one split-millisecond opportunity every 30 hours of process sleep time.
If your backends don't have multi-hour sleeps, it can't ever happen.

Even so, all the better if we settle on an approach that has neither hazard.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah@2ndquadrant.com> wrote:
>> No new ideas come to mind, here.
>
> OK, I have a new idea.  :-)
>
> 1. Add a new flag to each procState called something like "timeToPayAttention".
> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
> whose index is < lastBackend and sets stateP->timeToPayAttention =
> TRUE for each.
> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
> (!stateP->timeToPayAttention) return 0.
> 4. Immediately following that if statement and before acquiring any
> locks, we set stateP->timeToPayAttention = FALSE.
>
> The LWLockRelease() in SIGetDataEntries() will be sufficient to force
> all of the stateP->timeToPayAttention writes out to main memory, and
> the read side is OK because we either just took a lock (which acted as
> a fence) or else there's a race anyway.  But unlike my previous
> proposal, it doesn't involve *comparing* anything.  We needn't worry
> about whether we could read two different values that are through
> great misfortune out of sync because we're only reading one value.
>
> If, by chance, the value is set to true just after we set it to false,
> that won't mess us up either: we'll still read all the messages after
> acquiring the lock.

There turned out to be a little bit of further subtlety to this, but
it seems to work.  Patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: sinval synchronization considered harmful

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. Add a new flag to each procState called something like "timeToPayAttention".
>> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
>> whose index is < lastBackend and sets stateP->timeToPayAttention =
>> TRUE for each.
>> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
>> (!stateP->timeToPayAttention) return 0.
>> 4. Immediately following that if statement and before acquiring any
>> locks, we set stateP->timeToPayAttention = FALSE.

> There turned out to be a little bit of further subtlety to this, but
> it seems to work.  Patch attached.

And?

It didn't sound to me like this could possibly be a performance win,
but I await some numbers ...
        regards, tom lane


Re: sinval synchronization considered harmful

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> There turned out to be a little bit of further subtlety to this,
> but it seems to work.  Patch attached.
Stylistic question: Why is stateP->hasMessages set to false in one
place and FALSE and TRUE in others?  It seems like it would be less
confusing to be consistent.
A quick count of the usage of both forms in the PostgreSQL source
codes shows the lowercase is used about 10 times more often:
kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
-or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
'\b(FALSE|TRUE)\b'
1670
kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
-or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
'\b(false|true)\b'
17349
-Kevin


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Wed, Jul 27, 2011 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> 1. Add a new flag to each procState called something like "timeToPayAttention".
>>> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
>>> whose index is < lastBackend and sets stateP->timeToPayAttention =
>>> TRUE for each.
>>> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
>>> (!stateP->timeToPayAttention) return 0.
>>> 4. Immediately following that if statement and before acquiring any
>>> locks, we set stateP->timeToPayAttention = FALSE.
>
>> There turned out to be a little bit of further subtlety to this, but
>> it seems to work.  Patch attached.
>
> And?
>
> It didn't sound to me like this could possibly be a performance win,
> but I await some numbers ...

On master, with the patch, scale factor 100, pgbench -S -c $CLIENTS -j
$CLIENTS -T 300 results for different client counts, on a 32-core
machine with 128GB RAM, shared_buffers = 8GB:

01 tps = 4444.280459 (including connections establishing)
01 tps = 4438.365576 (including connections establishing)
01 tps = 4423.718466 (including connections establishing)
08 tps = 33187.827872 (including connections establishing)
08 tps = 33288.247330 (including connections establishing)
08 tps = 33304.307835 (including connections establishing)
32 tps = 178876.350559 (including connections establishing)
32 tps = 177293.082295 (including connections establishing)
32 tps = 175577.058885 (including connections establishing)
80 tps = 159202.449993 (including connections establishing)
80 tps = 151541.717088 (including connections establishing)
80 tps = 150454.658132 (including connections establishing)

Without the patch:

01 tps = 4447.660101 (including connections establishing)
01 tps = 4440.830713 (including connections establishing)
01 tps = 4411.610348 (including connections establishing)
08 tps = 33135.773476 (including connections establishing)
08 tps = 33365.387051 (including connections establishing)
08 tps = 33364.859705 (including connections establishing)
32 tps = 175834.280471 (including connections establishing)
32 tps = 176713.118271 (including connections establishing)
32 tps = 176830.687087 (including connections establishing)
80 tps = 135211.036587 (including connections establishing)
80 tps = 130642.264016 (including connections establishing)
80 tps = 133621.549513 (including connections establishing)

I'm fairly certain the results will be even more dramatic with the
lazy-vxid patch applied, since master still has to fight with the lock
manager on this workload.  I haven't tested that yet, but there's not
much reason to suppose that the results will be dramatically different
from any of the other methods of reducing the sinval overhead that
I've benchmarked, many of which I've already posted about.  See, for
example, the OP on this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Tue, Jul 26, 2011 at 09:57:10PM -0400, Robert Haas wrote:
> On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> > No new ideas come to mind, here.
> 
> OK, I have a new idea.  :-)
> 
> 1. Add a new flag to each procState called something like "timeToPayAttention".
> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
> whose index is < lastBackend and sets stateP->timeToPayAttention =
> TRUE for each.
> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
> (!stateP->timeToPayAttention) return 0.
> 4. Immediately following that if statement and before acquiring any
> locks, we set stateP->timeToPayAttention = FALSE.
> 
> The LWLockRelease() in SIGetDataEntries() will be sufficient to force
> all of the stateP->timeToPayAttention writes out to main memory, and
> the read side is OK because we either just took a lock (which acted as
> a fence) or else there's a race anyway.  But unlike my previous
> proposal, it doesn't involve *comparing* anything.  We needn't worry
> about whether we could read two different values that are through
> great misfortune out of sync because we're only reading one value.
> 
> If, by chance, the value is set to true just after we set it to false,
> that won't mess us up either: we'll still read all the messages after
> acquiring the lock.

This approach would work if a spinlock release constrained the global stores
timeline.  It makes a weaker guarantee: all stores preceding the lock release
in program order will precede it globally.  Consequently, no processor will
observe the spinlock to be available without also observing all stores made by
the last holding processor before that processor released it.  That guarantee
is not enough for this sequence of events:

Backend 0:add a message for table "a"store 5 => maxMsgNumstore true =>
timeToPayAttentionLWLockRelease(SInvalWriteLock)
<plenty of time passes>
Backend 2:LOCK TABLE a;load timeToPayAttention => true
Backend 1:add a message for table "b"store 6 => maxMsgNumSpinLockRelease(&vsegP->msgnumLock);store true =>
timeToPayAttention
Backend 2:store false => timeToPayAttentionLWLockAcquire(SInvalReadLock, LW_SHARED)load maxMsgNum => 5 [!]
Backend 1:LWLockRelease(SInvalWriteLock);
Backend 2:LOCK TABLE b;load timeToPayAttention => false<use b without processing updates>

The "SpinLockRelease(&vsegP->msgnumLock)" guarantees that any process
observing the backend 2 store of timeToPayAttention will also observe
maxMsgNum = 6.  However, nothing constrains which timeToPayAttention store
will "win" in main memory.  Backend 1 can observe neither update from backend
2, yet still have its store appear later than the backend 1 stores in the
global timeline.

> Now, there's some downside to this approach - it involves doing O(N)
> work for each invalidation message we receive.  But as long as it's
> only O(N) stores and not O(N) lock acquisitions and releases, I think
> that might be OK.

I think a benchmark is in order, something like 900 idle connections and 80
connections running small transactions that create a few temporary tables.  If
that shows no statistically significant regression, then we're probably fine
here.  I'm not sure what result to expect, honestly.

What did you think of making the message number a uint64, removing wraparound
handling, and retaining SISeg.msgnumLock for 32-bit only?  You could isolate the
variant logic in READ_MSGNUM and WRITE_MSGNUM macros.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> This approach would work if a spinlock release constrained the global stores
> timeline.  It makes a weaker guarantee: all stores preceding the lock release
> in program order will precede it globally.  Consequently, no processor will
> observe the spinlock to be available without also observing all stores made by
> the last holding processor before that processor released it.  That guarantee
> is not enough for this sequence of events:
>
> Backend 0:
>        add a message for table "a"
>        store 5 => maxMsgNum
>        store true => timeToPayAttention
>        LWLockRelease(SInvalWriteLock)
> <plenty of time passes>
> Backend 2:
>        LOCK TABLE a;
>        load timeToPayAttention => true
> Backend 1:
>        add a message for table "b"
>        store 6 => maxMsgNum
>        SpinLockRelease(&vsegP->msgnumLock);
>        store true => timeToPayAttention
> Backend 2:
>        store false => timeToPayAttention
>        LWLockAcquire(SInvalReadLock, LW_SHARED)
>        load maxMsgNum => 5 [!]

Eh, how can this possibly happen?  You have to hold msgNumLock to to
set maxMsgNum and msgNumLock to read maxMsgNum.  If that's not enough
to guarantee that we never read a stale value, what is?

> I think a benchmark is in order, something like 900 idle connections and 80
> connections running small transactions that create a few temporary tables.  If
> that shows no statistically significant regression, then we're probably fine
> here.  I'm not sure what result to expect, honestly.

That's setting the bar pretty high.  I don't mind doing the
experiment, but I'm not sure that's the case we should be optimizing
for.

> What did you think of making the message number a uint64, removing wraparound
> handling, and retaining SISeg.msgnumLock for 32-bit only?  You could isolate the
> variant logic in READ_MSGNUM and WRITE_MSGNUM macros.

Well, what you really need to know is whether the platform has 8-byte
atomic stores, which doesn't seem trivial to figure out, plus you need
a memory fence.  If that's the only method of fixing this problem we
can agree on, I'm willing to work on it, but an
architecture-independent fix would be nicer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Wed, Jul 27, 2011 at 01:30:47PM -0400, Robert Haas wrote:
> On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> > [wrong objection]
> 
> Eh, how can this possibly happen?  You have to hold msgNumLock to to
> set maxMsgNum and msgNumLock to read maxMsgNum.  If that's not enough
> to guarantee that we never read a stale value, what is?

Indeed, my analysis was all wrong.

> > I think a benchmark is in order, something like 900 idle connections and 80
> > connections running small transactions that create a few temporary tables.  If
> > that shows no statistically significant regression, then we're probably fine
> > here.  I'm not sure what result to expect, honestly.
> 
> That's setting the bar pretty high.  I don't mind doing the
> experiment, but I'm not sure that's the case we should be optimizing
> for.

Granted.  How about 32 clients running the temporary table transaction, no idle
connections?  Given the meager benefit of this patch compared to your previous
version, it would be hard to justify a notable performance drop in return.

> > What did you think of making the message number a uint64, removing wraparound
> > handling, and retaining SISeg.msgnumLock for 32-bit only?  You could isolate the
> > variant logic in READ_MSGNUM and WRITE_MSGNUM macros.
> 
> Well, what you really need to know is whether the platform has 8-byte
> atomic stores, which doesn't seem trivial to figure out, plus you need
> a memory fence.  If that's the only method of fixing this problem we
> can agree on, I'm willing to work on it, but an
> architecture-independent fix would be nicer.

Fair enough.

Thanks,
nm

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Wed, Jul 27, 2011 at 1:58 PM, Noah Misch <noah@2ndquadrant.com> wrote:
>> > I think a benchmark is in order, something like 900 idle connections and 80
>> > connections running small transactions that create a few temporary tables.  If
>> > that shows no statistically significant regression, then we're probably fine
>> > here.  I'm not sure what result to expect, honestly.
>>
>> That's setting the bar pretty high.  I don't mind doing the
>> experiment, but I'm not sure that's the case we should be optimizing
>> for.
>
> Granted.  How about 32 clients running the temporary table transaction, no idle
> connections?  Given the meager benefit of this patch compared to your previous
> version, it would be hard to justify a notable performance drop in return.

The reason the benefit is smaller is, I believe, because the previous
numbers were generated with the lazy vxid locks patch applied, and
these were generated against master.  With the lock manager as a
bottleneck, the sinval stuff doesn't get hit quite as hard, so the
benefit is less.  I can regenerate the numbers with the lazy vxid
patch applied; I suspect they'll be similar to what we saw before.
I'll also test out creating and dropping some tables.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Wed, Jul 27, 2011 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The reason the benefit is smaller is, I believe, because the previous
> numbers were generated with the lazy vxid locks patch applied, and
> these were generated against master.  With the lock manager as a
> bottleneck, the sinval stuff doesn't get hit quite as hard, so the
> benefit is less.  I can regenerate the numbers with the lazy vxid
> patch applied; I suspect they'll be similar to what we saw before.

Yep.  Here's with both lazy-vxids and sinval-hasmessages;

01 tps = 4470.133776 (including connections establishing)
01 tps = 4471.450650 (including connections establishing)
01 tps = 4490.833194 (including connections establishing)
32 tps = 191416.960956 (including connections establishing)
32 tps = 190653.742400 (including connections establishing)
32 tps = 191832.231458 (including connections establishing)
80 tps = 189348.509378 (including connections establishing)
80 tps = 191080.641878 (including connections establishing)
80 tps = 191366.728275 (including connections establishing)

And with just lazy vxids:

01 tps = 4458.667013 (including connections establishing)
01 tps = 4526.922638 (including connections establishing)
01 tps = 4480.415099 (including connections establishing)
32 tps = 193273.434028 (including connections establishing)
32 tps = 190661.279391 (including connections establishing)
32 tps = 189526.560031 (including connections establishing)
80 tps = 150572.020250 (including connections establishing)
80 tps = 118643.970622 (including connections establishing)
80 tps = 119211.643930 (including connections establishing)

Same select-only, scale-factor-100 pgbench test, same 32 core machine,
as I've been using for my other recent tests.

> I'll also test out creating and dropping some tables.

Still need to work on this one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'll also test out creating and dropping some tables.
>
> Still need to work on this one.

And there results are in.  I set up the following sophisticated test
script for pgbench:

CREATE TEMP TABLE foo (a int);
DROP TABLE foo;

And then did 5-minute test runs with varying numbers of clients on
Nate Boley's 32-core machine with (a) master, (b) master +
sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
+ sinval-hasmessages.  I held my breath until the results started
popping out, then felt much better.  In the table below, results with
L are with the lazy-vxid patch, S is with the sinval-hasmessages
patch, LS with both, and results with no letters are neither.  The
numbers are the client count.  Long story short, it seems that the
patch actually makes this test case faster at any client code, though
the improvement at 1 and 8 clients might be in the noise:

01L tps = 514.880290 (including connections establishing)
01L tps = 525.097199 (including connections establishing)
01L tps = 508.319588 (including connections establishing)
08L tps = 1834.259810 (including connections establishing)
08L tps = 1846.846089 (including connections establishing)
08L tps = 1841.402433 (including connections establishing)
32L tps = 1463.822936 (including connections establishing)
32L tps = 1481.169483 (including connections establishing)
32L tps = 1393.780335 (including connections establishing)
80L tps = 1192.768020 (including connections establishing)
80L tps = 1165.545010 (including connections establishing)
80L tps = 1169.776066 (including connections establishing)

01LS tps = 517.624068 (including connections establishing)
01LS tps = 524.507723 (including connections establishing)
01LS tps = 507.847622 (including connections establishing)
08LS tps = 1831.248178 (including connections establishing)
08LS tps = 1873.932133 (including connections establishing)
08LS tps = 1863.048113 (including connections establishing)
32LS tps = 1851.143407 (including connections establishing)
32LS tps = 1754.683356 (including connections establishing)
32LS tps = 1785.926527 (including connections establishing)
80LS tps = 1510.778084 (including connections establishing)
80LS tps = 1484.423486 (including connections establishing)
80LS tps = 1480.692051 (including connections establishing)

01 tps = 511.572832 (including connections establishing)
01 tps = 499.389527 (including connections establishing)
01 tps = 495.697080 (including connections establishing)
08 tps = 1832.762548 (including connections establishing)
08 tps = 1819.884564 (including connections establishing)
08 tps = 1835.608561 (including connections establishing)
32 tps = 1417.168790 (including connections establishing)
32 tps = 1447.478971 (including connections establishing)
32 tps = 1427.489879 (including connections establishing)
80 tps = 1154.272515 (including connections establishing)
80 tps = 1168.805881 (including connections establishing)
80 tps = 1173.971801 (including connections establishing)

01S tps = 519.860218 (including connections establishing)
01S tps = 510.759087 (including connections establishing)
01S tps = 517.159276 (including connections establishing)
08S tps = 1880.179600 (including connections establishing)
08S tps = 1829.693454 (including connections establishing)
08S tps = 1886.168401 (including connections establishing)
32S tps = 1809.950929 (including connections establishing)
32S tps = 1809.474070 (including connections establishing)
32S tps = 1798.620842 (including connections establishing)
80S tps = 1483.037788 (including connections establishing)
80S tps = 1481.059504 (including connections establishing)
80S tps = 1487.215748 (including connections establishing)

So, apparently, the extra work in SIInsertDataEntries() is more than
paid for by the speedup in SIGetDataEntries().  I'm guessing that at
high client counts you win because of reduced spinlock contention, and
at low client counts you still win a little bit because
SIGetDataEntries() is called multiple times per transaction, whereas
SIInsertDataEntries() is only called once.  I could be all wet on the
reason, but at any rate the numbers look pretty good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sinval synchronization considered harmful

From
Noah Misch
Date:
On Thu, Jul 28, 2011 at 03:03:05PM -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> I'll also test out creating and dropping some tables.
> >
> > Still need to work on this one.
> 
> And there results are in.  I set up the following sophisticated test
> script for pgbench:
> 
> CREATE TEMP TABLE foo (a int);
> DROP TABLE foo;
> 
> And then did 5-minute test runs with varying numbers of clients on
> Nate Boley's 32-core machine with (a) master, (b) master +
> sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
> + sinval-hasmessages.

The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
vs. (b) master + lazy-vxid + [2]sinval-hasmessages.  The only claimed benefit of
[2] over [1], as far as I can see, is invulnerability to the hazard described in
[3].  Before selecting [2] over [1], it would be instructive to know how much
performance we exchange for its benefit.

I did a bit of (relatively unrigorous) poking at this workload with oprofile,
and I never saw SIInsertDataEntries take more than 0.26% of CPU time.  It's
perhaps too cheap relative to the workload's other costs to matter.  That's good
enough for me.

[1] http://archives.postgresql.org/message-id/CA+TgmoZc8Z_JTj44xvpWpXKQt2jGjB5YGCZ3T9u5-QZVdBmyCA@mail.gmail.com
[2] http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xOX3aHuPYbmJT9+PKW6qubQzN7sukK3Dg@mail.gmail.com
[3] http://archives.postgresql.org/message-id/20110727033537.GB18910@tornado.leadboat.com

> 80L tps = 1192.768020 (including connections establishing)
> 80L tps = 1165.545010 (including connections establishing)
> 80L tps = 1169.776066 (including connections establishing)

> 80LS tps = 1510.778084 (including connections establishing)
> 80LS tps = 1484.423486 (including connections establishing)
> 80LS tps = 1480.692051 (including connections establishing)

> 80 tps = 1154.272515 (including connections establishing)
> 80 tps = 1168.805881 (including connections establishing)
> 80 tps = 1173.971801 (including connections establishing)

> 80S tps = 1483.037788 (including connections establishing)
> 80S tps = 1481.059504 (including connections establishing)
> 80S tps = 1487.215748 (including connections establishing)
> 
> So, apparently, the extra work in SIInsertDataEntries() is more than
> paid for by the speedup in SIGetDataEntries().  I'm guessing that at
> high client counts you win because of reduced spinlock contention, and
> at low client counts you still win a little bit because
> SIGetDataEntries() is called multiple times per transaction, whereas
> SIInsertDataEntries() is only called once.  I could be all wet on the
> reason, but at any rate the numbers look pretty good.

Interesting.  I had hypothesized that at two clients per core, nearly every call
to SIGetDataEntries() would find work to do, making the patch a strict loss.  A
bit of instrumented testing showed otherwise: at two clients per core,
sinval-hasmessages optimized away 93% of the calls.  At fifty clients per core,
it still optimized away more than half of the calls.

-- 
Noah Misch                    http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: sinval synchronization considered harmful

From
Robert Haas
Date:
On Thu, Jul 28, 2011 at 11:05 PM, Noah Misch <noah@2ndquadrant.com> wrote:
> The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
> vs. (b) master + lazy-vxid + [2]sinval-hasmessages.  The only claimed benefit of
> [2] over [1], as far as I can see, is invulnerability to the hazard described in
> [3].  Before selecting [2] over [1], it would be instructive to know how much
> performance we exchange for its benefit.
>
> I did a bit of (relatively unrigorous) poking at this workload with oprofile,
> and I never saw SIInsertDataEntries take more than 0.26% of CPU time.  It's
> perhaps too cheap relative to the workload's other costs to matter.  That's good
> enough for me.

Me, too.  There's another possible benefit, though: in
sinval-fastpath, everybody's got to read maxMsgNum every time through;
whereas in sinval-hasmessages, everyone's reading their own flag.  I'm
not sure how much it costs to have memory that gets read by multiple
readers rather than just a single reader, but I think I've seen some
things that indicate it might not always be free.

> Interesting.  I had hypothesized that at two clients per core, nearly every call
> to SIGetDataEntries() would find work to do, making the patch a strict loss.  A
> bit of instrumented testing showed otherwise: at two clients per core,
> sinval-hasmessages optimized away 93% of the calls.  At fifty clients per core,
> it still optimized away more than half of the calls.

Wow.  That's better than I expected.  Great idea for a test, too.

Unless someone has another concern here, I think we've got this one nailed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company