Thread: [PATCH] Improve spinlock inline assembly for x86.

[PATCH] Improve spinlock inline assembly for x86.

From
Andreas Seltenreich
Date:
Hi,

I'm currently experimenting with just-in-time compilation using libfirm.
While discussing issues with its developers, it was pointed out to me
that our spinlock inline assembly is less than optimal.  Attached is a
patch that addresses this.

,----
| Remove the LOCK prefix from the XCHG instruction.  Locking is implicit
| with XCHG and the prefix wastes a byte.  Also remove the "cc" register
| from the clobber list as the XCHG instruction does not modify any flags.
|
| Reported by Christoph Mallon.
`----

regards,
Andreas

>From c836b4f3e0b60d070481d4061e6fe0ffbe488495 Mon Sep 17 00:00:00 2001
From: Andreas Seltenreich <seltenreich@gmx.de>
Date: Sun, 17 Jan 2016 11:51:53 +0100
Subject: [PATCH] Improve spinlock inline assembly for x86.

Remove the LOCK prefix from the XCHG instruction.  Locking is implicit
with XCHG and the prefix wastes a byte.  Also remove the "cc" register
from the clobber list as the xchg instruction does not modify any
flags.

Reported by Christoph Mallon.
---
 src/include/storage/s_lock.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 8b240cd..933bb76 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -158,7 +158,6 @@ tas(volatile slock_t *lock)
     __asm__ __volatile__(
         "    cmpb    $0,%1    \n"
         "    jne        1f        \n"
-        "    lock            \n"
         "    xchgb    %0,%1    \n"
         "1: \n"
 :        "+q"(_res), "+m"(*lock)
@@ -226,11 +225,10 @@ tas(volatile slock_t *lock)
     register slock_t _res = 1;

     __asm__ __volatile__(
-        "    lock            \n"
         "    xchgb    %0,%1    \n"
 :        "+q"(_res), "+m"(*lock)
 :        /* no inputs */
-:        "memory", "cc");
+:        "memory");
     return (int) _res;
 }

--
2.1.4


Re: [PATCH] Improve spinlock inline assembly for x86.

From
Robert Haas
Date:
On Sun, Jan 17, 2016 at 6:38 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> I'm currently experimenting with just-in-time compilation using libfirm.
> While discussing issues with its developers, it was pointed out to me
> that our spinlock inline assembly is less than optimal.  Attached is a
> patch that addresses this.
>
> ,----
> | Remove the LOCK prefix from the XCHG instruction.  Locking is implicit
> | with XCHG and the prefix wastes a byte.  Also remove the "cc" register
> | from the clobber list as the XCHG instruction does not modify any flags.
> |
> | Reported by Christoph Mallon.
> `----

I did a Google search and everybody seems to agree that the LOCK
prefix is redundant.  I found a source agreeing with the idea that it
doesn't clobber registers, too:

http://www.oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_6/CH06-1.html#HEADING1-85

So I guess it would be safe to change this.  Scares me slightly, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Andres Freund
Date:
On January 18, 2016 7:27:59 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>On Sun, Jan 17, 2016 at 6:38 AM, Andreas Seltenreich
><seltenreich@gmx.de> wrote:
>> I'm currently experimenting with just-in-time compilation using
>libfirm.
>> While discussing issues with its developers, it was pointed out to me
>> that our spinlock inline assembly is less than optimal.  Attached is
>a
>> patch that addresses this.
>>
>> ,----
>> | Remove the LOCK prefix from the XCHG instruction.  Locking is
>implicit
>> | with XCHG and the prefix wastes a byte.  Also remove the "cc"
>register
>> | from the clobber list as the XCHG instruction does not modify any
>flags.
>> |
>> | Reported by Christoph Mallon.
>> `----
>
>I did a Google search and everybody seems to agree that the LOCK
>prefix is redundant.  I found a source agreeing with the idea that it
>doesn't clobber registers, too:
>
>http://www.oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_6/CH06-1.html#HEADING1-85
>
>So I guess it would be safe to change this.  Scares me slightly,
>though.

Clobbers IIRC are implicit on x86 anyway. Rather doubt that the space for the prefix makes any sorry of practical
difference,but there indeed seems no reason to have it.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Kevin Grittner
Date:
On Mon, Jan 18, 2016 at 3:04 PM, Andres Freund <andres@anarazel.de> wrote:
> On January 18, 2016 7:27:59 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Jan 17, 2016 at 6:38 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

>>> While discussing issues with its developers, it was pointed out to me
>>> that our spinlock inline assembly is less than optimal.  Attached is
>>> a patch that addresses this.

>> I did a Google search and everybody seems to agree that the LOCK
>> prefix is redundant.  I found a source agreeing with the idea that it
>> doesn't clobber registers

>> So I guess it would be safe to change this.  Scares me slightly,
>> though.
>
> Clobbers IIRC are implicit on x86 anyway. Rather doubt that the
> space for the prefix makes any sorry of practical difference, but
> there indeed seems no reason to have it.

I took a look at this and agree that the shorter, simpler code
proposed in this patch should make no *logical* difference, and
looks like it *should* have a neutral or beneficial affect; but
performance tuning in general, and spinlock performance in
particular, is full of surprises.  We have seen customers suffer
poor scaling on their brand new monster machines because of the
interaction between NUMA scheduling and our spinlock
implementation, and seen those problems go away with an upgrade
from pre-3.8 to post-3.8 kernels.  I would be hesitant to accept
this change without seeing a benchmark on a large NUMA machine with
4 or more memory nodes, on Linux kernels both before and after 3.8,
to make sure that the effects are at least neutral.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Andres Freund
Date:
On January 18, 2016 10:42:42 PM GMT+01:00, Kevin Grittner <kgrittn@gmail.com> wrote:
>On Mon, Jan 18, 2016 at 3:04 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> On January 18, 2016 7:27:59 PM GMT+01:00, Robert Haas
><robertmhaas@gmail.com> wrote:
>>> On Sun, Jan 17, 2016 at 6:38 AM, Andreas Seltenreich
><seltenreich@gmx.de> wrote:
>
>>>> While discussing issues with its developers, it was pointed out to
>me
>>>> that our spinlock inline assembly is less than optimal.  Attached
>is
>>>> a patch that addresses this.
>
>>> I did a Google search and everybody seems to agree that the LOCK
>>> prefix is redundant.  I found a source agreeing with the idea that
>it
>>> doesn't clobber registers
>
>>> So I guess it would be safe to change this.  Scares me slightly,
>>> though.
>>
>> Clobbers IIRC are implicit on x86 anyway. Rather doubt that the
>> space for the prefix makes any sorry of practical difference, but
>> there indeed seems no reason to have it.
>
>I took a look at this and agree that the shorter, simpler code
>proposed in this patch should make no *logical* difference, and
>looks like it *should* have a neutral or beneficial affect; but
>performance tuning in general, and spinlock performance in
>particular, is full of surprises.  We have seen customers suffer
>poor scaling on their brand new monster machines because of the
>interaction between NUMA scheduling and our spinlock
>implementation, and seen those problems go away with an upgrade
>from pre-3.8 to post-3.8 kernels.  I would be hesitant to accept
>this change without seeing a benchmark on a large NUMA machine with
>4 or more memory nodes, on Linux kernels both before and after 3.8,
>to make sure that the effects are at least neutral.

Unconvinced. You get just as much churn by changing code elsewhere, which often causes code movement and alignment
changes.

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Kevin Grittner
Date:
On Mon, Jan 18, 2016 at 3:47 PM, Andres Freund <andres@anarazel.de> wrote:
> On January 18, 2016 10:42:42 PM GMT+01:00, Kevin Grittner <kgrittn@gmail.com> wrote:

>> I took a look at this and agree that the shorter, simpler code
>> proposed in this patch should make no *logical* difference, and
>> looks like it *should* have a neutral or beneficial affect; but
>> performance tuning in general, and spinlock performance in
>> particular, is full of surprises.  We have seen customers suffer
>> poor scaling on their brand new monster machines because of the
>> interaction between NUMA scheduling and our spinlock
>> implementation, and seen those problems go away with an upgrade
>> from pre-3.8 to post-3.8 kernels.  I would be hesitant to accept
>> this change without seeing a benchmark on a large NUMA machine with
>> 4 or more memory nodes, on Linux kernels both before and after 3.8,
>> to make sure that the effects are at least neutral.
>
> Unconvinced.

Unconvinced that we should do performance testing on a proposed
performance patch before accepting it, that the changes in NUMA
scheduling in the Linux 3.8 kernel have a major effect on how well
our code performs at high concurrency on NUMA machines with a lot
of memory nodes, that patches to improve performance sometimes
cause surprising regressions, that the results will come out any
particular way, or that the difference will be out of the noise?
Personally I'm only convinced on the first three of those.

> You get just as much churn by changing code elsewhere, which
> often causes code movement and alignment changes.

It's hard to understand quite what you're saying there.  If you're
saying that code changes that should be performance neutral can
sometimes affect performance because of alignment of code with
cache line boundaries -- I absolutely agree; is that an argument
against performance testing performance patches?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Peter Geoghegan
Date:
On Mon, Jan 18, 2016 at 2:14 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> You get just as much churn by changing code elsewhere, which
>> often causes code movement and alignment changes.
>
> It's hard to understand quite what you're saying there.  If you're
> saying that code changes that should be performance neutral can
> sometimes affect performance because of alignment of code with
> cache line boundaries -- I absolutely agree; is that an argument
> against performance testing performance patches?

No, it isn't an argument against performance testing patches like
this, but I don't think anyone suggested otherwise. Of course every
performance related patch should be tested to make sure it meets its
goals and at acceptable cost, but I don't think that Andreas' patch is
necessarily a performance patch. There can be value in removing
superfluous code; doing so sometimes clarifies intent and
understanding.

-- 
Peter Geoghegan



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Kevin Grittner
Date:
On Mon, Jan 18, 2016 at 4:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Jan 18, 2016 at 2:14 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

>> It's hard to understand quite what you're saying there.  If you're
>> saying that code changes that should be performance neutral can
>> sometimes affect performance because of alignment of code with
>> cache line boundaries -- I absolutely agree; is that an argument
>> against performance testing performance patches?
>
> No, it isn't an argument against performance testing patches like
> this, but I don't think anyone suggested otherwise.

I'm still not sure what argument he *was* making, but I'm glad to
hear it wasn't that.

> Of course every performance related patch should be tested to
> make sure it meets its goals and at acceptable cost,

I argued that it should be tested to ensure that it caused no
regression in a case which has been a problem for some of our
customers (spinlock contention at high concurrency on a machine
with 4 or more NUMA nodes).  We have been able to solve those
problems, but it has been a fussy business -- sometimes we have
tweaked spinlock code and sometimes that has not worked but an OS
upgrade has.  I am quite unconvinced about whether the change will
help, hurt, or have no impact on these problems; I'm only arguing
for testing to find out.

> but I don't think that Andreas' patch is necessarily a
> performance patch. There can be value in removing superfluous
> code; doing so sometimes clarifies intent and understanding.

Well, that's why I said I would be satisfied with a neutral
benchmark result -- when there is a tie, the shorter, simpler code
should generally win.  I'm not really sure what there was in what I
said to argue about; since that I've just been trying figure that
out.  If we all agree, let's let it drop.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Andres Freund
Date:
On 2016-01-18 16:14:05 -0600, Kevin Grittner wrote:
> Unconvinced that we should do performance testing on a proposed
> performance patch before accepting it

I'm unconvinced that it makes sense to view this as a performance
patch. And unconvinced that you can sanely measure it. The lock prefix
is a one byte instruction prefix, and lock xchg, and xchg are exactly
the same, leaving the instruction width aside. It's just a littlebit
less work for the instruction decoder.

The point about alignment and such is, that changing some code somewhere
is likely to have a bigger performance impact than the actual effect of
the removal of those few bytes. So when you benchmark, you'd just
benchmark a slightly changed code layout.
objdump -d build/postgres/dev-assert/vpath/src/backend/postgres |grep 'lock xchg'|head -n1 4b732f:        f0 86 01
        lock xchg %al,(%rcx)
 
the f0 is the lock prefix. In total there's 22 of them in the postgres
codebase, when compiled with my flags/compiler.

I think it's unrealistic to benchmark slight codemovements on a regular
basis, particularly using a large machine. There's just not enough time
and hardware around for that.

Now I'm equally unconvinced that it's worthwhile to do anything
here. I just don't think benchmarking plays a role either way.

>, that the changes in NUMA
> scheduling in the Linux 3.8 kernel have a major effect on how well
> our code performs at high concurrency on NUMA machines with a lot
> of memory nodes

That I believe immediately.



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Peter Geoghegan
Date:
On Mon, Jan 18, 2016 at 2:39 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> but I don't think that Andreas' patch is necessarily a
>> performance patch. There can be value in removing superfluous
>> code; doing so sometimes clarifies intent and understanding.
>
> Well, that's why I said I would be satisfied with a neutral
> benchmark result -- when there is a tie, the shorter, simpler code
> should generally win.  I'm not really sure what there was in what I
> said to argue about; since that I've just been trying figure that
> out.  If we all agree, let's let it drop.

If we don't want to apply this, then I think that a sensible
compromise would be to add a code comment that says that we don't
believe the LOCK prefix matters.

-- 
Peter Geoghegan



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Kevin Grittner
Date:
On Mon, Jan 18, 2016 at 4:50 PM, Andres Freund <andres@anarazel.de> wrote:

> Now I'm equally unconvinced that it's worthwhile to do anything
> here. I just don't think benchmarking plays a role either way.

Well, that would be the crucial point on which we differ -- the
rest is all agreement.  I don't think we should accept the patch
*in the absence* of benchmarking to show a result that is neutral
or better.  Spinlocks are just too performance-critical and too
fussy to accept a change on the basis that "the source code looks
fine".  IMO, anyway.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Improve spinlock inline assembly for x86.

From
Andres Freund
Date:
On 2016-01-18 16:56:22 -0600, Kevin Grittner wrote:
> On Mon, Jan 18, 2016 at 4:50 PM, Andres Freund <andres@anarazel.de> wrote:
> 
> > Now I'm equally unconvinced that it's worthwhile to do anything
> > here. I just don't think benchmarking plays a role either way.
> 
> Well, that would be the crucial point on which we differ -- the
> rest is all agreement.  I don't think we should accept the patch
> *in the absence* of benchmarking to show a result that is neutral
> or better.  Spinlocks are just too performance-critical and too
> fussy to accept a change on the basis that "the source code looks
> fine".  IMO, anyway.

By that justification we need to start benchmarking adding new variables
on the stack, that'll most of the time have a bigger performance impact
than this.  Benchmarking minor source code cleanups is just not worth
the time.