Thread: Suboptimal spinlock code due to volatile

Suboptimal spinlock code due to volatile

From
Andres Freund
Date:
Hi,

As part of [1] I was staring at the assembly code generated for
SpinLockAcquire(), fairly randomly using GetRecoveryState() as the example.

On master, in an optimized build this generates the following code (gcc 12 in
this case, but it doesn't really matter):

0000000000004220 <GetRecoveryState>:
    4220:    55                       push   %rbp
    4221:    48 8b 05 00 00 00 00     mov    0x0(%rip),%rax        # 4228 <GetRecoveryState+0x8>
    4228:    ba 01 00 00 00           mov    $0x1,%edx
    422d:    48 89 e5                 mov    %rsp,%rbp
    4230:    48 05 c0 01 00 00        add    $0x1c0,%rax
    4236:    f0 86 10                 lock xchg %dl,(%rax)
    4239:    84 d2                    test   %dl,%dl
    423b:    75 23                    jne    4260 <GetRecoveryState+0x40>
    423d:    48 8b 05 00 00 00 00     mov    0x0(%rip),%rax        # 4244 <GetRecoveryState+0x24>
    4244:    8b 80 44 01 00 00        mov    0x144(%rax),%eax
    424a:    48 8b 15 00 00 00 00     mov    0x0(%rip),%rdx        # 4251 <GetRecoveryState+0x31>
    4251:    c6 82 c0 01 00 00 00     movb   $0x0,0x1c0(%rdx)
    4258:    5d                       pop    %rbp
    4259:    c3                       ret
    425a:    66 0f 1f 44 00 00        nopw   0x0(%rax,%rax,1)
    4260:    48 8b 05 00 00 00 00     mov    0x0(%rip),%rax        # 4267 <GetRecoveryState+0x47>
    4267:    48 8d 0d 00 00 00 00     lea    0x0(%rip),%rcx        # 426e <GetRecoveryState+0x4e>
    426e:    48 8d b8 c0 01 00 00     lea    0x1c0(%rax),%rdi
    4275:    ba c8 18 00 00           mov    $0x18c8,%edx
    427a:    48 8d 35 00 00 00 00     lea    0x0(%rip),%rsi        # 4281 <GetRecoveryState+0x61>
    4281:    ff 15 00 00 00 00        call   *0x0(%rip)        # 4287 <GetRecoveryState+0x67>
    4287:    eb b4                    jmp    423d <GetRecoveryState+0x1d>

The main thing I want to raise attention about is the following bit:
    add    $0x1c0,%rax
    lock xchg %dl,(%rax)

0x1c0 is the offset of info_lck in XLogCtlData. So the code first computes the
address of the lock in %rax and then does the xchg on that.

That's pretty odd, because on x86 this could just be encoded as an offset to
the address - as shown in the code for the unlock a bit later:
    4251:       c6 82 c0 01 00 00 00    movb   $0x0,0x1c0(%rdx)


After being confused for a while, the explanation is fairly simple: We use
volatile and dereference the address:

static __inline__ int
tas(volatile slock_t *lock)
{
    slock_t        _res = 1;

    __asm__ __volatile__(
        "    lock            \n"
        "    xchgb    %0,%1    \n"
:        "+q"(_res), "+m"(*lock)
:        /* no inputs */
:        "memory", "cc");
    return (int) _res;
}

(note the (*lock) and the volatile in the signature).

I think it'd be just as defensible to not emit a separate load here, despite
the volatile, and indeed clang doesn't emit a separate load. But it also does
seem defensible to take translate the code very literally, as gcc does.


If I remove the volatile from the signature or cast it away, gcc indeed
generates the offset version:
    4230:    f0 86 82 c0 01 00 00     lock xchg %al,0x1c0(%rdx)


A second, even smaller, issue with the code is that we use "lock xchgb"
despite xchg having implied lock approximately forever ([2]). That makes the code
slightly wider than necessary (the lock prefix is one byte).


I doubt there's a lot of situations where these end up having a meaningful
performance impact, but it still seems suboptimal.   I may be seeing a *small*
gain in a workload inserting lots of tiny records, but it's hard to be sure if
it's above the noise floor.


I'm wondering in how many places our fairly broad use of volatiles causes
more substantially worse code being generated.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20240729165154.56zqyg34x2ywkpsh%40awork3.anarazel.de
[2] https://www.felixcloutier.com/x86/xchg#description



Re: Suboptimal spinlock code due to volatile

From
Heikki Linnakangas
Date:
On 29/07/2024 22:59, Andres Freund wrote:
> After being confused for a while, the explanation is fairly simple: We use
> volatile and dereference the address:
> 
> static __inline__ int
> tas(volatile slock_t *lock)
> {
>     slock_t        _res = 1;
> 
>     __asm__ __volatile__(
>         "    lock            \n"
>         "    xchgb    %0,%1    \n"
> :        "+q"(_res), "+m"(*lock)
> :        /* no inputs */
> :        "memory", "cc");
>     return (int) _res;
> }
> 
> (note the (*lock) and the volatile in the signature).
> 
> I think it'd be just as defensible to not emit a separate load here, despite
> the volatile, and indeed clang doesn't emit a separate load. But it also does
> seem defensible to take translate the code very literally, as gcc does.
> 
> 
> If I remove the volatile from the signature or cast it away, gcc indeed
> generates the offset version:
>      4230:    f0 86 82 c0 01 00 00     lock xchg %al,0x1c0(%rdx)

Good catch. Seems safe to just remove the volatile.

> A second, even smaller, issue with the code is that we use "lock xchgb"
> despite xchg having implied lock approximately forever ([2]). That makes the code
> slightly wider than necessary (the lock prefix is one byte).
> 
> 
> I doubt there's a lot of situations where these end up having a meaningful
> performance impact, but it still seems suboptimal.   I may be seeing a *small*
> gain in a workload inserting lots of tiny records, but it's hard to be sure if
> it's above the noise floor.
> 
> 
> I'm wondering in how many places our fairly broad use of volatiles causes
> more substantially worse code being generated.

Aside from performance, I find "volatile" difficult to reason about. I 
feel more comfortable with atomics and memory barriers.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Suboptimal spinlock code due to volatile

From
Robert Haas
Date:
On Tue, Jul 30, 2024 at 3:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Aside from performance, I find "volatile" difficult to reason about. I
> feel more comfortable with atomics and memory barriers.

I think nearly everyone feels more comfortable with atomics and memory
barriers. The semantics of volatile are terrible. It almost does more
or less than what you actually wanted, sometimes both.

Reading Andres's original message, I couldn't help wondering if this
is an argument against rolling our own spinlock implementations.
Presumably a compiler intrinsic wouldn't cause this kind of
unfortunate artifact. Our position in the past has essentially been
"we know better," but this seems like a counterexample.

--
Robert Haas
EDB: http://www.enterprisedb.com