On 29/07/2024 22:59, Andres Freund wrote:
> After being confused for a while, the explanation is fairly simple: We use
> volatile and dereference the address:
>
> static __inline__ int
> tas(volatile slock_t *lock)
> {
> slock_t _res = 1;
>
> __asm__ __volatile__(
> " lock \n"
> " xchgb %0,%1 \n"
> : "+q"(_res), "+m"(*lock)
> : /* no inputs */
> : "memory", "cc");
> return (int) _res;
> }
>
> (note the (*lock) and the volatile in the signature).
>
> I think it'd be just as defensible to not emit a separate load here, despite
> the volatile, and indeed clang doesn't emit a separate load. But it also does
> seem defensible to take translate the code very literally, as gcc does.
>
>
> If I remove the volatile from the signature or cast it away, gcc indeed
> generates the offset version:
> 4230: f0 86 82 c0 01 00 00 lock xchg %al,0x1c0(%rdx)
Good catch. Seems safe to just remove the volatile.
> A second, even smaller, issue with the code is that we use "lock xchgb"
> despite xchg having implied lock approximately forever ([2]). That makes the code
> slightly wider than necessary (the lock prefix is one byte).
>
>
> I doubt there's a lot of situations where these end up having a meaningful
> performance impact, but it still seems suboptimal. I may be seeing a *small*
> gain in a workload inserting lots of tiny records, but it's hard to be sure if
> it's above the noise floor.
>
>
> I'm wondering in how many places our fairly broad use of volatiles causes
> more substantially worse code being generated.
Aside from performance, I find "volatile" difficult to reason about. I
feel more comfortable with atomics and memory barriers.
--
Heikki Linnakangas
Neon (https://neon.tech)