Thread: Re: [HACKERS] Fix performance of generic atomics

Re: [HACKERS] Fix performance of generic atomics

From

Sokolov Yura

Date:

25 May 2017, 16:39:22

A bit cleaner version of a patch.

Sokolov Yura писал 2017-05-25 15:22:
> Good day, everyone.
> 
> I've been played with pgbench on huge machine.
> (72 cores, 56 for postgresql, enough memory to fit base
> both into shared_buffers and file cache)
> (pgbench scale 500, unlogged tables, fsync=off,
> synchronous commit=off, wal_writer_flush_after=0).
> 
> With 200 clients performance is around 76000tps and main
> bottleneck in this dumb test is LWLockWaitListLock.
> 
> I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
> (ie using __sync_fetch_and_or) and performance became 83000tps.
> 
> It were a bit strange at a first look, cause __sync_fetch_and_or
> compiles to almost same CAS loop.
> 
> Looking closely, I noticed that intrinsic performs doesn't do
> read in the loop body, but at loop initialization. It is correct
> behavior cause `lock cmpxchg` instruction stores old value in EAX
> register.
> 
> It is expected behavior, and pg_compare_and_exchange_*_impl does
> the same in all implementations. So there is no need to re-read
> value in the loop body:
> 
> Example diff for pg_atomic_exchange_u32_impl:
> 
>  static inline uint32
>  pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 
> xchg_)
>  {
>      uint32 old;
> +    old = pg_atomic_read_u32_impl(ptr);
>      while (true)
>      {
> -        old = pg_atomic_read_u32_impl(ptr);
>          if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
>              break;
>      }
>      return old;
>  }
> 
> After applying this change to all generic atomic functions
> (and for pg_atomic_fetch_or_u32_impl ), performance became
> equal to __sync_fetch_and_or intrinsic.
> 
> Attached patch contains patch for all generic atomic
> functions, and also __sync_fetch_and_(or|and) for gcc, cause
> I believe GCC optimize code around intrinsic better than
> around inline assembler.
> (final performance is around 86000tps, but difference between
> 83000tps and 86000tps is not so obvious in NUMA system).
> 
> With regards,

-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Fix-performance-of-Atomics-generic-implementation.patch

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

25 May 2017, 17:39:13

Sokolov Yura <funny.falcon@postgrespro.ru> writes:
@@ -382,12 +358,8 @@ static inline uint64pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_){
uint64old;
 
-    while (true)
-    {
-        old = pg_atomic_read_u64_impl(ptr);
-        if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
-            break;
-    }
+    old = pg_atomic_read_u64_impl(ptr);
+    while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_));    return old;}#endif

FWIW, I do not think that writing the loops like that is good style.
It looks like a typo and will confuse readers.  You could perhaps
write the same code with better formatting, eg
while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))    /* skip */ ;

but why not leave the formulation with while(true) and a break alone?

(I take no position on whether moving the read of "old" outside the
loop is a valid optimization.)
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Sokolov Yura

Date:

25 May 2017, 18:12:44

Hello, Tom.

I agree that lonely semicolon looks bad.
Applied your suggestion for empty loop body (/* skip */).

Patch in first letter had while(true), but I removed it cause
I think it is uglier:
- `while(true)` was necessary for grouping read with `if`,
- but now there is single statement in a loop body and it is
   condition for loop exit, so it is clearly just a loop.

Optimization is valid cause compare_exchange always store old value
in `old` variable in a same atomic manner as atomic read.

Tom Lane wrote 2017-05-25 17:39:
> Sokolov Yura <funny.falcon@postgrespro.ru> writes:
> @@ -382,12 +358,8 @@ static inline uint64
>  pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 
> and_)
>  {
>      uint64 old;
> -    while (true)
> -    {
> -        old = pg_atomic_read_u64_impl(ptr);
> -        if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
> -            break;
> -    }
> +    old = pg_atomic_read_u64_impl(ptr);
> +    while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_));
>      return old;
>  }
>  #endif
> 
> FWIW, I do not think that writing the loops like that is good style.
> It looks like a typo and will confuse readers.  You could perhaps
> write the same code with better formatting, eg
> 
>     while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
>         /* skip */ ;
> 
> but why not leave the formulation with while(true) and a break alone?
> 
> (I take no position on whether moving the read of "old" outside the
> loop is a valid optimization.)
> 
>             regards, tom lane

With regards,
-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Fix-performance-of-Atomics-generic-implementation.patch

Re: [HACKERS] Fix performance of generic atomics

From

Sokolov Yura

Date:

04 June 2017, 16:10:16

Good day, every one.

I'm just posting benchmark numbers for atomics patch.

Hardware: 4 socket 72 core (144HT) x86_64 Centos 7.1
postgresql.conf tuning:
shared_buffers = 32GB
fsync = on
synchronous_commit = on
full_page_writes = off
wal_buffers = 16MB
wal_writer_flush_after = 16MB
commit_delay = 2
max_wal_size = 16GB

Results:
pgbench -i -s 300 + pgbench --skip-some-updates

Clients |  master |  atomics
========+=========+=======
    50   |  53.1k  |  53.2k
   100   | 101.2k  | 103.5k
   150   | 119.1k  | 121.9k
   200   | 128.7k  | 132.5k
   252   | 120.2k  | 130.0k
   304   | 100.8k  | 115.9k
   356   |  78.1k  |  90.1k
   395   |  70.2k  |  79.0k
   434   |  61.6k  |  70.7k

Also graph with more points attached.

On 2017-05-25 18:12, Sokolov Yura wrote:
> Hello, Tom.
> 
> I agree that lonely semicolon looks bad.
> Applied your suggestion for empty loop body (/* skip */).
> 
> Patch in first letter had while(true), but I removed it cause
> I think it is uglier:
> - `while(true)` was necessary for grouping read with `if`,
> - but now there is single statement in a loop body and it is
>   condition for loop exit, so it is clearly just a loop.
> 
> Optimization is valid cause compare_exchange always store old value
> in `old` variable in a same atomic manner as atomic read.
> 
> Tom Lane wrote 2017-05-25 17:39:
>> Sokolov Yura <funny.falcon@postgrespro.ru> writes:
>> @@ -382,12 +358,8 @@ static inline uint64
>>  pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 
>> and_)
>>  {
>>      uint64 old;
>> -    while (true)
>> -    {
>> -        old = pg_atomic_read_u64_impl(ptr);
>> -        if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
>> -            break;
>> -    }
>> +    old = pg_atomic_read_u64_impl(ptr);
>> +    while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_));
>>      return old;
>>  }
>>  #endif
>> 
>> FWIW, I do not think that writing the loops like that is good style.
>> It looks like a typo and will confuse readers.  You could perhaps
>> write the same code with better formatting, eg
>> 
>>     while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
>>         /* skip */ ;
>> 
>> but why not leave the formulation with while(true) and a break alone?
>> 
>> (I take no position on whether moving the read of "old" outside the
>> loop is a valid optimization.)
>> 
>>             regards, tom lane

With regards,
-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

compare.png

Re: [HACKERS] Fix performance of generic atomics

From

Jesper Pedersen

Date:

05 September 2017, 20:47:38

Hi,

On 05/25/2017 11:12 AM, Sokolov Yura wrote:
> I agree that lonely semicolon looks bad.
> Applied your suggestion for empty loop body (/* skip */).
> 
> Patch in first letter had while(true), but I removed it cause
> I think it is uglier:
> - `while(true)` was necessary for grouping read with `if`,
> - but now there is single statement in a loop body and it is
>    condition for loop exit, so it is clearly just a loop.
> 
> Optimization is valid cause compare_exchange always store old value
> in `old` variable in a same atomic manner as atomic read.
> 

I have tested this patch on a 2-socket machine, but don't see any 
performance change in the various runs. However, there is no regression 
either in all cases.

As such, I have marked the entry "Ready for Committer".

Remember to add a version postfix to your patches such that is easy to 
identify which is the latest version.

Best regards, Jesper

Re: [HACKERS] Fix performance of generic atomics

From

Jesper Pedersen

Date:

05 September 2017, 21:09:19

Hi,

On 05/25/2017 11:12 AM, Sokolov Yura wrote:
> I agree that lonely semicolon looks bad.
> Applied your suggestion for empty loop body (/* skip */).
> 
> Patch in first letter had while(true), but I removed it cause
> I think it is uglier:
> - `while(true)` was necessary for grouping read with `if`,
> - but now there is single statement in a loop body and it is
>    condition for loop exit, so it is clearly just a loop.
> 
> Optimization is valid cause compare_exchange always store old value
> in `old` variable in a same atomic manner as atomic read.
> 

I have tested this patch on a 2-socket machine, but don't see any 
performance change in the various runs. However, there is no regression 
either in all cases.

As such, I have marked the entry "Ready for Committer".

Remember to add a version postfix to your patches such that is easy to 
identify which is the latest version.

Best regards, Jesper

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

05 September 2017, 21:24:38

Jesper Pedersen <jesper.pedersen@redhat.com> writes:
> I have tested this patch on a 2-socket machine, but don't see any 
> performance change in the various runs. However, there is no regression 
> either in all cases.

Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code.  What test case(s) did you use?
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Jesper Pedersen

Date:

05 September 2017, 21:39:05

On 09/05/2017 02:24 PM, Tom Lane wrote:
> Jesper Pedersen <jesper.pedersen@redhat.com> writes:
>> I have tested this patch on a 2-socket machine, but don't see any
>> performance change in the various runs. However, there is no regression
>> either in all cases.
> 
> Hm, so if we can't demonstrate a performance win, it's hard to justify
> risking touching this code.  What test case(s) did you use?
> 

I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using 
both logged and unlogged tables. Also ran an internal benchmark which 
didn't show anything either.

Setting the entry back to "Needs Review" for additional feedback from 
others.

Best regards, Jesper

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

05 September 2017, 21:51:51

Jesper Pedersen <jesper.pedersen@redhat.com> writes:
> On 09/05/2017 02:24 PM, Tom Lane wrote:
>> Hm, so if we can't demonstrate a performance win, it's hard to justify
>> risking touching this code.  What test case(s) did you use?

> I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using 
> both logged and unlogged tables. Also ran an internal benchmark which 
> didn't show anything either.

That may just mean that pgbench isn't stressing any atomic ops very
hard (at least in the default scenario).

I'm tempted to write a little C function that just hits the relevant
atomic ops in a tight loop, and see how long it takes to do a few
million iterations.  That would be erring in the opposite direction,
of overstating the importance of atomic ops to real-world scenarios
--- but if we didn't get any win that way, then it's surely in the noise.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Jeff Janes

Date:

05 September 2017, 22:47:41

On Tue, Sep 5, 2017 at 11:39 AM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:

On 09/05/2017 02:24 PM, Tom Lane wrote:
Jesper Pedersen <jesper.pedersen@redhat.com> writes:
I have tested this patch on a 2-socket machine, but don't see any
performance change in the various runs. However, there is no regression
either in all cases.

Hm, so if we can't demonstrate a performance win, it's hard to justify
risking touching this code. What test case(s) did you use?

I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using both logged and unlogged tables. Also ran an internal benchmark which didn't show anything either.

What scale factor and client count? How many cores per socket? It looks like Sokolov was just starting to see gains at 200 clients on 72 cores, using -N transaction.

Cheers,

Jeff

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 07:23:25

Jeff Janes <jeff.janes@gmail.com> writes:
> What scale factor and client count? How many cores per socket?  It looks
> like Sokolov was just starting to see gains at 200 clients on 72 cores,
> using -N transaction.

I spent some time poking at this with the test scaffolding I showed at
https://www.postgresql.org/message-id/17694.1504665058%40sss.pgh.pa.us

AFAICT, there are not huge differences between different coding methods
even for two processes in direct competition in the tightest possible
atomic-ops loops.  So if you're testing SQL operations, you're definitely
not going to see anything without a whole lot of concurrent processes.

Moreover, it matters which primitive you're testing, on which platform,
with which compiler, because we have a couple of layers of atomic ops
implementations.

I started out testing pg_atomic_fetch_add_u32(), as shown in the above-
mentioned message.  On x86_x64 with gcc, that is implemented by the
code for it in src/include/port/atomics/arch-x86.h.  If you dike out
that support, it falls back to the version in atomics/generic-gcc.h,
which seems no worse and possibly better.  Only if you also dike out
generic-gcc.h do you get to the version in atomics/generic.h that this
patch wants to change.  However, at that point you're also down to a
spinlock-based implementation of the underlying pg_atomic_read_u32_impl
and pg_atomic_compare_exchange_u32_impl, which means that performance
is going to be less than great anyway.  No platform that we consider
well-supported ought to be hitting the spinlock paths.  This means
that Sokolov's proposed changes in atomics/generic.h ought to be
uninteresting for performance on any platform we care about --- at
least for pg_atomic_fetch_add_u32().

However, Sokolov also proposes adding gcc-intrinsic support for
pg_atomic_fetch_and_xxx and pg_atomic_fetch_or_xxx.  This is a
different kettle of fish.  I repeated the experiments I'd done
for pg_atomic_fetch_add_u32(), per the above message, using the "or"
primitive, and found largely the same results as for "add": it's
slightly better under contention than the generic code, and
significantly better if the result value of the atomic op isn't needed.

So I think we should definitely take the part of the patch that
changes generic-gcc.h --- and we should check to see if we're missing
out on any other gcc primitives we should be using there.

I'm less excited about the proposed changes in generic.h.  There's
nothing wrong with them in principle, but they mostly shouldn't
make any performance difference on any platform we care about,
because we shouldn't get to that code in the first place on any
platform we care about.  If we are getting to that code for any
specific primitive, then either there's a gap in the platform-specific
or compiler-specific support, or it's debatable that we ought to
consider that primitive to be very primitive.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Sokolov Yura

Date:

06 September 2017, 14:54:55

On 2017-09-06 07:23, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> What scale factor and client count? How many cores per socket?  It 
>> looks
>> like Sokolov was just starting to see gains at 200 clients on 72 
>> cores,
>> using -N transaction.
> This means that Sokolov's proposed changes in atomics/generic.h
> ought to be uninteresting for performance on any platform we care about 
> --- at
> least for pg_atomic_fetch_add_u32().

May I cite you this way:
"Tom Lane says PostgreSQL core team doesn't care for 4 socket 72 core
Intel Xeon servers"
???

To be honestly, I didn't measure exact impact of changing 
pg_atomic_fetch_add_u32.
I found issue with pg_atomic_fetch_or_u32, cause it is highly contended
both in LWLock, and in buffer page state management. I found changing
loop for generic pg_atomic_fetch_or_u32 already gives improvement.
And I changed loop for other generic atomic functions to make
all functions uniform.

It is bad style guide not to changing worse (but correct) code into
better (and also correct) just because you can't measure difference
(in my opinion. Perhaps i am mistaken).

(and yes: gcc intrinsic gives more improvement).

--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Re: [HACKERS] Fix performance of generic atomics

From

Sokolov Yura

Date:

06 September 2017, 15:21:04

On 2017-09-06 14:54, Sokolov Yura wrote:
> On 2017-09-06 07:23, Tom Lane wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> What scale factor and client count? How many cores per socket?  It 
>>> looks
>>> like Sokolov was just starting to see gains at 200 clients on 72 
>>> cores,
>>> using -N transaction.
>> This means that Sokolov's proposed changes in atomics/generic.h
>> ought to be uninteresting for performance on any platform we care 
>> about --- at
>> least for pg_atomic_fetch_add_u32().
> 
> May I cite you this way:

I apologize for tone of my previous message.
My tongue is my enemy.

-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Re: [HACKERS] Fix performance of generic atomics

From

Simon Riggs

Date:

06 September 2017, 15:34:46

On 5 September 2017 at 21:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> What scale factor and client count? How many cores per socket?  It looks
>> like Sokolov was just starting to see gains at 200 clients on 72 cores,
>> using -N transaction.

...
> Moreover, it matters which primitive you're testing, on which platform,
> with which compiler, because we have a couple of layers of atomic ops
> implementations.
...

I think Sokolov was aiming at 4-socket servers specifically, rather
than as a general performance gain.

If there is no gain on 2-socket, at least there is no loss either.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 15:56:20

Simon Riggs <simon@2ndquadrant.com> writes:
> On 5 September 2017 at 21:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Moreover, it matters which primitive you're testing, on which platform,
>> with which compiler, because we have a couple of layers of atomic ops
>> implementations.

> If there is no gain on 2-socket, at least there is no loss either.

The point I'm trying to make is that if tweaking generic.h improves
performance then it's an indicator of missed cases in the less-generic
atomics code, and the latter is where our attention should be focused.
I think basically all of the improvement Sokolov got was from upgrading
the coverage of generic-gcc.h.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Sokolov Yura

Date:

06 September 2017, 16:31:53

On 2017-09-06 15:56, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> On 5 September 2017 at 21:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Moreover, it matters which primitive you're testing, on which 
>>> platform,
>>> with which compiler, because we have a couple of layers of atomic ops
>>> implementations.
> 
>> If there is no gain on 2-socket, at least there is no loss either.
> 
> The point I'm trying to make is that if tweaking generic.h improves
> performance then it's an indicator of missed cases in the less-generic
> atomics code, and the latter is where our attention should be focused.
> I think basically all of the improvement Sokolov got was from upgrading
> the coverage of generic-gcc.h.
> 
>             regards, tom lane

Not exactly. I've checked, that new version of generic 
pg_atomic_fetch_or_u32
loop also gives improvement. Without that check I'd not suggest to fix
generic atomic functions. Of course, gcc intrinsic gives more gain.

-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 16:36:36

Sokolov Yura <funny.falcon@postgrespro.ru> writes:
> On 2017-09-06 15:56, Tom Lane wrote:
>> The point I'm trying to make is that if tweaking generic.h improves
>> performance then it's an indicator of missed cases in the less-generic
>> atomics code, and the latter is where our attention should be focused.
>> I think basically all of the improvement Sokolov got was from upgrading
>> the coverage of generic-gcc.h.

> Not exactly. I've checked, that new version of generic 
> pg_atomic_fetch_or_u32
> loop also gives improvement.

But once you put in the generic-gcc version, that's not reached anymore.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Sokolov Yura

Date:

06 September 2017, 16:42:29

On 2017-09-06 16:36, Tom Lane wrote:
> Sokolov Yura <funny.falcon@postgrespro.ru> writes:
>> On 2017-09-06 15:56, Tom Lane wrote:
>>> The point I'm trying to make is that if tweaking generic.h improves
>>> performance then it's an indicator of missed cases in the 
>>> less-generic
>>> atomics code, and the latter is where our attention should be 
>>> focused.
>>> I think basically all of the improvement Sokolov got was from 
>>> upgrading
>>> the coverage of generic-gcc.h.
> 
>> Not exactly. I've checked, that new version of generic
>> pg_atomic_fetch_or_u32
>> loop also gives improvement.
> 
> But once you put in the generic-gcc version, that's not reached 
> anymore.
> 

Yes, you're right.

But I think, generic version still should be "fixed".
If generic version is not reached on any platform, then why it is kept?
If it is reached somewhere, then it should be improved.

-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Re: [HACKERS] Fix performance of generic atomics

From

Robert Haas

Date:

06 September 2017, 18:35:00

On Wed, Sep 6, 2017 at 9:42 AM, Sokolov Yura
<funny.falcon@postgrespro.ru> wrote:
> Yes, you're right.
>
> But I think, generic version still should be "fixed".
> If generic version is not reached on any platform, then why it is kept?
> If it is reached somewhere, then it should be improved.

This seems like a pretty sound argument to me.  I think Tom's probably
right that the changes in generic-gcc.h are the important ones, but
I'm not sure that's an argument against patching generics.h.  Given
that pg_atomic_compare_exchange_u32_impl is defined to update *old
there seems to be no reason to call pg_atomic_read_u32_impl every time
through the loop.  Whether doing so hurts performance in practice is
likely to depend on a bunch of stuff, but we don't normally refuse to
remove unnecessary code on the grounds that it will rarely be reached.
Given that CPUs are weird, it's possible that there is some system
where throwing an extra read of a value we already have into the loop
works out to a win, but in the absence of evidence I'm reluctant to
presume it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 19:01:24

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 6, 2017 at 9:42 AM, Sokolov Yura
> <funny.falcon@postgrespro.ru> wrote:
>> But I think, generic version still should be "fixed".
>> If generic version is not reached on any platform, then why it is kept?
>> If it is reached somewhere, then it should be improved.

> This seems like a pretty sound argument to me.  I think Tom's probably
> right that the changes in generic-gcc.h are the important ones, but
> I'm not sure that's an argument against patching generics.h.  Given
> that pg_atomic_compare_exchange_u32_impl is defined to update *old
> there seems to be no reason to call pg_atomic_read_u32_impl every time
> through the loop.

Probably not.  I'm not quite 100% convinced of that, because of my
observation that gcc is capable of generating different and better
code for some of these primitives if it can prove that the return
value is not needed.  It's not clear that that could apply in any
of these uses of pg_atomic_compare_exchange_u32_impl, though.
In any case, by my own argument, it shouldn't matter, because if
any of these are really performance-critical then we should be
looking for better ways.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Robert Haas

Date:

06 September 2017, 19:09:11

On Wed, Sep 6, 2017 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This seems like a pretty sound argument to me.  I think Tom's probably
>> right that the changes in generic-gcc.h are the important ones, but
>> I'm not sure that's an argument against patching generics.h.  Given
>> that pg_atomic_compare_exchange_u32_impl is defined to update *old
>> there seems to be no reason to call pg_atomic_read_u32_impl every time
>> through the loop.
>
> Probably not.  I'm not quite 100% convinced of that, because of my
> observation that gcc is capable of generating different and better
> code for some of these primitives if it can prove that the return
> value is not needed.  It's not clear that that could apply in any
> of these uses of pg_atomic_compare_exchange_u32_impl, though.
> In any case, by my own argument, it shouldn't matter, because if
> any of these are really performance-critical then we should be
> looking for better ways.

It's not a question of whether the return value is used, but of
whether the updated value of *old is used.

Unless I'm confused.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 20:17:18

Robert Haas <robertmhaas@gmail.com> writes:
> It's not a question of whether the return value is used, but of
> whether the updated value of *old is used.

Right, but if we re-read "old" in the loop, and if the primitive
doesn't return "old" (or does, but call site ignores it) then in
principle the CAS might be strength-reduced a bit.  Whether that
can win enough to be better than removing the unlocked read,
I dunno.

In the case at hand, looking at a loop like
while (count-- > 0){    (void) pg_atomic_fetch_or_u32(myptr, 1);}

with our HEAD code and RHEL6's gcc I get this for the inner loop:

.L9:movl    (%rdx), %eaxmovl    %eax, %ecxorl    $1, %ecxlock                cmpxchgl    %ecx,(%rdx)    setz        %cl
      testb    %cl, %clje    .L9subq    $1, %rbxtestq    %rbx, %rbxjg    .L9

Applying the proposed generic.h patch, it becomes

.L10:movl    (%rsi), %eax
.L9:movl    %eax, %ecxorl    $1, %ecxlock                cmpxchgl    %ecx,(%rdx)    setz        %cl        testb
%cl,%clje    .L9subq    $1, %rbxtestq    %rbx, %rbxjg    .L10

Note that in both cases the cmpxchgl is coming out of the asm construct in
pg_atomic_compare_exchange_u32_impl from atomics/arch-x86.h, so that even
if a simpler assembly instruction were possible given that we don't need
%eax to be updated, there's no chance of that actually happening.  This
gets back to the point I made in the other thread that maybe the
arch-x86.h asm sequences are not an improvement over the gcc intrinsics.

The code is the same if the loop is modified to use the result of
pg_atomic_fetch_or_u32, so I won't bother showing that.

Adding the proposed generic-gcc.h patch, the loop reduces to

.L11:lock orl    $1, (%rax)subq    $1, %rbxtestq    %rbx, %rbxjg    .L11

or if we make the loop do    junk += pg_atomic_fetch_or_u32(myptr, 1);
then we get

.L13:movl    (%rsi), %eax
.L10:movl    %eax, %edimovl    %eax, %ecxorl    $1, %ecxlock cmpxchgl    %ecx, (%rdx)jne    .L10addl    %edi, %r8dsubq
 $1, %rbxtestq    %rbx, %rbxjg    .L13

So that last is slightly better than the generic.h + asm CAS version,
perhaps not meaningfully so --- but it's definitely better when
the compiler can see the result isn't used.

In short then, given the facts on the ground here, in particular the
asm-based CAS primitive at the heart of these loops, it's clear that
taking the read out of the loop can't hurt.  But that should be read
as a rather narrow conclusion.  With a different compiler and/or a
different version of pg_atomic_compare_exchange_u32_impl, maybe the
answer would be different.  And of course it's moot once the
generic-gcc.h patch is applied.

Anyway, I don't have a big objection to applying this.  My concern
is more that we need to be taking a harder look at other parts of
the atomics infrastructure, because tweaks there are likely to buy
much more.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Jesper Pedersen

Date:

06 September 2017, 20:41:27

Hi Jeff,

On 09/05/2017 03:47 PM, Jeff Janes wrote:
>> I ran pgbench (-M prepared) with synchronous_commit 'on' and 'off' using
>> both logged and unlogged tables. Also ran an internal benchmark which
>> didn't show anything either.
>>
> 
> What scale factor and client count? How many cores per socket?  It looks
> like Sokolov was just starting to see gains at 200 clients on 72 cores,
> using -N transaction.

I have done a run with scale factor 300, and another with 3000 on a 
2S/28C/56T/256Gb w/ 2 x RAID10 SSD machine; up to 200 clients.

I would consider the runs as "noise" as I'm seeing +-1% for all client 
counts, so nothing like Yura is seeing in [1] for the higher client counts.

I did a run with -N too using scale factor 300, using the settings in 
[1], but with same result (+-1%).

[1] 
https://www.postgresql.org/message-id/d62d7d9d473d07e172d799d5a57e70be@postgrespro.ru

Best regards, Jesper

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 21:31:26

I wrote:
> Anyway, I don't have a big objection to applying this.  My concern
> is more that we need to be taking a harder look at other parts of
> the atomics infrastructure, because tweaks there are likely to buy
> much more.

I went ahead and pushed these patches, adding __sync_fetch_and_sub
since gcc seems to provide that on the same footing as these other
functions.

Looking at these generic functions a bit closer, I notice that
most of them are coded like
old = pg_atomic_read_u32_impl(ptr);while (!pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_))    /* skip */;

but there's an exception: pg_atomic_exchange_u64_impl just does
old = ptr->value;while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, xchg_))    /* skip */;

That's interesting.  Why not pg_atomic_read_u64_impl there?

I think that the simple read is actually okay as it stands: if we
are unlucky enough to get a torn read, the first compare/exchange
will fail to compare equal, and it will replace "old" with a valid
atomically-read value, and then the next loop iteration has a chance
to succeed.  Therefore there's no need to pay the extra cost of a
locked read instead of an unlocked one.

However, if that's the reasoning, why don't we make all of these
use simple reads?  It seems unlikely that a locked read is free.

If there's actually a reason for pg_atomic_exchange_u64_impl to be
different from the rest, it needs to have a comment explaining why.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Andres Freund

Date:

06 September 2017, 21:45:37

Hi,

On 2017-09-06 14:31:26 -0400, Tom Lane wrote:
> I wrote:
> > Anyway, I don't have a big objection to applying this.  My concern
> > is more that we need to be taking a harder look at other parts of
> > the atomics infrastructure, because tweaks there are likely to buy
> > much more.
> 
> I went ahead and pushed these patches, adding __sync_fetch_and_sub
> since gcc seems to provide that on the same footing as these other
> functions.
> 
> Looking at these generic functions a bit closer, I notice that
> most of them are coded like
> 
>     old = pg_atomic_read_u32_impl(ptr);
>     while (!pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_))
>         /* skip */;
> 
> but there's an exception: pg_atomic_exchange_u64_impl just does
> 
>     old = ptr->value;
>     while (!pg_atomic_compare_exchange_u64_impl(ptr, &old, xchg_))
>         /* skip */;
> 
> That's interesting.  Why not pg_atomic_read_u64_impl there?

Because our platforms don't generally guaranatee that 64bit reads:
/* * 64 bit reads aren't safe on all platforms. In the generic * implementation implement them as a compare/exchange
with0. That'll * fail or succeed, but always return the old value. Possible might store * a 0, but only if the prev.
valuealso was a 0 - i.e. harmless. */pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);

so I *guess* I decided doing an unnecessary cmpxchg wasn't useful. But
since then we've added an optimization for platforms where we know 64bit
reads have single copy atomicity. So we could change that now.

> I think that the simple read is actually okay as it stands: if we
> are unlucky enough to get a torn read, the first compare/exchange
> will fail to compare equal, and it will replace "old" with a valid
> atomically-read value, and then the next loop iteration has a chance
> to succeed.  Therefore there's no need to pay the extra cost of a
> locked read instead of an unlocked one.
> 
> However, if that's the reasoning, why don't we make all of these
> use simple reads?  It seems unlikely that a locked read is free.

We don't really use locked reads? All the _atomic_ wrapper forces is an
actual read from memory rather than a register.

Greetings,

Andres Freund

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 22:12:13

Andres Freund <andres@anarazel.de> writes:
> On 2017-09-06 14:31:26 -0400, Tom Lane wrote:
>> However, if that's the reasoning, why don't we make all of these
>> use simple reads?  It seems unlikely that a locked read is free.

> We don't really use locked reads? All the _atomic_ wrapper forces is an
> actual read from memory rather than a register.

It looks to me like two of the three implementations promise no such
thing.  Even if they somehow do, it hardly matters given that the cmpxchg
loop would be self-correcting.  Mostly, though, I'm looking at the
fallback pg_atomic_read_u64_impl implementation (with a CAS), which
seems far more expensive than can be justified for this.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Andres Freund

Date:

06 September 2017, 22:19:05

On 2017-09-06 15:12:13 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-09-06 14:31:26 -0400, Tom Lane wrote:
> >> However, if that's the reasoning, why don't we make all of these
> >> use simple reads?  It seems unlikely that a locked read is free.
> 
> > We don't really use locked reads? All the _atomic_ wrapper forces is an
> > actual read from memory rather than a register.
> 
> It looks to me like two of the three implementations promise no such
> thing.

They're volatile vars, so why not?


> Even if they somehow do, it hardly matters given that the cmpxchg loop
> would be self-correcting.

Well, in this one instance maybe, hardly in others.


> Mostly, though, I'm looking at the fallback pg_atomic_read_u64_impl
> implementation (with a CAS), which seems far more expensive than can
> be justified for this.

What are you suggesting as an alternative?

- Andres

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 22:25:20

Andres Freund <andres@anarazel.de> writes:
> On 2017-09-06 15:12:13 -0400, Tom Lane wrote:
>> It looks to me like two of the three implementations promise no such
>> thing.

> They're volatile vars, so why not?

Yeah, but so are the caller's variables.  That is, in

pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_)
{uint64 old;old = ptr->value;

ISTM that the compiler is required to actually fetch ptr->value, not
rely on some previous read of it.  I do not think that (the first
version of) pg_atomic_read_u64_impl is adding any guarantee that wasn't
there already.

>> Even if they somehow do, it hardly matters given that the cmpxchg loop
>> would be self-correcting.

> Well, in this one instance maybe, hardly in others.

All the functions involved use nigh-identical cmpxchg loops.

> What are you suggesting as an alternative?

I think we can just use "old = ptr->value" to set up for the cmpxchg
loop in every generic.h function that uses such a loop.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Andres Freund

Date:

06 September 2017, 22:27:46

On 2017-09-06 15:25:20 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-09-06 15:12:13 -0400, Tom Lane wrote:
> >> It looks to me like two of the three implementations promise no such
> >> thing.
> 
> > They're volatile vars, so why not?
> 
> Yeah, but so are the caller's variables.  That is, in
> 
> pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_)
> {
>     uint64 old;
>     old = ptr->value;
> 
> ISTM that the compiler is required to actually fetch ptr->value, not
> rely on some previous read of it.  I do not think that (the first
> version of) pg_atomic_read_u64_impl is adding any guarantee that wasn't
> there already.
> 
> >> Even if they somehow do, it hardly matters given that the cmpxchg loop
> >> would be self-correcting.
> 
> > Well, in this one instance maybe, hardly in others.
> 
> All the functions involved use nigh-identical cmpxchg loops.
> 
> > What are you suggesting as an alternative?
> 
> I think we can just use "old = ptr->value" to set up for the cmpxchg
> loop in every generic.h function that uses such a loop.

I think we might have been talking past each other - I thought you were
talking about changing the pg_atomic_read_u64_impl implementation for
external users.

Greetings,

Andres Freund

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 22:34:09

Andres Freund <andres@anarazel.de> writes:
> On 2017-09-06 15:25:20 -0400, Tom Lane wrote:
>> I think we can just use "old = ptr->value" to set up for the cmpxchg
>> loop in every generic.h function that uses such a loop.

> I think we might have been talking past each other - I thought you were
> talking about changing the pg_atomic_read_u64_impl implementation for
> external users.

Ah.  I was not thinking of touching pg_atomic_read_u32/u64_impl,
although now that you mention it, it's not clear to me why we
couldn't simplify

-    return *(&ptr->value);
+    return ptr->value;

AFAIK, the compiler is entitled to, and does, simplify away that
take-a-pointer-and-immediately-dereference-it dance.  If it did
not, a whole lot of standard array locutions would be much less
efficient than they should be.  What matters here is the volatile
qualifier, which we've already got.
        regards, tom lane

Re: [HACKERS] Fix performance of generic atomics

From

Tom Lane

Date:

06 September 2017, 22:54:20

I wrote:
> Ah.  I was not thinking of touching pg_atomic_read_u32/u64_impl,
> although now that you mention it, it's not clear to me why we
> couldn't simplify

> -    return *(&ptr->value);
> +    return ptr->value;

Just to check, I applied that change to pg_atomic_read_u32_impl and
pg_atomic_read_u64_impl, and recompiled.  I get bit-for-bit the
same backend executable.  Maybe it would have an effect on some
other compiler, but I doubt it, except perhaps at -O0.
        regards, tom lane