Re: Possible performance regression in version 10.1 with pgbench read-write tests. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Date
Msg-id 13622.1532119413@sss.pgh.pa.us
Whole thread Raw
In response to Re: Possible performance regression in version 10.1 with pgbenchread-write tests.  (Andres Freund <andres@anarazel.de>)
Responses Re: Possible performance regression in version 10.1 with pgbenchread-write tests.  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
>> In any case, I strongly resist making performance-based changes on
>> the basis of one test on one kernel and one hardware platform.

> Sure, it'd be good to do more of that. But from a theoretical POV it's
> quite logical that posix semas sharing cachelines is bad for
> performance, if there's any contention. When backed by futexes -
> i.e. all non ancient linux machines - the hot path just does a cmpxchg
> of the *userspace* data (I've copied the relevant code below).

Here's the thing: the hot path is of little or no interest, because
if we are in the sema code at all, we are expecting to block.  The
only case where we wouldn't block is if the lock manager decided the
current process needs to sleep, but some other process already released
us by the time we reach the futex/kernel call.  Certainly that will happen
some of the time, but it's not likely to be the way to bet.  So I'm very
dubious of any arguments based on the speed of the "uncontended" path.

It's possible that the bigger picture here is that the kernel boys
optimized for the "uncontended" path to the point where they broke
performance of the blocking path.  It's hard to see how they could
have broke it to the point of being slower than the SysV sema API,
though.

Anyway, I think we need to test first and patch second.  I'm working
on getting some numbers on my own machines now.

On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
only got 8 cores) but other parameters matching Mithun's example,
I just got

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 8
number of threads: 8
duration: 1800 s
number of transactions actually processed: 29001016
latency average = 0.497 ms
tps = 16111.575661 (including connections establishing)
tps = 16111.623329 (excluding connections establishing)

which is interesting because vmstat was pretty consistently reporting
around 500000 context swaps/second during the run, or circa 30
cs/transaction.  We'd have a minimum of 14 cs/transaction just between
client and server (due to seven SQL commands per transaction in TPC-B)
so that seems on the low side; not a lot of lock contention here it
seems.  I wonder what the corresponding ratio was in Mithun's runs.

            regards, tom lane


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: buildfarm: could not read block 3 in file "base/16384/2662":read only 0 of 8192 bytes
Next
From: Andres Freund
Date:
Subject: Re: Possible performance regression in version 10.1 with pgbenchread-write tests.