Thread: Possible performance regression in version 10.1 with pgbenchread-write tests.

Hi all,

When I was trying to do read-write pgbench bench-marking of PostgreSQL
9.6.6 vs 10.1 I found PostgreSQL 10.1 regresses against 9.6.6 in some
cases.

Non Default settings and test
======================
Server:
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Pgbench:
CASE 1: when data fits shared buffers.
./pgbench -i -s 1000 postgres

CASE 2: when data exceeds shared buffers.
./pgbench -i -s 1000 postgres

./pgbench -c $threads -j $threads -T 1800 -M prepared postgres

Script "perf_buff_mgmt_write-2.sh" which is added below can be used to run same.


Machine : "cthulhu" 8 node numa machine with 128 hyper threads.
===================================================
>numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103
node 0 size: 65498 MB
node 0 free: 37885 MB
node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111
node 1 size: 65536 MB
node 1 free: 31215 MB
node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119
node 2 size: 65536 MB
node 2 free: 15331 MB
node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127
node 3 size: 65536 MB
node 3 free: 36774 MB
node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40
node 4 size: 65536 MB
node 4 free: 62 MB
node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48
node 5 size: 65536 MB
node 5 free: 9653 MB
node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56
node 6 size: 65536 MB
node 6 free: 50209 MB
node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64
node 7 size: 65536 MB
node 7 free: 43966 MB

CASE 1:
In 9.6.6 peak performance is achieved at 72 concurrent cleints TPS :
35554.573858 and in 10.1 at 72 clients TPS dips to 26882.828133 so
nearly 23% decrease in TPS.

CASE 2:
In 9.6.6 peak performance is achieved at 72 concurrent cleints TPS :
24861.074079 and in 10.1 at 72 clients TPS dips to 18372.565663 so
nearly 26% decrease in TPS.

Added "Postgresql_benchmarking_9.6vs10.ods" which gives more detailed
TPS numbers. And, TPS is median of 3 runs result.

I have not run bisect yet to find what has caused the issue.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Jan 24, 2018 at 12:06 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Hi all,
>
> When I was trying to do read-write pgbench bench-marking of PostgreSQL
> 9.6.6 vs 10.1 I found PostgreSQL 10.1 regresses against 9.6.6 in some
> cases.
>
> Non Default settings and test
> ======================
> Server:
> ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
>
> Pgbench:
> CASE 1: when data fits shared buffers.
> ./pgbench -i -s 1000 postgres
>
> CASE 2: when data exceeds shared buffers.
> ./pgbench -i -s 1000 postgres
>

Both the cases look identical, but from the document attached, it
seems the case-1 is for scale factor 300.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Wed, Jan 24, 2018 at 7:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Both the cases look identical, but from the document attached, it
> seems the case-1 is for scale factor 300.

Oops sorry it was a typo. CASE 1 is scale factor 300 which will fit in
shared buffer =8GB.



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> seeing futex in the call stack andres suggested that following commit could
> be the reason for regression
>
> commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   2016-10-09 18:03:45 -0400
>
>     Use unnamed POSIX semaphores, if available, on Linux and FreeBSD.
>
> Commenting out same in src/template/linux I did run the benchmark tests
> again
> performance improved from 26871.567326 to 34286.620251 (both median of 3
> TPS).

Hmm.  So that commit might not have been the greatest idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, Feb 22, 2018 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> seeing futex in the call stack andres suggested that following commit could
>> be the reason for regression
>>
>> commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59
>> Author: Tom Lane <tgl@sss.pgh.pa.us>
>> Date:   2016-10-09 18:03:45 -0400
>>
>>     Use unnamed POSIX semaphores, if available, on Linux and FreeBSD.
>>
>> Commenting out same in src/template/linux I did run the benchmark tests
>> again
>> performance improved from 26871.567326 to 34286.620251 (both median of 3
>> TPS).
>
> Hmm.  So that commit might not have been the greatest idea.
>

It appears so.  I think we should do something about it as the
regression is quite noticeable.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On 2018-Jul-19, Amit Kapila wrote:

> On Thu, Feb 22, 2018 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> >> seeing futex in the call stack andres suggested that following commit could
> >> be the reason for regression
> >>
> >> commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59
> >> Author: Tom Lane <tgl@sss.pgh.pa.us>
> >> Date:   2016-10-09 18:03:45 -0400
> >>
> >>     Use unnamed POSIX semaphores, if available, on Linux and FreeBSD.

> > Hmm.  So that commit might not have been the greatest idea.
> 
> It appears so.  I think we should do something about it as the
> regression is quite noticeable.

So the fix is just to revert the change for the linux makefile?  Sounds
easy enough, code-wise.  Do we need more evidence that it's harmful?

Since it was changed in pg10 not 11, I don't think this is an open-item
per se.  (Maybe an "older bug", if we must really have it there.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Hi,

On 2018-01-24 00:06:44 +0530, Mithun Cy wrote:
> Server:
> ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Which kernel & glibc version does this server have?

Greetings,

Andres Freund


Hi,

On 2018-07-19 15:39:44 -0400, Alvaro Herrera wrote:
> On 2018-Jul-19, Amit Kapila wrote:
> 
> > On Thu, Feb 22, 2018 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > >> seeing futex in the call stack andres suggested that following commit could
> > >> be the reason for regression
> > >>
> > >> commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59
> > >> Author: Tom Lane <tgl@sss.pgh.pa.us>
> > >> Date:   2016-10-09 18:03:45 -0400
> > >>
> > >>     Use unnamed POSIX semaphores, if available, on Linux and FreeBSD.
> 
> > > Hmm.  So that commit might not have been the greatest idea.
> > 
> > It appears so.  I think we should do something about it as the
> > regression is quite noticeable.
> 
> So the fix is just to revert the change for the linux makefile?  Sounds
> easy enough, code-wise.  Do we need more evidence that it's harmful?
> 
> Since it was changed in pg10 not 11, I don't think this is an open-item
> per se.  (Maybe an "older bug", if we must really have it there.)

I'm a bit hesitant to just revert without further evaluation - it's just
about as likely we'll regress on other hardware / kernel
versions. Except it'd be in a minor release, whereas the current issue
was in a major release.  It'd also suddenly make some installations not
start, due to sysv semaphore # limitations.

There've been a few annoying, and a few embarassing, issues with
futexes, but they receive far more attention from a performance POV.

Greetings,

Andres Freund


Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2018-Jul-19, Amit Kapila wrote:
>> It appears so.  I think we should do something about it as the
>> regression is quite noticeable.

It's not *that* noticeable, as I failed to demonstrate any performance
difference before committing the patch.  I think some more investigation
is warranted to find out why some other people are getting different
results.

> So the fix is just to revert the change for the linux makefile?  Sounds
> easy enough, code-wise.  Do we need more evidence that it's harmful?

Some fraction of 3d21f08bc would also need to be undone.  It's also worth
contemplating that we'd be re-introducing old problems with not-enough-
SysV-semaphores, so even if there is a performance benefit to be had,
it's very far from being free.

            regards, tom lane


Hi Andres,

On Fri, Jul 20, 2018 at 1:21 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-01-24 00:06:44 +0530, Mithun Cy wrote:
> Server:
> ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Which kernel & glibc version does this server have?

[mithun.cy@cthulhu ~]$ cat /proc/version
Linux version 3.10.0-693.5.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Fri Oct 20 20:32:50 UTC 2017

[mithun.cy@cthulhu ~]$ ldd --version
ldd (GNU libc) 2.17
 

--
Thanks and Regards
Mithun C Y

Andres Freund <andres@anarazel.de> writes:
> I'm a bit hesitant to just revert without further evaluation - it's just
> about as likely we'll regress on other hardware / kernel
> versions.

I looked into the archives for the discussion that led up to ecb0d20a9,
and found it here:

https://www.postgresql.org/message-id/flat/8536.1475704230%40sss.pgh.pa.us

The test cases I tried in that thread said that POSIX semas were *faster*
... by single-digit percentages, but still faster.  So I think we really
need to study this issue, rather than just take one contrary result as
being gospel.

            regards, tom lane


On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> On 2018-Jul-19, Amit Kapila wrote:
>>> It appears so.  I think we should do something about it as the
>>> regression is quite noticeable.
>
> It's not *that* noticeable, as I failed to demonstrate any performance
> difference before committing the patch.  I think some more investigation
> is warranted to find out why some other people are getting different
> results.

Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on
Linux/amd64 and we're probably hitting elements clustered at one end
of the array?  Let's see... I tried sticking padding into
PGSemaphoreData and I got ~8% more TPS (72 client on multi socket
box, pgbench scale 100, only running for a minute but otherwise the
same settings that Mithun showed).

--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -45,6 +45,7 @@
 typedef struct PGSemaphoreData
 {
        sem_t           pgsem;
+       char            padding[PG_CACHE_LINE_SIZE - sizeof(sem_t)];
 } PGSemaphoreData;

That's probably not the right idiom and my tests probably weren't long
enough, but there seems to be some effect here.

-- 
Thomas Munro
http://www.enterprisedb.com


On Fri, Jul 20, 2018 at 10:52 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> It's not *that* noticeable, as I failed to demonstrate any performance
> difference before committing the patch.  I think some more investigation
> is warranted to find out why some other people are getting different
> results
Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on
Linux/amd64 and we're probably hitting elements clustered at one end
of the array?  Let's see... I tried sticking padding into
PGSemaphoreData and I got ~8% more TPS (72 client on multi socket
box, pgbench scale 100, only running for a minute but otherwise the
same settings that Mithun showed).

--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -45,6 +45,7 @@
 typedef struct PGSemaphoreData
 {
        sem_t           pgsem;
+       char            padding[PG_CACHE_LINE_SIZE - sizeof(sem_t)];
 } PGSemaphoreData;

That's probably not the right idiom and my tests probably weren't long
enough, but there seems to be some effect here.

I did a quick test applying the patch with same settings as initial mail I have reported  (On postgresql 10 latest code)
72 clients

CASE 1:
Without Patch : TPS 29269.823540

With Patch : TPS 36005.544960.    --- 23% jump

Just Disabling using unnamed POSIX semaphores: TPS 34481.207959

So it seems that is the issue as the test is being run on 8 node numa machine.
I also came across a presentation [1] : slide 20 which says one of those futex architecture is bad for NUMA machine. I am not sure the new fix for same is included as part of Linux version 3.10.0-693.5.2.el7.x86_64 which is on my test machine.




--
Thanks and Regards
Mithun C Y

Hi,

On 2018-07-21 00:53:28 +0530, Mithun Cy wrote:
> On Fri, Jul 20, 2018 at 10:52 AM, Thomas Munro <
> thomas.munro@enterprisedb.com> wrote:
> 
> > On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > It's not *that* noticeable, as I failed to demonstrate any performance
> > > difference before committing the patch.  I think some more investigation
> > > is warranted to find out why some other people are getting different
> > > results
> > Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on
> > Linux/amd64 and we're probably hitting elements clustered at one end
> > of the array?  Let's see... I tried sticking padding into
> > PGSemaphoreData and I got ~8% more TPS (72 client on multi socket
> > box, pgbench scale 100, only running for a minute but otherwise the
> > same settings that Mithun showed).
> >
> > --- a/src/backend/port/posix_sema.c
> > +++ b/src/backend/port/posix_sema.c
> > @@ -45,6 +45,7 @@
> >  typedef struct PGSemaphoreData
> >  {
> >         sem_t           pgsem;
> > +       char            padding[PG_CACHE_LINE_SIZE - sizeof(sem_t)];
> >  } PGSemaphoreData;
> >
> > That's probably not the right idiom and my tests probably weren't long
> > enough, but there seems to be some effect here.
> >
> 
> I did a quick test applying the patch with same settings as initial mail I
> have reported  (On postgresql 10 latest code)
> 72 clients
> 
> CASE 1:
> Without Patch : TPS 29269.823540
> 
> With Patch : TPS 36005.544960.    --- 23% jump
> 
> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959

> So it seems that is the issue as the test is being run on 8 node numa
> machine.

Cool. I think we should just backpatch that then.  Does anybody want to
argue against?


> I also came across a presentation [1] : slide 20 which says one of those
> futex architecture is bad for NUMA machine. I am not sure the new fix for
> same is included as part of Linux version 3.10.0-693.5.2.el7.x86_64 which
> is on my test machine.

Similar issues are also present internally for sysv semas, so I don't
think this really means that much.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-07-21 00:53:28 +0530, Mithun Cy wrote:
>> I did a quick test applying the patch with same settings as initial mail I
>> have reported  (On postgresql 10 latest code)
>> 72 clients
>> 
>> CASE 1:
>> Without Patch : TPS 29269.823540
>> 
>> With Patch : TPS 36005.544960.    --- 23% jump
>> 
>> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959

>> So it seems that is the issue as the test is being run on 8 node numa
>> machine.

> Cool. I think we should just backpatch that then.  Does anybody want to
> argue against?

Not entirely clear to me what change is being proposed here?

In any case, I strongly resist making performance-based changes on
the basis of one test on one kernel and one hardware platform.
We should reproduce the results on a few different machines before
we even think of committing anything.  I'm happy to test on what
I have, although I'd be the first to agree that what I'm checking
is relatively low-end cases.  (Too bad hydra's gone.)

            regards, tom lane


On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote:
> >> I did a quick test applying the patch with same settings as initial mail I
> >> have reported  (On postgresql 10 latest code)
> >> 72 clients
> >> 
> >> CASE 1:
> >> Without Patch : TPS 29269.823540
> >> 
> >> With Patch : TPS 36005.544960.    --- 23% jump
> >> 
> >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959
> 
> >> So it seems that is the issue as the test is being run on 8 node numa
> >> machine.
> 
> > Cool. I think we should just backpatch that then.  Does anybody want to
> > argue against?
> 
> Not entirely clear to me what change is being proposed here?

Adding padding to struct PGSemaphoreData, so multiple semas don't share
a cacheline.


> In any case, I strongly resist making performance-based changes on
> the basis of one test on one kernel and one hardware platform.
> We should reproduce the results on a few different machines before
> we even think of committing anything.  I'm happy to test on what
> I have, although I'd be the first to agree that what I'm checking
> is relatively low-end cases.  (Too bad hydra's gone.)

Sure, it'd be good to do more of that. But from a theoretical POV it's
quite logical that posix semas sharing cachelines is bad for
performance, if there's any contention. When backed by futexes -
i.e. all non ancient linux machines - the hot path just does a cmpxchg
of the *userspace* data (I've copied the relevant code below).  Given
that we don't have a large number of semas these days, that there's
reasons to make the change even without measuring it, that we have
benchmark results, and that it's hard to see how it'd cause regressions,
I don't think going for a fix quickly is unreasonable.

You could argue it'd be better to make semaphores be embeddable in
bigger structures like PGPROC, rather than allocated in an array. While
I suspect you'd get a bit of a performance benefit from that, it seems
like a far bigger change we'd want to do in a minor release.

int
__new_sem_wait (sem_t *sem)
{
  /* We need to check whether we need to act upon a cancellation request here
     because POSIX specifies that cancellation points "shall occur" in
     sem_wait and sem_timedwait, which also means that they need to check
     this regardless whether they block or not (unlike "may occur"
     functions).  See the POSIX Rationale for this requirement: Section
     "Thread Cancellation Overview" [1] and austin group issue #1076 [2]
     for thoughs on why this may be a suboptimal design.

     [1] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html
     [2] http://austingroupbugs.net/view.php?id=1076 for thoughts on why this
   */
  __pthread_testcancel ();

  if (__new_sem_wait_fast ((struct new_sem *) sem, 0) == 0)
    return 0;
  else
    return __new_sem_wait_slow((struct new_sem *) sem, NULL);
}

/* Fast path: Try to grab a token without blocking.  */
static int
__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
{
  /* We need acquire MO if we actually grab a token, so that this
     synchronizes with all token providers (i.e., the RMW operation we read
     from or all those before it in modification order; also see sem_post).
     We do not need to guarantee any ordering if we observed that there is
     no token (POSIX leaves it unspecified whether functions that fail
     synchronize memory); thus, relaxed MO is sufficient for the initial load
     and the failure path of the CAS.  If the weak CAS fails and we need a
     definitive result, retry.  */
#if __HAVE_64B_ATOMICS
  uint64_t d = atomic_load_relaxed (&sem->data);
  do
    {
      if ((d & SEM_VALUE_MASK) == 0)
    break;
      if (atomic_compare_exchange_weak_acquire (&sem->data, &d, d - 1))
    return 0;
    }
  while (definitive_result);
  return -1;
#else
  unsigned int v = atomic_load_relaxed (&sem->value);
  do
    {
      if ((v >> SEM_VALUE_SHIFT) == 0)
    break;
      if (atomic_compare_exchange_weak_acquire (&sem->value,
      &v, v - (1 << SEM_VALUE_SHIFT)))
    return 0;
    }
  while (definitive_result);
  return -1;
#endif
}

/* See sem_wait for an explanation of the algorithm.  */
int
__new_sem_post (sem_t *sem)
{
  struct new_sem *isem = (struct new_sem *) sem;
  int private = isem->private;

#if __HAVE_64B_ATOMICS
  /* Add a token to the semaphore.  We use release MO to make sure that a
     thread acquiring this token synchronizes with us and other threads that
     added tokens before (the release sequence includes atomic RMW operations
     by other threads).  */
  /* TODO Use atomic_fetch_add to make it scale better than a CAS loop?  */
  uint64_t d = atomic_load_relaxed (&isem->data);
  do
    {
      if ((d & SEM_VALUE_MASK) == SEM_VALUE_MAX)
    {
      __set_errno (EOVERFLOW);
      return -1;
    }
    }
  while (!atomic_compare_exchange_weak_release (&isem->data, &d, d + 1));

  /* If there is any potentially blocked waiter, wake one of them.  */
  if ((d >> SEM_NWAITERS_SHIFT) > 0)
    futex_wake (((unsigned int *) &isem->data) + SEM_VALUE_OFFSET, 1, private);
#else
  /* Add a token to the semaphore.  Similar to 64b version.  */
  unsigned int v = atomic_load_relaxed (&isem->value);
  do
    {
      if ((v >> SEM_VALUE_SHIFT) == SEM_VALUE_MAX)
    {
      __set_errno (EOVERFLOW);
      return -1;
    }
    }
  while (!atomic_compare_exchange_weak_release
     (&isem->value, &v, v + (1 << SEM_VALUE_SHIFT)));

  /* If there is any potentially blocked waiter, wake one of them.  */
  if ((v & SEM_NWAITERS_MASK) != 0)
    futex_wake (&isem->value, 1, private);
#endif

  return 0;
}


Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
>> In any case, I strongly resist making performance-based changes on
>> the basis of one test on one kernel and one hardware platform.

> Sure, it'd be good to do more of that. But from a theoretical POV it's
> quite logical that posix semas sharing cachelines is bad for
> performance, if there's any contention. When backed by futexes -
> i.e. all non ancient linux machines - the hot path just does a cmpxchg
> of the *userspace* data (I've copied the relevant code below).

Here's the thing: the hot path is of little or no interest, because
if we are in the sema code at all, we are expecting to block.  The
only case where we wouldn't block is if the lock manager decided the
current process needs to sleep, but some other process already released
us by the time we reach the futex/kernel call.  Certainly that will happen
some of the time, but it's not likely to be the way to bet.  So I'm very
dubious of any arguments based on the speed of the "uncontended" path.

It's possible that the bigger picture here is that the kernel boys
optimized for the "uncontended" path to the point where they broke
performance of the blocking path.  It's hard to see how they could
have broke it to the point of being slower than the SysV sema API,
though.

Anyway, I think we need to test first and patch second.  I'm working
on getting some numbers on my own machines now.

On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
only got 8 cores) but other parameters matching Mithun's example,
I just got

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 8
number of threads: 8
duration: 1800 s
number of transactions actually processed: 29001016
latency average = 0.497 ms
tps = 16111.575661 (including connections establishing)
tps = 16111.623329 (excluding connections establishing)

which is interesting because vmstat was pretty consistently reporting
around 500000 context swaps/second during the run, or circa 30
cs/transaction.  We'd have a minimum of 14 cs/transaction just between
client and server (due to seven SQL commands per transaction in TPC-B)
so that seems on the low side; not a lot of lock contention here it
seems.  I wonder what the corresponding ratio was in Mithun's runs.

            regards, tom lane


Hi,

On 2018-07-20 16:43:33 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
> >> In any case, I strongly resist making performance-based changes on
> >> the basis of one test on one kernel and one hardware platform.
> 
> > Sure, it'd be good to do more of that. But from a theoretical POV it's
> > quite logical that posix semas sharing cachelines is bad for
> > performance, if there's any contention. When backed by futexes -
> > i.e. all non ancient linux machines - the hot path just does a cmpxchg
> > of the *userspace* data (I've copied the relevant code below).
> 
> Here's the thing: the hot path is of little or no interest, because
> if we are in the sema code at all, we are expecting to block.

Note that we're also using semas for ProcArrayGroupClearXid(), which is
pretty commonly hot for pgbench style workloads, and where the expected
wait times are very short.


> It's possible that the bigger picture here is that the kernel boys
> optimized for the "uncontended" path to the point where they broke
> performance of the blocking path.  It's hard to see how they could
> have broke it to the point of being slower than the SysV sema API,
> though.

I don't see how this is a likely proposition, given that adding padding
to the *userspace* portion of futexes increased the performance quite
significantly.


> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
> only got 8 cores) but other parameters matching Mithun's example,
> I just got

It's *really* common to have more actual clients than cpus for oltp
workloads, so I don't think it's insane to test with more clients.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-07-20 16:43:33 -0400, Tom Lane wrote:
>> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
>> only got 8 cores) but other parameters matching Mithun's example,
>> I just got

> It's *really* common to have more actual clients than cpus for oltp
> workloads, so I don't think it's insane to test with more clients.

I finished a set of runs using similar parameters to Mithun's test except
for using 8 clients, and another set using 72 clients (but, being
impatient, 5-minute runtime) just to verify that the results wouldn't
be markedly different.  I got TPS numbers like this:

                8 clients    72 clients

unmodified HEAD            16112        16284
with padding patch        16096        16283
with SysV semas            15926        16064
with padding+SysV        15949        16085

This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual
4-core Intel E5-2609 (Sandy Bridge era).  This hardware does show NUMA
effects, although no doubt less strongly than Mithun's machine.

I would like to see some other results with a newer kernel.  I tried to
repeat this test on a laptop running Fedora 28, but soon concluded that
anything beyond very short runs was mainly going to tell me about thermal
throttling :-(.  I could possibly get repeatable numbers from, say,
1-minute SELECT-only runs, but that would be a different test scenario,
likely one with a lot less lock contention.

Anyway, for me, the padding change is a don't-care.  Given that both
Mithun and Thomas showed some positive effect from it, I'm not averse
to applying it.  I'm still -1 on going back to SysV semas.

            regards, tom lane


On Sun, Jul 22, 2018 at 8:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2018-07-20 16:43:33 -0400, Tom Lane wrote:
>>> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
>>> only got 8 cores) but other parameters matching Mithun's example,
>>> I just got
>
>> It's *really* common to have more actual clients than cpus for oltp
>> workloads, so I don't think it's insane to test with more clients.
>
> I finished a set of runs using similar parameters to Mithun's test except
> for using 8 clients, and another set using 72 clients (but, being
> impatient, 5-minute runtime) just to verify that the results wouldn't
> be markedly different.  I got TPS numbers like this:
>
>                                 8 clients       72 clients
>
> unmodified HEAD                 16112           16284
> with padding patch              16096           16283
> with SysV semas                 15926           16064
> with padding+SysV               15949           16085
>
> This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual
> 4-core Intel E5-2609 (Sandy Bridge era).  This hardware does show NUMA
> effects, although no doubt less strongly than Mithun's machine.
>
> I would like to see some other results with a newer kernel.  I tried to
> repeat this test on a laptop running Fedora 28, but soon concluded that
> anything beyond very short runs was mainly going to tell me about thermal
> throttling :-(.  I could possibly get repeatable numbers from, say,
> 1-minute SELECT-only runs, but that would be a different test scenario,
> likely one with a lot less lock contention.

I did some testing on 2-node, 4-node and 8-node systems running Linux
3.10.something (slightly newer but still ancient).  Only the 8-node
box (= same one Mithun used) shows the large effect (the 2-node box
may be a tiny bit faster patched but I'm calling that noise for now...
it's not slower, anyway).

On the problematic box, I also tried some different strides (char
padding[N - sizeof(sem_t)]) and was surprised by the result:

Unpatched = ~35k TPS
64 byte stride = ~35k TPS
128 byte stride = ~42k TPS
4096 byte stride = ~47k TPS

Huh.  PG_CACHE_LINE_SIZE is 128, but the true cache line size on this
system is 64 bytes.  That exaggeration turned out to do something
useful, though I can't explain it.

While looking for discussion of 128 byte cache effects I came across
the Intel "L2 adjacent cache line prefetcher"[1].  Maybe this, or some
of the other prefetchers (enabled in the BIOS) or related stuff could
be at work here.  It could be microarchitecture-dependent (this is an
old Westmere box), though I found a fairly recent discussion about a
similar effect[2] that mentions more recent hardware.  The spatial
prefetcher reference can be found in the Optimization Manual[3].

[1] https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
[2] https://groups.google.com/forum/#!msg/mechanical-sympathy/i3-M2uCYTJE/P7vyoOTIAgAJ
[3]
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

-- 
Thomas Munro
http://www.enterprisedb.com


On Mon, Jul 23, 2018 at 3:40 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I did some testing on 2-node, 4-node and 8-node systems running Linux
> 3.10.something (slightly newer but still ancient).  Only the 8-node
> box (= same one Mithun used) shows the large effect (the 2-node box
> may be a tiny bit faster patched but I'm calling that noise for now...
> it's not slower, anyway).

(I forgot to add that the 4 node system that showed no change is POWER
architecture.)

-- 
Thomas Munro
http://www.enterprisedb.com


On Mon, Jul 23, 2018 at 3:40 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, Jul 22, 2018 at 8:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>                                 8 clients       72 clients
>>
>> unmodified HEAD                 16112           16284
>> with padding patch              16096           16283
>> with SysV semas                 15926           16064
>> with padding+SysV               15949           16085
>>
>> This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual
>> 4-core Intel E5-2609 (Sandy Bridge era).  This hardware does show NUMA
>> effects, although no doubt less strongly than Mithun's machine.
>>
>> I would like to see some other results with a newer kernel.  I tried to
>> repeat this test on a laptop running Fedora 28, but soon concluded that
>> anything beyond very short runs was mainly going to tell me about thermal
>> throttling :-(.  I could possibly get repeatable numbers from, say,
>> 1-minute SELECT-only runs, but that would be a different test scenario,
>> likely one with a lot less lock contention.
>
> I did some testing on 2-node, 4-node and 8-node systems running Linux
> 3.10.something (slightly newer but still ancient).  Only the 8-node
> box (= same one Mithun used) shows the large effect (the 2-node box
> may be a tiny bit faster patched but I'm calling that noise for now...
> it's not slower, anyway).

Here's an attempt to use existing style better: a union, like
LWLockPadded and WALInsertLockPadded.  I think we should back-patch to
10.  Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment
On Mon, Jul 23, 2018 at 10:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here's an attempt to use existing style better: a union, like
> LWLockPadded and WALInsertLockPadded.  I think we should back-patch to
> 10.  Thoughts?

Pushed to 10, 11, master.

It's interesting that I could see a further ~12% speedup by using VM
page-size stride on that 8 socket machine, but that's something to
look at another day.  The PG_CACHE_LINE_SIZE padding change gets us
back to approximately where we were in 9.6.

/me . o O ( Gee, it'd be really nice to see this change on a graph on
a web page that tracks a suite of tests on a farm of interesting
machines on each branch over time. )

-- 
Thomas Munro
http://www.enterprisedb.com