Re: Possible performance regression in version 10.1 with pgbenchread-write tests. - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
Date
Msg-id 20180720195042.qf3ne7z3qyame5im@alap3.anarazel.de
Whole thread Raw
In response to Re: Possible performance regression in version 10.1 with pgbench read-write tests.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Possible performance regression in version 10.1 with pgbench read-write tests.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote:
> >> I did a quick test applying the patch with same settings as initial mail I
> >> have reported  (On postgresql 10 latest code)
> >> 72 clients
> >> 
> >> CASE 1:
> >> Without Patch : TPS 29269.823540
> >> 
> >> With Patch : TPS 36005.544960.    --- 23% jump
> >> 
> >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959
> 
> >> So it seems that is the issue as the test is being run on 8 node numa
> >> machine.
> 
> > Cool. I think we should just backpatch that then.  Does anybody want to
> > argue against?
> 
> Not entirely clear to me what change is being proposed here?

Adding padding to struct PGSemaphoreData, so multiple semas don't share
a cacheline.


> In any case, I strongly resist making performance-based changes on
> the basis of one test on one kernel and one hardware platform.
> We should reproduce the results on a few different machines before
> we even think of committing anything.  I'm happy to test on what
> I have, although I'd be the first to agree that what I'm checking
> is relatively low-end cases.  (Too bad hydra's gone.)

Sure, it'd be good to do more of that. But from a theoretical POV it's
quite logical that posix semas sharing cachelines is bad for
performance, if there's any contention. When backed by futexes -
i.e. all non ancient linux machines - the hot path just does a cmpxchg
of the *userspace* data (I've copied the relevant code below).  Given
that we don't have a large number of semas these days, that there's
reasons to make the change even without measuring it, that we have
benchmark results, and that it's hard to see how it'd cause regressions,
I don't think going for a fix quickly is unreasonable.

You could argue it'd be better to make semaphores be embeddable in
bigger structures like PGPROC, rather than allocated in an array. While
I suspect you'd get a bit of a performance benefit from that, it seems
like a far bigger change we'd want to do in a minor release.

int
__new_sem_wait (sem_t *sem)
{
  /* We need to check whether we need to act upon a cancellation request here
     because POSIX specifies that cancellation points "shall occur" in
     sem_wait and sem_timedwait, which also means that they need to check
     this regardless whether they block or not (unlike "may occur"
     functions).  See the POSIX Rationale for this requirement: Section
     "Thread Cancellation Overview" [1] and austin group issue #1076 [2]
     for thoughs on why this may be a suboptimal design.

     [1] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html
     [2] http://austingroupbugs.net/view.php?id=1076 for thoughts on why this
   */
  __pthread_testcancel ();

  if (__new_sem_wait_fast ((struct new_sem *) sem, 0) == 0)
    return 0;
  else
    return __new_sem_wait_slow((struct new_sem *) sem, NULL);
}

/* Fast path: Try to grab a token without blocking.  */
static int
__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
{
  /* We need acquire MO if we actually grab a token, so that this
     synchronizes with all token providers (i.e., the RMW operation we read
     from or all those before it in modification order; also see sem_post).
     We do not need to guarantee any ordering if we observed that there is
     no token (POSIX leaves it unspecified whether functions that fail
     synchronize memory); thus, relaxed MO is sufficient for the initial load
     and the failure path of the CAS.  If the weak CAS fails and we need a
     definitive result, retry.  */
#if __HAVE_64B_ATOMICS
  uint64_t d = atomic_load_relaxed (&sem->data);
  do
    {
      if ((d & SEM_VALUE_MASK) == 0)
    break;
      if (atomic_compare_exchange_weak_acquire (&sem->data, &d, d - 1))
    return 0;
    }
  while (definitive_result);
  return -1;
#else
  unsigned int v = atomic_load_relaxed (&sem->value);
  do
    {
      if ((v >> SEM_VALUE_SHIFT) == 0)
    break;
      if (atomic_compare_exchange_weak_acquire (&sem->value,
      &v, v - (1 << SEM_VALUE_SHIFT)))
    return 0;
    }
  while (definitive_result);
  return -1;
#endif
}

/* See sem_wait for an explanation of the algorithm.  */
int
__new_sem_post (sem_t *sem)
{
  struct new_sem *isem = (struct new_sem *) sem;
  int private = isem->private;

#if __HAVE_64B_ATOMICS
  /* Add a token to the semaphore.  We use release MO to make sure that a
     thread acquiring this token synchronizes with us and other threads that
     added tokens before (the release sequence includes atomic RMW operations
     by other threads).  */
  /* TODO Use atomic_fetch_add to make it scale better than a CAS loop?  */
  uint64_t d = atomic_load_relaxed (&isem->data);
  do
    {
      if ((d & SEM_VALUE_MASK) == SEM_VALUE_MAX)
    {
      __set_errno (EOVERFLOW);
      return -1;
    }
    }
  while (!atomic_compare_exchange_weak_release (&isem->data, &d, d + 1));

  /* If there is any potentially blocked waiter, wake one of them.  */
  if ((d >> SEM_NWAITERS_SHIFT) > 0)
    futex_wake (((unsigned int *) &isem->data) + SEM_VALUE_OFFSET, 1, private);
#else
  /* Add a token to the semaphore.  Similar to 64b version.  */
  unsigned int v = atomic_load_relaxed (&isem->value);
  do
    {
      if ((v >> SEM_VALUE_SHIFT) == SEM_VALUE_MAX)
    {
      __set_errno (EOVERFLOW);
      return -1;
    }
    }
  while (!atomic_compare_exchange_weak_release
     (&isem->value, &v, v + (1 << SEM_VALUE_SHIFT)));

  /* If there is any potentially blocked waiter, wake one of them.  */
  if ((v & SEM_NWAITERS_MASK) != 0)
    futex_wake (&isem->value, 1, private);
#endif

  return 0;
}


Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Next
From: Tom Lane
Date:
Subject: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes