Re: Patch: fix lock contention for HASHHDR.mutex - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: Patch: fix lock contention for HASHHDR.mutex
Date
Msg-id 20160212183445.716b2845@fujitsu
Whole thread Raw
In response to Re: Patch: fix lock contention for HASHHDR.mutex  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Patch: fix lock contention for HASHHDR.mutex
List pgsql-hackers
Hello, Robert

> It also strikes me that it's probably quite likely that slock_t
> mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> For example, on a system where slock_t is just one byte, most likely
> all of those mutexes are going to be in the same cache line, which
> means you're going to get a LOT of false sharing.  It seems like it
> would be sensible to define a new struct that contains an slock_t, a
> long, and a HASHELEMENT *, and then make an array of those structs.
> That wouldn't completely eliminate false sharing, but it would reduce
> it quite a bit.  My guess is that if you did that, you could reduce
> the number of freelists to 8 or less and get pretty much the same
> performance benefit that you're getting right now with 32. And that,
> too, seems likely to be good for single-client performance.

I experimented for a while trying to fit every spinlock in a separate
cache line. Indeed we can gain some speedup this way. Here are
benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
this case difference is more notable):

| FREELISTS | SIZE =  32 | SIZE =  64 | SIZE = 128 | SIZE = 256 |
|-----------|------------|------------|------------|------------|
|     4     |   +25.4%   |   +28.7%   |   +28.4%   |   +28.3%   |
|     8     |   +28.2%   |   +29.4%   |   +31.7%   |   +31.4%   |
|    16     |   +29.9%   |   +32.6%   |   +31.6%   |   +30.8%   |
|    32     |   +33.7%   |   +34.2%   |   +33.6%   |   +32.6%   |

Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
align FREELIST structure, see attached patch). Cache line on this CPU
is 64 bytes according to /sys/devices/system/cpu/cpu0/cache/index0/*

I would like to remind that without these changes we had the following
picture (please note "NLP = 32" column):

pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12

 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128
-----|----------|----------|----------|----------
   8 |  +15.1%  |  +28.2%  |  +34.1%  |  +33.7%
  16 |  +16.6%  |  +30.9%  |  +37.0%  |  +40.8%
  32 |  +15.1%  |  +33.9%  |  +39.5%  |  +41.9%
  64 |  +15.0%  |  +31.9%  |  +40.1%  |  +42.9%
 128 |   +7.7%  |  +24.7%  |  +29.6%  |  +31.6%

* NLP = NUM_LOCK_PARTITIONS
* DMN = DEFAULT_MUTEXES_NUM

So its +34.2% vs +33.9% and the number of freeLists remains the same.

> I am however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.

Also I tried to use PID to determine freeList number:

```
#include <sys/types.h>
#include <unistd.h>

...

#define FREELIST_IDX(hctl, hashcode) \
  (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)

...

// now nentries could be negative in this case
// Assert(FREELIST(hctl, freelist_idx).nentries > 0);

```

Unfortunately this approach gives +33.9% TPS in best case. Also there
is a possibility that nentries will overflow. So far I don't have a
clear idea how to solve this issue effectively.

I'm not sure if further improvements worth an effort.
Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: GinPageIs* don't actually return a boolean
Next
From: Robert Haas
Date:
Subject: Re: GinPageIs* don't actually return a boolean