Re: Patch: fix lock contention for HASHHDR.mutex - Mailing list pgsql-hackers
From | Aleksander Alekseev |
---|---|
Subject | Re: Patch: fix lock contention for HASHHDR.mutex |
Date | |
Msg-id | 20160212183445.716b2845@fujitsu Whole thread Raw |
In response to | Re: Patch: fix lock contention for HASHHDR.mutex (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Patch: fix lock contention for HASHHDR.mutex
|
List | pgsql-hackers |
Hello, Robert > It also strikes me that it's probably quite likely that slock_t > mutex[NUM_FREELISTS] is a poor way to lay out this data in memory. > For example, on a system where slock_t is just one byte, most likely > all of those mutexes are going to be in the same cache line, which > means you're going to get a LOT of false sharing. It seems like it > would be sensible to define a new struct that contains an slock_t, a > long, and a HASHELEMENT *, and then make an array of those structs. > That wouldn't completely eliminate false sharing, but it would reduce > it quite a bit. My guess is that if you did that, you could reduce > the number of freelists to 8 or less and get pretty much the same > performance benefit that you're getting right now with 32. And that, > too, seems likely to be good for single-client performance. I experimented for a while trying to fit every spinlock in a separate cache line. Indeed we can gain some speedup this way. Here are benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in this case difference is more notable): | FREELISTS | SIZE = 32 | SIZE = 64 | SIZE = 128 | SIZE = 256 | |-----------|------------|------------|------------|------------| | 4 | +25.4% | +28.7% | +28.4% | +28.3% | | 8 | +28.2% | +29.4% | +31.7% | +31.4% | | 16 | +29.9% | +32.6% | +31.6% | +30.8% | | 32 | +33.7% | +34.2% | +33.6% | +32.6% | Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to align FREELIST structure, see attached patch). Cache line on this CPU is 64 bytes according to /sys/devices/system/cpu/cpu0/cache/index0/* I would like to remind that without these changes we had the following picture (please note "NLP = 32" column): pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128 -----|----------|----------|----------|---------- 8 | +15.1% | +28.2% | +34.1% | +33.7% 16 | +16.6% | +30.9% | +37.0% | +40.8% 32 | +15.1% | +33.9% | +39.5% | +41.9% 64 | +15.0% | +31.9% | +40.1% | +42.9% 128 | +7.7% | +24.7% | +29.6% | +31.6% * NLP = NUM_LOCK_PARTITIONS * DMN = DEFAULT_MUTEXES_NUM So its +34.2% vs +33.9% and the number of freeLists remains the same. > I am however wondering if it to set the freelist affinity based on > something other than the hash value, like say the PID, so that the > same process doesn't keep switching to a different freelist for every > buffer eviction. Also I tried to use PID to determine freeList number: ``` #include <sys/types.h> #include <unistd.h> ... #define FREELIST_IDX(hctl, hashcode) \ (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0) ... // now nentries could be negative in this case // Assert(FREELIST(hctl, freelist_idx).nentries > 0); ``` Unfortunately this approach gives +33.9% TPS in best case. Also there is a possibility that nentries will overflow. So far I don't have a clear idea how to solve this issue effectively. I'm not sure if further improvements worth an effort.
Attachment
pgsql-hackers by date: