Re: Patch: fix lock contention for HASHHDR.mutex - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: Patch: fix lock contention for HASHHDR.mutex
Date
Msg-id 20151222183953.771cb58b@fujitsu
Whole thread Raw
In response to Re: Patch: fix lock contention for HASHHDR.mutex  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Patch: fix lock contention for HASHHDR.mutex
List pgsql-hackers
> > Actually, I'd like to improve all partitioned hashes instead of
> > improve only one case.
>
> Yeah.  I'm not sure that should be an LWLock rather than a spinlock,
> but we can benchmark it both ways.

I would like to share some preliminary results. I tested four
implementations:

- no locks and no element stealing from other partitions;
- single LWLock per partitioned table;
- single spinlock per partitioned table;
- NUM_LOCK_PARTITIONS spinlocks per partitioned table;

Interestingly "Shared Buffer Lookup Table" (see buf_table.c) has 128
partitions. The constant NUM_BUFFER_PARTITIONS was increased from 16 to
128 in commit 3acc10c9:


http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892;hp=952872698d9443fdf9b808a1376017f00c91065a

Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
partitions (and assuming that all necessary locking/unlocking is done
on calling side) tables can't have more than NUM_LOCK_PARTITIONS
partitions because it would cause race conditions. For this reason I
had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
behaviour of PostgreSQL depending on different values of
NUM_LOCK_PARTITIONS.

So here are results:

Core i7, pgbench -j 8 -c 8 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  295.4   |  297.4   |  299.4   |  285.6   |  302.7
(1 << 4)   |  286.1   |  300.5   |  283.4   |  300.9   |  300.4
           |  300.0   |  300.0   |  302.1   |  300.7   |  300.3
-----------|----------|----------|----------|----------|----------
           |          |  296.7   |  299.9   |  298.8   |  298.3
(1 << 5)   |   ----   |  301.9   |  302.2   |  305.7   |  306.3
           |          |  287.7   |  301.0   |  303.0   |  304.5
-----------|----------|----------|----------|----------|----------
           |          |  296.4   |  300.5   |  302.9   |  304.6
(1 << 6)   |   ----   |  301.7   |  305.6   |  306.4   |  302.3
           |          |  299.6   |  304.5   |  306.6   |  300.4
-----------|----------|----------|----------|----------|----------
           |          |  295.9   |  298.7   |  295.3   |  305.0
(1 << 7)   |   ----   |  299.5   |  300.5   |  299.0   |  310.2
           |          |  287.8   |  285.9   |  300.2   |  302.2

Core i7, pgbench -j 8 -c 8 -f big_table.sql -T 30 my_database
(3 test, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  505.1   |  521.3   |  511.1   |  524.4   |  501.6
(1 << 4)   |  452.4   |  467.4   |  509.2   |  472.3   |  453.7
           |  435.2   |  462.4   |  445.8   |  467.9   |  467.0
-----------|----------|----------|----------|----------|----------
           |          |  514.8   |  476.3   |  507.9   |  510.6
(1 << 5)   |   ----   |  457.5   |  491.2   |  464.6   |  431.7
           |          |  442.2   |  457.0   |  495.5   |  448.2
-----------|----------|----------|----------|----------|----------
           |          |  516.4   |  502.5   |  468.0   |  521.3
(1 << 6)   |   ----   |  463.6   |  438.7   |  488.8   |  455.4
           |          |  434.2   |  468.1   |  484.7   |  433.5
-----------|----------|----------|----------|----------|----------
           |          |  513.6   |  459.4   |  519.6   |  510.3
(1 << 7)   |   ----   |  470.1   |  454.6   |  445.5   |  415.9
           |          |  459.4   |  489.7   |  457.1   |  452.8

60-core server, pgbench -j 64 -c 64 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  3156.2  |  3157.9  |  3542.0  |  3444.3  |  3472.4
(1 << 4)   |  3268.5  |  3444.7  |  3485.7  |  3486.0  |  3500.5
           |  3251.2  |  3482.3  |  3398.7  |  3587.1  |  3557.7
-----------|----------|----------|----------|----------|----------
           |          |  3352.7  |  3556.0  |  3543.3  |  3526.8
(1 << 5)   |   ----   |  3465.0  |  3475.2  |  3486.9  |  3528.4
           |          |  3410.0  |  3482.0  |  3493.7  |  3444.9
-----------|----------|----------|----------|----------|----------
           |          |  3437.8  |  3413.1  |  3445.8  |  3481.6
(1 << 6)   |   ----   |  3470.1  |  3478.4  |  3538.5  |  3579.9
           |          |  3450.8  |  3431.1  |  3509.0  |  3512.5
-----------|----------|----------|----------|----------|----------
           |          |  3425.4  |  3534.6  |  3414.7  |  3517.1
(1 << 7)   |   ----   |  3436.5  |  3430.0  |  3428.0  |  3536.4
           |          |  3455.6  |  3479.7  |  3573.4  |  3543.0

60-core server, pgbench -j 64 -c 64 -f big_table.sql -T 30 my_database
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  661.1   |  4639.6  |  1435.2  |  445.9   |  1589.6
(1 << 4)   |  642.9   |  4566.7  |  1410.3  |  457.1   |  1601.7
           |  643.9   |  4621.8  |  1404.8  |  489.0   |  1592.6
-----------|----------|----------|----------|----------|----------
           |          |  4721.9  |  1543.1  |  499.1   |  1596.9
(1 << 5)   |   ----   |  4506.8  |  1513.0  |  528.3   |  1594.7
           |          |  4744.7  |  1540.3  |  524.0   |  1593.0
-----------|----------|----------|----------|----------|----------
           |          |  4649.1  |  1564.5  |  475.9   |  1580.1
(1 << 6)   |   ----   |  4671.0  |  1560.5  |  485.6   |  1589.1
           |          |  4751.0  |  1557.4  |  505.1   |  1580.3
-----------|----------|----------|----------|----------|----------
           |          |  4657.7  |  1551.8  |  534.7   |  1585.1
(1 << 7)   |   ----   |  4616.8  |  1546.8  |  495.8   |  1623.4
           |          |  4779.2  |  1538.5  |  537.4   |  1588.5

All four implementations (W.I.P. quality --- dirty code, no comments,
etc) are attached to this message. Schema of my_database and
big_table.sql file are attached to the first message of this thread.

A large spread of TPS on Core i7 is due to the fact that its actually
my laptop with other applications running beside PostgreSQL. Still we
see that all solutions are equally good on this CPU and there is no
performance degradation.

Now regarding 60-core server:

- One spinlock per hash table doesn't scale. I personally was expecting
  this;
- LWLock's and array of spinlocks do scale on NUMA up to a certain
  point;
- Best results are shown by "no locks";

I believe that "no locks" implementation is the best one since it is at
least 3 times faster on NUMA then any other implementation. Also it is
simpler and doesn't have stealing-from-other-freelists logic that
executes rarely and therefore is a likely source of bugs. Regarding ~16
elements of freelists which in some corner cases could but wouldn't be
used --- as I mentioned before I believe its not such a big problem.
Also its a small price to pay for 3 times more TPS.

Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some
doubts. For sure Robert had a good reason for committing 3acc10c9.
Unfortunately I'm not familiar with a story behind this commit. What do
you think?
Attachment

pgsql-hackers by date:

Previous
From: Yury Zhuravlev
Date:
Subject: Re: Some questions about the array.
Next
From: Tom Lane
Date:
Subject: Re: Remove Windows crash dump support?