SLRU optimization - configurable buffer pool and partitioning the SLRU lock - Mailing list pgsql-hackers

From Dilip Kumar
Subject SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Date
Msg-id CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
Whole thread Raw
Responses Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
List pgsql-hackers
The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable.  And, in order to deal with the linear search in the
large buffer pool,  we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool.  This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.

So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter.  The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.

Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.

There are some problems I faced while converting this centralized
control lock to a bank-wise lock and that is mainly because this lock
is (mis)used for different purposes. The main purpose of this control
lock as I understand it is to protect the in-memory access
(read/write) of the buffers in the SLRU buffer pool.

Here is the list of some problems and their analysis:

1) In some of the SLRU, we use this lock for protecting the members
inside the control structure which is specific to that SLRU layer i.e.
SerialControlData() members are protected by the SerialSLRULock, and I
don’t think it is the right use of this lock so for this purpose I
have introduced another lock called SerialControlLock for this
specific purpose.  Based on my analysis there is no reason for
protecting these members and the SLRU buffer access with the same
lock.
2) The member called ‘latest_page_number’ inside SlruSharedData is
also protected by the SLRULock, I would not say this use case is wrong
but since this is a common variable and not a per bank variable can
not be protected by the bankwise lock now.  But the usage of this
variable is just to track the latest page in an SLRU so that we do not
evict out the latest page during victim page selection.  So I have
converted this to an atomic variable as it is completely independent
of the SLRU buffer access.
3) In order to protect SlruScanDirectory, basically the
SlruScanDirectory() from DeactivateCommitTs(), is called under the
SLRU control lock, but from all other places  SlruScanDirectory() is
called without lock and that is because the caller of this function is
called from the places which are not executed concurrently(i.e.
startup, checkpoint).  This function DeactivateCommitTs() is also
called from the startup only so there doesn't seem any use in calling
this under the SLRU control lock.  Currently, I have called this under
the all-bank lock because logically this is not a performance path,
and that way we are keeping it consistent with the current logic, but
if others also think that we do not need a lock at this place then we
might remove it and then we don't need this all-bank lock anywhere.

There are some other uses of this lock where we might think it will be
a problem if we divide it into a bank-wise lock but it's not and I
have given my analysis for the same

1) SimpleLruTruncate: We might worry that if we convert to a bank-wise
lock then this could be an issue as we might need to release and
acquire different locks as we scan different banks.  But as per my
analysis, this is not an issue because a) With the current code also
do release and acquire the centralized lock multiple times in order to
perform the I/O on the buffer slot so the behavior is not changed but
the most important thing is b) All SLRU layers take care that this
function should not be accessed concurrently, I have verified all
access to this function and its true and the function header of this
function also says the same.  So this is not an issue as per my
analysis.

2) Extending or adding a new page to SLRU:  I have noticed that this
is also protected by either some other exclusive lock or only done
during startup.  So in short the SLRULock is just used for protecting
against the access of the buffers in the buffer pool but that is not
for guaranteeing the exclusive access inside the function because that
is taken care of in some other way.

3) Another thing that I noticed while writing this and thought it
would be good to make a note of that as well.  Basically for the CLOG
group update of the xid status.  Therein if we do not get the control
lock on the SLRU then we add ourselves to a group and then the group
leader does the job for all the members in the group.  One might think
that different pages in the group might belong to different SLRU bank
so the leader might need to acquire/release the lock as it process the
request in the group.  Yes, that is true, and it is taken care but we
don’t need to worry about the case because as per the implementation
of the group update, we are trying to have the members with the same
page request in one group and only due to some exception there could
be members with the different page request.  So the point is with a
bank-wise lock we are handling that exception case but that's not a
regular case that we need to acquire/release multiple times.  So
design-wise we are good and performance-wise there should not be any
problem because most of the time we might be updating the pages from
the same bank, and if in some cases we have some updates for old
transactions due to long-running transactions then we should do better
by not having a centralized lock.

Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter.  So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200

Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres

savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;

Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003

Results:
Clients Head SlruBank  SlruBank+BankwiseLockAndLru
1          457      491           475
8          3753    3819        3782
32      14594   14328      17028
64      15600    16243      25944
128    15957    16272      31731

So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD.  We might
be thinking that we do not see much gain only with the SlruBank patch.
The reason is that in this particular test case, we are not seeing
much load on the buffer replacement.  In fact, the wait event also
doesn’t show contention on any lock instead the main load is due to
frequently modifying the common variable like the centralized control
lock and the centralized LRU counters.  That is evident in perf data
as shown below

+  74.72%   0.06% postgres postgres  [.] XidInMVCCSnapshot
+  74.08%   0.02% postgres postgres  [.] SubTransGetTopmostTransaction
+  74.04%   0.07% postgres postgres  [.] SubTransGetParent
+  57.66%   0.04% postgres postgres  [.] LWLockAcquire
+  57.64%   0.26% postgres postgres  [.] SimpleLruReadPage_ReadOnly
……
+  16.53%   0.07% postgres postgres  [.] LWLockRelease
+  16.36%   0.04% postgres postgres  [.] pg_atomic_sub_fetch_u32
+  16.31%  16.24% postgres postgres  [.] pg_atomic_fetch_sub_u32_impl

We can notice that the main load is on the atomic variable within the
LWLockAcquire and LWLockRelease.  Once we apply the bankwise lock
patch(v1-0002) the same problem is visible on cur_lru_count updation
in the SlruRecentlyUsed[2] macro (I have not shown that here but it
was visible in my perf report).  And that is resolved by implementing
a bankwise counter.

[2]
#define SlruRecentlyUsed(shared, slotno) \
do { \
..
(shared)->cur_lru_count = ++new_lru_count; \
..
} \
} while (0)


Exp2: This test shows the load on SLRU frequent buffer replacement. In
this test, we are running the pgbench kind script which frequently
generates multixact-id, and parallelly we are starting and committing
a long-running transaction so that the multixact-ids are not
immediately cleaned up by the vacuum and we create contention on the
SLRU buffer pool.  I am not leaving the long-running transaction
running forever as that will start to show another problem with
respect to bloat and we will lose the purpose of what I am trying to
show here.

Note: test configurations are the same as Exp1, just the workload is
different, we are running below 2 scripts.
and new config parameter(added in v1-0001) slru_buffers_size_scale=4,
that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head
and
NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head

./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres
./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres

cat > multixact.sql <<EOF
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE;
SAVEPOINT S1;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
EOF

cat > longrunning.sql << EOF
BEGIN;
INSERT INTO pgbench_test VALUES(1);
select pg_sleep(10);
COMMIT;
EOF

Results:
Clients     Head    SlruBank   SlruBank+BankwiseLock
1              528        513           531
8             3870     4239         4157
32         13945   14470       14556
64         10086   19034       24482
128         6909    15627      18161

Here we can see good improvement with the SlruBank patch itself
because of increasing the SLRU buffer pool, as in this workload there
is a lot of contention due to buffer replacement.  As shown below we
can see a lot of load on MultiXactOffsetSLRU as well as on
MultiXactOffsetBuffer which shows there are frequent buffer evictions
in this workload.  And, increasing the SLRU buffer pool size is
helping a lot, and further dividing the SLRU lock into bank-wise locks
we are seeing a further gain.  So in total, we are seeing ~2.5x TPS at
64 and 128 thread compared to head.

   3401  LWLock          | MultiXactOffsetSLRU
   2031  LWLock          | MultiXactOffsetBuffer
    687                         |
    427  LWLock          | BufferContent

Credits:
- The base patch v1-0001 is authored by Thomas Munro and I have just rebased it.
- 0002 and 0003 are new patches written by me based on design ideas
from Robert and Myself.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Next
From: Heikki Linnakangas
Date:
Subject: Re: Refactoring backend fork+exec code