SLRU optimization - configurable buffer pool and partitioning the SLRU lock - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | SLRU optimization - configurable buffer pool and partitioning the SLRU lock |
Date | |
Msg-id | CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com Whole thread Raw |
Responses |
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock |
List | pgsql-hackers |
The small size of the SLRU buffer pools can sometimes become a performance problem because it’s not difficult to have a workload where the number of buffers actively in use is larger than the fixed-size buffer pool. However, just increasing the size of the buffer pool doesn’t necessarily help, because the linear search that we use for buffer replacement doesn’t scale, and also because contention on the single centralized lock limits scalability. There is a couple of patches proposed in the past to address the problem of increasing the buffer pool size, one of the patch [1] was proposed by Thomas Munro where we make the size of the buffer pool configurable. And, in order to deal with the linear search in the large buffer pool, we divide the SLRU buffer pool into associative banks so that searching in the buffer pool doesn’t get affected by the large size of the buffer pool. This does well for the workloads which are mainly impacted by the frequent buffer replacement but this still doesn’t stand well with the workloads where the centralized control lock is the bottleneck. So I have taken this patch as my base patch (v1-0001) and further added 2 more improvements to this 1) In v1-0002, Instead of a centralized control lock for the SLRU I have introduced a bank-wise control lock 2)In v1-0003, I have removed the global LRU counter and introduced a bank-wise counter. The second change (v1-0003) is in order to avoid the CPU/OS cache invalidation due to frequent updates of the single variable, later in my performance test I will show how much gain we have gotten because of these 2 changes. Note: This is going to be a long email but I have summarised the main idea above this point and now I am going to discuss more internal information in order to show that the design idea is valid and also going to show 2 performance tests where one is specific to the contention on the centralized lock and other is mainly contention due to frequent buffer replacement in SLRU buffer pool. We are getting ~2x TPS compared to the head by these patches and in later sections, I am going discuss this in more detail i.e. exact performance numbers and analysis of why we are seeing the gain. There are some problems I faced while converting this centralized control lock to a bank-wise lock and that is mainly because this lock is (mis)used for different purposes. The main purpose of this control lock as I understand it is to protect the in-memory access (read/write) of the buffers in the SLRU buffer pool. Here is the list of some problems and their analysis: 1) In some of the SLRU, we use this lock for protecting the members inside the control structure which is specific to that SLRU layer i.e. SerialControlData() members are protected by the SerialSLRULock, and I don’t think it is the right use of this lock so for this purpose I have introduced another lock called SerialControlLock for this specific purpose. Based on my analysis there is no reason for protecting these members and the SLRU buffer access with the same lock. 2) The member called ‘latest_page_number’ inside SlruSharedData is also protected by the SLRULock, I would not say this use case is wrong but since this is a common variable and not a per bank variable can not be protected by the bankwise lock now. But the usage of this variable is just to track the latest page in an SLRU so that we do not evict out the latest page during victim page selection. So I have converted this to an atomic variable as it is completely independent of the SLRU buffer access. 3) In order to protect SlruScanDirectory, basically the SlruScanDirectory() from DeactivateCommitTs(), is called under the SLRU control lock, but from all other places SlruScanDirectory() is called without lock and that is because the caller of this function is called from the places which are not executed concurrently(i.e. startup, checkpoint). This function DeactivateCommitTs() is also called from the startup only so there doesn't seem any use in calling this under the SLRU control lock. Currently, I have called this under the all-bank lock because logically this is not a performance path, and that way we are keeping it consistent with the current logic, but if others also think that we do not need a lock at this place then we might remove it and then we don't need this all-bank lock anywhere. There are some other uses of this lock where we might think it will be a problem if we divide it into a bank-wise lock but it's not and I have given my analysis for the same 1) SimpleLruTruncate: We might worry that if we convert to a bank-wise lock then this could be an issue as we might need to release and acquire different locks as we scan different banks. But as per my analysis, this is not an issue because a) With the current code also do release and acquire the centralized lock multiple times in order to perform the I/O on the buffer slot so the behavior is not changed but the most important thing is b) All SLRU layers take care that this function should not be accessed concurrently, I have verified all access to this function and its true and the function header of this function also says the same. So this is not an issue as per my analysis. 2) Extending or adding a new page to SLRU: I have noticed that this is also protected by either some other exclusive lock or only done during startup. So in short the SLRULock is just used for protecting against the access of the buffers in the buffer pool but that is not for guaranteeing the exclusive access inside the function because that is taken care of in some other way. 3) Another thing that I noticed while writing this and thought it would be good to make a note of that as well. Basically for the CLOG group update of the xid status. Therein if we do not get the control lock on the SLRU then we add ourselves to a group and then the group leader does the job for all the members in the group. One might think that different pages in the group might belong to different SLRU bank so the leader might need to acquire/release the lock as it process the request in the group. Yes, that is true, and it is taken care but we don’t need to worry about the case because as per the implementation of the group update, we are trying to have the members with the same page request in one group and only due to some exception there could be members with the different page request. So the point is with a bank-wise lock we are handling that exception case but that's not a regular case that we need to acquire/release multiple times. So design-wise we are good and performance-wise there should not be any problem because most of the time we might be updating the pages from the same bank, and if in some cases we have some updates for old transactions due to long-running transactions then we should do better by not having a centralized lock. Performance Test: Exp1: Show problems due to CPU/OS cache invalidation due to frequent updates of the centralized lock and a common LRU counter. So here we are running a parallel transaction to pgbench script which frequently creates subtransaction overflow and that forces the visibility-check mechanism to access the subtrans SLRU. Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD scale factor: 300 shared_buffers=20GB checkpoint_timeout=40min max_wal_size=20GB max_connections=200 Workload: Run these 2 scripts parallelly: ./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres ./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres savepoint.sql (create subtransaction overflow) BEGIN; SAVEPOINT S1; INSERT INTO test VALUES(1) ← repeat 70 times → SELECT pg_sleep(1); COMMIT; Code under test: Head: PostgreSQL head code SlruBank: The first patch applied to convert the SLRU buffer pool into the bank (0001) SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003 Results: Clients Head SlruBank SlruBank+BankwiseLockAndLru 1 457 491 475 8 3753 3819 3782 32 14594 14328 17028 64 15600 16243 25944 128 15957 16272 31731 So we can see that at 128 clients, we get ~2x TPS(with SlruBank + BankwiseLock and bankwise LRU counter) as compared to HEAD. We might be thinking that we do not see much gain only with the SlruBank patch. The reason is that in this particular test case, we are not seeing much load on the buffer replacement. In fact, the wait event also doesn’t show contention on any lock instead the main load is due to frequently modifying the common variable like the centralized control lock and the centralized LRU counters. That is evident in perf data as shown below + 74.72% 0.06% postgres postgres [.] XidInMVCCSnapshot + 74.08% 0.02% postgres postgres [.] SubTransGetTopmostTransaction + 74.04% 0.07% postgres postgres [.] SubTransGetParent + 57.66% 0.04% postgres postgres [.] LWLockAcquire + 57.64% 0.26% postgres postgres [.] SimpleLruReadPage_ReadOnly …… + 16.53% 0.07% postgres postgres [.] LWLockRelease + 16.36% 0.04% postgres postgres [.] pg_atomic_sub_fetch_u32 + 16.31% 16.24% postgres postgres [.] pg_atomic_fetch_sub_u32_impl We can notice that the main load is on the atomic variable within the LWLockAcquire and LWLockRelease. Once we apply the bankwise lock patch(v1-0002) the same problem is visible on cur_lru_count updation in the SlruRecentlyUsed[2] macro (I have not shown that here but it was visible in my perf report). And that is resolved by implementing a bankwise counter. [2] #define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0) Exp2: This test shows the load on SLRU frequent buffer replacement. In this test, we are running the pgbench kind script which frequently generates multixact-id, and parallelly we are starting and committing a long-running transaction so that the multixact-ids are not immediately cleaned up by the vacuum and we create contention on the SLRU buffer pool. I am not leaving the long-running transaction running forever as that will start to show another problem with respect to bloat and we will lose the purpose of what I am trying to show here. Note: test configurations are the same as Exp1, just the workload is different, we are running below 2 scripts. and new config parameter(added in v1-0001) slru_buffers_size_scale=4, that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head and NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head ./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres ./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres cat > multixact.sql <<EOF \set aid random(1, 100000 * :scale) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE; SAVEPOINT S1; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; EOF cat > longrunning.sql << EOF BEGIN; INSERT INTO pgbench_test VALUES(1); select pg_sleep(10); COMMIT; EOF Results: Clients Head SlruBank SlruBank+BankwiseLock 1 528 513 531 8 3870 4239 4157 32 13945 14470 14556 64 10086 19034 24482 128 6909 15627 18161 Here we can see good improvement with the SlruBank patch itself because of increasing the SLRU buffer pool, as in this workload there is a lot of contention due to buffer replacement. As shown below we can see a lot of load on MultiXactOffsetSLRU as well as on MultiXactOffsetBuffer which shows there are frequent buffer evictions in this workload. And, increasing the SLRU buffer pool size is helping a lot, and further dividing the SLRU lock into bank-wise locks we are seeing a further gain. So in total, we are seeing ~2.5x TPS at 64 and 128 thread compared to head. 3401 LWLock | MultiXactOffsetSLRU 2031 LWLock | MultiXactOffsetBuffer 687 | 427 LWLock | BufferContent Credits: - The base patch v1-0001 is authored by Thomas Munro and I have just rebased it. - 0002 and 0003 are new patches written by me based on design ideas from Robert and Myself. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: