Thread: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
The small size of the SLRU buffer pools can sometimes become a performance problem because it’s not difficult to have a workload where the number of buffers actively in use is larger than the fixed-size buffer pool. However, just increasing the size of the buffer pool doesn’t necessarily help, because the linear search that we use for buffer replacement doesn’t scale, and also because contention on the single centralized lock limits scalability. There is a couple of patches proposed in the past to address the problem of increasing the buffer pool size, one of the patch [1] was proposed by Thomas Munro where we make the size of the buffer pool configurable. And, in order to deal with the linear search in the large buffer pool, we divide the SLRU buffer pool into associative banks so that searching in the buffer pool doesn’t get affected by the large size of the buffer pool. This does well for the workloads which are mainly impacted by the frequent buffer replacement but this still doesn’t stand well with the workloads where the centralized control lock is the bottleneck. So I have taken this patch as my base patch (v1-0001) and further added 2 more improvements to this 1) In v1-0002, Instead of a centralized control lock for the SLRU I have introduced a bank-wise control lock 2)In v1-0003, I have removed the global LRU counter and introduced a bank-wise counter. The second change (v1-0003) is in order to avoid the CPU/OS cache invalidation due to frequent updates of the single variable, later in my performance test I will show how much gain we have gotten because of these 2 changes. Note: This is going to be a long email but I have summarised the main idea above this point and now I am going to discuss more internal information in order to show that the design idea is valid and also going to show 2 performance tests where one is specific to the contention on the centralized lock and other is mainly contention due to frequent buffer replacement in SLRU buffer pool. We are getting ~2x TPS compared to the head by these patches and in later sections, I am going discuss this in more detail i.e. exact performance numbers and analysis of why we are seeing the gain. There are some problems I faced while converting this centralized control lock to a bank-wise lock and that is mainly because this lock is (mis)used for different purposes. The main purpose of this control lock as I understand it is to protect the in-memory access (read/write) of the buffers in the SLRU buffer pool. Here is the list of some problems and their analysis: 1) In some of the SLRU, we use this lock for protecting the members inside the control structure which is specific to that SLRU layer i.e. SerialControlData() members are protected by the SerialSLRULock, and I don’t think it is the right use of this lock so for this purpose I have introduced another lock called SerialControlLock for this specific purpose. Based on my analysis there is no reason for protecting these members and the SLRU buffer access with the same lock. 2) The member called ‘latest_page_number’ inside SlruSharedData is also protected by the SLRULock, I would not say this use case is wrong but since this is a common variable and not a per bank variable can not be protected by the bankwise lock now. But the usage of this variable is just to track the latest page in an SLRU so that we do not evict out the latest page during victim page selection. So I have converted this to an atomic variable as it is completely independent of the SLRU buffer access. 3) In order to protect SlruScanDirectory, basically the SlruScanDirectory() from DeactivateCommitTs(), is called under the SLRU control lock, but from all other places SlruScanDirectory() is called without lock and that is because the caller of this function is called from the places which are not executed concurrently(i.e. startup, checkpoint). This function DeactivateCommitTs() is also called from the startup only so there doesn't seem any use in calling this under the SLRU control lock. Currently, I have called this under the all-bank lock because logically this is not a performance path, and that way we are keeping it consistent with the current logic, but if others also think that we do not need a lock at this place then we might remove it and then we don't need this all-bank lock anywhere. There are some other uses of this lock where we might think it will be a problem if we divide it into a bank-wise lock but it's not and I have given my analysis for the same 1) SimpleLruTruncate: We might worry that if we convert to a bank-wise lock then this could be an issue as we might need to release and acquire different locks as we scan different banks. But as per my analysis, this is not an issue because a) With the current code also do release and acquire the centralized lock multiple times in order to perform the I/O on the buffer slot so the behavior is not changed but the most important thing is b) All SLRU layers take care that this function should not be accessed concurrently, I have verified all access to this function and its true and the function header of this function also says the same. So this is not an issue as per my analysis. 2) Extending or adding a new page to SLRU: I have noticed that this is also protected by either some other exclusive lock or only done during startup. So in short the SLRULock is just used for protecting against the access of the buffers in the buffer pool but that is not for guaranteeing the exclusive access inside the function because that is taken care of in some other way. 3) Another thing that I noticed while writing this and thought it would be good to make a note of that as well. Basically for the CLOG group update of the xid status. Therein if we do not get the control lock on the SLRU then we add ourselves to a group and then the group leader does the job for all the members in the group. One might think that different pages in the group might belong to different SLRU bank so the leader might need to acquire/release the lock as it process the request in the group. Yes, that is true, and it is taken care but we don’t need to worry about the case because as per the implementation of the group update, we are trying to have the members with the same page request in one group and only due to some exception there could be members with the different page request. So the point is with a bank-wise lock we are handling that exception case but that's not a regular case that we need to acquire/release multiple times. So design-wise we are good and performance-wise there should not be any problem because most of the time we might be updating the pages from the same bank, and if in some cases we have some updates for old transactions due to long-running transactions then we should do better by not having a centralized lock. Performance Test: Exp1: Show problems due to CPU/OS cache invalidation due to frequent updates of the centralized lock and a common LRU counter. So here we are running a parallel transaction to pgbench script which frequently creates subtransaction overflow and that forces the visibility-check mechanism to access the subtrans SLRU. Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD scale factor: 300 shared_buffers=20GB checkpoint_timeout=40min max_wal_size=20GB max_connections=200 Workload: Run these 2 scripts parallelly: ./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres ./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres savepoint.sql (create subtransaction overflow) BEGIN; SAVEPOINT S1; INSERT INTO test VALUES(1) ← repeat 70 times → SELECT pg_sleep(1); COMMIT; Code under test: Head: PostgreSQL head code SlruBank: The first patch applied to convert the SLRU buffer pool into the bank (0001) SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003 Results: Clients Head SlruBank SlruBank+BankwiseLockAndLru 1 457 491 475 8 3753 3819 3782 32 14594 14328 17028 64 15600 16243 25944 128 15957 16272 31731 So we can see that at 128 clients, we get ~2x TPS(with SlruBank + BankwiseLock and bankwise LRU counter) as compared to HEAD. We might be thinking that we do not see much gain only with the SlruBank patch. The reason is that in this particular test case, we are not seeing much load on the buffer replacement. In fact, the wait event also doesn’t show contention on any lock instead the main load is due to frequently modifying the common variable like the centralized control lock and the centralized LRU counters. That is evident in perf data as shown below + 74.72% 0.06% postgres postgres [.] XidInMVCCSnapshot + 74.08% 0.02% postgres postgres [.] SubTransGetTopmostTransaction + 74.04% 0.07% postgres postgres [.] SubTransGetParent + 57.66% 0.04% postgres postgres [.] LWLockAcquire + 57.64% 0.26% postgres postgres [.] SimpleLruReadPage_ReadOnly …… + 16.53% 0.07% postgres postgres [.] LWLockRelease + 16.36% 0.04% postgres postgres [.] pg_atomic_sub_fetch_u32 + 16.31% 16.24% postgres postgres [.] pg_atomic_fetch_sub_u32_impl We can notice that the main load is on the atomic variable within the LWLockAcquire and LWLockRelease. Once we apply the bankwise lock patch(v1-0002) the same problem is visible on cur_lru_count updation in the SlruRecentlyUsed[2] macro (I have not shown that here but it was visible in my perf report). And that is resolved by implementing a bankwise counter. [2] #define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0) Exp2: This test shows the load on SLRU frequent buffer replacement. In this test, we are running the pgbench kind script which frequently generates multixact-id, and parallelly we are starting and committing a long-running transaction so that the multixact-ids are not immediately cleaned up by the vacuum and we create contention on the SLRU buffer pool. I am not leaving the long-running transaction running forever as that will start to show another problem with respect to bloat and we will lose the purpose of what I am trying to show here. Note: test configurations are the same as Exp1, just the workload is different, we are running below 2 scripts. and new config parameter(added in v1-0001) slru_buffers_size_scale=4, that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head and NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head ./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres ./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres cat > multixact.sql <<EOF \set aid random(1, 100000 * :scale) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE; SAVEPOINT S1; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; EOF cat > longrunning.sql << EOF BEGIN; INSERT INTO pgbench_test VALUES(1); select pg_sleep(10); COMMIT; EOF Results: Clients Head SlruBank SlruBank+BankwiseLock 1 528 513 531 8 3870 4239 4157 32 13945 14470 14556 64 10086 19034 24482 128 6909 15627 18161 Here we can see good improvement with the SlruBank patch itself because of increasing the SLRU buffer pool, as in this workload there is a lot of contention due to buffer replacement. As shown below we can see a lot of load on MultiXactOffsetSLRU as well as on MultiXactOffsetBuffer which shows there are frequent buffer evictions in this workload. And, increasing the SLRU buffer pool size is helping a lot, and further dividing the SLRU lock into bank-wise locks we are seeing a further gain. So in total, we are seeing ~2.5x TPS at 64 and 128 thread compared to head. 3401 LWLock | MultiXactOffsetSLRU 2031 LWLock | MultiXactOffsetBuffer 687 | 427 LWLock | BufferContent Credits: - The base patch v1-0001 is authored by Thomas Munro and I have just rebased it. - 0002 and 0003 are new patches written by me based on design ideas from Robert and Myself. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > The small size of the SLRU buffer pools can sometimes become a > performance problem because it’s not difficult to have a workload > where the number of buffers actively in use is larger than the > fixed-size buffer pool. However, just increasing the size of the > buffer pool doesn’t necessarily help, because the linear search that > we use for buffer replacement doesn’t scale, and also because > contention on the single centralized lock limits scalability. > > There is a couple of patches proposed in the past to address the > problem of increasing the buffer pool size, one of the patch [1] was > proposed by Thomas Munro where we make the size of the buffer pool > configurable. In my last email, I forgot to give the link from where I have taken the base path for dividing the buffer pool in banks so giving the same here[1]. And looking at this again it seems that the idea of that patch was from Andrey M. Borodin and the idea of the SLRU scale factor were introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing that in the first email. [1] https://commitfest.postgresql.org/43/2627/ -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > In my last email, I forgot to give the link from where I have taken > the base path for dividing the buffer pool in banks so giving the same > here[1]. And looking at this again it seems that the idea of that > patch was from > Andrey M. Borodin and the idea of the SLRU scale factor were > introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing > that in the first email. > > [1] https://commitfest.postgresql.org/43/2627/ In my last email I have just rebased the base patch, so now while reading through that patch I realized that there was some refactoring needed and some unused functions were there so I have removed that and also added some comments. Also did some refactoring to my patches. So reposting the patch series. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Amit Kapila
Date:
On Wed, Oct 11, 2023 at 4:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > The small size of the SLRU buffer pools can sometimes become a > performance problem because it’s not difficult to have a workload > where the number of buffers actively in use is larger than the > fixed-size buffer pool. However, just increasing the size of the > buffer pool doesn’t necessarily help, because the linear search that > we use for buffer replacement doesn’t scale, and also because > contention on the single centralized lock limits scalability. > > There is a couple of patches proposed in the past to address the > problem of increasing the buffer pool size, one of the patch [1] was > proposed by Thomas Munro where we make the size of the buffer pool > configurable. And, in order to deal with the linear search in the > large buffer pool, we divide the SLRU buffer pool into associative > banks so that searching in the buffer pool doesn’t get affected by the > large size of the buffer pool. This does well for the workloads which > are mainly impacted by the frequent buffer replacement but this still > doesn’t stand well with the workloads where the centralized control > lock is the bottleneck. > > So I have taken this patch as my base patch (v1-0001) and further > added 2 more improvements to this 1) In v1-0002, Instead of a > centralized control lock for the SLRU I have introduced a bank-wise > control lock 2)In v1-0003, I have removed the global LRU counter and > introduced a bank-wise counter. The second change (v1-0003) is in > order to avoid the CPU/OS cache invalidation due to frequent updates > of the single variable, later in my performance test I will show how > much gain we have gotten because of these 2 changes. > > Note: This is going to be a long email but I have summarised the main > idea above this point and now I am going to discuss more internal > information in order to show that the design idea is valid and also > going to show 2 performance tests where one is specific to the > contention on the centralized lock and other is mainly contention due > to frequent buffer replacement in SLRU buffer pool. We are getting ~2x > TPS compared to the head by these patches and in later sections, I am > going discuss this in more detail i.e. exact performance numbers and > analysis of why we are seeing the gain. > ... > > Performance Test: > Exp1: Show problems due to CPU/OS cache invalidation due to frequent > updates of the centralized lock and a common LRU counter. So here we > are running a parallel transaction to pgbench script which frequently > creates subtransaction overflow and that forces the visibility-check > mechanism to access the subtrans SLRU. > Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD > scale factor: 300 > shared_buffers=20GB > checkpoint_timeout=40min > max_wal_size=20GB > max_connections=200 > > Workload: Run these 2 scripts parallelly: > ./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres > ./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres > > savepoint.sql (create subtransaction overflow) > BEGIN; > SAVEPOINT S1; > INSERT INTO test VALUES(1) > ← repeat 70 times → > SELECT pg_sleep(1); > COMMIT; > > Code under test: > Head: PostgreSQL head code > SlruBank: The first patch applied to convert the SLRU buffer pool into > the bank (0001) > SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003 > > Results: > Clients Head SlruBank SlruBank+BankwiseLockAndLru > 1 457 491 475 > 8 3753 3819 3782 > 32 14594 14328 17028 > 64 15600 16243 25944 > 128 15957 16272 31731 > > So we can see that at 128 clients, we get ~2x TPS(with SlruBank + > BankwiseLock and bankwise LRU counter) as compared to HEAD. > This and other results shared by you look promising. Will there be any improvement in workloads related to clog buffer usage? BTW, I remember that there was also a discussion of moving SLRU into a regular buffer pool [1]. You have not provided any explanation as to whether that approach will have any merits after we do this or whether that approach is not worth pursuing at all. [1] - https://commitfest.postgresql.org/43/3514/ -- With Regards, Amit Kapila.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > This and other results shared by you look promising. Will there be any > improvement in workloads related to clog buffer usage? I did not understand this question can you explain this a bit? In short, if it is regarding the performance then we will see it for all the SLRUs as the control lock is not centralized anymore instead it is a bank-wise lock. BTW, I remember > that there was also a discussion of moving SLRU into a regular buffer > pool [1]. You have not provided any explanation as to whether that > approach will have any merits after we do this or whether that > approach is not worth pursuing at all. > > [1] - https://commitfest.postgresql.org/43/3514/ Yeah, I haven't read that thread in detail about performance numbers and all. But both of these can not coexist because this patch is improving the SLRU buffer pool access/configurable size and also lock contention. If we move SLRU to the main buffer pool then we might not have a similar problem instead there might be other problems like SLRU buffers getting swapped out due to other relation buffers and all and OTOH the advantages of that approach would be that we can just use a bigger buffer pool and SLRU can also take advantage of that. But in my opinion, most of the time we have limited page access in SLRU and the SLRU buffer access pattern is also quite different from the relation pages access pattern so keeping them under the same buffer pool and comparing against relation pages for victim buffer selection might cause different problems. But anyway I would discuss those points maybe in that thread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2023-Oct-11, Dilip Kumar wrote: > In my last email, I forgot to give the link from where I have taken > the base path for dividing the buffer pool in banks so giving the same > here[1]. And looking at this again it seems that the idea of that > patch was from Andrey M. Borodin and the idea of the SLRU scale factor > were introduced by Yura Sokolov and Ivan Lazarev. Apologies for > missing that in the first email. You mean [1]. [1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall: NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256) NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256) NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256) NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256) ... which look pretty random already, if similar enough to the current hardcoded values. In reality, the code implements different values than what the documentation says. I don't see why would CLOG have the same number as COMMIT_TS, when the size for elements of the latter is like 32 times bigger -- however, the frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG. SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the same range. The MULTIXACT ones appear to keep the current ratio among them (8/16 gets changed to 32/64). ... and this whole mess is scaled exponentially without regard to the size that each SLRU requires. This is just betting that enough memory can be wasted across all SLRUs up to the point where the one that is actually contended has sufficient memory. This doesn't sound sensible to me. Like everybody else, I like having less GUCs to configure, but going this far to avoid them looks rather disastrous to me. IMO we should just use Munro's older patches that gave one GUC per SLRU, and users only need to increase the one that shows up in pg_wait_event sampling. Someday we will get the (much more complicated) patches to move these buffers to steal memory from shared buffers, and that'll hopefully let use get rid of all this complexity. I'm inclined to use Borodin's patch last posted here [2] instead of your proposed 0001. [2] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru I did skim patches 0002 and 0003 without going into too much detail; they look reasonable ideas. I have not tried to reproduce the claimed performance benefits. I think measuring this patch set with the tests posted by Shawn Debnath in [3] is important, too. [3] https://postgr.es/m/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com On the other hand, here's a somewhat crazy idea. What if, instead of stealing buffers from shared_buffers (which causes a lot of complexity), we allocate a common pool for all SLRUs to use? We provide a single knob -- say, non_relational_buffers=32MB as default -- and we use a LRU algorithm (or something) to distribute that memory across all the SLRUs. So the ratio to use for this SLRU or that one would depend on the nature of the workload: maybe more for multixact in this server here, but more for subtrans in that server there; it's just the total amount that the user would have to configure, side by side with shared_buffers (and perhaps scale with it like wal_buffers), and the LRU would handle the rest. The "only" problem here is finding a distribution algorithm that doesn't further degrade performance, of course ... -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "The problem with the facetime model is not just that it's demoralizing, but that the people pretending to work interrupt the ones actually working." -- Paul Graham, http://www.paulgraham.com/opensource.html
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Nathan Bossart
Date:
On Tue, Oct 24, 2023 at 06:04:13PM +0200, Alvaro Herrera wrote: > Like everybody else, I like having less GUCs to configure, but going > this far to avoid them looks rather disastrous to me. IMO we should > just use Munro's older patches that gave one GUC per SLRU, and users > only need to increase the one that shows up in pg_wait_event sampling. > Someday we will get the (much more complicated) patches to move these > buffers to steal memory from shared buffers, and that'll hopefully let > use get rid of all this complexity. +1 > On the other hand, here's a somewhat crazy idea. What if, instead of > stealing buffers from shared_buffers (which causes a lot of complexity), > we allocate a common pool for all SLRUs to use? We provide a single > knob -- say, non_relational_buffers=32MB as default -- and we use a LRU > algorithm (or something) to distribute that memory across all the SLRUs. > So the ratio to use for this SLRU or that one would depend on the nature > of the workload: maybe more for multixact in this server here, but more > for subtrans in that server there; it's just the total amount that the > user would have to configure, side by side with shared_buffers (and > perhaps scale with it like wal_buffers), and the LRU would handle the > rest. The "only" problem here is finding a distribution algorithm that > doesn't further degrade performance, of course ... I think it's worth a try. It does seem simpler, and it might allow us to sidestep some concerns about scaling when the SLRU pages are in shared_buffers [0]. [0] https://postgr.es/m/ZPsaEGRvllitxB3v%40tamriel.snowman.net -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Oct-11, Dilip Kumar wrote: > > > In my last email, I forgot to give the link from where I have taken > > the base path for dividing the buffer pool in banks so giving the same > > here[1]. And looking at this again it seems that the idea of that > > patch was from Andrey M. Borodin and the idea of the SLRU scale factor > > were introduced by Yura Sokolov and Ivan Lazarev. Apologies for > > missing that in the first email. > > You mean [1]. > [1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru > I don't like this idea very much, because of the magic numbers that act > as ratios for numbers of buffers on each SLRU compared to other SLRUs. > These values, which I took from the documentation part of the patch, > appear to have been selected by throwing darts at the wall: > > NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256) > NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256) > NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256) > NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) > NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) > NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) > NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256) > > ... which look pretty random already, if similar enough to the current > hardcoded values. In reality, the code implements different values than > what the documentation says. > > I don't see why would CLOG have the same number as COMMIT_TS, when the > size for elements of the latter is like 32 times bigger -- however, the > frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG. > SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the > same range. The MULTIXACT ones appear to keep the current ratio among > them (8/16 gets changed to 32/64). > > ... and this whole mess is scaled exponentially without regard to the > size that each SLRU requires. This is just betting that enough memory > can be wasted across all SLRUs up to the point where the one that is > actually contended has sufficient memory. This doesn't sound sensible > to me. > > Like everybody else, I like having less GUCs to configure, but going > this far to avoid them looks rather disastrous to me. IMO we should > just use Munro's older patches that gave one GUC per SLRU, and users > only need to increase the one that shows up in pg_wait_event sampling. > Someday we will get the (much more complicated) patches to move these > buffers to steal memory from shared buffers, and that'll hopefully let > use get rid of all this complexity. Overall I agree with your comments, actually, I haven't put that much thought into the GUC part and how it scales the SLRU buffers w.r.t. this single configurable parameter. Yeah, so I think it is better that we take the older patch version as our base patch where we have separate GUC per SLRU. > I'm inclined to use Borodin's patch last posted here [2] instead of your > proposed 0001. > [2] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru I will rebase my patches on top of this. > I did skim patches 0002 and 0003 without going into too much detail; > they look reasonable ideas. I have not tried to reproduce the claimed > performance benefits. I think measuring this patch set with the tests > posted by Shawn Debnath in [3] is important, too. > [3] https://postgr.es/m/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com Thanks for taking a look. > > On the other hand, here's a somewhat crazy idea. What if, instead of > stealing buffers from shared_buffers (which causes a lot of complexity), Currently, we do not steal buffers from shared_buffers, computation is dependent upon Nbuffers though. I mean for each SLRU we are computing separate memory which is additional than the shared_buffers no? > we allocate a common pool for all SLRUs to use? We provide a single > knob -- say, non_relational_buffers=32MB as default -- and we use a LRU > algorithm (or something) to distribute that memory across all the SLRUs. > So the ratio to use for this SLRU or that one would depend on the nature > of the workload: maybe more for multixact in this server here, but more > for subtrans in that server there; it's just the total amount that the > user would have to configure, side by side with shared_buffers (and > perhaps scale with it like wal_buffers), and the LRU would handle the > rest. The "only" problem here is finding a distribution algorithm that > doesn't further degrade performance, of course ... Yeah, this could be an idea, but are you talking about that all the SLRUs will share the single buffer pool and based on the LRU algorithm it will be decided which page will stay in the buffer pool and which will be out? But wouldn't that create another issue of different SLRUs starting to contend on the same lock if we have a common buffer pool for all the SLRUs? Or am I missing something? Or you are saying that although there is a common buffer pool each SLRU will have its own boundaries in it so protected by a separate lock and based on the workload those boundaries can change dynamically? I haven't put much thought into how practical the idea is but just trying to understand what you have in mind. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Amit Kapila
Date:
On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > This and other results shared by you look promising. Will there be any > > improvement in workloads related to clog buffer usage? > > I did not understand this question can you explain this a bit? > I meant to ask about the impact of this patch on accessing transaction status via TransactionIdGetStatus(). Shouldn't we expect some improvement in accessing CLOG buffers? -- With Regards, Amit Kapila.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > This and other results shared by you look promising. Will there be any > > > improvement in workloads related to clog buffer usage? > > > > I did not understand this question can you explain this a bit? > > > > I meant to ask about the impact of this patch on accessing transaction > status via TransactionIdGetStatus(). Shouldn't we expect some > improvement in accessing CLOG buffers? Yes, there should be because 1) Now there is no common lock so contention on a centralized control lock will be reduced when we are accessing the transaction status from pages falling in different SLRU banks 2) Buffers size is configurable so if the workload is accessing transactions status of different range then it would help in frequent buffer eviction but this might not be most common case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Oct 25, 2023 at 10:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Overall I agree with your comments, actually, I haven't put that much > thought into the GUC part and how it scales the SLRU buffers w.r.t. > this single configurable parameter. Yeah, so I think it is better > that we take the older patch version as our base patch where we have > separate GUC per SLRU. > > > I'm inclined to use Borodin's patch last posted here [2] instead of your > > proposed 0001. > > [2] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru > > I will rebase my patches on top of this. I have taken 0001 and 0002 from [1], done some bug fixes in 0001, and changed the logic of SlruAdjustNSlots() in 0002, such that now it starts with the next power of 2 value of the configured slots and keeps doubling the number of banks until we reach the number of banks to the max SLRU_MAX_BANKS(128) and bank size is bigger than SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too many banks, but also that we don't have very large banks. There was also a patch 0003 in this thread but I haven't taken this as this is another optimization of merging some structure members and I will analyze the performance characteristic of this and try to add it on top of the complete patch series. Patch details: 0001 - GUC parameter for each SLRU 0002 - Divide the SLRU pool into banks (The above 2 are taken from [1] with some modification and rebasing by me) 0003 - Implement bank-wise SLRU lock as described in the first email of this thread 0004 - Implement bank-wise LRU counter as described in the first email of this thread 0005 - Some other optimization suggested offlist by Alvaro, i.e. merging buffer locks and bank locks in the same array so that the bank-wise LRU counter does not fetch the next cache line in a hot function SlruRecentlyUsed() Note: I think 0003,0004 and 0005 can be merged together but kept separate so that we can review them independently and see how useful each of them is. [1] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Based on some offlist discussions with Alvaro and Robert in separate conversations, I and Alvaro we came to the same point if a user sets a very high value for the number of slots (say 1GB) then the number of slots in each bank will be 1024 (considering max number of bank 128) and if we continue the sequence search for finding the buffer for the page then that could be costly in such cases. But later in one of the conversations with Robert, I realized that we can have this bank-wise lock approach along with the partitioned hash table. So the idea is, that we will use the buffer mapping hash table something like Thoams used in one of his patches [1], but instead of a normal hash table, we will use the partitioned hash table. The SLRU buffer pool is still divided as we have done in the bank-wise approach and there will be separate locks for each slot range. So now we get the benefit of both approaches 1) By having a mapping hash we can avoid the sequence search 2) By dividing the buffer pool into banks and keeping the victim buffer search within those banks we avoid locking all the partitions during victim buffer search 3) And we can also maintain a bank-wise LRU counter so that we avoid contention on a single variable as we have discussed in my first email of this thread. Please find the updated patch set details and patches attached to the email. [1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same patch as the previous patch set [2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce buffer mapping hash table [3] 0003-Partition-wise-slru-locks: Partition the hash table and also introduce partition-wise locks: this is a merge of 0003 and 0004 from the previous patch set but instead of bank-wise locks it has partition-wise locks and LRU counter. [4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging buffer locks and bank locks in the same array so that the bank-wise LRU counter does not fetch the next cache line in a hot function SlruRecentlyUsed()(same as 0005 from the previous patch set) [5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure that the number of slots is in multiple of the number of banks With this approach, I have also made some changes where the number of banks is constant (i.e. 8) so that some of the computations are easy. I think with a buffer mapping hash table we should not have much problem in keeping this fixed as with very extreme configuration and very high numbers of slots also we do not have performance problems as we are not doing sequence search because of buffer mapping hash and if the number of slots is set so high then the victim buffer search also should not be frequent so we should not be worried about sequence search within a bank for victim buffer search. I have also changed the default value of the number of slots to 64 and the minimum value to 16 I think this is a reasonable default value because the existing values are too low considering the modern hardware and these parameters is configurable so user can set it to low value if running with very low memory. [1] https://www.postgresql.org/message-id/CA%2BhUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-%3Dpyg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.
Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:I have taken 0001 and 0002 from [1], done some bug fixes in 0001
BTW can you please describe in more detail what kind of bugs?
Thanks for working on this!
Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > changed the logic of SlruAdjustNSlots() in 0002, such that now it > starts with the next power of 2 value of the configured slots and > keeps doubling the number of banks until we reach the number of banks > to the max SLRU_MAX_BANKS(128) and bank size is bigger than > SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too > many banks > > There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets. I agree with that, but I feel with bank-wise locks we are removing major contention from the centralized control lock and we can see that from my first email that how much benefit we can get in one of the simple test cases when we create subtransaction overflow. > Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0]. > I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this isthe case. > I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search. The main intention of having this buffer mapping hash is to find the SLRU page faster than sequence search when banks are relatively bigger in size, but if we find the cases where having hash creates more overhead than providing gain then I am fine to remove the hash because the whole purpose of adding hash here to make the lookup faster. So far in my test I did not find the slowness. Do you or anyone else have any test case based on the previous research on whether it creates any slowness? > Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of locallinear search. Lock partitions have to be bigger for sure. Yeah, that could also be an idea if we plan to drop the hash. I mean bank-wise counter is fine as we are finding a victim buffer within a bank itself, but each lock could cover more slots than one bank size or in other words, it can protect multiple banks. Let's hear more opinion on this. > > On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have taken 0001 and 0002 from [1], done some bug fixes in 0001 > > > BTW can you please describe in more detail what kind of bugs? Yeah, actually that patch was using the same GUC (multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as well as for MultiXactMemberCtl, see the below patch snippet from the original patch. @@ -1851,13 +1851,13 @@ MultiXactShmemInit(void) MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes; SimpleLruInit(MultiXactOffsetCtl, - "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0, + "MultiXactOffset", multixact_offsets_buffers, 0, MultiXactOffsetSLRULock, "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER, SYNC_HANDLER_MULTIXACT_OFFSET); SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE); SimpleLruInit(MultiXactMemberCtl, - "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0, + "MultiXactMember", multixact_offsets_buffers, 0, MultiXactMemberSLRULock, "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER, SYNC_HANDLER_MULTIXACT_MEMBER); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > >> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0]. >> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this isthe case. >> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search. > > The main intention of having this buffer mapping hash is to find the > SLRU page faster than sequence search when banks are relatively bigger > in size, but if we find the cases where having hash creates more > overhead than providing gain then I am fine to remove the hash because > the whole purpose of adding hash here to make the lookup faster. So > far in my test I did not find the slowness. Do you or anyone else > have any test case based on the previous research on whether it > creates any slowness? PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus()) before introducing HTAB for buffer mapping I get Time: 14837.851 ms (00:14.838) with buffer HTAB I get Time: 22723.243 ms (00:22.723) This hash table makes getting transaction status ~50% slower. Benchmark script I used: make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries= 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru'postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU')) FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres) > >> Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of locallinear search. Lock partitions have to be bigger for sure. > > Yeah, that could also be an idea if we plan to drop the hash. I mean > bank-wise counter is fine as we are finding a victim buffer within a > bank itself, but each lock could cover more slots than one bank size > or in other words, it can protect multiple banks. Let's hear more > opinion on this. +1 > >> >> On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> I have taken 0001 and 0002 from [1], done some bug fixes in 0001 >> >> >> BTW can you please describe in more detail what kind of bugs? > > Yeah, actually that patch was using the same GUC > (multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as > well as for MultiXactMemberCtl, see the below patch snippet from the > original patch. Ouch. We were running this for serveral years with this bug... Thanks! Best regards, Andrey Borodin.
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Nov 6, 2023 at 1:05 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > >> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0]. > >> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt thisis the case. > >> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linearsearch. > > > > The main intention of having this buffer mapping hash is to find the > > SLRU page faster than sequence search when banks are relatively bigger > > in size, but if we find the cases where having hash creates more > > overhead than providing gain then I am fine to remove the hash because > > the whole purpose of adding hash here to make the lookup faster. So > > far in my test I did not find the slowness. Do you or anyone else > > have any test case based on the previous research on whether it > > creates any slowness? > PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus()) > before introducing HTAB for buffer mapping I get > Time: 14837.851 ms (00:14.838) > with buffer HTAB I get > Time: 22723.243 ms (00:22.723) > > This hash table makes getting transaction status ~50% slower. > > Benchmark script I used: > make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries= 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru'postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU')) > FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres) With this test, I got below numbers, nslots. no-hash hash 8 10s 13s 16 10s 13s 32 15s 13s 64 17s 13s Yeah so we can see with a small bank size <=16 slots we are seeing that the fetching page with hash is 30% slower than the sequential search, but beyond 32 slots sequential search is become slower as you grow the number of slots whereas with hash it stays constant as expected. But now as you told if keep the lock partition range different than the bank size then we might not have any problem by having more numbers of banks and with that, we can keep the bank size small like 16. Let me put some more thought into this and get back. Any other opinions on this? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2023-Nov-06, Dilip Kumar wrote: > Yeah so we can see with a small bank size <=16 slots we are seeing > that the fetching page with hash is 30% slower than the sequential > search, but beyond 32 slots sequential search is become slower as you > grow the number of slots whereas with hash it stays constant as > expected. But now as you told if keep the lock partition range > different than the bank size then we might not have any problem by > having more numbers of banks and with that, we can keep the bank size > small like 16. Let me put some more thought into this and get back. > Any other opinions on this? dynahash is notoriously slow, which is why we have simplehash.h since commit b30d3ea824c5. Maybe we could use that instead. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > dynahash is notoriously slow, which is why we have simplehash.h since > commit b30d3ea824c5. Maybe we could use that instead. Dynahash has lock partitioning. Simplehash has not, AFAIK. The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need tocope with collisions much - we can evict a collided buffer. Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh patchin April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software, it’staken from hardware cache. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > dynahash is notoriously slow, which is why we have simplehash.h since > > commit b30d3ea824c5. Maybe we could use that instead. > > Dynahash has lock partitioning. Simplehash has not, AFAIK. Yeah, Simplehash doesn't have partitioning so with simple hash we will be stuck with the centralized control lock that is one of the main problems trying to solve here. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> dynahash is notoriously slow, which is why we have simplehash.h since
> commit b30d3ea824c5. Maybe we could use that instead.
Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need to cope with collisions much - we can evict a collided buffer.
Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh patch in April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software, it’s taken from hardware cache.
I agree that we don't need the hash function to generate hash value out of
pageno which itself is sufficient, but I don't understand how we can get rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all?
pageno which itself is sufficient, but I don't understand how we can get rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all?
Regards,
Amul
On Fri, Nov 3, 2023 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [...]
[1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same
patch as the previous patch set
[2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce
buffer mapping hash table
[3] 0003-Partition-wise-slru-locks: Partition the hash table and also
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure
that the number of slots is in multiple of the number of banks
[...]
Here are some minor comments:
+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
- return Min(256, Max(4, NBuffers / 256));
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(16, commit_ts_buffers);
+ return Min(256, Max(16, NBuffers / 256));
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
- return Min(256, Max(4, NBuffers / 256));
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(16, commit_ts_buffers);
+ return Min(256, Max(16, NBuffers / 256));
Do you mean "4MB of for every 1GB" in the comment?
--
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
#include "replication/origin.h"
#include "storage/sync.h"
-
extern PGDLLIMPORT bool track_commit_timestamp;
A spurious change.
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
#include "replication/origin.h"
#include "storage/sync.h"
-
extern PGDLLIMPORT bool track_commit_timestamp;
A spurious change.
--
@@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
-
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}
Another spurious change in 0002 patch.
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
-
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}
Another spurious change in 0002 patch.
--
+/*
+ * The slru buffer mapping table is partitioned to reduce contention. To
+ * determine which partition lock a given pageno requires, compute the pageno's
+ * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
+ */
+ * The slru buffer mapping table is partitioned to reduce contention. To
+ * determine which partition lock a given pageno requires, compute the pageno's
+ * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
+ */
I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in
your patches, is that outdated comment?
--
- sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
- sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
+ sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); /* locks[] */
- sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
+ sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); /* locks[] */
I am a bit uncomfortable with these changes, merging parts and buffer locks
making it hard to understand the code. Not sure what we were getting out of
this?
making it hard to understand the code. Not sure what we were getting out of
this?
--
partitions
I think the 0005 patch can be merged to 0001.
Regards,
Amul
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Nov 8, 2023 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote: Thanks for review Amul, > Here are some minor comments: > > + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the > + * maximum value that slru.c will allow, but always at least 16 buffers. > */ > Size > CommitTsShmemBuffers(void) > { > - return Min(256, Max(4, NBuffers / 256)); > + /* Use configured value if provided. */ > + if (commit_ts_buffers > 0) > + return Max(16, commit_ts_buffers); > + return Min(256, Max(16, NBuffers / 256)); > > Do you mean "4MB of for every 1GB" in the comment? You are right > -- > > diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h > index 5087cdce51..78d017ad85 100644 > --- a/src/include/access/commit_ts.h > +++ b/src/include/access/commit_ts.h > @@ -16,7 +16,6 @@ > #include "replication/origin.h" > #include "storage/sync.h" > > - > extern PGDLLIMPORT bool track_commit_timestamp; > > A spurious change. Will fix > -- > > @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns) > > if (nlsns > 0) > sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */ > - > return BUFFERALIGN(sz) + BLCKSZ * nslots; > } > > Another spurious change in 0002 patch. Will fix > -- > > +/* > + * The slru buffer mapping table is partitioned to reduce contention. To > + * determine which partition lock a given pageno requires, compute the pageno's > + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock(). > + */ > > I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in > your patches, is that outdated comment? Yes will fix it, actually, there are some major design changes to this. > -- > > - sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */ > - sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */ > + sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); /* locks[] */ > > I am a bit uncomfortable with these changes, merging parts and buffer locks > making it hard to understand the code. Not sure what we were getting out of > this? Yes, even I do not like this much because it is confusing. But the advantage of this is that we are using a single pointer for the lock which means the next variable for the LRU counter will come in the same cacheline and frequent updates of lru counter will be benefitted from this. Although I don't have any number which proves this. Currently, I want to focus on all the base patches and keep this patch as add on and later if we find its useful and want to pursue this then we will see how to make it better readable. > > Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of > partitions > > I think the 0005 patch can be merged to 0001. Yeah in the next version, it is done that way. Planning to post end of the day. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Ants Aasma
Date:
On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banksHaving hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.
Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise locks will increase the number of locks < 10%.
I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank. The data layout is:
struct CacheBank {
int page_numbers[16];
char access_age[16];
}
The first part uses up one cache line, and the second line has 48 bytes of space left over that could fit a lwlock and page_status, page_dirty arrays.
Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only branch is for found/not found. Hoping to have a working prototype of SLRU on top in the next couple of days.
Regards,
Ants Aasma
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote: > > Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise lockswill increase the number of locks < 10%. The problem was not attracting much attention for some years. So my reasoning was that solution should not have any costsat all. Initial patchset with banks did not add any memory footprint. > On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote: > > I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank. FWIW I tried to pack struct parts together to minimize cache lines touched, see step 3 in [0]. So far I could not prove anyperformance benefits of this approach. But maybe your implementation will be more efficient. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Nov 6, 2023 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of locallinear search. Lock partitions have to be bigger for sure. > > Yeah, that could also be an idea if we plan to drop the hash. I mean > bank-wise counter is fine as we are finding a victim buffer within a > bank itself, but each lock could cover more slots than one bank size > or in other words, it can protect multiple banks. Let's hear more > opinion on this. Here is the updated version of the patch, here I have taken the approach suggested by Andrey and I discussed the same with Alvaro offlist and he also agrees with it. So the idea is that we will keep the bank size fixed which is 16 buffers per bank and the allowed GUC value for each slru buffer must be in multiple of the bank size. We have removed the centralized lock but instead of one lock per bank, we have kept the maximum limit on the number of bank locks which is 128. We kept the max limit as 128 because, in one of the operations (i.e. ActivateCommitTs), we need to acquire all the bank locks (but this is not a performance path at all) and at a time we can acquire a max of 200 LWlocks, so we think this limit of 128 is good. So now if the number of banks is <= 128 then we will be using one lock per bank otherwise the one lock may protect access of buffer in multiple banks. We might argue that we should keep the max lock lesser than 128 i.e. 64 or 32 and I am open to that we can do more experiments with a very large buffer pool and a very heavy workload to see whether having lock up to 128 is helpful or not -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
IMO the whole area of SLRU buffering is in horrible shape and many users are struggling with overall PG performance because of it. An improvement doesn't have to be perfect -- it just has to be much better than the current situation, which should be easy enough. We can continue to improve later, using more scalable algorithms or ones that allow us to raise the limits higher. The only point on which we do not have full consensus yet is the need to have one GUC per SLRU, and a lot of effort seems focused on trying to fix the problem without adding so many GUCs (for example, using shared buffers instead, or use a single "scaling" GUC). I think that hinders progress. Let's just add multiple GUCs, and users can leave most of them alone and only adjust the one with which they have a performance problems; it's not going to be the same one for everybody. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Sallah, I said NO camels! That's FIVE camels; can't you count?" (Indiana Jones)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Here is the updated version of the patch, here I have taken the > approach suggested by Andrey and I discussed the same with Alvaro > offlist and he also agrees with it. So the idea is that we will keep > the bank size fixed which is 16 buffers per bank and the allowed GUC > value for each slru buffer must be in multiple of the bank size. We > have removed the centralized lock but instead of one lock per bank, we > have kept the maximum limit on the number of bank locks which is 128. > We kept the max limit as 128 because, in one of the operations (i.e. > ActivateCommitTs), we need to acquire all the bank locks (but this is > not a performance path at all) and at a time we can acquire a max of > 200 LWlocks, so we think this limit of 128 is good. So now if the > number of banks is <= 128 then we will be using one lock per bank > otherwise the one lock may protect access of buffer in multiple banks. Just so I understand, I guess this means that an SLRU is limited to 16*128 = 2k buffers = 16MB? When we were talking about this earlier, I suggested fixing the number of banks and allowing the number of buffers per bank to scale depending on the setting. That seemed simpler than allowing both the number of banks and the number of buffers to vary, and it might allow the compiler to optimize some code better, by converting a calculation like page_no%number_of_banks into a masking operation like page_no&0xf or whatever. However, because it allows an individual bank to become arbitrarily large, it more or less requires us to use a buffer mapping table. Some of the performance problems mentioned could be alleviated by omitting the hash table when the number of buffers per bank is small, and we could also create the dynahash with a custom hash function that just does modular arithmetic on the page number rather than a real hashing operation. However, maybe we don't really need to do any of that. I agree that dynahash is clunky on a good day. I hadn't realized the impact would be so noticeable. This proposal takes the opposite approach of fixing the number of buffers per bank, letting the number of banks vary. I think that's probably fine, although it does reduce the effective associativity of the cache. If there are more hot buffers in a bank than the bank size, the bank will be contended, even if other banks are cold. However, given the way SLRUs are accessed, it seems hard to imagine this being a real problem in practice. There aren't likely to be say 20 hot buffers that just so happen to all be separated from one another by a number of pages that is a multiple of the configured number of banks. And in the seemingly very unlikely event that you have a workload that behaves like that, you could always adjust the number of banks up or down by one, and the problem would go away. So this seems OK to me. I also agree with a couple of points that Alvaro made, specifically that (1) this doesn't have to be perfect, just better than now and (2) separate GUCs for each SLRU is fine. On the latter point, it's worth keeping in mind that the cost of a GUC that most people don't need to tune is fairly low. GUCs like work_mem and shared_buffers are "expensive" because everybody more or less needs to understand what they are and how to set them and getting the right value can tricky -- but a GUC like autovacuum_naptime is a lot cheaper, because almost nobody needs to change it. It seems to me that these GUCs will fall into the latter category. Users can hopefully just ignore them except if they see a contention on the SLRU bank locks -- and then they can consider increasing the number of banks for that particular SLRU. That seems simple enough. As with autovacuum_naptime, there is a danger that people will configure a ridiculous value of the parameter for no good reason and get bad results, so it would be nice if someday we had a magical system that just got all of this right without the user needing to configure anything. But in the meantime, it's better to have a somewhat manual system to relieve pressure on these locks than no system at all. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Nov 9, 2023 at 9:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Here is the updated version of the patch, here I have taken the > > approach suggested by Andrey and I discussed the same with Alvaro > > offlist and he also agrees with it. So the idea is that we will keep > > the bank size fixed which is 16 buffers per bank and the allowed GUC > > value for each slru buffer must be in multiple of the bank size. We > > have removed the centralized lock but instead of one lock per bank, we > > have kept the maximum limit on the number of bank locks which is 128. > > We kept the max limit as 128 because, in one of the operations (i.e. > > ActivateCommitTs), we need to acquire all the bank locks (but this is > > not a performance path at all) and at a time we can acquire a max of > > 200 LWlocks, so we think this limit of 128 is good. So now if the > > number of banks is <= 128 then we will be using one lock per bank > > otherwise the one lock may protect access of buffer in multiple banks. > > Just so I understand, I guess this means that an SLRU is limited to > 16*128 = 2k buffers = 16MB? Not really, because 128 is the maximum limit on the number of bank locks not on the number of banks. So for example, if you have 16*128 = 2k buffers then each lock will protect one bank, and likewise when you have 16 * 512 = 8k buffers then each lock will protect 4 banks. So in short we can get the lock for each bank by simple computation (banklockno = bankno % 128 ) > When we were talking about this earlier, I suggested fixing the number > of banks and allowing the number of buffers per bank to scale > depending on the setting. That seemed simpler than allowing both the > number of banks and the number of buffers to vary, and it might allow > the compiler to optimize some code better, by converting a calculation > like page_no%number_of_banks into a masking operation like page_no&0xf > or whatever. However, because it allows an individual bank to become > arbitrarily large, it more or less requires us to use a buffer mapping > table. Some of the performance problems mentioned could be alleviated > by omitting the hash table when the number of buffers per bank is > small, and we could also create the dynahash with a custom hash > function that just does modular arithmetic on the page number rather > than a real hashing operation. However, maybe we don't really need to > do any of that. I agree that dynahash is clunky on a good day. I > hadn't realized the impact would be so noticeable. Yes, so one idea is that we keep the number of banks fixed and with that, as you pointed out correctly with a large number of buffers, the bank size can be quite big and for that, we would need a hash table and OTOH what I am doing here is keeping the bank size fixed and smaller (16 buffers per bank) and with that we can have large numbers of the bank when the buffer pool size is quite big. But I feel having more banks is not really a problem if we grow the number of locks beyond a certain limit as in some corner cases we need to acquire all locks together and there is a limit on that. So I like this idea of sharing locks across the banks with that 1) We can have enough locks so that lock contention or cache invalidation due to a common lock should not be a problem anymore 2) We can keep a small bank size with that seq search within the bank is quite fast so reads are fast 3) With small bank size victim buffer search which has to be sequential is quite fast. > This proposal takes the opposite approach of fixing the number of > buffers per bank, letting the number of banks vary. I think that's > probably fine, although it does reduce the effective associativity of > the cache. If there are more hot buffers in a bank than the bank size, > the bank will be contended, even if other banks are cold. However, > given the way SLRUs are accessed, it seems hard to imagine this being > a real problem in practice. There aren't likely to be say 20 hot > buffers that just so happen to all be separated from one another by a > number of pages that is a multiple of the configured number of banks. > And in the seemingly very unlikely event that you have a workload that > behaves like that, you could always adjust the number of banks up or > down by one, and the problem would go away. So this seems OK to me. I agree with this > I also agree with a couple of points that Alvaro made, specifically > that (1) this doesn't have to be perfect, just better than now and (2) > separate GUCs for each SLRU is fine. On the latter point, it's worth > keeping in mind that the cost of a GUC that most people don't need to > tune is fairly low. GUCs like work_mem and shared_buffers are > "expensive" because everybody more or less needs to understand what > they are and how to set them and getting the right value can tricky -- > but a GUC like autovacuum_naptime is a lot cheaper, because almost > nobody needs to change it. It seems to me that these GUCs will fall > into the latter category. Users can hopefully just ignore them except > if they see a contention on the SLRU bank locks -- and then they can > consider increasing the number of banks for that particular SLRU. That > seems simple enough. As with autovacuum_naptime, there is a danger > that people will configure a ridiculous value of the parameter for no > good reason and get bad results, so it would be nice if someday we had > a magical system that just got all of this right without the user > needing to configure anything. But in the meantime, it's better to > have a somewhat manual system to relieve pressure on these locks than > no system at all. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > IMO the whole area of SLRU buffering is in horrible shape and many users > are struggling with overall PG performance because of it. An > improvement doesn't have to be perfect -- it just has to be much better > than the current situation, which should be easy enough. We can > continue to improve later, using more scalable algorithms or ones that > allow us to raise the limits higher. I agree with this. > The only point on which we do not have full consensus yet is the need to > have one GUC per SLRU, and a lot of effort seems focused on trying to > fix the problem without adding so many GUCs (for example, using shared > buffers instead, or use a single "scaling" GUC). I think that hinders > progress. Let's just add multiple GUCs, and users can leave most of > them alone and only adjust the one with which they have a performance > problems; it's not going to be the same one for everybody. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Nathan Bossart
Date:
On Fri, Nov 10, 2023 at 10:17:49AM +0530, Dilip Kumar wrote: > On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> The only point on which we do not have full consensus yet is the need to >> have one GUC per SLRU, and a lot of effort seems focused on trying to >> fix the problem without adding so many GUCs (for example, using shared >> buffers instead, or use a single "scaling" GUC). I think that hinders >> progress. Let's just add multiple GUCs, and users can leave most of >> them alone and only adjust the one with which they have a performance >> problems; it's not going to be the same one for everybody. > > +1 +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
I just noticed that 0003 does some changes to TransactionGroupUpdateXidStatus() that haven't been adequately explained AFAICS. How do you know that these changes are safe? 0001 contains one typo in the docs, "cotents". I'm not a fan of the fact that some CLOG sizing macros moved to clog.h, leaving others in clog.c. Maybe add commentary cross-linking both. Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to the slru.h-defined limit of 131072 is not that bad, even if it's more than could possibly be needed for xact_buffers; nobody is going to use 64k buffers, since useful values are below a couple thousand anyhow. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ Tom: There seems to be something broken here. Teodor: I'm in sackcloth and ashes... Fixed. http://postgr.es/m/482D1632.8010507@sigaev.ru
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I just noticed that 0003 does some changes to > TransactionGroupUpdateXidStatus() that haven't been adequately > explained AFAICS. How do you know that these changes are safe? IMHO this is safe as well as logical to do w.r.t. performance. It's safe because whenever we are updating any page in a group we are acquiring the respective bank lock in exclusive mode and in extreme cases if there are pages from different banks then we do switch the lock as well before updating the pages from different groups. And, we do not wake any process in a group unless we have done the status update for all the processes so there could not be any race condition as well. Also, It should not affect the performance adversely as well and this will not remove the need for group updates. The main use case of group update is that it will optimize a situation when most of the processes are contending for status updates on the same page and processes that are waiting for status updates on different pages will go to different groups w.r.t. that page, so in short in a group on best effort basis we are trying to have the processes which are waiting to update the same clog page that mean logically all the processes in the group will be waiting on the same bank lock. In an extreme situation if there are processes in the group that are trying to update different pages or even pages from different banks then we are handling it well by changing the lock. Although someone may raise a concern that in cases where there are processes that are waiting for different bank locks then after releasing one lock why not wake up those processes, I think that is not required because that is the situation we are trying to avoid where there are processes trying to update different are in the same group so there is no point in adding complexity to optimize that case. > 0001 contains one typo in the docs, "cotents". > > I'm not a fan of the fact that some CLOG sizing macros moved to clog.h, > leaving others in clog.c. Maybe add commentary cross-linking both. > Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to > the slru.h-defined limit of 131072 is not that bad, even if it's more > than could possibly be needed for xact_buffers; nobody is going to use > 64k buffers, since useful values are below a couple thousand anyhow. I agree, that allowing xact_buffers to grow beyond 65536 up to the slru.h-defined limit of 131072 is not that bad, so I will change that in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: PFA, updated patch version, this fixes the comment given by Alvaro and also improves some of the comments. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
In SlruSharedData, a new comment is added that starts: "Instead of global counter we maintain a bank-wise lru counter because ..." You don't need to explain what we don't do. Just explain what we do do. So remove the words "Instead of a global counter" from there, because they offer no wisdom. Same with the phrase "so there is no point to ...". I think "The oldest page is therefore" should say "The oldest page *in the bank* is therefore", for extra clarity. I wonder what's the deal with false sharing in the new bank_cur_lru_count array. Maybe instead of using LWLockPadded for bank_locks, we should have a new struct, with both the LWLock and the LRU counter; then pad *that* to the cacheline size. This way, both the lwlock and the counter come to the CPU running this code together. Looking at SlruRecentlyUsed, which was a macro and is now a function. The comment about "multiple evaluation of arguments" no longer applies, so it needs to be removed; and it shouldn't talk about itself being a macro. Using "Size" as type for bank_mask looks odd. For a bitmask, maybe it's be more appropriate to use bits64 if we do need a 64-bit mask (we don't have bits64, but it's easy to add a typedef). I bet we don't really need a 64 bit mask, and a 32-bit or even a 16-bit is sufficient, given the other limitations on number of buffers. I think SimpleLruReadPage should have this assert at start: + Assert(LWLockHeldByMe(SimpleLruGetSLRUBankLock(ctl, pageno))); Do we really need one separate lwlock tranche for each SLRU? -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2023-Nov-17, Dilip Kumar wrote: > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > I just noticed that 0003 does some changes to > > TransactionGroupUpdateXidStatus() that haven't been adequately > > explained AFAICS. How do you know that these changes are safe? > > IMHO this is safe as well as logical to do w.r.t. performance. It's > safe because whenever we are updating any page in a group we are > acquiring the respective bank lock in exclusive mode and in extreme > cases if there are pages from different banks then we do switch the > lock as well before updating the pages from different groups. Looking at the coverage for this code, https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 it seems in our test suites we never hit the case where there is anything in the "nextidx" field for commit groups. To be honest, I don't understand this group stuff, and so I'm doubly hesitant to go without any testing here. Maybe it'd be possible to use Michael Paquier's injection points somehow? I think in the code comments where you use "w.r.t.", that acronym can be replaced with "for", which improves readability. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: Thanks for the review, all comments looks fine to me, replying to those that need some clarification > I wonder what's the deal with false sharing in the new > bank_cur_lru_count array. Maybe instead of using LWLockPadded for > bank_locks, we should have a new struct, with both the LWLock and the > LRU counter; then pad *that* to the cacheline size. This way, both the > lwlock and the counter come to the CPU running this code together. Actually, the array lengths of both LWLock and the LRU counter are different so I don't think we can move them to a common structure. The length of the *buffer_locks array is equal to the number of slots, the length of the *bank_locks array is Min (number_of_banks, 128), and the length of the *bank_cur_lru_count array is number_of_banks. > Looking at the coverage for this code, > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 > it seems in our test suites we never hit the case where there is > anything in the "nextidx" field for commit groups. To be honest, I > don't understand this group stuff, and so I'm doubly hesitant to go > without any testing here. Maybe it'd be possible to use Michael > Paquier's injection points somehow? Sorry, but I am not aware of "Michael Paquier's injection" Is it something already in the repo? Can you redirect me to some of the example test cases if we already have them? Then I will try it out. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2023-Nov-18, Dilip Kumar wrote: > On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I wonder what's the deal with false sharing in the new > > bank_cur_lru_count array. Maybe instead of using LWLockPadded for > > bank_locks, we should have a new struct, with both the LWLock and the > > LRU counter; then pad *that* to the cacheline size. This way, both the > > lwlock and the counter come to the CPU running this code together. > > Actually, the array lengths of both LWLock and the LRU counter are > different so I don't think we can move them to a common structure. > The length of the *buffer_locks array is equal to the number of slots, > the length of the *bank_locks array is Min (number_of_banks, 128), and > the length of the *bank_cur_lru_count array is number_of_banks. Oh. > > Looking at the coverage for this code, > > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 > > it seems in our test suites we never hit the case where there is > > anything in the "nextidx" field for commit groups. To be honest, I > > don't understand this group stuff, and so I'm doubly hesitant to go > > without any testing here. Maybe it'd be possible to use Michael > > Paquier's injection points somehow? > > Sorry, but I am not aware of "Michael Paquier's injection" Is it > something already in the repo? Can you redirect me to some of the > example test cases if we already have them? Then I will try it out. https://postgr.es/ZVWufO_YKzTJHEHW@paquier.xyz -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Sallah, I said NO camels! That's FIVE camels; can't you count?" (Indiana Jones)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 17 Nov 2023, at 16:11, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > PFA, updated patch version, this fixes the comment given by Alvaro and > also improves some of the comments. I’ve skimmed through the patch set. Here are some minor notes. 1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly()now have identical comments. I think a little of copy-paste is OK. But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while SlruSelectLRUPage() does not. This is not relatedto the patch set, just a code nearby. 2. Do we really want these functions doing all the same? extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource source); extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource source); extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source); extern bool check_notify_buffers(int *newval, void **extra, GucSource source); extern bool check_serial_buffers(int *newval, void **extra, GucSource source); extern bool check_xact_buffers(int *newval, void **extra, GucSource source); extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source); 3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d suggest truncating prefix of infix. I do not have hard opinion on any of this items. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Sun, Nov 19, 2023 at 12:39 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > I’ve skimmed through the patch set. Here are some minor notes. Thanks for the review > > 1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly()now have identical comments. I think a little of copy-paste is OK. > But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while SlruSelectLRUPage() does not. This is not relatedto the patch set, just a code nearby. Do you mean to say we need to modify the comments or you are saying pgstat_count_slru_page_hit() is missing in SlruSelectLRUPage(), if it is later then I can see the caller of SlruSelectLRUPage() is calling pgstat_count_slru_page_hit() and the SlruRecentlyUsed(). > 2. Do we really want these functions doing all the same? > extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource source); > extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource source); > extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source); > extern bool check_notify_buffers(int *newval, void **extra, GucSource source); > extern bool check_serial_buffers(int *newval, void **extra, GucSource source); > extern bool check_xact_buffers(int *newval, void **extra, GucSource source); > extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source); I tried duplicating these by doing all the work inside the check_slru_buffers() function. But I think it is hard to make them a single function because there is no option to pass an SLRU name in the GUC check hook and IMHO in the check hook we need to print the GUC name, any suggestions on how we can avoid having so many functions? > 3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d suggest truncating prefix of infix. > > I do not have hard opinion on any of this items. > I prefer SimpleLruGetBankLock() so that it is consistent with other external functions starting with "SimpleLruGet", are you fine with this name? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Nov-17, Dilip Kumar wrote: I think I need some more clarification for some of the review comments > > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > I just noticed that 0003 does some changes to > > > TransactionGroupUpdateXidStatus() that haven't been adequately > > > explained AFAICS. How do you know that these changes are safe? > > > > IMHO this is safe as well as logical to do w.r.t. performance. It's > > safe because whenever we are updating any page in a group we are > > acquiring the respective bank lock in exclusive mode and in extreme > > cases if there are pages from different banks then we do switch the > > lock as well before updating the pages from different groups. > > Looking at the coverage for this code, > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 > it seems in our test suites we never hit the case where there is > anything in the "nextidx" field for commit groups. 1) I was looking into your coverage report and I have attached a screenshot from the same, it seems we do hit the block where nextidx is not INVALID_PGPROCNO, which means there is some other process other than the group leader. Although I have already started exploring the injection point but just wanted to be sure what is your main concern point about the coverage so though of checking that first. 470 : /* 471 : * If the list was not empty, the leader will update the status of our 472 : * XID. It is impossible to have followers without a leader because the 473 : * first process that has added itself to the list will always have 474 : * nextidx as INVALID_PGPROCNO. 475 : */ 476 98 : if (nextidx != INVALID_PGPROCNO) 477 : { 478 2 : int extraWaits = 0; 479 : 480 : /* Sleep until the leader updates our XID status. */ 481 2 : pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE); 482 : for (;;) 483 : { 484 : /* acts as a read barrier */ 485 2 : PGSemaphoreLock(proc->sem); 486 2 : if (!proc->clogGroupMember) 487 2 : break; 488 0 : extraWaits++; 489 : } 2) Do we really need one separate lwlock tranche for each SLRU? IMHO if we use the same lwlock tranche then the wait event will show the same wait event name, right? And that would be confusing for the user, whether we are waiting for Subtransaction or Multixact or anything else. Is my understanding no correct here? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > 2) Do we really need one separate lwlock tranche for each SLRU? > > IMHO if we use the same lwlock tranche then the wait event will show > the same wait event name, right? And that would be confusing for the > user, whether we are waiting for Subtransaction or Multixact or > anything else. Is my understanding no correct here? If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when they observewait times. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > 2) Do we really need one separate lwlock tranche for each SLRU? > > > > IMHO if we use the same lwlock tranche then the wait event will show > > the same wait event name, right? And that would be confusing for the > > user, whether we are waiting for Subtransaction or Multixact or > > anything else. Is my understanding no correct here? > > If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when they observewait times. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > 2) Do we really need one separate lwlock tranche for each SLRU? > > > > > > IMHO if we use the same lwlock tranche then the wait event will show > > > the same wait event name, right? And that would be confusing for the > > > user, whether we are waiting for Subtransaction or Multixact or > > > anything else. Is my understanding no correct here? > > > > If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when theyobserve wait times. PFA, updated patch set, I have worked on review comments by Alvaro and Andrey. So the only open comments are about clog group commit testing, for that my question was as I sent in the previous email exactly what part we are worried about in the coverage report? The second point is, if we want to generate a group update we will have to create the injection point after we hold the control lock so that other processes go for group update and then for waking up the waiting process who is holding the SLRU control lock in the exclusive mode we would need to call a function ('test_injection_points_wake()') to wake that up and for calling the function we would need to again acquire the SLRU lock in read mode for visibility check in the catalog for fetching the procedure row and now this wake up session will block on control lock for the session which is waiting on injection point so now it will create a deadlock. Maybe with bank-wise lock we can create a lot of transaction so that these 2 falls in different banks and then we can somehow test this, but then we will have to generate 16 * 4096 = 64k transaction so that the SLRU banks are different for the transaction which inserted procedure row in system table from the transaction in which we are trying to do the group commit -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Tue, Nov 21, 2023 at 2:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > > > > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > 2) Do we really need one separate lwlock tranche for each SLRU? > > > > > > > > IMHO if we use the same lwlock tranche then the wait event will show > > > > the same wait event name, right? And that would be confusing for the > > > > user, whether we are waiting for Subtransaction or Multixact or > > > > anything else. Is my understanding no correct here? > > > > > > If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when theyobserve wait times. > > PFA, updated patch set, I have worked on review comments by Alvaro and > Andrey. So the only open comments are about clog group commit > testing, for that my question was as I sent in the previous email > exactly what part we are worried about in the coverage report? > > The second point is, if we want to generate a group update we will > have to create the injection point after we hold the control lock so > that other processes go for group update and then for waking up the > waiting process who is holding the SLRU control lock in the exclusive > mode we would need to call a function ('test_injection_points_wake()') > to wake that up and for calling the function we would need to again > acquire the SLRU lock in read mode for visibility check in the catalog > for fetching the procedure row and now this wake up session will block > on control lock for the session which is waiting on injection point so > now it will create a deadlock. Maybe with bank-wise lock we can > create a lot of transaction so that these 2 falls in different banks > and then we can somehow test this, but then we will have to generate > 16 * 4096 = 64k transaction so that the SLRU banks are different for > the transaction which inserted procedure row in system table from the > transaction in which we are trying to do the group commit I have attached a POC patch for testing the group update using the injection point framework. This is just for testing the group update part and is not yet a committable test. I have added a bunch of logs in the code so that we can see what's going on with the group update. From the below logs, we can see that multiple processes are getting accumulated for the group update and the leader is updating their xid status. Note: With this testing, we have found a bug in the bank-wise approach, basically we are clearing a procglobal->clogGroupFirst, even before acquiring the bank lock that means in most of the cases there will be a single process in each group as a group leader (I think this is what Alvaro was pointing in his coverage report). I have added this fix in this POC just for testing purposes but in my next version I will add this fix to my proper patch version after a proper review and a bit more testing. here is the output after running the test ============== 2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl LOG: procno 6 got the lock 2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl STATEMENT: SELECT txid_current(); 2023-11-23 05:55:29.406 UTC [93369] 003_clog_group_commit.pl LOG: statement: SELECT test_injection_points_attach('ClogGroupCommit', 'wait'); 2023-11-23 05:55:29.415 UTC [93371] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(1); 2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl LOG: procno 4 got the lock 2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(1); 2023-11-23 05:55:29.424 UTC [93373] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(2); 2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl LOG: procno 3 for xid 128742 added for group update 2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(2); 2023-11-23 05:55:29.431 UTC [93376] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(3); 2023-11-23 05:55:29.438 UTC [93378] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(4); 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG: procno 2 for xid 128743 added for group update 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(3); 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG: procno 2 is follower and wait for group leader to update commit status of xid 128743 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(3); 2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG: procno 1 for xid 128744 added for group update 2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(4); 2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG: procno 1 is follower and wait for group leader to update commit status of xid 128744 2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(4); 2023-11-23 05:55:29.445 UTC [93380] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(5); 2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl LOG: procno 0 for xid 128745 added for group update 2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(5); 2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl LOG: procno 0 is follower and wait for group leader to update commit status of xid 128745 2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(5); 2023-11-23 05:55:29.451 UTC [93382] 003_clog_group_commit.pl LOG: statement: SELECT test_injection_points_wake(); 2023-11-23 05:55:29.460 UTC [93384] 003_clog_group_commit.pl LOG: statement: SELECT test_injection_points_detach('ClogGroupCommit'); ============= -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Note: With this testing, we have found a bug in the bank-wise > approach, basically we are clearing a procglobal->clogGroupFirst, even > before acquiring the bank lock that means in most of the cases there > will be a single process in each group as a group leader I realized that the bug fix I have done is not proper, so will send the updated patch set with the proper fix soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Note: With this testing, we have found a bug in the bank-wise > > approach, basically we are clearing a procglobal->clogGroupFirst, even > > before acquiring the bank lock that means in most of the cases there > > will be a single process in each group as a group leader > > I realized that the bug fix I have done is not proper, so will send > the updated patch set with the proper fix soon. PFA, updated patch set fixes the bug found during the testing of the group update using the injection point. Also attached a path to test the injection point but for that, we need to apply the injection point patches [1] [1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
tender wang
Date:
The v8-0001 patch failed to apply in my local repo as below:
git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
error: patch failed: src/backend/access/transam/multixact.c:1851
error: src/backend/access/transam/multixact.c: patch does not apply
error: patch failed: src/backend/access/transam/subtrans.c:184
error: src/backend/access/transam/subtrans.c: patch does not apply
error: patch failed: src/backend/commands/async.c:117
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/predicate.c:808
error: src/backend/storage/lmgr/predicate.c: patch does not apply
error: patch failed: src/include/commands/async.h:15
error: src/include/commands/async.h: patch does not apply
error: patch failed: src/backend/access/transam/multixact.c:1851
error: src/backend/access/transam/multixact.c: patch does not apply
error: patch failed: src/backend/access/transam/subtrans.c:184
error: src/backend/access/transam/subtrans.c: patch does not apply
error: patch failed: src/backend/commands/async.c:117
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/predicate.c:808
error: src/backend/storage/lmgr/predicate.c: patch does not apply
error: patch failed: src/include/commands/async.h:15
error: src/include/commands/async.h: patch does not apply
My local head commit is 15c9ac36299. Is there something I missed?
Dilip Kumar <dilipbalaut@gmail.com> 于2023年11月24日周五 17:08写道:
On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Note: With this testing, we have found a bug in the bank-wise
> > approach, basically we are clearing a procglobal->clogGroupFirst, even
> > before acquiring the bank lock that means in most of the cases there
> > will be a single process in each group as a group leader
>
> I realized that the bug fix I have done is not proper, so will send
> the updated patch set with the proper fix soon.
PFA, updated patch set fixes the bug found during the testing of the
group update using the injection point. Also attached a path to test
the injection point but for that, we need to apply the injection point
patches [1]
[1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2023-Nov-29, tender wang wrote: > The v8-0001 patch failed to apply in my local repo as below: > > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch > error: patch failed: src/backend/access/transam/multixact.c:1851 > error: src/backend/access/transam/multixact.c: patch does not apply > error: patch failed: src/backend/access/transam/subtrans.c:184 > error: src/backend/access/transam/subtrans.c: patch does not apply > error: patch failed: src/backend/commands/async.c:117 > error: src/backend/commands/async.c: patch does not apply > error: patch failed: src/backend/storage/lmgr/predicate.c:808 > error: src/backend/storage/lmgr/predicate.c: patch does not apply > error: patch failed: src/include/commands/async.h:15 > error: src/include/commands/async.h: patch does not apply Yeah, this patch series conflicts with today's commit 4ed8f0913bfd. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ Syntax error: function hell() needs an argument. Please choose what hell you want to involve.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Nov-29, tender wang wrote: > > > The v8-0001 patch failed to apply in my local repo as below: > > > > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch > > error: patch failed: src/backend/access/transam/multixact.c:1851 > > error: src/backend/access/transam/multixact.c: patch does not apply > > error: patch failed: src/backend/access/transam/subtrans.c:184 > > error: src/backend/access/transam/subtrans.c: patch does not apply > > error: patch failed: src/backend/commands/async.c:117 > > error: src/backend/commands/async.c: patch does not apply > > error: patch failed: src/backend/storage/lmgr/predicate.c:808 > > error: src/backend/storage/lmgr/predicate.c: patch does not apply > > error: patch failed: src/include/commands/async.h:15 > > error: src/include/commands/async.h: patch does not apply > > Yeah, this patch series conflicts with today's commit 4ed8f0913bfd. I will send a rebased version by tomorrow. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2023-Nov-29, tender wang wrote: > > > > > The v8-0001 patch failed to apply in my local repo as below: > > > > > > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch > > > error: patch failed: src/backend/access/transam/multixact.c:1851 > > > error: src/backend/access/transam/multixact.c: patch does not apply > > > error: patch failed: src/backend/access/transam/subtrans.c:184 > > > error: src/backend/access/transam/subtrans.c: patch does not apply > > > error: patch failed: src/backend/commands/async.c:117 > > > error: src/backend/commands/async.c: patch does not apply > > > error: patch failed: src/backend/storage/lmgr/predicate.c:808 > > > error: src/backend/storage/lmgr/predicate.c: patch does not apply > > > error: patch failed: src/include/commands/async.h:15 > > > error: src/include/commands/async.h: patch does not apply > > > > Yeah, this patch series conflicts with today's commit 4ed8f0913bfd. > > I will send a rebased version by tomorrow. PFA, a rebased version of the patch, I have avoided attaching because a) that patch is POC to show the coverage and it has a dependency on the other thread b) the old patch still applies so it doesn't need rebase. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: Here is the updated patch based on some comments by tender wang (those comments were sent only to me) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
[Added Andrey again in CC, because as I understand they are using this code or something like it in production. Please don't randomly remove people from CC lists.] I've been looking at this some more, and I'm not confident in that the group clog update stuff is correct. I think the injection points test case was good enough to discover a problem, but it's hard to get peace of mind that there aren't other, more subtle problems. The problem I see is that the group update mechanism is designed around contention of the global xact-SLRU control lock; it uses atomics to coordinate a single queue when the lock is contended. So if we split up the global SLRU control lock using banks, then multiple processes using different bank locks might not contend. OK, this is fine, but what happens if two separate groups of processes encounter contention on two different bank locks? I think they will both try to update the same queue, and coordinate access to that *using different bank locks*. I don't see how can this work correctly. I suspect the first part of that algorithm, where atomics are used to create the list without a lock, might work fine. But will each "leader" process, each of which is potentially using a different bank lock, coordinate correctly? Maybe this works correctly because only one process will find the queue head not empty? If this is what happens, then there needs to be comments about it. Without any explanation, this seems broken and potentially dangerous, as some transaction commit bits might become lost given high enough concurrency and bad luck. Maybe this can be made to work by having one more lwlock that we use solely to coordinate this task. Though we would have to demonstrate that coordinating this task with a different lock works correctly in conjunction with the per-bank lwlock usage in the regular slru.c paths. Andrey, do you have any stress tests or anything else that you used to gain confidence in this code? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Tue, Dec 12, 2023 at 6:58 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > [Added Andrey again in CC, because as I understand they are using this > code or something like it in production. Please don't randomly remove > people from CC lists.] Oh, glad to know that. Yeah, I generally do not remove but I have noticed that in the mail chain, some of the reviewers just replied to me and the hackers' list, and from that point onwards I lost track of the CC list. > I've been looking at this some more, and I'm not confident in that the > group clog update stuff is correct. I think the injection points test > case was good enough to discover a problem, but it's hard to get peace > of mind that there aren't other, more subtle problems. Yeah, I agree. > The problem I see is that the group update mechanism is designed around > contention of the global xact-SLRU control lock; it uses atomics to > coordinate a single queue when the lock is contended. So if we split up > the global SLRU control lock using banks, then multiple processes using > different bank locks might not contend. OK, this is fine, but what > happens if two separate groups of processes encounter contention on two > different bank locks? I think they will both try to update the same > queue, and coordinate access to that *using different bank locks*. I > don't see how can this work correctly. Let's back up a bit and start from the current design with the centralized lock. With that, if one process is holding the lock the other processes will try to perform the group update, and if there is already a group that still hasn't got the lock but trying to update the different CLOG page then what this process wants to update then it will not add itself for the group update instead it will fallback to the normal lock wait. Now, in another situation, it may so happen that the group leader of the other group already got the control lock and in such case, it would have cleared the 'procglobal->clogGroupFirst' that means now we will start forming a different group. So logically if we talk only about the optimization part then the thing is that it is assumed that at a time when we are trying to commit a log of concurrent xid then those xids are mostly of the same range and will fall in the same SLRU page and group update will help them. But if we are getting some out-of-range xid of some long-running transaction they might not even go for group update as the page number will be different. Although the situation might be better here with a bank-wise lock because there if those xids are falling in altogether a different bank it might not even contend. Now, let's talk about the correctness, I think even though we are getting processes that might be contending on different bank-lock, still we are ensuring that in a group all the processes are trying to update the same SLRU page (i.e. same bank also, we will talk about the exception later[1]). One of the processes is becoming a leader and as soon as the leader gets the lock it detaches the queue from the 'procglobal->clogGroupFirst' by setting it as INVALID_PGPROCNO so that other group update requesters now can form another parallel group. But here I do not see a problem with correctness. I agree someone might say that since now there is a possibility that different groups might get formed for different bank locks we do not get other groups to get formed until we get the lock for our bank as we do not clear 'procglobal->clogGroupFirst' before we get the lock. Other requesters might want to update the page in different banks so why block them? But the thing is the group update design is optimized for the cases where all requesters are trying to update the status of xids generated near the same range. > I suspect the first part of that algorithm, where atomics are used to > create the list without a lock, might work fine. But will each "leader" > process, each of which is potentially using a different bank lock, > coordinate correctly? Maybe this works correctly because only one > process will find the queue head not empty? If this is what happens, > then there needs to be comments about it. Yes, you got it right, I will try to comment on it better. Without any explanation, > this seems broken and potentially dangerous, as some transaction commit > bits might become lost given high enough concurrency and bad luck. > > Maybe this can be made to work by having one more lwlock that we use > solely to coordinate this task. Do you mean to say a different lock for adding/removing in the list instead of atomic operation? I think then we will lose the benefit we got in the group update by having contention on another lock. [1] I think we already know about the exception case and I have explained in the comments as well that in some cases we might add different clog page update requests in the same group, and for handling that exceptional case we are checking the respective bank lock for each page and if that exception occurred we will release the old bank lock and acquire a new lock. This case might not be performant because now it is possible that after getting the lock leader might need to wait again on another bank lock, but this is an extremely exceptional case so should not be worried about performance and I do not see any correctness issue here as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Andrey, do you have any stress tests or anything else that you used to > gain confidence in this code? We are using only first two steps of the patchset, these steps do not touch locking stuff. We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3
On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Here is the updated patch based on some comments by tender wang (those
comments were sent only to me)
few nitpicks:
+ /*
+ * Mask for slotno banks, considering 1GB SLRU buffer pool size and the
+ * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
+ */
+ bits16 bank_mask;
} SlruCtlData;
...
...
+ int bankno = pageno & ctl->bank_mask;
I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
of banks (num_banks) and get the bank number through modulus op (pageno %
num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
bit difficult to read compared to modulus op which is quite simple,
straightforward and much common practice in hashing.
Are there any advantages of using & over % ?
Also, a few places in 0002 and 0003 patch, need the bank number, it is better
to have a macro for that.
---
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage,
void *data);
-
+extern bool check_slru_buffers(const char *name, int *newval);
#endif /* SLRU_H */
Add an empty line after the declaration, in 0002 patch.
---
-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)
Unrelated change for 0003 patch.
---
Regards,
Amul
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Dec 14, 2023 at 8:43 AM Amul Sul <sulamul@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> Here is the updated patch based on some comments by tender wang (those >> comments were sent only to me) > > > few nitpicks: > > + > + /* > + * Mask for slotno banks, considering 1GB SLRU buffer pool size and the > + * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask. > + */ > + bits16 bank_mask; > } SlruCtlData; > > ... > ... > > + int bankno = pageno & ctl->bank_mask; > > I am a bit uncomfortable seeing it as a mask, why can't it be simply a number > of banks (num_banks) and get the bank number through modulus op (pageno % > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a > bit difficult to read compared to modulus op which is quite simple, > straightforward and much common practice in hashing. > > Are there any advantages of using & over % ? I am not sure either but since this change in 0002 is by Andrey, I will let him comment on this before we change or take any decision. > Also, a few places in 0002 and 0003 patch, need the bank number, it is better > to have a macro for that. > --- > > extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage, > void *data); > - > +extern bool check_slru_buffers(const char *name, int *newval); > #endif /* SLRU_H */ > > > Add an empty line after the declaration, in 0002 patch. > --- > > -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno) > +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, > + int slotno) > > Unrelated change for 0003 patch. Fixed Thanks for your review, PFA updated version. I have added @Amit Kapila to the list to view his opinion about whether anything can break in the clog group update with our changes of bank-wise SLRU lock. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 14 Dec 2023, at 08:12, Amul Sul <sulamul@gmail.com> wrote: > > > + int bankno = pageno & ctl->bank_mask; > > I am a bit uncomfortable seeing it as a mask, why can't it be simply a number > of banks (num_banks) and get the bank number through modulus op (pageno % > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a > bit difficult to read compared to modulus op which is quite simple, > straightforward and much common practice in hashing. > > Are there any advantages of using & over % ? The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some readabilityto save ~ten nanoseconds on each check of a status of a transaction. [0] https://www.agner.org/optimize/instruction_tables.pdf
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
tender wang
Date:
Andrey M. Borodin <x4mmm@yandex-team.ru> 于2023年12月14日周四 17:02写道:
> On 14 Dec 2023, at 08:12, Amul Sul <sulamul@gmail.com> wrote:
>
>
> + int bankno = pageno & ctl->bank_mask;
>
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
>
> Are there any advantages of using & over % ?
use Compiler Explorer[1] tool, '%' has more Assembly instructions than '&' .
int GetBankno1(int pageno) {
return pageno & 127;
}
int GetBankno2(int pageno) {
return pageno % 127;
}
under clang 13.0 under gcc 13.2
GetBankno1: # @GetBankno1
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
and eax, 127
pop rbp
ret
GetBankno2: # @GetBankno2
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
mov ecx, 127
cdq
idiv ecx
mov eax, edx
pop rbp
ret
GetBankno1:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
and eax, 127
pop rbp
ret
GetBankno2:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
movsx rdx, eax
imul rdx, rdx, -2130574327
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 6
cdq
sub ecx, edx
mov edx, ecx
sal edx, 7
sub edx, ecx
sub eax, edx
mov ecx, eax
mov eax, ecx
pop rbp
ret
The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some readability to save ~ten nanoseconds on each check of a status of a transaction.
Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127' :
SimpleLruGetBankLock()
{
int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;
use '&'
return &(ctl->shared->bank_locks[banklockno].lock);
}
int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;
use '&'
return &(ctl->shared->bank_locks[banklockno].lock);
}
Thoughts?
[0] https://www.agner.org/optimize/instruction_tables.pdf
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 14 Dec 2023, at 14:28, tender wang <tndrwang@gmail.com> wrote: > > Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '&127' unsigned int GetBankno1(unsigned int pageno) { return pageno & 127; } unsigned int GetBankno2(unsigned int pageno) { return pageno % 128; } Generates with -O2 GetBankno1(unsigned int): mov eax, edi and eax, 127 ret GetBankno2(unsigned int): mov eax, edi and eax, 127 ret Compiler is smart enough with constants. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Andrey, do you have any stress tests or anything else that you used to > > gain confidence in this code? > > We are using only first two steps of the patchset, these steps do not touch locking stuff. > > We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks! > I have run this test [1], instead of comparing against the master I have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs (patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and here is the result of the benchmark-1 and benchmark-2. I have noticed a very good improvement with the addition of patch 0003. Machine information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 configurations: max_wal_size=20GB shared_buffers=20GB checkpoint_timeout=40min max_connections=700 maintenance_work_mem=1GB subtrans_buffers=$variable multixact_offsets_buffers=$variable multixact_members_buffers=$variable benchmark-1 version | subtrans | multixact | tps | buffers | offs/memb | func+ballast -----------+--------------+--------------+------ patch-1 | 64 | 64/128 | 87 + 58 patch-2 | 64 | 64/128 | 128 +83 patch-1 | 1024 | 512/1024 | 96 + 64 patch-2 | 1024 | 512/1024 | 163+108 benchmark-2 version | subtrans | multixact | tps | buffers | offs/memb | func -----------+--------------+--------------+------ patch-1 | 64 | 64/128 | 10 patch-2 | 64 | 64/128 | 12 patch-1 | 1024 | 512/1024 | 44 patch-2 | 1024 | 512/1024 | 72 [1] https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 14 Dec 2023, at 16:06, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have noticed > a very good improvement with the addition of patch 0003. Indeed, a very impressive results! It’s almost x2 of performance on high contention scenario, on top of previous improvements. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
tender wang
Date:
Andrey M. Borodin <x4mmm@yandex-team.ru> 于2023年12月14日周四 17:35写道:
> On 14 Dec 2023, at 14:28, tender wang <tndrwang@gmail.com> wrote:
>
> Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127'
unsigned int GetBankno1(unsigned int pageno) {
return pageno & 127;
}
unsigned int GetBankno2(unsigned int pageno) {
return pageno % 128;
}
Generates with -O2
GetBankno1(unsigned int):
mov eax, edi
and eax, 127
ret
GetBankno2(unsigned int):
mov eax, edi
and eax, 127
ret
Compiler is smart enough with constants.
Yeah, that's true.
int GetBankno(long pageno) {
unsigned short bank_mask = 128;
int bankno = (pageno & bank_mask) % 128;
return bankno;
}
enable -O2, only one instruction:
xor eax, eax
But if we all use '%', thing changs as below:
int GetBankno(long pageno) {
unsigned short bank_mask = 128;
int bankno = (pageno % bank_mask) % 128;
return bankno;
}
mov rdx, rdi
sar rdx, 63
shr rdx, 57
lea rax, [rdi+rdx]
and eax, 127
sub eax, edx
sar rdx, 63
shr rdx, 57
lea rax, [rdi+rdx]
and eax, 127
sub eax, edx
Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 14 Dec 2023, at 16:32, tender wang <tndrwang@gmail.com> wrote: > > enable -O2, only one instruction: > xor eax, eax This is not fast code. This is how friendly C compiler suggests you that mask must be 127, not 128. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > > On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > Andrey, do you have any stress tests or anything else that you used to > > > gain confidence in this code? > > I have done some more testing for the clog group update as the attached test file executes two concurrent scripts executed with pgbench, the first script is the slow script which will run 10-second long transactions and the second script is a very fast transaction with ~10000 transactions per second. Along with that, I have also changed the bank size such that each bank will contain just 1 page i.e. 32k transactions per bank. I have done this way so that we do not need to keep long-running transactions running for very long in order to get the transactions from different banks committed during the same time. With this test, I have got that behavior and the below logs shows that multiple transaction range which is in different slru-bank (considering 32k transactions per bank) are doing group update at the same time. e.g. in the below logs, we can see xid range around 70600, 70548, and 70558, and xid range around 755, and 752 are getting group updates by different leaders but near the same time. It is running fine when running for a long duration, but I am not sure how to validate the sanity of this kind of test. 2023-12-14 14:43:31.813 GMT [3306] LOG: group leader procno 606 updated status of procno 606 xid 70600 2023-12-14 14:43:31.816 GMT [3326] LOG: procno 586 for xid 70548 added for group update 2023-12-14 14:43:31.816 GMT [3326] LOG: procno 586 is group leader and got the lock 2023-12-14 14:43:31.816 GMT [3326] LOG: group leader procno 586 updated status of procno 586 xid 70548 2023-12-14 14:43:31.818 GMT [3327] LOG: procno 585 for xid 70558 added for group update 2023-12-14 14:43:31.818 GMT [3327] LOG: procno 585 is group leader and got the lock 2023-12-14 14:43:31.818 GMT [3327] LOG: group leader procno 585 updated status of procno 585 xid 70558 2023-12-14 14:43:31.829 GMT [3155] LOG: procno 687 for xid 752 added for group update 2023-12-14 14:43:31.829 GMT [3207] LOG: procno 669 for xid 755 added for group update 2023-12-14 14:43:31.829 GMT [3155] LOG: procno 687 is group leader and got the lock 2023-12-14 14:43:31.829 GMT [3155] LOG: group leader procno 687 updated status of procno 669 xid 755 2023-12-14 14:43:31.829 GMT [3155] LOG: group leader procno 687 updated status of procno 687 xid 752 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Dec 14, 2023 at 1:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 8:43 AM Amul Sul <sulamul@gmail.com> wrote: > > > > On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> > >> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> > > >> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> > >> Here is the updated patch based on some comments by tender wang (those > >> comments were sent only to me) > > > > > > few nitpicks: > > > > + > > + /* > > + * Mask for slotno banks, considering 1GB SLRU buffer pool size and the > > + * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask. > > + */ > > + bits16 bank_mask; > > } SlruCtlData; > > > > ... > > ... > > > > + int bankno = pageno & ctl->bank_mask; > > > > I am a bit uncomfortable seeing it as a mask, why can't it be simply a number > > of banks (num_banks) and get the bank number through modulus op (pageno % > > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a > > bit difficult to read compared to modulus op which is quite simple, > > straightforward and much common practice in hashing. > > > > Are there any advantages of using & over % ? > > I am not sure either but since this change in 0002 is by Andrey, I > will let him comment on this before we change or take any decision. > > > Also, a few places in 0002 and 0003 patch, need the bank number, it is better > > to have a macro for that. > > --- > > > > extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage, > > void *data); > > - > > +extern bool check_slru_buffers(const char *name, int *newval); > > #endif /* SLRU_H */ > > > > > > Add an empty line after the declaration, in 0002 patch. > > --- > > > > -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno) > > +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, > > + int slotno) > > > > Unrelated change for 0003 patch. > > Fixed > > Thanks for your review, PFA updated version. > > I have added @Amit Kapila to the list to view his opinion about > whether anything can break in the clog group update with our changes > of bank-wise SLRU lock. Updated the comments about group commit safety based on the off-list suggestion by Alvaro. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Tue, Dec 12, 2023 at 8:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > The problem I see is that the group update mechanism is designed around > contention of the global xact-SLRU control lock; it uses atomics to > coordinate a single queue when the lock is contended. So if we split up > the global SLRU control lock using banks, then multiple processes using > different bank locks might not contend. OK, this is fine, but what > happens if two separate groups of processes encounter contention on two > different bank locks? I think they will both try to update the same > queue, and coordinate access to that *using different bank locks*. I > don't see how can this work correctly. > > I suspect the first part of that algorithm, where atomics are used to > create the list without a lock, might work fine. But will each "leader" > process, each of which is potentially using a different bank lock, > coordinate correctly? Maybe this works correctly because only one > process will find the queue head not empty? If this is what happens, > then there needs to be comments about it. Without any explanation, > this seems broken and potentially dangerous, as some transaction commit > bits might become lost given high enough concurrency and bad luck. I don't want to be dismissive of this concern, but I took a look at this part of the patch set and I don't see a correctness problem. I think the idea of the mechanism is that we have a single linked list in shared memory that can accumulate those waiters. At some point a process pops the entire list of waiters, which allows a new group of waiters to begin accumulating. The process that pops the list must perform the updates for every process in the just-popped list without failing, else updates would be lost. In theory, there can be any number of lists that some process has popped and is currently working its way through at the same time, although in practice I bet it's quite rare for there to be more than one. The only correctness problem is if it's possible for a process that popped the list to error out before it finishes doing the updates that it "promised" to do by popping the list. Having individual bank locks doesn't really change anything here. That's just a matter of what lock has to be held in order to perform the update that was promised, and the algorithm described in the previous paragraph doesn't really care about that. Nor is there a deadlock hazard here as long as processes only take one lock at a time, which I believe is the case. The only potential issue that I see here is with performance. I've heard some questions about whether this machinery performs well even as it stands, but certainly if we divide up the lock into a bunch of bankwise locks then that should tend in the direction of making a mechanism like this less valuable, because both mechanisms are trying to alleviate contention, and so in a certain sense they are competing for the same job. However, they do aim to alleviate different TYPES of contention: the group XID update stuff should be most valuable when lots of processes are trying to update the same page, and the banks should be most valuable when there is simultaneous access to a bunch of different pages. So I'm not convinced that this patch is a reason to remove the group XID update mechanism, but someone might argue otherwise. A related concern is that, if by chance we do end up with multiple updaters from different pages in the same group, it will now be more expensive to sort that out because we'll have to potentially keep switching locks. So that could make the group XID update mechanism less performant than it is currently. I think that's probably not an issue because I think it should be a rare occurrence, as the comments say. A bit more cost in a code path that is almost never taken won't matter. But if that path is more commonly taken than I think, then maybe making it more expensive could hurt. It might be worth adding some debugging to see how often we actually go down that path in a highly stressed system. BTW: + * as well. The main reason for now allowing requesters of different pages now -> not -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Mon, Dec 18, 2023 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > certain sense they are competing for the same job. However, they do > aim to alleviate different TYPES of contention: the group XID update > stuff should be most valuable when lots of processes are trying to > update the same page, and the banks should be most valuable when there > is simultaneous access to a bunch of different pages. So I'm not > convinced that this patch is a reason to remove the group XID update > mechanism, but someone might argue otherwise. Hmm, but, on the other hand: Currently all readers and writers are competing for the same LWLock. But with this change, the readers will (mostly) no longer be competing with the writers. So, in theory, that might reduce lock contention enough to make the group update mechanism pointless. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 18 Dec 2023, at 22:30, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Dec 18, 2023 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: >> certain sense they are competing for the same job. However, they do >> aim to alleviate different TYPES of contention: the group XID update >> stuff should be most valuable when lots of processes are trying to >> update the same page, and the banks should be most valuable when there >> is simultaneous access to a bunch of different pages. So I'm not >> convinced that this patch is a reason to remove the group XID update >> mechanism, but someone might argue otherwise. > > Hmm, but, on the other hand: > > Currently all readers and writers are competing for the same LWLock. > But with this change, the readers will (mostly) no longer be competing > with the writers. So, in theory, that might reduce lock contention > enough to make the group update mechanism pointless. One page still accommodates 32K transaction statuses under one lock. It feels like a lot. About 1 second of transactionson a typical installation. When the group commit was committed did we have a benchmark to estimate efficiency of this technology? Can we repeat thattest again? Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Mon, Dec 18, 2023 at 12:53 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > One page still accommodates 32K transaction statuses under one lock. It feels like a lot. About 1 second of transactionson a typical installation. > > When the group commit was committed did we have a benchmark to estimate efficiency of this technology? Can we repeat thattest again? I think we did, but it might take some research to find it in the archives. If we can, I agree that repeating it feels like a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Dec 18, 2023 at 11:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Dec 18, 2023 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > certain sense they are competing for the same job. However, they do > > aim to alleviate different TYPES of contention: the group XID update > > stuff should be most valuable when lots of processes are trying to > > update the same page, and the banks should be most valuable when there > > is simultaneous access to a bunch of different pages. So I'm not > > convinced that this patch is a reason to remove the group XID update > > mechanism, but someone might argue otherwise. > > Hmm, but, on the other hand: > > Currently all readers and writers are competing for the same LWLock. > But with this change, the readers will (mostly) no longer be competing > with the writers. So, in theory, that might reduce lock contention > enough to make the group update mechanism pointless. Thanks for your feedback, I agree that with a bank-wise lock, we might not need group updates for some of the use cases as you said where readers and writers are contenting on the centralized lock because, in most of the cases, readers will be distributed across different banks. OTOH there are use cases where the writer commit is the bottleneck (on SLRU lock) like pgbench simple-update or TPC-B then we will still benefit by group update. During group update testing we have seen benefits with such a scenario[1] with high client counts. So as per my understanding by distributing the SLRU locks there are scenarios where we will not need group update anymore but OTOH there are also scenarios where we will still benefit from the group update. [1] https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
On 19 Dec 2023, at 10:34, Dilip Kumar <dilipbalaut@gmail.com> wrote:
Just a side node.
It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby **bits** are written by different CPUs.
I think that banks and locks are good thing. But also we could reorganize log so that
status of transaction 0 is on a page 0 at bit offset 0
status of transaction 1 is on a page 1 at bit offset 0
status of transaction 2 is on a page 2 at bit offset 0
status of transaction 3 is on a page 3 at bit offset 0
status of transaction 4 is on a page 0 at bit offset 2
status of transaction 5 is on a page 1 at bit offset 2
status of transaction 6 is on a page 2 at bit offset 2
status of transaction 7 is on a page 3 at bit offset 2
etc...
And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least cache line. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend?
This does not matter much because
0. Patch set in current thread produces robust SLRU anyway
1. One day we are going to throw away SLRU anyway
Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > Just a side node. > It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby **bits**are written by different CPUs. > I think that banks and locks are good thing. But also we could reorganize log so that > status of transaction 0 is on a page 0 at bit offset 0 > status of transaction 1 is on a page 1 at bit offset 0 > status of transaction 2 is on a page 2 at bit offset 0 > status of transaction 3 is on a page 3 at bit offset 0 > status of transaction 4 is on a page 0 at bit offset 2 > status of transaction 5 is on a page 1 at bit offset 2 > status of transaction 6 is on a page 2 at bit offset 2 > status of transaction 7 is on a page 3 at bit offset 2 > etc... This is an interesting idea. A variant would be to stripe across cachelines within the same page rather than across pages. If we do stripe across pages as proposed here, we'd probably want to rethink the way the SLRU is extended -- page at a time wouldn't really make sense, but preallocating an entire file might. > And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least cacheline. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? I don't understand how this could work. We need to be able to look up transaction status by XID, not backend ID. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 2 Jan 2024, at 19:23, Robert Haas <robertmhaas@gmail.com> wrote: > >> >> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least cacheline. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? > > I don't understand how this could work. We need to be able to look up > transaction status by XID, not backend ID. When GetNewTransactionId() is called we can reserve 256 xids in backend local memory. This values will be reused by transactionsor subtransactions of this backend. Here 256 == sizeof(CacheLine). This would ensure that different backends touch different cache lines. But this approach would dramatically increase xid consumption speed on patterns where client reconnects after several transactions.So we can keep unused xids in procarray for future reuse. I doubt we can find significant performance improvement here, because false cache line sharing cannot be _that_ bad. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Tue, Jan 2, 2024 at 1:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > On 2 Jan 2024, at 19:23, Robert Haas <robertmhaas@gmail.com> wrote: > >> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at leastcache line. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? > > > > I don't understand how this could work. We need to be able to look up > > transaction status by XID, not backend ID. > > When GetNewTransactionId() is called we can reserve 256 xids in backend local memory. This values will be reused by transactionsor subtransactions of this backend. Here 256 == sizeof(CacheLine). > This would ensure that different backends touch different cache lines. > > But this approach would dramatically increase xid consumption speed on patterns where client reconnects after several transactions.So we can keep unused xids in procarray for future reuse. > > I doubt we can find significant performance improvement here, because false cache line sharing cannot be _that_ bad. Yeah, this seems way too complicated for what we'd potentially gain from it. An additional problem is that the xmin horizon computation assumes that XIDs are assigned in monotonically increasing fashion; breaking that would be messy. And even an occasional leak of XIDs could precipitate enough additional vacuuming to completely outweigh any gains we could hope to achieve here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Tue, Jan 2, 2024 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > Just a side node. > > It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby**bits** are written by different CPUs. > > I think that banks and locks are good thing. But also we could reorganize log so that > > status of transaction 0 is on a page 0 at bit offset 0 > > status of transaction 1 is on a page 1 at bit offset 0 > > status of transaction 2 is on a page 2 at bit offset 0 > > status of transaction 3 is on a page 3 at bit offset 0 > > status of transaction 4 is on a page 0 at bit offset 2 > > status of transaction 5 is on a page 1 at bit offset 2 > > status of transaction 6 is on a page 2 at bit offset 2 > > status of transaction 7 is on a page 3 at bit offset 2 > > etc... > > This is an interesting idea. A variant would be to stripe across > cachelines within the same page rather than across pages. If we do > stripe across pages as proposed here, we'd probably want to rethink > the way the SLRU is extended -- page at a time wouldn't really make > sense, but preallocating an entire file might. Yeah, this is indeed an interesting idea. So I think if we are interested in working in this direction maybe this can be submitted in a different thread, IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Wed, Jan 3, 2024 at 12:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yeah, this is indeed an interesting idea. So I think if we are > interested in working in this direction maybe this can be submitted in > a different thread, IMHO. Yeah, that's something quite different from the patch before us. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
The more I look at TransactionGroupUpdateXidStatus, the more I think it's broken, and while we do have some tests, I don't have confidence that they cover all possible cases. Or, at least, if this code is good, then it hasn't been sufficiently explained. If we have multiple processes trying to write bits to clog, and they are using different banks, then the LWLockConditionalAcquire will be able to acquire the bank lock -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom." (Isaac Asimov)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > The more I look at TransactionGroupUpdateXidStatus, the more I think > it's broken, and while we do have some tests, I don't have confidence > that they cover all possible cases. > > Or, at least, if this code is good, then it hasn't been sufficiently > explained. Any thought about a case in which you think it might be broken, I mean any vague thought might also help where you think it might not work as expected so that I can also think in that direction. It might be possible that I might not be thinking of some perspective that you are thinking and comments might be lacking from that point of view. > If we have multiple processes trying to write bits to clog, and they are > using different banks, then the LWLockConditionalAcquire will be able to > acquire the bank lock Do you think there is a problem with multiple processes getting the lock? I mean they are modifying different CLOG pages so that can be done concurrently right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Jan-08, Dilip Kumar wrote: > On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > The more I look at TransactionGroupUpdateXidStatus, the more I think > > it's broken, and while we do have some tests, I don't have confidence > > that they cover all possible cases. > > > > Or, at least, if this code is good, then it hasn't been sufficiently > > explained. > > Any thought about a case in which you think it might be broken, I mean > any vague thought might also help where you think it might not work as > expected so that I can also think in that direction. It might be > possible that I might not be thinking of some perspective that you are > thinking and comments might be lacking from that point of view. Eh, apologies. This email was an unfinished draft that I had laying around before the holidays which I intended to discard it but somehow kept around, and just now I happened to press the wrong key combination and it ended up being sent instead. We had some further discussion, after which I no longer think that there is a problem here, so please ignore this email. I'll come back to this patch later this week. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche")
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Jan 8, 2024 at 9:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Eh, apologies. This email was an unfinished draft that I had laying > around before the holidays which I intended to discard it but somehow > kept around, and just now I happened to press the wrong key combination > and it ended up being sent instead. We had some further discussion, > after which I no longer think that there is a problem here, so please > ignore this email. > > I'll come back to this patch later this week. No problem The patch was facing some compilation issues after some recent commits, so I have changed it. Reported by Julien Tachoires (offlist) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Jan 10, 2024 at 6:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jan 8, 2024 at 9:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Eh, apologies. This email was an unfinished draft that I had laying > > around before the holidays which I intended to discard it but somehow > > kept around, and just now I happened to press the wrong key combination > > and it ended up being sent instead. We had some further discussion, > > after which I no longer think that there is a problem here, so please > > ignore this email. > > > > I'll come back to this patch later this week. > > No problem > > The patch was facing some compilation issues after some recent > commits, so I have changed it. Reported by Julien Tachoires (offlist) The last patch conflicted with some of the recent commits, so here is the updated version of the patch, I also noticed that the slur bank lock wat event details were missing from the wait_event_names.txt file so added that as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
Here's a touched-up version of this patch. First, PROC_GLOBAL->clogGroupFirst and SlruCtl->latest_page_number change from being protected by locks to being atomics, but there's no mention of considering memory barriers that they should have. Looking at the code, I think the former doesn't need any additional barriers, but latest_page_number is missing some, which I have added. This deserves another careful look. Second and most user visibly, the GUC names seem to have been chosen based on the source-code variables, which have never meant to be user- visible. So I renamed a few: xact_buffers -> transaction_buffers subtrans_buffers -> subtransaction_buffers serial_buffers -> serializable_buffers commit_ts_buffers -> commit_timestamp_buffers (unchanged: multixact_offsets_buffers, multixact_members_buffers, notify_buffers) I did this explicitly trying to avoid using the term SLRU in a user-visible manner, because what do users care? But immediately after doing this I realized that we already have pg_stat_slru, so maybe the cat is already out of the bag, and so perhaps we should name these GUCS as, say slru_transaction_buffers? That may make the connection between these things a little more explicit. (I do think we need to add cross-links in the documentation of those GUCs to the pg_stat_slru docs and vice-versa.) Another thing that bothered me a bit is that we have auto-tuning for transaction_buffers and commit_timestamp_buffers, but not for subtransaction_buffers. (Autotuning means you set the GUC to 0 and it scales with shared_buffers). I don't quite understand what's the reason for the ommision, so I added it for subtrans too. I think it may make sense to do likewise for the multixact ones too, not sure. It doesn't seem worth having that for pg_serial and notify. While messing about with these GUCs I realized that the usage of the show_hook to print what the current number is, when autoturning is used, was bogus: SHOW would print the number of blocks for (say) transaction_buffers, but if you asked it to print (say) multixact_offsets_buffers, it would give a size in MB or kB. I'm sure such an inconsistency would bite us. So, digging around I found that a good way to handle this is to remove the show_hook, and instead call SetConfigOption() at the time when the ShmemInit function is called, with the correct number of buffers determined. This is pretty much what is used for XLOGbuffers, and it works correctly as far as my testing shows. Still with these auto-tuning GUCs, I noticed that the auto-tuning code would continue to grow the buffer sizes with shared_buffers to arbitrarily large values. I added an arbitrary maximum of 1024 (8 MB), which is much higher than the current value of 128; but if you have (say) 30 GB of shared_buffers (not uncommon these days), do you really need 30MB of pg_clog cache? It seems mostly unnecessary ... and you can still set it manually that way if you need it. So, largely I just rewrote those small functions completely. I also made the SGML documentation and postgresql.sample.conf all match what the code actually docs. The whole thing wasn't particularly consistent. I rewrote a bunch of code comments and moved stuff around to appear in alphabetical order, etc. More comment rewriting still pending. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Jan-25, Alvaro Herrera wrote: > Here's a touched-up version of this patch. > diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c > index 98fa6035cc..4a5e05d5e4 100644 > --- a/src/backend/storage/lmgr/lwlock.c > +++ b/src/backend/storage/lmgr/lwlock.c > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = { > [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash", > [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA", > [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash", > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU", > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU", > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU", > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU", > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU" > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU", > + [LWTRANCHE_XACT_SLRU] = "XactSLRU", > }; Eeek. Last minute changes ... Fixed here. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Robert Haas
Date:
On Thu, Jan 25, 2024 at 11:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Still with these auto-tuning GUCs, I noticed that the auto-tuning code > would continue to grow the buffer sizes with shared_buffers to > arbitrarily large values. I added an arbitrary maximum of 1024 (8 MB), > which is much higher than the current value of 128; but if you have > (say) 30 GB of shared_buffers (not uncommon these days), do you really > need 30MB of pg_clog cache? It seems mostly unnecessary ... and you can > still set it manually that way if you need it. So, largely I just > rewrote those small functions completely. Yeah, I think that if we're going to scale with shared_buffers, it should be capped. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
I've continued to review this and decided that I don't like the mess this patch proposes in order to support pg_commit_ts's deletion of all files. (Yes, I know that I was the one that proposed this idea. It's still a bad one). I'd like to change that code by removing the limit that we can only have 128 bank locks in a SLRU. To recap, the reason we did this is that commit_ts sometimes wants to delete all files while running (DeactivateCommitTs), and for this it needs to acquire all bank locks. Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we added the SLRU limit making multiple banks share lwlocks. I propose two alternative solutions: 1. The easiest is to have DeactivateCommitTs continue to hold CommitTsLock until the end, including while SlruScanDirectory does its thing. This sounds terrible, but considering that this code only runs when the module is being deactivated, I don't think it's really all that bad in practice. I mean, if you deactivate the commit_ts module and then try to read commit timestamp data, you deserve to wait for a few seconds just as a punishment for your stupidity. AFAICT the cases where anything is done in the module mostly check without locking that commitTsActive is set, so we're not slowing down any critical operations. Also, if you don't like to be delayed for a couple of seconds, just don't deactivate the module. 2. If we want some refinement, the other idea is to change SlruScanDirCbDeleteAll (the callback that SlruScanDirectory uses in this case) so that it acquires the bank lock appropriate for all the slots used by the file that's going to be deleted. This is OK because in the default compilation each file only has 32 segments, so that requires only 32 lwlocks held at once while the file is being deleted. I think we don't need to bother with this, but it's an option if we see the above as unworkable for whatever reason. The only other user of SlruScanDirCbDeleteAll is async.c (the LISTEN/ NOTIFY code), and what that does is delete all the files at server start, where nothing is running concurrently anyway. So whatever we do for commit_ts, it won't affect async.c. So, if we do away with the SLRU_MAX_BANKLOCKS idea, then we're assured one LWLock per bank (instead of sharing some lwlocks to multiple banks), and that makes the code simpler to reason about. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "In fact, the basic problem with Perl 5's subroutines is that they're not crufty enough, so the cruft leaks out into user-defined code instead, by the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Andrey Borodin
Date:
> On 26 Jan 2024, at 22:38, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > This is OK because in the > default compilation each file only has 32 segments, so that requires > only 32 lwlocks held at once while the file is being deleted. Do we account somehow that different subsystems do not accumulate MAX_SIMUL_LWLOCKS together? E.g. GiST during split can combine 75 locks, and somehow commit_ts will be deactivated by this backend at the same momentand add 32 locks more :) I understand that this sounds fantastic, these subsystems do not interfere. But this is fantastic only until something likethat actually happens. If possible, I'd prefer one lock at a time, any maybe sometimes two-three with some guarantees that this is safe. So, from my POV first solution that you proposed seems much better to me. Thanks for working on this! Best regard, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Jan-25, Alvaro Herrera wrote: > > > Here's a touched-up version of this patch. > > > diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c > > index 98fa6035cc..4a5e05d5e4 100644 > > --- a/src/backend/storage/lmgr/lwlock.c > > +++ b/src/backend/storage/lmgr/lwlock.c > > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = { > > [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash", > > [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA", > > [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash", > > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU", > > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU", > > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU", > > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU", > > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU" > > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU", > > + [LWTRANCHE_XACT_SLRU] = "XactSLRU", > > }; > > Eeek. Last minute changes ... Fixed here. Thank you for working on this. There is one thing that I feel is problematic. We have kept the allowed values for these GUCs to be in multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min values were changed to 16 but in this refactoring patch for some of the buffers you have changed that to 8 so I think that's not good. + { + {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM, + gettext_noop("Sets the size of the dedicated buffer pool used for the MultiXact offset cache."), + NULL, + GUC_UNIT_BLOCKS + }, + &multixact_offsets_buffers, + 16, 8, SLRU_MAX_ALLOWED_BUFFERS, + check_multixact_offsets_buffers, NULL, NULL + }, Other than this patch looks good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I've continued to review this and decided that I don't like the mess > this patch proposes in order to support pg_commit_ts's deletion of all > files. (Yes, I know that I was the one that proposed this idea. It's > still a bad one). I'd like to change that code by removing the limit > that we can only have 128 bank locks in a SLRU. To recap, the reason we > did this is that commit_ts sometimes wants to delete all files while > running (DeactivateCommitTs), and for this it needs to acquire all bank > locks. Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we > added the SLRU limit making multiple banks share lwlocks. > > I propose two alternative solutions: > > 1. The easiest is to have DeactivateCommitTs continue to hold > CommitTsLock until the end, including while SlruScanDirectory does its > thing. This sounds terrible, but considering that this code only runs > when the module is being deactivated, I don't think it's really all that > bad in practice. I mean, if you deactivate the commit_ts module and > then try to read commit timestamp data, you deserve to wait for a few > seconds just as a punishment for your stupidity. I think this idea looks reasonable. I agree that if we are trying to read commit_ts after deactivating it then it's fine to make it wait. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Jan-29, Dilip Kumar wrote: > Thank you for working on this. There is one thing that I feel is > problematic. We have kept the allowed values for these GUCs to be in > multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min > values were changed to 16 but in this refactoring patch for some of > the buffers you have changed that to 8 so I think that's not good. Oh, absolutely, you're right. Restored the minimum to 16. So, here's the patchset as two pieces. 0001 converts SlruSharedData->latest_page_number to use atomics. I don't see any reason to mix this in with the rest of the patch, and though it likely won't have any performance advantage by itself (since the lock acquisition is pretty much the same), it seems better to get it in ahead of the rest -- I think that simplifies matters for the second patch, which is large enough. So, 0002 introduces the rest of the feature. I removed use of banklocks in a different amount as banks, and I made commit_ts use a longer lwlock acquisition at truncation time, rather than forcing all-lwlock acquisition. The more I look at 0002, the more I notice that some comments need badly updated, so please don't read too much into it yet. But I wanted to post it anyway for archives and cfbot purposes. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
Hah: postgres -c lc_messages=C -c shared_buffers=$((512*17)) 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter "transaction_buffers": 17 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must be a multiple of 16 -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ <Schwern> It does it in a really, really complicated way <crab> why does it need to be complicated? <Schwern> Because it's MakeMaker.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hah: > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter "transaction_buffers": 17 > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must be a multiple of 16 Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE instead of giving an error? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-01, Dilip Kumar wrote: > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter "transaction_buffers": 17 > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must be a multiple of 16 > > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE > instead of giving an error? Since this is the auto-tuning feature, I think it should use the previous multiple rather than the next, but yeah, something like that. While I have your attention -- if you could give a look to the 0001 patch I posted, I would appreciate it. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Los trabajadores menos efectivos son sistematicamente llevados al lugar donde pueden hacer el menor daño posible: gerencia." (El principio Dilbert)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Feb-01, Dilip Kumar wrote: > > > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > > > > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter "transaction_buffers": 17 > > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must be a multiple of 16 > > > > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE > > instead of giving an error? > > Since this is the auto-tuning feature, I think it should use the > previous multiple rather than the next, but yeah, something like that. Okay. > > While I have your attention -- if you could give a look to the 0001 > patch I posted, I would appreciate it. > I will look into it. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Okay. > > > > While I have your attention -- if you could give a look to the 0001 > > patch I posted, I would appreciate it. > > > > I will look into it. Thanks. Some quick observations, Do we need below two write barriers at the end of the function? because the next instruction is separated by the function boundary @@ -766,14 +766,11 @@ StartupCLOG(void) .. - XactCtl->shared->latest_page_number = pageno; - - LWLockRelease(XactSLRULock); + pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno); + pg_write_barrier(); } /* * Initialize member's idea of the latest page number. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number, + pageno); + + pg_write_barrier(); } I am looking more into this from the concurrency point of view and will update you soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > Okay. > > > > > > While I have your attention -- if you could give a look to the 0001 > > > patch I posted, I would appreciate it. > > > > > > > I will look into it. Thanks. > > Some quick observations, > > Do we need below two write barriers at the end of the function? > because the next instruction is separated by the function boundary > > @@ -766,14 +766,11 @@ StartupCLOG(void) > .. > - XactCtl->shared->latest_page_number = pageno; > - > - LWLockRelease(XactSLRULock); > + pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno); > + pg_write_barrier(); > } > > /* > * Initialize member's idea of the latest page number. > */ > pageno = MXOffsetToMemberPage(offset); > - MultiXactMemberCtl->shared->latest_page_number = pageno; > + pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number, > + pageno); > + > + pg_write_barrier(); > } > I have checked the patch and it looks fine to me other than the above question related to memory barrier usage one more question about the same, basically below to instances 1 and 2 look similar but in 1 you are not using the memory write_barrier whereas in 2 you are using the write_barrier, why is it so? I mean why the reordering can not happen in 1 and it may happen in 2? 1. + pg_atomic_write_u64(&CommitTsCtl->shared->latest_page_number, + trunc->pageno); SimpleLruTruncate(CommitTsCtl, trunc->pageno); vs 2. - shared->latest_page_number = pageno; + pg_atomic_write_u64(&shared->latest_page_number, pageno); + pg_write_barrier(); /* update the stats counter of zeroed pages */ pgstat_count_slru_page_zeroed(shared->slru_stats_idx); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-02, Dilip Kumar wrote: > I have checked the patch and it looks fine to me other than the above > question related to memory barrier usage one more question about the > same, basically below to instances 1 and 2 look similar but in 1 you > are not using the memory write_barrier whereas in 2 you are using the > write_barrier, why is it so? I mean why the reordering can not happen > in 1 and it may happen in 2? What I was thinking is that there's a lwlock operation just below, which acts as a barrier. But I realized something more important: there are only two places that matter, which are SlruSelectLRUPage and SimpleLruZeroPage. The others are all initialization code that run at a point where there's no going to be any concurrency in SLRU access, so we don't need barriers anyway. In SlruSelectLRUPage we definitely don't want to evict the page that SimpleLruZeroPage has initialized, starting from the point where it returns that new page to its caller. But if you consider the code of those two routines, you realize that the only time an equality between latest_page_number and "this_page_number" is going to occur, is when both pages are in the same bank ... and both routines are required to be holding the bank lock while they run, so in practice this is never a problem. We need the atomic write and atomic read so that multiple processes processing pages in different banks can update latest_page_number simultaneously. But the equality condition that we're looking for? it can never happen concurrently. In other words, these barriers are fully useless. (We also have SimpleLruTruncate, but I think it's not as critical to have a barrier there anyhow: accessing a slightly outdated page number could only be a problem if a bug elsewhere causes us to try to truncate in the current page. I think we only have this code there because we did have such bugs in the past, but IIUC this shouldn't happen anymore.) -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
In short, I propose the attached. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
Sorry, brown paper bag bug there. Here's the correct one. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 4 Feb 2024, at 18:38, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > In other words, these barriers are fully useless. +1. I've tried to understand ideas behind barriers, but latest_page_number is heuristics that does not need any guaranteesat all. It's also is used in safety check which can fire only when everything is already broken beyond any repair..(Though using atomic access seems a good idea anyway) This patch uses wording "banks" in comments before banks start to exist. But as far as I understand, it is expected to becommitted before "banks" patch. Besides this patch looks good to me. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Feb-02, Dilip Kumar wrote: > > > I have checked the patch and it looks fine to me other than the above > > question related to memory barrier usage one more question about the > > same, basically below to instances 1 and 2 look similar but in 1 you > > are not using the memory write_barrier whereas in 2 you are using the > > write_barrier, why is it so? I mean why the reordering can not happen > > in 1 and it may happen in 2? > > What I was thinking is that there's a lwlock operation just below, which > acts as a barrier. But I realized something more important: there are > only two places that matter, which are SlruSelectLRUPage and > SimpleLruZeroPage. The others are all initialization code that run at a > point where there's no going to be any concurrency in SLRU access, so we > don't need barriers anyway. In SlruSelectLRUPage we definitely don't > want to evict the page that SimpleLruZeroPage has initialized, starting > from the point where it returns that new page to its caller. > But if you consider the code of those two routines, you realize that the > only time an equality between latest_page_number and "this_page_number" > is going to occur, is when both pages are in the same bank ... and both > routines are required to be holding the bank lock while they run, so in > practice this is never a problem. Right, in fact when I first converted this 'latest_page_number' to an atomic the thinking was to protect it from concurrently setting the values in SimpleLruZeroPage() and also concurrently reading in SlruSelectLRUPage() should not read the corrupted value. All other usages were during the initialization phase where we do not need any protection. > > We need the atomic write and atomic read so that multiple processes > processing pages in different banks can update latest_page_number > simultaneously. But the equality condition that we're looking for? > it can never happen concurrently. Yeah, that's right, after you told I also realized that the case is protected by the bank lock. Earlier I didn't think about this case. > In other words, these barriers are fully useless. > > (We also have SimpleLruTruncate, but I think it's not as critical to > have a barrier there anyhow: accessing a slightly outdated page number > could only be a problem if a bug elsewhere causes us to try to truncate > in the current page. I think we only have this code there because we > did have such bugs in the past, but IIUC this shouldn't happen anymore.) +1, I agree with this theory in general. But the below comment in SimpleLruTrucate in your v3 patch doesn't seem correct, because here we are checking if the latest_page_number is smaller than the cutoff if so we log it as wraparound and skip the whole thing and that is fine even if we are reading with atomic variable and slightly outdated value should not be a problem but the comment claim that this safe because we have the same bank lock as SimpleLruZeroPage(), but that's not true here we will be acquiring different bank locks one by one based on which slotno we are checking. Am I missing something? + * An important safety check: the current endpoint page must not be + * eligible for removal. Like SlruSelectLRUPage, we don't need a + * memory barrier here because for the affected page to be relevant, + * we'd have to have the same bank lock as SimpleLruZeroPage. */ - if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage)) + if (ctl->PagePrecedes(pg_atomic_read_u64(&shared->latest_page_number), + cutoffPage)) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-04, Andrey M. Borodin wrote: > This patch uses wording "banks" in comments before banks start to > exist. But as far as I understand, it is expected to be committed > before "banks" patch. True -- changed that to use ControlLock. > Besides this patch looks good to me. Many thanks for reviewing. On 2024-Feb-05, Dilip Kumar wrote: > > (We also have SimpleLruTruncate, but I think it's not as critical to > > have a barrier there anyhow: accessing a slightly outdated page number > > could only be a problem if a bug elsewhere causes us to try to truncate > > in the current page. I think we only have this code there because we > > did have such bugs in the past, but IIUC this shouldn't happen anymore.) > > +1, I agree with this theory in general. But the below comment in > SimpleLruTrucate in your v3 patch doesn't seem correct, because here > we are checking if the latest_page_number is smaller than the cutoff > if so we log it as wraparound and skip the whole thing and that is > fine even if we are reading with atomic variable and slightly outdated > value should not be a problem but the comment claim that this safe > because we have the same bank lock as SimpleLruZeroPage(), but that's > not true here we will be acquiring different bank locks one by one > based on which slotno we are checking. Am I missing something? I think you're correct. I reworded this comment, so now it says this: /* * An important safety check: the current endpoint page must not be * eligible for removal. This check is just a backstop against wraparound * bugs elsewhere in SLRU handling, so we don't care if we read a slightly * outdated value; therefore we don't add a memory barrier. */ Pushed with those changes. Thank you! Now I'll go rebase the rest of the patch on top. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > (We also have SimpleLruTruncate, but I think it's not as critical to > > > have a barrier there anyhow: accessing a slightly outdated page number > > > could only be a problem if a bug elsewhere causes us to try to truncate > > > in the current page. I think we only have this code there because we > > > did have such bugs in the past, but IIUC this shouldn't happen anymore.) > > > > +1, I agree with this theory in general. But the below comment in > > SimpleLruTrucate in your v3 patch doesn't seem correct, because here > > we are checking if the latest_page_number is smaller than the cutoff > > if so we log it as wraparound and skip the whole thing and that is > > fine even if we are reading with atomic variable and slightly outdated > > value should not be a problem but the comment claim that this safe > > because we have the same bank lock as SimpleLruZeroPage(), but that's > > not true here we will be acquiring different bank locks one by one > > based on which slotno we are checking. Am I missing something? > > I think you're correct. I reworded this comment, so now it says this: > > /* > * An important safety check: the current endpoint page must not be > * eligible for removal. This check is just a backstop against wraparound > * bugs elsewhere in SLRU handling, so we don't care if we read a slightly > * outdated value; therefore we don't add a memory barrier. > */ > > Pushed with those changes. Thank you! Yeah, this looks perfect, thanks. > Now I'll go rebase the rest of the patch on top. Okay, I will review and test after that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
Here's the rest of it rebased on top of current master. I think it makes sense to have this as one individual commit. I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, because we don't have that macro at that point, so I just used constant 16. Obviously need a better solution for this. I also changed the location of bank_mask in SlruCtlData for better packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO() to SlotGetBankNumber(). Some very critical comments still need to be updated to the new design, particularly anything that mentions "control lock"; but also the overall model needs to be explained in some central location, rather than incongruently some pieces here and other pieces there. I'll see about this later. But at least this is code you should be able to play with. I've been wondering whether we should add a "slru" to the name of the GUCs: commit_timestamp_slru_buffers transaction_slru_buffers etc -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Aprender sin pensar es inútil; pensar sin aprender, peligroso" (Confucio)
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Here's the rest of it rebased on top of current master. I think it > makes sense to have this as one individual commit. > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > because we don't have that macro at that point, so I just used constant > 16. Obviously need a better solution for this. If we define SLRU_BANK_SIZE in slur.h then we can use it here right, because these files are anyway include slur.h so. > > I also changed the location of bank_mask in SlruCtlData for better > packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO() > to SlotGetBankNumber(). Okay. > Some very critical comments still need to be updated to the new design, > particularly anything that mentions "control lock"; but also the overall > model needs to be explained in some central location, rather than > incongruently some pieces here and other pieces there. I'll see about > this later. But at least this is code you should be able to play with. Okay, I will review and test this > I've been wondering whether we should add a "slru" to the name of the > GUCs: > > commit_timestamp_slru_buffers > transaction_slru_buffers > etc I am not sure we are exposing anything related to SLRU to the user, I mean transaction_buffers should make sense for the user that it stores transaction-related data in some buffers pool but whether that buffer pool is called SLRU or not doesn't matter much to the user IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 7 Feb 2024, at 10:58, Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> commit_timestamp_slru_buffers >> transaction_slru_buffers >> etc > > I am not sure we are exposing anything related to SLRU to the user, I think we already tell something about SLRU to the user. I’d rather consider if “transaction_slru_buffers" is easier tounderstand than “transaction_buffers” .. IMO transaction_buffers is clearer. But I do not have strong opinion. > I > mean transaction_buffers should make sense for the user that it stores > transaction-related data in some buffers pool but whether that buffer > pool is called SLRU or not doesn't matter much to the user IMHO. +1 Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-07, Dilip Kumar wrote: > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > > because we don't have that macro at that point, so I just used constant > > 16. Obviously need a better solution for this. > > If we define SLRU_BANK_SIZE in slur.h then we can use it here right, > because these files are anyway include slur.h so. Sure, but is that really what we want? > > I've been wondering whether we should add a "slru" to the name of the > > GUCs: > > > > commit_timestamp_slru_buffers > > transaction_slru_buffers > > etc > > I am not sure we are exposing anything related to SLRU to the user, We do -- we have pg_stat_slru already. > I mean transaction_buffers should make sense for the user that it > stores transaction-related data in some buffers pool but whether that > buffer pool is called SLRU or not doesn't matter much to the user > IMHO. Yeah, that's exactly what my initial argument was for naming these this way. But since the term slru already escaped into the wild via the pg_stat_slru view, perhaps it helps users make the connection between these things. Alternatively, we can cross-reference each term from the other's documentation and call it a day. Another painful point is that pg_stat_slru uses internal names in the data it outputs, which obviously do not match the new GUCs. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Uno puede defenderse de los ataques; contra los elogios se esta indefenso"
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Feb-07, Dilip Kumar wrote: > > > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > > > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > > > because we don't have that macro at that point, so I just used constant > > > 16. Obviously need a better solution for this. > > > > If we define SLRU_BANK_SIZE in slur.h then we can use it here right, > > because these files are anyway include slur.h so. > > Sure, but is that really what we want? So your question is do we want these buffers to be in multiple of SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I don't think it should create any problem logically. I mean we can look again in the patch to see if we have made any such assumptions but that should be fairly easy to fix, then maybe if we are going in this way we should get rid of the check_slru_buffers() function as well. > > > I've been wondering whether we should add a "slru" to the name of the > > > GUCs: > > > > > > commit_timestamp_slru_buffers > > > transaction_slru_buffers > > > etc > > > > I am not sure we are exposing anything related to SLRU to the user, > > We do -- we have pg_stat_slru already. > > > I mean transaction_buffers should make sense for the user that it > > stores transaction-related data in some buffers pool but whether that > > buffer pool is called SLRU or not doesn't matter much to the user > > IMHO. > > Yeah, that's exactly what my initial argument was for naming these this > way. But since the term slru already escaped into the wild via the > pg_stat_slru view, perhaps it helps users make the connection between > these things. Alternatively, we can cross-reference each term from the > other's documentation and call it a day. Yeah, that's true I forgot this point about the pg_stat_slru, from this pov if the configuration has the name slru they would be able to make a better connection with the configured value, and the stats in this view based on that they can take call if they need to somehow increase the size of these slru buffers. > Another painful point is that pg_stat_slru uses internal names in the > data it outputs, which obviously do not match the new GUCs. Yeah, that's true, but I think this could be explained somewhere not sure what is the right place for this. FYI, I have also repeated all the performance tests I performed in my first email[1], and I am seeing a similar gain. Some comments on v18 in my first pass of the review. 1. @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) lsnindex = GetLSNIndex(slotno, xid); *lsn = XactCtl->shared->group_lsn[lsnindex]; - LWLockRelease(XactSLRULock); + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno)); Maybe here we can add an assert before releasing the lock for a safety check Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno))); 2. + * + * XXX could we make the LSNs to be bank-based? */ XLogRecPtr *group_lsn; IMHO, the flush still happens at the page level so up to which LSN should be flush before flushing the particular clog page should also be maintained at the page level. [1] https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-07, Dilip Kumar wrote: > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Sure, but is that really what we want? > > So your question is do we want these buffers to be in multiple of > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > don't think it should create any problem logically. I mean we can > look again in the patch to see if we have made any such assumptions > but that should be fairly easy to fix, then maybe if we are going in > this way we should get rid of the check_slru_buffers() function as > well. Not really, I just don't think the macro should be in slru.h. Another thing I've been thinking is that perhaps it would be useful to make the banks smaller, when the total number of buffers is small. For example, if you have 16 or 32 buffers, it's not really clear to me that it makes sense to have just 1 bank or 2 banks. It might be more sensible to have 4 banks with 4 or 8 buffers instead. That should make the algorithm scale down as well as up ... I haven't done either of those things in the attached v19 version. I did go over the comments once again and rewrote the parts I was unhappy with, including some existing ones. I think it's OK now from that point of view ... at some point I thought about creating a separate README, but in the end I thought it not necessary. I did add a bunch of Assert()s to make sure the locks that are supposed to be held are actually held. This led me to testing the page status to be not EMPTY during SimpleLruWriteAll() before calling SlruInternalWritePage(), because the assert was firing. The previous code is not really *buggy*, but to me it's weird to call WritePage() on a slot with no contents. Another change was in TransactionGroupUpdateXidStatus: the original code had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to know which bank to lock. I changed it to simply be the page used by the leader process; this doesn't need an atomic read, and should be the same page anyway. (If it isn't, it's no big deal). But what's more: even if we do read ->clogGroupFirst at that point, there's no guarantee that this is going to be exactly for the same process that ends up being the first in the list, because since we have not set it to INVALID by the time we grab the bank lock, it is quite possible for more processes to add themselves to the list. I realized all this while rewriting the comments in a way that would let me understand what was going on ... so IMO the effort was worthwhile. Anyway, what I send now should be pretty much final, modulo the change to the check_slru_buffers routines and documentation additions to match pg_stat_slru to the new GUC names. > > Another painful point is that pg_stat_slru uses internal names in the > > data it outputs, which obviously do not match the new GUCs. > > Yeah, that's true, but I think this could be explained somewhere not > sure what is the right place for this. Or we can change those names in the view. > FYI, I have also repeated all the performance tests I performed in my > first email[1], and I am seeing a similar gain. Excellent, thanks for doing that. > Some comments on v18 in my first pass of the review. > > 1. > @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) > lsnindex = GetLSNIndex(slotno, xid); > *lsn = XactCtl->shared->group_lsn[lsnindex]; > > - LWLockRelease(XactSLRULock); > + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno)); > > Maybe here we can add an assert before releasing the lock for a safety check > > Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno))); Hmm, I think this would just throw a warning or error "you don't hold such lwlock", so it doesn't seem necessary. > 2. > + * > + * XXX could we make the LSNs to be bank-based? > */ > XLogRecPtr *group_lsn; > > IMHO, the flush still happens at the page level so up to which LSN > should be flush before flushing the particular clog page should also > be maintained at the page level. Yeah, this was a misguided thought, nevermind. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ <Schwern> It does it in a really, really complicated way <crab> why does it need to be complicated? <Schwern> Because it's MakeMaker.
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Feb-07, Dilip Kumar wrote: > > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Sure, but is that really what we want? > > > > So your question is do we want these buffers to be in multiple of > > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > > don't think it should create any problem logically. I mean we can > > look again in the patch to see if we have made any such assumptions > > but that should be fairly easy to fix, then maybe if we are going in > > this way we should get rid of the check_slru_buffers() function as > > well. > > Not really, I just don't think the macro should be in slru.h. Okay > Another thing I've been thinking is that perhaps it would be useful to > make the banks smaller, when the total number of buffers is small. For > example, if you have 16 or 32 buffers, it's not really clear to me that > it makes sense to have just 1 bank or 2 banks. It might be more > sensible to have 4 banks with 4 or 8 buffers instead. That should make > the algorithm scale down as well as up ... It might be helpful to have small-size banks when SLRU buffers are set to a very low value and we are only accessing a couple of pages at a time (i.e. no buffer replacement) because in such cases most of the contention will be on SLRU Bank lock. Although I am not sure how practical such a use case would be, I mean if someone is using multi-xact very heavily or creating frequent subtransaction overflow then wouldn't they should set this buffer limit to some big enough value? By doing this we would lose some simplicity of the patch I mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would need to compute this and store it in SlruShared. Maybe that's not that bad. > > I haven't done either of those things in the attached v19 version. I > did go over the comments once again and rewrote the parts I was unhappy > with, including some existing ones. I think it's OK now from that point > of view ... at some point I thought about creating a separate README, > but in the end I thought it not necessary. Thanks, I will review those changes. > I did add a bunch of Assert()s to make sure the locks that are supposed > to be held are actually held. This led me to testing the page status to > be not EMPTY during SimpleLruWriteAll() before calling > SlruInternalWritePage(), because the assert was firing. The previous > code is not really *buggy*, but to me it's weird to call WritePage() on > a slot with no contents. Okay, I mean internally SlruInternalWritePage() will flush only if the status is SLRU_PAGE_VALID, but it is better the way you have done. > Another change was in TransactionGroupUpdateXidStatus: the original code > had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to > know which bank to lock. I changed it to simply be the page used by the > leader process; this doesn't need an atomic read, and should be the same > page anyway. (If it isn't, it's no big deal). But what's more: even if > we do read ->clogGroupFirst at that point, there's no guarantee that > this is going to be exactly for the same process that ends up being the > first in the list, because since we have not set it to INVALID by the > time we grab the bank lock, it is quite possible for more processes to > add themselves to the list. Yeah, this looks better > I realized all this while rewriting the comments in a way that would let > me understand what was going on ... so IMO the effort was worthwhile. +1 I will review and do some more testing early next week and share my feedback. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 23 Feb 2024, at 12:36, Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> Another thing I've been thinking is that perhaps it would be useful to >> make the banks smaller, when the total number of buffers is small. For >> example, if you have 16 or 32 buffers, it's not really clear to me that >> it makes sense to have just 1 bank or 2 banks. It might be more >> sensible to have 4 banks with 4 or 8 buffers instead. That should make >> the algorithm scale down as well as up ... > > It might be helpful to have small-size banks when SLRU buffers are set > to a very low value and we are only accessing a couple of pages at a > time (i.e. no buffer replacement) because in such cases most of the > contention will be on SLRU Bank lock. Although I am not sure how > practical such a use case would be, I mean if someone is using > multi-xact very heavily or creating frequent subtransaction overflow > then wouldn't they should set this buffer limit to some big enough > value? By doing this we would lose some simplicity of the patch I > mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would > need to compute this and store it in SlruShared. Maybe that's not that > bad. I'm sure anyone with multiple CPUs should increase, not decrease previous default of 128 buffers (with 512MB shared buffers).Having more CPUs (the only way to benefit from more locks) implies bigger transaction buffers. IMO making bank size variable adds unneeded computation overhead, bank search loops should be unrollable by compiler etc. Originally there was a patch set step, that packed bank's page addresses together in one array. It was done to make banksearch a SIMD instruction. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2024-Feb-07, Dilip Kumar wrote: > > > > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > Sure, but is that really what we want? > > > > > > So your question is do we want these buffers to be in multiple of > > > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > > > don't think it should create any problem logically. I mean we can > > > look again in the patch to see if we have made any such assumptions > > > but that should be fairly easy to fix, then maybe if we are going in > > > this way we should get rid of the check_slru_buffers() function as > > > well. > > > > Not really, I just don't think the macro should be in slru.h. > > Okay > > > Another thing I've been thinking is that perhaps it would be useful to > > make the banks smaller, when the total number of buffers is small. For > > example, if you have 16 or 32 buffers, it's not really clear to me that > > it makes sense to have just 1 bank or 2 banks. It might be more > > sensible to have 4 banks with 4 or 8 buffers instead. That should make > > the algorithm scale down as well as up ... > > It might be helpful to have small-size banks when SLRU buffers are set > to a very low value and we are only accessing a couple of pages at a > time (i.e. no buffer replacement) because in such cases most of the > contention will be on SLRU Bank lock. Although I am not sure how > practical such a use case would be, I mean if someone is using > multi-xact very heavily or creating frequent subtransaction overflow > then wouldn't they should set this buffer limit to some big enough > value? By doing this we would lose some simplicity of the patch I > mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would > need to compute this and store it in SlruShared. Maybe that's not that > bad. > > > > > I haven't done either of those things in the attached v19 version. I > > did go over the comments once again and rewrote the parts I was unhappy > > with, including some existing ones. I think it's OK now from that point > > of view ... at some point I thought about creating a separate README, > > but in the end I thought it not necessary. > > Thanks, I will review those changes. Few other things I noticed while reading through the patch, I haven't read it completely yet but this is what I got for now. 1. + * If no process is already in the list, we're the leader; our first step + * is to "close out the group" by resetting the list pointer from + * ProcGlobal->clogGroupFirst (this lets other processes set up other + * groups later); then we lock the SLRU bank corresponding to our group's + * page, do the SLRU updates, release the SLRU bank lock, and wake up the + * sleeping processes. I think here we are saying that we "close out the group" before acquiring the SLRU lock but that's not true. We keep the group open until we gets the lock so that we can get maximum members in while we are anyway waiting for the lock. 2. static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts, RepOriginId nodeid, int slotno) { - Assert(TransactionIdIsNormal(xid)); + if (!TransactionIdIsNormal(xid)) + return; + + entryno = TransactionIdToCTsEntry(xid); I do not understand why we need this change. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-23, Andrey M. Borodin wrote: > I'm sure anyone with multiple CPUs should increase, not decrease > previous default of 128 buffers (with 512MB shared buffers). Having > more CPUs (the only way to benefit from more locks) implies bigger > transaction buffers. Sure. > IMO making bank size variable adds unneeded computation overhead, bank > search loops should be unrollable by compiler etc. Makes sense. > Originally there was a patch set step, that packed bank's page > addresses together in one array. It was done to make bank search a > SIMD instruction. Ants Aasma had proposed a rework of the LRU code for better performance. He told me it depended on bank size being 16, so you're right that it's probably not a good idea to make it variable. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-23, Dilip Kumar wrote: > 1. > + * If no process is already in the list, we're the leader; our first step > + * is to "close out the group" by resetting the list pointer from > + * ProcGlobal->clogGroupFirst (this lets other processes set up other > + * groups later); then we lock the SLRU bank corresponding to our group's > + * page, do the SLRU updates, release the SLRU bank lock, and wake up the > + * sleeping processes. > > I think here we are saying that we "close out the group" before > acquiring the SLRU lock but that's not true. We keep the group open > until we gets the lock so that we can get maximum members in while we > are anyway waiting for the lock. Absolutely right. Reworded that. > 2. > static void > TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts, > RepOriginId nodeid, int slotno) > { > - Assert(TransactionIdIsNormal(xid)); > + if (!TransactionIdIsNormal(xid)) > + return; > + > + entryno = TransactionIdToCTsEntry(xid); > > I do not understand why we need this change. Ah yeah, I was bothered by the fact that if you pass Xid values earlier than NormalXid to this function, we'd reply with some nonsensical values instead of throwing an error. But you're right that it doesn't belong in this patch, so I removed that. Here's a version with these fixes, where I also added some text to the pg_stat_slru documentation: + <para> + For each <literal>SLRU</literal> area that's part of the core server, + there is a configuration parameter that controls its size, with the suffix + <literal>_buffers</literal> appended. For historical + reasons, the names are not exact matches, but <literal>Xact</literal> + corresponds to <literal>transaction_buffers</literal> and the rest should + be obvious. + <!-- Should we edit pgstat_internal.h::slru_names so that the "name" matches + the GUC name?? --> + </para> I think I would like to suggest renaming the GUCs to have the _slru_ bit in the middle: +# - SLRU Buffers (change requires restart) - + +#commit_timestamp_slru_buffers = 0 # memory for pg_commit_ts (0 = auto) +#multixact_offsets_slru_buffers = 16 # memory for pg_multixact/offsets +#multixact_members_slru_buffers = 32 # memory for pg_multixact/members +#notify_slru_buffers = 16 # memory for pg_notify +#serializable_slru_buffers = 32 # memory for pg_serial +#subtransaction_slru_buffers = 0 # memory for pg_subtrans (0 = auto) +#transaction_slru_buffers = 0 # memory for pg_xact (0 = auto) and the pgstat_internal.h table: static const char *const slru_names[] = { "commit_timestamp", "multixact_members", "multixact_offsets", "notify", "serializable", "subtransaction", "transaction", "other" /* has to be last */ }; This way they match perfectly. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings)
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Feb-23, Dilip Kumar wrote:
+ <para>
+ For each <literal>SLRU</literal> area that's part of the core server,
+ there is a configuration parameter that controls its size, with the suffix
+ <literal>_buffers</literal> appended. For historical
+ reasons, the names are not exact matches, but <literal>Xact</literal>
+ corresponds to <literal>transaction_buffers</literal> and the rest should
+ be obvious.
+ <!-- Should we edit pgstat_internal.h::slru_names so that the "name" matches
+ the GUC name?? -->
+ </para>
I think I would like to suggest renaming the GUCs to have the _slru_ bit
in the middle:
+# - SLRU Buffers (change requires restart) -
+
+#commit_timestamp_slru_buffers = 0 # memory for pg_commit_ts (0 = auto)
+#multixact_offsets_slru_buffers = 16 # memory for pg_multixact/offsets
+#multixact_members_slru_buffers = 32 # memory for pg_multixact/members
+#notify_slru_buffers = 16 # memory for pg_notify
+#serializable_slru_buffers = 32 # memory for pg_serial
+#subtransaction_slru_buffers = 0 # memory for pg_subtrans (0 = auto)
+#transaction_slru_buffers = 0 # memory for pg_xact (0 = auto)
and the pgstat_internal.h table:
static const char *const slru_names[] = {
"commit_timestamp",
"multixact_members",
"multixact_offsets",
"notify",
"serializable",
"subtransaction",
"transaction",
"other" /* has to be last */
};
This way they match perfectly.
Yeah, I think this looks fine to me.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-27, Dilip Kumar wrote: > > static const char *const slru_names[] = { > > "commit_timestamp", > > "multixact_members", > > "multixact_offsets", > > "notify", > > "serializable", > > "subtransaction", > > "transaction", > > "other" /* has to be last > > */ > > }; Here's a patch for the renaming part. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 27 Feb 2024, at 21:03, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Feb-27, Dilip Kumar wrote: > >>> static const char *const slru_names[] = { >>> "commit_timestamp", >>> "multixact_members", >>> "multixact_offsets", >>> "notify", >>> "serializable", >>> "subtransaction", >>> "transaction", >>> "other" /* has to be last >>> */ >>> }; > > Here's a patch for the renaming part. Sorry for the late reply, I have one nit. Are you sure that multixact_members and multixact_offsets are plural, while transactionand commit_timestamp are singular? Maybe multixact_members and multixact_offset? Because there are many members and one offset for a givent multixact? Userscertainly do not care, thought... Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-27, Andrey M. Borodin wrote: > Sorry for the late reply, I have one nit. Are you sure that > multixact_members and multixact_offsets are plural, while transaction > and commit_timestamp are singular? > Maybe multixact_members and multixact_offset? Because there are many > members and one offset for a givent multixact? Users certainly do not > care, thought... I made myself the same question actually, and thought about putting them both in the singular. I only backed off because I noticed that the directories themselves are in plural (an old mistake of mine, evidently). Maybe we should follow that instinct and use the singular for these. If we do that, we can rename the directories to also appear in singular when/if the patch to add standard page headers to the SLRUs lands -- which is going to need code to rewrite the files during pg_upgrade anyway, so the rename is not going to be a big deal. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Crear es tan difícil como ser libre" (Elsa Triolet)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
Here's the complete set, with these two names using the singular. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Uno puede defenderse de los ataques; contra los elogios se esta indefenso"
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
"Andrey M. Borodin"
Date:
> On 27 Feb 2024, at 22:33, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > <v21-0001-Rename-SLRU-elements-in-pg_stat_slru.patch><v21-0002-Make-SLRU-buffer-sizes-configurable.patch> These patches look amazing! Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-27, Alvaro Herrera wrote: > Here's the complete set, with these two names using the singular. BTW one thing I had not noticed is that before this patch we have minimum shmem size that's lower than the lowest you can go with the new code. This means Postgres may no longer start when extremely tight memory restrictions (and of course use more memory even when idle or with small databases). I wonder to what extent should we make an effort to relax that. For small, largely inactive servers, this is just memory we use for no good reason. However, anything we do here will impact performance on the high end, because as Andrey says this will add calculations and jumps where there are none today. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "We’ve narrowed the problem down to the customer’s pants being in a situation of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Feb-27, Alvaro Herrera wrote:
> Here's the complete set, with these two names using the singular.
BTW one thing I had not noticed is that before this patch we have
minimum shmem size that's lower than the lowest you can go with the new
code.
This means Postgres may no longer start when extremely tight memory
restrictions (and of course use more memory even when idle or with small
databases). I wonder to what extent should we make an effort to relax
that. For small, largely inactive servers, this is just memory we use
for no good reason. However, anything we do here will impact
performance on the high end, because as Andrey says this will add
calculations and jumps where there are none today.
I was just comparing the minimum memory required for SLRU when the system is minimally configured, correct me if I am wrong.
SLRU unpatched patched
commit_timestamp_buffers 4 16
subtransaction_buffers 32 16
transaction_buffers 4 16
subtransaction_buffers 32 16
transaction_buffers 4 16
multixact_offset_buffers 8 16
multixact_member_buffers 16 16
notify_buffers 8 16
serializable_buffers 16 16
multixact_member_buffers 16 16
notify_buffers 8 16
serializable_buffers 16 16
-------------------------------------------------------------------------------------
total buffers 88 112
total buffers 88 112
so that is < 200kB of extra memory on a minimally configured system, IMHO this should not matter.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Kyotaro Horiguchi
Date:
At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > Here's the complete set, with these two names using the singular. The commit by the second patch added several GUC descriptions: > Sets the size of the dedicated buffer pool used for the commit timestamp cache. Some of them, commit_timestamp_buffers, transaction_buffers, subtransaction_buffers use 0 to mean auto-tuning based on shared-buffer size. I think it's worth adding an extra_desc such as "0 to automatically determine this value based on the shared buffer size". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-29, Kyotaro Horiguchi wrote: > At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > > Here's the complete set, with these two names using the singular. > > The commit by the second patch added several GUC descriptions: > > > Sets the size of the dedicated buffer pool used for the commit timestamp cache. > > Some of them, commit_timestamp_buffers, transaction_buffers, > subtransaction_buffers use 0 to mean auto-tuning based on > shared-buffer size. I think it's worth adding an extra_desc such as "0 > to automatically determine this value based on the shared buffer > size". How about this? -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "La victoria es para quien se atreve a estar solo"
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Feb-29, Alvaro Herrera wrote: > On 2024-Feb-29, Kyotaro Horiguchi wrote: > > Some of them, commit_timestamp_buffers, transaction_buffers, > > subtransaction_buffers use 0 to mean auto-tuning based on > > shared-buffer size. I think it's worth adding an extra_desc such as "0 > > to automatically determine this value based on the shared buffer > > size". > > How about this? Pushed that way, but we can discuss further wording improvements/changes if someone wants to propose any. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Pushed that way, but we can discuss further wording improvements/changes > if someone wants to propose any. I just noticed that drongo is complaining about two lines added by commit 53c2a97a9: drongo | 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=': 'SlruPageStatus*' differs in levels of indirection from 'int' drongo | 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=': 'SlruPageStatus*' differs in levels of indirection from 'int' These lines are Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY); Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY); These are comparing the address of something with an enum value, which surely cannot be sane. Is the "&" operator incorrect? It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately) the numeric value of zero, so I guess the majority of our BF animals are understanding this as "address != NULL". But that doesn't look like a useful test to be making. regards, tom lane
I wrote: > It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately) > the numeric value of zero, so I guess the majority of our BF > animals are understanding this as "address != NULL". But that > doesn't look like a useful test to be making. In hopes of noticing whether there are other similar thinkos, I permuted the order of the SlruPageStatus enum values, and now I get the expected warnings from gcc: In file included from ../../../../src/include/postgres.h:45, from slru.c:59: slru.c: In function ‘SimpleLruWaitIO’: slru.c:436:38: warning: comparison between pointer and integer Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY); ^~ ../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’ if (!(condition)) \ ^~~~~~~~~ slru.c: In function ‘SimpleLruWritePage’: slru.c:717:43: warning: comparison between pointer and integer Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY); ^~ ../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’ if (!(condition)) \ ^~~~~~~~~ So it looks like it's just these two places. regards, tom lane
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Mar-04, Tom Lane wrote: > In hopes of noticing whether there are other similar thinkos, > I permuted the order of the SlruPageStatus enum values, and > now I get the expected warnings from gcc: Thanks for checking! I pushed the fixes. Maybe we should assign a nonzero value (= 1) to the first element of enums, to avoid this kind of mistake. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alexander Lakhin
Date:
Hello Alvaro, 27.02.2024 20:33, Alvaro Herrera wrote: > Here's the complete set, with these two names using the singular. > I've managed to trigger an assert added by 53c2a97a9. Please try the following script against a server compiled with -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the define, it just simplifies reproducing...): # initdb & start ... createdb test echo " SELECT pg_current_xact_id() AS tx \gset SELECT format('CREATE TABLE t%s(i int)', g) FROM generate_series(1, 1022 - :tx) g \gexec BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT pg_current_xact_id(); SELECT pg_sleep(5); " | psql test & echo " SELECT pg_sleep(1); BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT 1 INTO a; COMMIT; BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT 2 INTO b; " | psql test It fails for me with the following stack trace: TRAP: failed Assert("LWLockHeldByMeInMode(SimpleLruGetBankLock(ctl, pageno), LW_EXCLUSIVE)"), File: "slru.c", Line: 366, PID: 21711 ExceptionalCondition at assert.c:52:13 SimpleLruZeroPage at slru.c:369:11 SerialAdd at predicate.c:921:20 SummarizeOldestCommittedSxact at predicate.c:1521:2 GetSerializableTransactionSnapshotInt at predicate.c:1787:16 GetSerializableTransactionSnapshot at predicate.c:1691:1 GetTransactionSnapshot at snapmgr.c:264:21 exec_simple_query at postgres.c:1162:4 ... Best regards, Alexander
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
Hello, On 2024-Apr-03, Alexander Lakhin wrote: > I've managed to trigger an assert added by 53c2a97a9. > Please try the following script against a server compiled with > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the > define, it just simplifies reproducing...): Ah yes, absolutely, we're missing to trade the correct SLRU bank lock there. This rewrite of that small piece should fix it. Thanks for reporting this. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
Attachment
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Dilip Kumar
Date:
On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hello, > > On 2024-Apr-03, Alexander Lakhin wrote: > > > I've managed to trigger an assert added by 53c2a97a9. > > Please try the following script against a server compiled with > > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the > > define, it just simplifies reproducing...): > > Ah yes, absolutely, we're missing to trade the correct SLRU bank lock > there. This rewrite of that small piece should fix it. Thanks for > reporting this. > Yeah, we missed acquiring the bank lock w.r.t. intervening pages, thanks for reporting. Your fix looks correct to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
From
Alvaro Herrera
Date:
On 2024-Apr-03, Dilip Kumar wrote: > Yeah, we missed acquiring the bank lock w.r.t. intervening pages, > thanks for reporting. Your fix looks correct to me. Thanks for the quick review! And thanks to Alexander for the report. Pushed the fix. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No hay hombre que no aspire a la plenitud, es decir, la suma de experiencias de que un hombre es capaz"