Thread: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

11 October 2023, 11:04:37

The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable.  And, in order to deal with the linear search in the
large buffer pool,  we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool.  This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.

So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter.  The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.

Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.

There are some problems I faced while converting this centralized
control lock to a bank-wise lock and that is mainly because this lock
is (mis)used for different purposes. The main purpose of this control
lock as I understand it is to protect the in-memory access
(read/write) of the buffers in the SLRU buffer pool.

Here is the list of some problems and their analysis:

1) In some of the SLRU, we use this lock for protecting the members
inside the control structure which is specific to that SLRU layer i.e.
SerialControlData() members are protected by the SerialSLRULock, and I
don’t think it is the right use of this lock so for this purpose I
have introduced another lock called SerialControlLock for this
specific purpose.  Based on my analysis there is no reason for
protecting these members and the SLRU buffer access with the same
lock.
2) The member called ‘latest_page_number’ inside SlruSharedData is
also protected by the SLRULock, I would not say this use case is wrong
but since this is a common variable and not a per bank variable can
not be protected by the bankwise lock now.  But the usage of this
variable is just to track the latest page in an SLRU so that we do not
evict out the latest page during victim page selection.  So I have
converted this to an atomic variable as it is completely independent
of the SLRU buffer access.
3) In order to protect SlruScanDirectory, basically the
SlruScanDirectory() from DeactivateCommitTs(), is called under the
SLRU control lock, but from all other places  SlruScanDirectory() is
called without lock and that is because the caller of this function is
called from the places which are not executed concurrently(i.e.
startup, checkpoint).  This function DeactivateCommitTs() is also
called from the startup only so there doesn't seem any use in calling
this under the SLRU control lock.  Currently, I have called this under
the all-bank lock because logically this is not a performance path,
and that way we are keeping it consistent with the current logic, but
if others also think that we do not need a lock at this place then we
might remove it and then we don't need this all-bank lock anywhere.

There are some other uses of this lock where we might think it will be
a problem if we divide it into a bank-wise lock but it's not and I
have given my analysis for the same

1) SimpleLruTruncate: We might worry that if we convert to a bank-wise
lock then this could be an issue as we might need to release and
acquire different locks as we scan different banks.  But as per my
analysis, this is not an issue because a) With the current code also
do release and acquire the centralized lock multiple times in order to
perform the I/O on the buffer slot so the behavior is not changed but
the most important thing is b) All SLRU layers take care that this
function should not be accessed concurrently, I have verified all
access to this function and its true and the function header of this
function also says the same.  So this is not an issue as per my
analysis.

2) Extending or adding a new page to SLRU:  I have noticed that this
is also protected by either some other exclusive lock or only done
during startup.  So in short the SLRULock is just used for protecting
against the access of the buffers in the buffer pool but that is not
for guaranteeing the exclusive access inside the function because that
is taken care of in some other way.

3) Another thing that I noticed while writing this and thought it
would be good to make a note of that as well.  Basically for the CLOG
group update of the xid status.  Therein if we do not get the control
lock on the SLRU then we add ourselves to a group and then the group
leader does the job for all the members in the group.  One might think
that different pages in the group might belong to different SLRU bank
so the leader might need to acquire/release the lock as it process the
request in the group.  Yes, that is true, and it is taken care but we
don’t need to worry about the case because as per the implementation
of the group update, we are trying to have the members with the same
page request in one group and only due to some exception there could
be members with the different page request.  So the point is with a
bank-wise lock we are handling that exception case but that's not a
regular case that we need to acquire/release multiple times.  So
design-wise we are good and performance-wise there should not be any
problem because most of the time we might be updating the pages from
the same bank, and if in some cases we have some updates for old
transactions due to long-running transactions then we should do better
by not having a centralized lock.

Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter.  So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200

Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres

savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;

Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003

Results:
Clients Head SlruBank  SlruBank+BankwiseLockAndLru
1          457      491           475
8          3753    3819        3782
32      14594   14328      17028
64      15600    16243      25944
128    15957    16272      31731

So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD.  We might
be thinking that we do not see much gain only with the SlruBank patch.
The reason is that in this particular test case, we are not seeing
much load on the buffer replacement.  In fact, the wait event also
doesn’t show contention on any lock instead the main load is due to
frequently modifying the common variable like the centralized control
lock and the centralized LRU counters.  That is evident in perf data
as shown below

+  74.72%   0.06% postgres postgres  [.] XidInMVCCSnapshot
+  74.08%   0.02% postgres postgres  [.] SubTransGetTopmostTransaction
+  74.04%   0.07% postgres postgres  [.] SubTransGetParent
+  57.66%   0.04% postgres postgres  [.] LWLockAcquire
+  57.64%   0.26% postgres postgres  [.] SimpleLruReadPage_ReadOnly
……
+  16.53%   0.07% postgres postgres  [.] LWLockRelease
+  16.36%   0.04% postgres postgres  [.] pg_atomic_sub_fetch_u32
+  16.31%  16.24% postgres postgres  [.] pg_atomic_fetch_sub_u32_impl

We can notice that the main load is on the atomic variable within the
LWLockAcquire and LWLockRelease.  Once we apply the bankwise lock
patch(v1-0002) the same problem is visible on cur_lru_count updation
in the SlruRecentlyUsed[2] macro (I have not shown that here but it
was visible in my perf report).  And that is resolved by implementing
a bankwise counter.

[2]
#define SlruRecentlyUsed(shared, slotno) \
do { \
..
(shared)->cur_lru_count = ++new_lru_count; \
..
} \
} while (0)


Exp2: This test shows the load on SLRU frequent buffer replacement. In
this test, we are running the pgbench kind script which frequently
generates multixact-id, and parallelly we are starting and committing
a long-running transaction so that the multixact-ids are not
immediately cleaned up by the vacuum and we create contention on the
SLRU buffer pool.  I am not leaving the long-running transaction
running forever as that will start to show another problem with
respect to bloat and we will lose the purpose of what I am trying to
show here.

Note: test configurations are the same as Exp1, just the workload is
different, we are running below 2 scripts.
and new config parameter(added in v1-0001) slru_buffers_size_scale=4,
that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head
and
NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head

./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres
./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres

cat > multixact.sql <<EOF
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE;
SAVEPOINT S1;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
EOF

cat > longrunning.sql << EOF
BEGIN;
INSERT INTO pgbench_test VALUES(1);
select pg_sleep(10);
COMMIT;
EOF

Results:
Clients     Head    SlruBank   SlruBank+BankwiseLock
1              528        513           531
8             3870     4239         4157
32         13945   14470       14556
64         10086   19034       24482
128         6909    15627      18161

Here we can see good improvement with the SlruBank patch itself
because of increasing the SLRU buffer pool, as in this workload there
is a lot of contention due to buffer replacement.  As shown below we
can see a lot of load on MultiXactOffsetSLRU as well as on
MultiXactOffsetBuffer which shows there are frequent buffer evictions
in this workload.  And, increasing the SLRU buffer pool size is
helping a lot, and further dividing the SLRU lock into bank-wise locks
we are seeing a further gain.  So in total, we are seeing ~2.5x TPS at
64 and 128 thread compared to head.

   3401  LWLock          | MultiXactOffsetSLRU
   2031  LWLock          | MultiXactOffsetBuffer
    687                         |
    427  LWLock          | BufferContent

Credits:
- The base patch v1-0001 is authored by Thomas Munro and I have just rebased it.
- 0002 and 0003 are new patches written by me based on design ideas
from Robert and Myself.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

11 October 2023, 12:27:20

On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> The small size of the SLRU buffer pools can sometimes become a
> performance problem because it’s not difficult to have a workload
> where the number of buffers actively in use is larger than the
> fixed-size buffer pool. However, just increasing the size of the
> buffer pool doesn’t necessarily help, because the linear search that
> we use for buffer replacement doesn’t scale, and also because
> contention on the single centralized lock limits scalability.
>
> There is a couple of patches proposed in the past to address the
> problem of increasing the buffer pool size, one of the patch [1] was
> proposed by Thomas Munro where we make the size of the buffer pool
> configurable.

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1].  And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev.  Apologies for missing
that in the first email.

[1] https://commitfest.postgresql.org/43/2627/

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

12 October 2023, 12:46:05

On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> In my last email, I forgot to give the link from where I have taken
> the base path for dividing the buffer pool in banks so giving the same
> here[1].  And looking at this again it seems that the idea of that
> patch was from
> Andrey M. Borodin and the idea of the SLRU scale factor were
> introduced by Yura Sokolov and Ivan Lazarev.  Apologies for missing
> that in the first email.
>
> [1] https://commitfest.postgresql.org/43/2627/

In my last email I have just rebased the base patch, so now while
reading through that patch I realized that there was some refactoring
needed and some unused functions were there so I have removed that and
also added some comments.  Also did some refactoring to my patches. So
reposting the patch series.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Wed, Oct 25, 2023 at 10:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Overall I agree with your comments, actually, I haven't put that much
> thought into the GUC part and how it scales the SLRU buffers w.r.t.
> this single configurable parameter.  Yeah, so I think it is better
> that we take the older patch version as our base patch where we have
> separate GUC per SLRU.
>
> > I'm inclined to use Borodin's patch last posted here [2] instead of your
> > proposed 0001.
> > [2] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru
>
> I will rebase my patches on top of this.

I have taken 0001 and 0002 from [1], done some bug fixes in 0001, and
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
many banks, but also that we don't have very large banks.  There was
also a patch 0003 in this thread but I haven't taken this as this is
another optimization of merging some structure members and I will
analyze the performance characteristic of this and try to add it on
top of the complete patch series.

Patch details:
0001 - GUC parameter for each SLRU
0002 - Divide the SLRU pool into banks
(The above 2 are taken from [1] with some modification and rebasing by me)
0003 - Implement bank-wise SLRU lock as described in the first email
of this thread
0004 - Implement bank-wise LRU counter as described in the first email
of this thread
0005 - Some other optimization suggested offlist by Alvaro, i.e.
merging buffer locks and bank locks in the same array so that the
bank-wise LRU counter does not fetch the next cache line in a hot
function SlruRecentlyUsed()

Note: I think 0003,0004 and 0005 can be merged together but kept
separate so that we can review them independently and see how useful
each of them is.

[1] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

03 November 2023, 05:28:43

On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
Based on some offlist discussions with Alvaro and Robert in separate
conversations, I and Alvaro we came to the same point if a user sets a
very high value for the number of slots (say 1GB) then the number of
slots in each bank will be 1024 (considering max number of bank 128)
and if we continue the sequence search for finding the buffer for the
page then that could be costly in such cases.  But later in one of the
conversations with Robert, I realized that we can have this bank-wise
lock approach along with the partitioned hash table.

So the idea is, that we will use the buffer mapping hash table
something like Thoams used in one of his patches [1], but instead of a
normal hash table, we will use the partitioned hash table.  The SLRU
buffer pool is still divided as we have done in the bank-wise approach
and there will be separate locks for each slot range.  So now we get
the benefit of both approaches 1) By having a  mapping hash we can
avoid the sequence search 2) By dividing the buffer pool into banks
and keeping the victim buffer search within those banks we avoid
locking all the partitions during victim buffer search 3) And we can
also maintain a bank-wise LRU counter so that we avoid contention on a
single variable as we have discussed in my first email of this thread.
Please find the updated patch set details and patches attached to the
email.

[1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same
patch as the previous patch set
[2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce
buffer mapping hash table
[3] 0003-Partition-wise-slru-locks: Partition the hash table and also
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure
that the number of slots is in multiple of the number of banks

With this approach, I have also made some changes where the number of
banks is constant (i.e. 8) so that some of the computations are easy.
I think with a buffer mapping hash table we should not have much
problem in keeping this fixed as with very extreme configuration and
very high numbers of slots also we do not have performance problems as
we are not doing sequence search because of buffer mapping hash and if
the number of slots is set so high then the victim buffer search also
should not be frequent so we should not be worried about sequence
search within a bank for victim buffer search.  I have also changed
the default value of the number of slots to 64 and the minimum value
to 16 I think this is a reasonable default value because the existing
values are too low considering the modern hardware and these
parameters is configurable so user can set it to low value if running
with very low memory.


[1] https://www.postgresql.org/message-id/CA%2BhUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-%3Dpyg%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

04 November 2023, 20:07:52

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks

There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.

Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].

I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.

I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?

Thanks for working on this!

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/flat/CA%2BhUKGKVqrxOp82zER1%3DXN%3DyPwV_-OCGAg%3Dez%3D1iz9rG%2BA7Smw%40mail.gmail.com#b60f1cb73d350cf686338d4e800e12a2

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

06 November 2023, 04:09:54

On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> changed the logic of SlruAdjustNSlots() in 0002, such that now it
> starts with the next power of 2 value of the configured slots and
> keeps doubling the number of banks until we reach the number of banks
> to the max SLRU_MAX_BANKS(128) and bank size is bigger than
> SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
> many banks
>
> There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.

I agree with that, but I feel with bank-wise locks we are removing
major contention from the centralized control lock and we can see that
from my first email that how much benefit we can get in one of the
simple test cases when we create subtransaction overflow.

> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here
[0].
> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this
isthe case. 
> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear
search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster.  So
far in my test I did not find the slowness.  Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

> Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of
locallinear search. Lock partitions have to be bigger for sure. 

Yeah, that could also be an idea if we plan to drop the hash.  I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks.  Let's hear more
opinion on this.

>
> On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have taken 0001 and 0002 from [1], done some bug fixes in 0001
>
>
> BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for  MultiXactMemberCtl, see the below patch snippet from the
original patch.

@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
  MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;

  SimpleLruInit(MultiXactOffsetCtl,
-   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+   "MultiXactOffset", multixact_offsets_buffers, 0,
    MultiXactOffsetSLRULock, "pg_multixact/offsets",
    LWTRANCHE_MULTIXACTOFFSET_BUFFER,
    SYNC_HANDLER_MULTIXACT_OFFSET);
  SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
  SimpleLruInit(MultiXactMemberCtl,
-   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+   "MultiXactMember", multixact_offsets_buffers, 0,
    MultiXactMemberSLRULock, "pg_multixact/members",
    LWTRANCHE_MULTIXACTMEMBER_BUFFER,
    SYNC_HANDLER_MULTIXACT_MEMBER);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

06 November 2023, 07:35:55


> On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
>> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here
[0].
>> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this
isthe case. 
>> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear
search.
>
> The main intention of having this buffer mapping hash is to find the
> SLRU page faster than sequence search when banks are relatively bigger
> in size, but if we find the cases where having hash creates more
> overhead than providing gain then I am fine to remove the hash because
> the whole purpose of adding hash here to make the lookup faster.  So
> far in my test I did not find the slowness.  Do you or anyone else
> have any test case based on the previous research on whether it
> creates any slowness?
PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of
TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)

This hash table makes getting transaction status ~50% slower.

Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo
"shared_preload_libraries= 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension
test_slru'postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU')) 
  FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)

>
>> Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of
locallinear search. Lock partitions have to be bigger for sure. 
>
> Yeah, that could also be an idea if we plan to drop the hash.  I mean
> bank-wise counter is fine as we are finding a victim buffer within a
> bank itself, but each lock could cover more slots than one bank size
> or in other words, it can protect multiple banks.  Let's hear more
> opinion on this.
+1

>
>>
>> On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> I have taken 0001 and 0002 from [1], done some bug fixes in 0001
>>
>>
>> BTW can you please describe in more detail what kind of bugs?
>
> Yeah, actually that patch was using the same GUC
> (multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
> well as for  MultiXactMemberCtl, see the below patch snippet from the
> original patch.

Ouch. We were running this for serveral years with this bug... Thanks!


Best regards, Andrey Borodin.

Attachment

0001-Implement-benchmark_slru_page_readonly-to-assess-SLR.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

06 November 2023, 09:21:04

On Mon, Nov 6, 2023 at 1:05 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> > On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> >> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here
[0].
> >> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt
thisis the case. 
> >> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to
linearsearch. 
> >
> > The main intention of having this buffer mapping hash is to find the
> > SLRU page faster than sequence search when banks are relatively bigger
> > in size, but if we find the cases where having hash creates more
> > overhead than providing gain then I am fine to remove the hash because
> > the whole purpose of adding hash here to make the lookup faster.  So
> > far in my test I did not find the slowness.  Do you or anyone else
> > have any test case based on the previous research on whether it
> > creates any slowness?
> PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of
TransactionIdGetStatus())
> before introducing HTAB for buffer mapping I get
> Time: 14837.851 ms (00:14.838)
> with buffer HTAB I get
> Time: 22723.243 ms (00:22.723)
>
> This hash table makes getting transaction status ~50% slower.
>
> Benchmark script I used:
> make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo
"shared_preload_libraries= 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension
test_slru'postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU')) 
>   FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);"
postgres)

With this test, I got below numbers,

nslots.  no-hash    hash
8            10s       13s
16          10s       13s
32          15s       13s
64          17s       13s

Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected.  But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16.  Let me put some more thought into this and get back.
Any other opinions on this?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

06 November 2023, 09:31:41

On 2023-Nov-06, Dilip Kumar wrote:

> Yeah so we can see with a small bank size <=16 slots we are seeing
> that the fetching page with hash is 30% slower than the sequential
> search, but beyond 32 slots sequential search is become slower as you
> grow the number of slots whereas with hash it stays constant as
> expected.  But now as you told if keep the lock partition range
> different than the bank size then we might not have any problem by
> having more numbers of banks and with that, we can keep the bank size
> small like 16.  Let me put some more thought into this and get back.
> Any other opinions on this?

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5.  Maybe we could use that instead.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

06 November 2023, 11:14:32

> On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> dynahash is notoriously slow, which is why we have simplehash.h since
> commit b30d3ea824c5.  Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need
tocope with collisions much - we can evict a collided buffer. 

Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh
patchin April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software,
it’staken from hardware cache. 

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

06 November 2023, 11:24:42

On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > dynahash is notoriously slow, which is why we have simplehash.h since
> > commit b30d3ea824c5.  Maybe we could use that instead.
>
> Dynahash has lock partitioning. Simplehash has not, AFAIK.

Yeah, Simplehash doesn't have partitioning so with simple hash we will
be stuck with the centralized control lock that is one of the main
problems trying to solve here.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Amul Sul

Date:

08 November 2023, 05:09:34

On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> dynahash is notoriously slow, which is why we have simplehash.h since
> commit b30d3ea824c5. Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need to cope with collisions much - we can evict a collided buffer.

Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh patch in April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software, it’s taken from hardware cache.

I agree that we don't need the hash function to generate hash value out of
pageno which itself is sufficient, but I don't understand how we can get rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all?

Regards,

Amul

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Amul Sul

Date:

08 November 2023, 05:21:38

On Fri, Nov 3, 2023 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [...]
[1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same
patch as the previous patch set
[2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce
buffer mapping hash table
[3] 0003-Partition-wise-slru-locks: Partition the hash table and also
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure
that the number of slots is in multiple of the number of banks
[...]

Here are some minor comments:

+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
- return Min(256, Max(4, NBuffers / 256));
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(16, commit_ts_buffers);
+ return Min(256, Max(16, NBuffers / 256));

Do you mean "4MB of for every 1GB" in the comment?

diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
#include "replication/origin.h"
#include "storage/sync.h"

-
extern PGDLLIMPORT bool track_commit_timestamp;

A spurious change.

@@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)

if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
-
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}

Another spurious change in 0002 patch.

+/*
+ * The slru buffer mapping table is partitioned to reduce contention. To
+ * determine which partition lock a given pageno requires, compute the pageno's
+ * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
+ */

I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in
your patches, is that outdated comment?

- sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
- sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
+ sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); /* locks[] */

I am a bit uncomfortable with these changes, merging parts and buffer locks
making it hard to understand the code. Not sure what we were getting out of
this?

Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of

partitions

I think the 0005 patch can be merged to 0001.

Regards,

Amul

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

08 November 2023, 05:46:03

On Wed, Nov 8, 2023 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote:

Thanks for review Amul,

> Here are some minor comments:
>
> + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
> + * maximum value that slru.c will allow, but always at least 16 buffers.
>   */
>  Size
>  CommitTsShmemBuffers(void)
>  {
> -   return Min(256, Max(4, NBuffers / 256));
> +   /* Use configured value if provided. */
> +   if (commit_ts_buffers > 0)
> +       return Max(16, commit_ts_buffers);
> +   return Min(256, Max(16, NBuffers / 256));
>
> Do you mean "4MB of for every 1GB"  in the comment?

You are right

> --
>
> diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
> index 5087cdce51..78d017ad85 100644
> --- a/src/include/access/commit_ts.h
> +++ b/src/include/access/commit_ts.h
> @@ -16,7 +16,6 @@
>  #include "replication/origin.h"
>  #include "storage/sync.h"
>
> -
>  extern PGDLLIMPORT bool track_commit_timestamp;
>
> A spurious change.

Will fix

> --
>
> @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
>
>     if (nlsns > 0)
>         sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));    /* group_lsn[] */
> -
>     return BUFFERALIGN(sz) + BLCKSZ * nslots;
>  }
>
> Another spurious change in 0002 patch.

Will fix

> --
>
> +/*
> + * The slru buffer mapping table is partitioned to reduce contention. To
> + * determine which partition lock a given pageno requires, compute the pageno's
> + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
> + */
>
> I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in
> your patches, is that outdated comment?

Yes will fix it, actually, there are some major design changes to this.

> --
>
> -   sz += MAXALIGN(nslots * sizeof(LWLockPadded));  /* buffer_locks[] */
> -   sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
> +   sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));  /* locks[] */
>
> I am a bit uncomfortable with these changes, merging parts and buffer locks
> making it hard to understand the code. Not sure what we were getting out of
> this?

Yes, even I do not like this much because it is confusing.  But the
advantage of this is that we are using a single pointer for the lock
which means the next variable for the LRU counter will come in the
same cacheline and frequent updates of lru counter will be benefitted
from this.  Although I don't have any number which proves this.
Currently, I want to focus on all the base patches and keep this patch
as add on and later if we find its useful and want to pursue this then
we will see how to make it better readable.


>
> Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
>  partitions
>
> I think the 0005 patch can be merged to 0001.

Yeah in the next version, it is done that way.  Planning to post end of the day.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Ants Aasma

Date:

08 November 2023, 09:17:23

On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks
There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise locks will increase the number of locks < 10%.

I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank. The data layout is:

struct CacheBank {

int page_numbers[16];

char access_age[16];

}

The first part uses up one cache line, and the second line has 48 bytes of space left over that could fit a lwlock and page_status, page_dirty arrays.

Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only branch is for found/not found. Hoping to have a working prototype of SLRU on top in the next couple of days.

Regards,

Ants Aasma

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

08 November 2023, 09:32:48

> On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote:
>
> Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise
lockswill increase the number of locks < 10%. 

The problem was not attracting much attention for some years. So my reasoning was that solution should not have any
costsat all. Initial patchset with banks did not add any memory footprint. 

> On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote:
>
> I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank.

FWIW I tried to pack struct parts together to minimize cache lines touched, see step 3 in [0]. So far I could not prove
anyperformance benefits of this approach. But maybe your implementation will be more efficient. 

Thanks!

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

08 November 2023, 11:40:43

On Mon, Nov 6, 2023 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of
locallinear search. Lock partitions have to be bigger for sure. 
>
> Yeah, that could also be an idea if we plan to drop the hash.  I mean
> bank-wise counter is fine as we are finding a victim buffer within a
> bank itself, but each lock could cover more slots than one bank size
> or in other words, it can protect multiple banks.  Let's hear more
> opinion on this.

Here is the updated version of the patch, here I have taken the
approach suggested by Andrey and I discussed the same with Alvaro
offlist and he also agrees with it.  So the idea is that we will keep
the bank size fixed which is 16 buffers per bank and the allowed GUC
value for each slru buffer must be in multiple of the bank size.  We
have removed the centralized lock but instead of one lock per bank, we
have kept the maximum limit on the number of bank locks which is 128.
We kept the max limit as 128 because, in one of the operations (i.e.
ActivateCommitTs), we need to acquire all the bank locks (but this is
not a performance path at all) and at a time we can acquire a max of
200 LWlocks, so we think this limit of 128 is good.  So now if the
number of banks is <= 128 then we will be using one lock per bank
otherwise the one lock may protect access of buffer in multiple banks.

We might argue that we should keep the max lock lesser than 128 i.e.
64 or 32 and I am open to that we can do more experiments with a very
large buffer pool and a very heavy workload to see whether having lock
up to 128 is helpful or not

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

PFA, updated patch version, this fixes the comment given by Alvaro and
also improves some of the comments.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Nov-17, Dilip Kumar wrote:

I think I need some more clarification for some of the review comments

> > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > I just noticed that 0003 does some changes to
> > > TransactionGroupUpdateXidStatus() that haven't been adequately
> > > explained AFAICS.  How do you know that these changes are safe?
> >
> > IMHO this is safe as well as logical to do w.r.t. performance.  It's
> > safe because whenever we are updating any page in a group we are
> > acquiring the respective bank lock in exclusive mode and in extreme
> > cases if there are pages from different banks then we do switch the
> > lock as well before updating the pages from different groups.
>
> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.

1)
I was looking into your coverage report and I have attached a
screenshot from the same, it seems we do hit the block where nextidx
is not INVALID_PGPROCNO, which means there is some other process other
than the group leader.  Although I have already started exploring the
injection point but just wanted to be sure what is your main concern
point about the coverage so though of checking that first.

470             :     /*
     471             :      * If the list was not empty, the leader
will update the status of our
     472             :      * XID. It is impossible to have followers
without a leader because the
     473             :      * first process that has added itself to
the list will always have
     474             :      * nextidx as INVALID_PGPROCNO.
     475             :      */
     476          98 :     if (nextidx != INVALID_PGPROCNO)
     477             :     {
     478           2 :         int         extraWaits = 0;
     479             :
     480             :         /* Sleep until the leader updates our
XID status. */
     481           2 :
pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE);
     482             :         for (;;)
     483             :         {
     484             :             /* acts as a read barrier */
     485           2 :             PGSemaphoreLock(proc->sem);
     486           2 :             if (!proc->clogGroupMember)
     487           2 :                 break;
     488           0 :             extraWaits++;
     489             :         }

2) Do we really need one separate lwlock tranche for each SLRU?

IMHO if we use the same lwlock tranche then the wait event will show
the same wait event name, right? And that would be confusing for the
user, whether we are waiting for Subtransaction or Multixact or
anything else.  Is my understanding no correct here?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

image.png

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

20 November 2023, 09:07:18


> On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> 2) Do we really need one separate lwlock tranche for each SLRU?
>
> IMHO if we use the same lwlock tranche then the wait event will show
> the same wait event name, right? And that would be confusing for the
> user, whether we are waiting for Subtransaction or Multixact or
> anything else.  Is my understanding no correct here?

If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when they
observewait times. 

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

20 November 2023, 11:12:43

On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > 2) Do we really need one separate lwlock tranche for each SLRU?
> >
> > IMHO if we use the same lwlock tranche then the wait event will show
> > the same wait event name, right? And that would be confusing for the
> > user, whether we are waiting for Subtransaction or Multixact or
> > anything else.  Is my understanding no correct here?
>
> If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when they
observewait times. 

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

21 November 2023, 08:33:46

On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > 2) Do we really need one separate lwlock tranche for each SLRU?
> > >
> > > IMHO if we use the same lwlock tranche then the wait event will show
> > > the same wait event name, right? And that would be confusing for the
> > > user, whether we are waiting for Subtransaction or Multixact or
> > > anything else.  Is my understanding no correct here?
> >
> > If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when
theyobserve wait times. 

PFA, updated patch set, I have worked on review comments by Alvaro and
Andrey.  So the only open comments are about clog group commit
testing, for that my question was as I sent in the previous email
exactly what part we are worried about in the coverage report?

The second point is, if we want to generate a group update we will
have to create the injection point after we hold the control lock so
that other processes go for group update and then for waking up the
waiting process who is holding the SLRU control lock in the exclusive
mode we would need to call a function ('test_injection_points_wake()')
to wake that up and for calling the function we would need to again
acquire the SLRU lock in read mode for visibility check in the catalog
for fetching the procedure row and now this wake up session will block
on control lock for the session which is waiting on injection point so
now it will create a deadlock.   Maybe with bank-wise lock we can
create a lot of transaction so that these 2 falls in different banks
and then we can somehow test this, but then we will have to generate
16 * 4096 = 64k transaction so that the SLRU banks are different for
the transaction which inserted procedure row in system table from the
transaction in which we are trying to do the group commit

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

23 November 2023, 06:04:15

On Tue, Nov 21, 2023 at 2:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > > > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > 2) Do we really need one separate lwlock tranche for each SLRU?
> > > >
> > > > IMHO if we use the same lwlock tranche then the wait event will show
> > > > the same wait event name, right? And that would be confusing for the
> > > > user, whether we are waiting for Subtransaction or Multixact or
> > > > anything else.  Is my understanding no correct here?
> > >
> > > If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when
theyobserve wait times. 
>
> PFA, updated patch set, I have worked on review comments by Alvaro and
> Andrey.  So the only open comments are about clog group commit
> testing, for that my question was as I sent in the previous email
> exactly what part we are worried about in the coverage report?
>
> The second point is, if we want to generate a group update we will
> have to create the injection point after we hold the control lock so
> that other processes go for group update and then for waking up the
> waiting process who is holding the SLRU control lock in the exclusive
> mode we would need to call a function ('test_injection_points_wake()')
> to wake that up and for calling the function we would need to again
> acquire the SLRU lock in read mode for visibility check in the catalog
> for fetching the procedure row and now this wake up session will block
> on control lock for the session which is waiting on injection point so
> now it will create a deadlock.   Maybe with bank-wise lock we can
> create a lot of transaction so that these 2 falls in different banks
> and then we can somehow test this, but then we will have to generate
> 16 * 4096 = 64k transaction so that the SLRU banks are different for
> the transaction which inserted procedure row in system table from the
> transaction in which we are trying to do the group commit

I have attached a POC patch for testing the group update using the
injection point framework.  This is just for testing the group update
part and is not yet a committable test.  I have added a bunch of logs
in the code so that we can see what's going on with the group update.
From the below logs, we can see that multiple processes are getting
accumulated for the group update and the leader is updating their xid
status.


Note: With this testing, we have found a bug in the bank-wise
approach, basically we are clearing a procglobal->clogGroupFirst, even
before acquiring the bank lock that means in most of the cases there
will be a single process in each group as a group leader (I think this
is what Alvaro was pointing in his coverage report).  I have added
this fix in this POC just for testing purposes but in my next version
I will add this fix to my proper patch version after a proper review
and a bit more testing.


here is the output after running the test
==============
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl LOG:
procno 6 got the lock
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl
STATEMENT:  SELECT txid_current();
2023-11-23 05:55:29.406 UTC [93369] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_attach('ClogGroupCommit',
'wait');
2023-11-23 05:55:29.415 UTC [93371] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(1);
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl LOG:
procno 4 got the lock
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(1);
2023-11-23 05:55:29.424 UTC [93373] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(2);
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl LOG:
procno 3 for xid 128742 added for group update
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(2);
2023-11-23 05:55:29.431 UTC [93376] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93378] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(4);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 for xid 128743 added for group update
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 is follower and wait for group leader to update commit status
of xid 128743
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG:
procno 1 for xid 128744 added for group update
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(4);
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG:
procno 1 is follower and wait for group leader to update commit status
of xid 128744
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(4);
2023-11-23 05:55:29.445 UTC [93380] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(5);
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl LOG:
procno 0 for xid 128745 added for group update
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(5);
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl LOG:
procno 0 is follower and wait for group leader to update commit status
of xid 128745
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(5);
2023-11-23 05:55:29.451 UTC [93382] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_wake();
2023-11-23 05:55:29.460 UTC [93384] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_detach('ClogGroupCommit');

=============

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0001-test-group-update-poc-no-for-commit.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

24 November 2023, 04:47:54

On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Note: With this testing, we have found a bug in the bank-wise
> approach, basically we are clearing a procglobal->clogGroupFirst, even
> before acquiring the bank lock that means in most of the cases there
> will be a single process in each group as a group leader

I realized that the bug fix I have done is not proper, so will send
the updated patch set with the proper fix soon.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

24 November 2023, 09:07:33

On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Note: With this testing, we have found a bug in the bank-wise
> > approach, basically we are clearing a procglobal->clogGroupFirst, even
> > before acquiring the bank lock that means in most of the cases there
> > will be a single process in each group as a group leader
>
> I realized that the bug fix I have done is not proper, so will send
> the updated patch set with the proper fix soon.

PFA, updated patch set fixes the bug found during the testing of the
group update using the injection point.  Also attached a path to test
the injection point but for that, we need to apply the injection point
patches [1]

[1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

tender wang

Date:

29 November 2023, 09:40:48

The v8-0001 patch failed to apply in my local repo as below:

git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
error: patch failed: src/backend/access/transam/multixact.c:1851
error: src/backend/access/transam/multixact.c: patch does not apply
error: patch failed: src/backend/access/transam/subtrans.c:184
error: src/backend/access/transam/subtrans.c: patch does not apply
error: patch failed: src/backend/commands/async.c:117
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/predicate.c:808
error: src/backend/storage/lmgr/predicate.c: patch does not apply
error: patch failed: src/include/commands/async.h:15
error: src/include/commands/async.h: patch does not apply

My local head commit is 15c9ac36299. Is there something I missed?

Dilip Kumar <dilipbalaut@gmail.com> 于2023年11月24日周五 17:08写道：

On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Note: With this testing, we have found a bug in the bank-wise
> > approach, basically we are clearing a procglobal->clogGroupFirst, even
> > before acquiring the bank lock that means in most of the cases there
> > will be a single process in each group as a group leader
>
> I realized that the bug fix I have done is not proper, so will send
> the updated patch set with the proper fix soon.

PFA, updated patch set fixes the bug found during the testing of the
group update using the injection point. Also attached a path to test
the injection point but for that, we need to apply the injection point
patches [1]

[1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

29 November 2023, 09:59:05

On 2023-Nov-29, tender wang wrote:

> The v8-0001 patch failed to apply in my local repo as below:
> 
> git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> error: patch failed: src/backend/access/transam/multixact.c:1851
> error: src/backend/access/transam/multixact.c: patch does not apply
> error: patch failed: src/backend/access/transam/subtrans.c:184
> error: src/backend/access/transam/subtrans.c: patch does not apply
> error: patch failed: src/backend/commands/async.c:117
> error: src/backend/commands/async.c: patch does not apply
> error: patch failed: src/backend/storage/lmgr/predicate.c:808
> error: src/backend/storage/lmgr/predicate.c: patch does not apply
> error: patch failed: src/include/commands/async.h:15
> error: src/include/commands/async.h: patch does not apply

Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

29 November 2023, 11:28:07

On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Nov-29, tender wang wrote:
>
> > The v8-0001 patch failed to apply in my local repo as below:
> >
> > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> > error: patch failed: src/backend/access/transam/multixact.c:1851
> > error: src/backend/access/transam/multixact.c: patch does not apply
> > error: patch failed: src/backend/access/transam/subtrans.c:184
> > error: src/backend/access/transam/subtrans.c: patch does not apply
> > error: patch failed: src/backend/commands/async.c:117
> > error: src/backend/commands/async.c: patch does not apply
> > error: patch failed: src/backend/storage/lmgr/predicate.c:808
> > error: src/backend/storage/lmgr/predicate.c: patch does not apply
> > error: patch failed: src/include/commands/async.h:15
> > error: src/include/commands/async.h: patch does not apply
>
> Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

I will send a rebased version by tomorrow.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

30 November 2023, 10:00:15

On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2023-Nov-29, tender wang wrote:
> >
> > > The v8-0001 patch failed to apply in my local repo as below:
> > >
> > > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> > > error: patch failed: src/backend/access/transam/multixact.c:1851
> > > error: src/backend/access/transam/multixact.c: patch does not apply
> > > error: patch failed: src/backend/access/transam/subtrans.c:184
> > > error: src/backend/access/transam/subtrans.c: patch does not apply
> > > error: patch failed: src/backend/commands/async.c:117
> > > error: src/backend/commands/async.c: patch does not apply
> > > error: patch failed: src/backend/storage/lmgr/predicate.c:808
> > > error: src/backend/storage/lmgr/predicate.c: patch does not apply
> > > error: patch failed: src/include/commands/async.h:15
> > > error: src/include/commands/async.h: patch does not apply
> >
> > Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.
>
> I will send a rebased version by tomorrow.

PFA, a rebased version of the patch, I have avoided attaching because
a) that patch is POC to show the coverage and it has a dependency on
the other thread b) the old patch still applies so it doesn't need
rebase.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

11 December 2023, 05:11:47

On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the updated patch based on some comments by tender wang (those
comments were sent only to me)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the updated patch based on some comments by tender wang (those
comments were sent only to me)

few nitpicks:

+
+ /*
+ * Mask for slotno banks, considering 1GB SLRU buffer pool size and the
+ * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
+ */
+ bits16 bank_mask;
} SlruCtlData;

...
...

+ int bankno = pageno & ctl->bank_mask;

I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
of banks (num_banks) and get the bank number through modulus op (pageno %
num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
bit difficult to read compared to modulus op which is quite simple,
straightforward and much common practice in hashing.

Are there any advantages of using & over % ?

Also, a few places in 0002 and 0003 patch, need the bank number, it is better
to have a macro for that.
---

extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage,
void *data);
-
+extern bool check_slru_buffers(const char *name, int *newval);
#endif /* SLRU_H */

Add an empty line after the declaration, in 0002 patch.
---

-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)

Unrelated change for 0003 patch.

---

Regards,

Amul

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

14 December 2023, 08:23:12

On Thu, Dec 14, 2023 at 8:43 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> Here is the updated patch based on some comments by tender wang (those
>> comments were sent only to me)
>
>
> few nitpicks:
>
> +
> +   /*
> +    * Mask for slotno banks, considering 1GB SLRU buffer pool size and the
> +    * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
> +    */
> +   bits16      bank_mask;
>  } SlruCtlData;
>
> ...
> ...
>
> + int bankno = pageno & ctl->bank_mask;
>
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
>
> Are there any advantages of using &  over % ?

I am not sure either but since this change in 0002 is by Andrey, I
will let him comment on this before we change or take any decision.

> Also, a few places in 0002 and 0003 patch, need the bank number, it is better
> to have a macro for that.
> ---
>
>  extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage,
>                                    void *data);
> -
> +extern bool check_slru_buffers(const char *name, int *newval);
>  #endif                         /* SLRU_H */
>
>
> Add an empty line after the declaration, in 0002 patch.
> ---
>
> -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
> +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
> +                         int slotno)
>
> Unrelated change for 0003 patch.

 Fixed

Thanks for your review, PFA updated version.

I have added @Amit Kapila to the list to view his opinion about
whether anything can break in the clog group update with our changes
of bank-wise SLRU lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

14 December 2023, 09:01:52


> On 14 Dec 2023, at 08:12, Amul Sul <sulamul@gmail.com> wrote:
>
>
> + int bankno = pageno & ctl->bank_mask;
>
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
>
> Are there any advantages of using &  over % ?

The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some
readabilityto save ~ten nanoseconds on each check of a status of a transaction. 


[0] https://www.agner.org/optimize/instruction_tables.pdf

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

tender wang

Date:

14 December 2023, 09:28:01

Andrey M. Borodin <x4mmm@yandex-team.ru> 于2023年12月14日周四 17:02写道：

> On 14 Dec 2023, at 08:12, Amul Sul <sulamul@gmail.com> wrote:
>
>
> + int bankno = pageno & ctl->bank_mask;
>
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
>
> Are there any advantages of using & over % ?

use Compiler Explorer[1] tool, '%' has more Assembly instructions than '&' .

int GetBankno1(int pageno) {
    return pageno & 127;
}

int GetBankno2(int pageno) {
    return pageno % 127;
}
 under clang 13.0 GetBankno1:                             # @GetBankno1
        push    rbp
        mov     rbp, rsp
        mov     dword ptr [rbp - 4], edi
        mov     eax, dword ptr [rbp - 4]
        and     eax, 127
        pop     rbp
        ret
GetBankno2:                             # @GetBankno2
        push    rbp
        mov     rbp, rsp
        mov     dword ptr [rbp - 4], edi
        mov     eax, dword ptr [rbp - 4]
        mov     ecx, 127
        cdq
        idiv    ecx
        mov     eax, edx
        pop     rbp
        ret
 under gcc 13.2
GetBankno1:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        and     eax, 127
        pop     rbp
        ret
GetBankno2:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        movsx   rdx, eax
        imul    rdx, rdx, -2130574327
        shr     rdx, 32
        add     edx, eax
        mov     ecx, edx
        sar     ecx, 6
        cdq
        sub     ecx, edx
        mov     edx, ecx
        sal     edx, 7
        sub     edx, ecx
        sub     eax, edx
        mov     ecx, eax
        mov     eax, ecx
        pop     rbp
        ret

[1] https://godbolt.org/

The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some readability to save ~ten nanoseconds on each check of a status of a transaction.

Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127' :

SimpleLruGetBankLock()

{
int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;
use '&'
return &(ctl->shared->bank_locks[banklockno].lock);
}

Thoughts?

[0] https://www.agner.org/optimize/instruction_tables.pdf

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

14 December 2023, 09:35:37


> On 14 Dec 2023, at 14:28, tender wang <tndrwang@gmail.com> wrote:
>
>   Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS' operation in  SimpleLruGetBankLock()  with
'&127' 

unsigned int GetBankno1(unsigned int pageno) {
return pageno & 127;
}

unsigned int GetBankno2(unsigned int pageno) {
return pageno % 128;
}

Generates with -O2

GetBankno1(unsigned int):
mov eax, edi
and eax, 127
ret
GetBankno2(unsigned int):
mov eax, edi
and eax, 127
ret


Compiler is smart enough with constants.


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

14 December 2023, 11:06:03

On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> > On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > Andrey, do you have any stress tests or anything else that you used to
> > gain confidence in this code?
>
> We are using only first two steps of the patchset, these steps do not touch locking stuff.
>
> We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!
>

I have run this test [1], instead of comparing against the master I
have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs
(patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and
here is the result of the benchmark-1 and benchmark-2.  I have noticed
a very good improvement with the addition of patch 0003.

Machine information:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127

configurations:

max_wal_size=20GB
shared_buffers=20GB
checkpoint_timeout=40min
max_connections=700
maintenance_work_mem=1GB

subtrans_buffers=$variable
multixact_offsets_buffers=$variable
multixact_members_buffers=$variable

benchmark-1
version  | subtrans | multixact   | tps
              | buffers   | offs/memb | func+ballast
-----------+--------------+--------------+------
patch-1 | 64           | 64/128        |  87 + 58
patch-2 | 64           | 64/128         | 128 +83
patch-1 | 1024        | 512/1024    |  96 + 64
patch-2 | 1024        | 512/1024    | 163+108

benchmark-2

version  | subtrans | multixact   | tps
              | buffers   | offs/memb | func
-----------+--------------+--------------+------
patch-1 | 64           | 64/128        |  10
patch-2 | 64           | 64/128         | 12
patch-1 | 1024        | 512/1024    | 44
patch-2 | 1024        | 512/1024    | 72


[1]
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

14 December 2023, 11:25:46

> On 14 Dec 2023, at 16:06, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>  I have noticed
> a very good improvement with the addition of patch 0003.

Indeed, a very impressive results! It’s almost x2 of performance on high contention scenario, on top of previous
improvements.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

tender wang

Date:

14 December 2023, 11:32:06

Andrey M. Borodin <x4mmm@yandex-team.ru> 于2023年12月14日周四 17:35写道：

> On 14 Dec 2023, at 14:28, tender wang <tndrwang@gmail.com> wrote:
>
> Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127'

unsigned int GetBankno1(unsigned int pageno) {
return pageno & 127;
}

unsigned int GetBankno2(unsigned int pageno) {
return pageno % 128;
}

Generates with -O2

GetBankno1(unsigned int):
mov eax, edi
and eax, 127
ret
GetBankno2(unsigned int):
mov eax, edi
and eax, 127
ret

Compiler is smart enough with constants.

Yeah, that's true.

int GetBankno(long pageno) {
    unsigned short bank_mask = 128;
    int bankno = (pageno & bank_mask) % 128;
    return bankno;

}

enable -O2, only one instruction:

xor eax, eax

But if we all use '%', thing changs as below:

int GetBankno(long pageno) {
    unsigned short bank_mask = 128;
    int bankno = (pageno % bank_mask) % 128;
    return bankno;
}

mov rdx, rdi
sar rdx, 63
shr rdx, 57
lea rax, [rdi+rdx]
and eax, 127
sub eax, edx

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

14 December 2023, 11:56:50


> On 14 Dec 2023, at 16:32, tender wang <tndrwang@gmail.com> wrote:
>
> enable -O2, only one  instruction:
> xor     eax, eax

This is not fast code.  This is how friendly C compiler suggests you that mask must be 127, not 128.


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

14 December 2023, 14:56:17

On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > > On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > Andrey, do you have any stress tests or anything else that you used to
> > > gain confidence in this code?
> >

I have done some more testing for the clog group update as the
attached test file executes two concurrent scripts executed with
pgbench, the first script is the slow script which will run 10-second
long transactions and the second script is a very fast transaction
with ~10000 transactions per second.  Along with that, I have also
changed the bank size such that each bank will contain just 1 page
i.e. 32k transactions per bank.  I have done this way so that we do
not need to keep long-running transactions running for very long in
order to get the transactions from different banks committed during
the same time.  With this test, I have got that behavior and the below
logs shows that multiple transaction range which is in different
slru-bank (considering 32k transactions per bank) are doing group
update at the same time. e.g. in the below logs, we can see xid range
around 70600, 70548, and 70558, and xid range around 755, and 752 are
getting group updates by different leaders but near the same time.

It is running fine when running for a long duration, but I am not sure
how to validate the sanity of this kind of test.

2023-12-14 14:43:31.813 GMT [3306] LOG:  group leader procno 606
updated status of procno 606 xid 70600
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 for xid 70548
added for group update
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 is group leader
and got the lock
2023-12-14 14:43:31.816 GMT [3326] LOG:  group leader procno 586
updated status of procno 586 xid 70548
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 for xid 70558
added for group update
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 is group leader
and got the lock
2023-12-14 14:43:31.818 GMT [3327] LOG:  group leader procno 585
updated status of procno 585 xid 70558
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 for xid 752 added
for group update
2023-12-14 14:43:31.829 GMT [3207] LOG:  procno 669 for xid 755 added
for group update
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 is group leader
and got the lock
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 669 xid 755
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 687 xid 752

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

test_clog_group_commit.sh

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

15 December 2023, 05:15:36

On Thu, Dec 14, 2023 at 1:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 8:43 AM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >>
> >> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> >
> >> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >>
> >> Here is the updated patch based on some comments by tender wang (those
> >> comments were sent only to me)
> >
> >
> > few nitpicks:
> >
> > +
> > +   /*
> > +    * Mask for slotno banks, considering 1GB SLRU buffer pool size and the
> > +    * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
> > +    */
> > +   bits16      bank_mask;
> >  } SlruCtlData;
> >
> > ...
> > ...
> >
> > + int bankno = pageno & ctl->bank_mask;
> >
> > I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> > of banks (num_banks) and get the bank number through modulus op (pageno %
> > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> > bit difficult to read compared to modulus op which is quite simple,
> > straightforward and much common practice in hashing.
> >
> > Are there any advantages of using &  over % ?
>
> I am not sure either but since this change in 0002 is by Andrey, I
> will let him comment on this before we change or take any decision.
>
> > Also, a few places in 0002 and 0003 patch, need the bank number, it is better
> > to have a macro for that.
> > ---
> >
> >  extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage,
> >                                    void *data);
> > -
> > +extern bool check_slru_buffers(const char *name, int *newval);
> >  #endif                         /* SLRU_H */
> >
> >
> > Add an empty line after the declaration, in 0002 patch.
> > ---
> >
> > -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
> > +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
> > +                         int slotno)
> >
> > Unrelated change for 0003 patch.
>
>  Fixed
>
> Thanks for your review, PFA updated version.
>
> I have added @Amit Kapila to the list to view his opinion about
> whether anything can break in the clog group update with our changes
> of bank-wise SLRU lock.

Updated the comments about group commit safety based on the off-list
suggestion by Alvaro.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On 19 Dec 2023, at 10:34, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Just a side node.

It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby **bits** are written by different CPUs.

I think that banks and locks are good thing. But also we could reorganize log so that

status of transaction 0 is on a page 0 at bit offset 0

status of transaction 1 is on a page 1 at bit offset 0

status of transaction 2 is on a page 2 at bit offset 0

status of transaction 3 is on a page 3 at bit offset 0

status of transaction 4 is on a page 0 at bit offset 2

status of transaction 5 is on a page 1 at bit offset 2

status of transaction 6 is on a page 2 at bit offset 2

status of transaction 7 is on a page 3 at bit offset 2

etc...

And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least cache line. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend?

This does not matter much because

0. Patch set in current thread produces robust SLRU anyway

1. One day we are going to throw away SLRU anyway

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Robert Haas

Date:

02 January 2024, 14:23:27

On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> Just a side node.
> It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby
**bits**are written by different CPUs. 
> I think that banks and locks are good thing. But also we could reorganize log so that
> status of transaction 0 is on a page 0 at bit offset 0
> status of transaction 1 is on a page 1 at bit offset 0
> status of transaction 2 is on a page 2 at bit offset 0
> status of transaction 3 is on a page 3 at bit offset 0
> status of transaction 4 is on a page 0 at bit offset 2
> status of transaction 5 is on a page 1 at bit offset 2
> status of transaction 6 is on a page 2 at bit offset 2
> status of transaction 7 is on a page 3 at bit offset 2
> etc...

This is an interesting idea. A variant would be to stripe across
cachelines within the same page rather than across pages. If we do
stripe across pages as proposed here, we'd probably want to rethink
the way the SLRU is extended -- page at a time wouldn't really make
sense, but preallocating an entire file might.

> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least
cacheline. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? 

I don't understand how this could work. We need to be able to look up
transaction status by XID, not backend ID.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

02 January 2024, 18:10:04

> On 2 Jan 2024, at 19:23, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>
>> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least
cacheline. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? 
>
> I don't understand how this could work. We need to be able to look up
> transaction status by XID, not backend ID.

When GetNewTransactionId() is called we can reserve 256 xids in backend local memory. This values will be reused by
transactionsor subtransactions of this backend. Here 256 == sizeof(CacheLine). 
This would ensure that different backends touch different cache lines.

But this approach would dramatically increase xid consumption speed on patterns where client reconnects after several
transactions.So we can keep unused xids in procarray for future reuse. 

I doubt we can find significant performance improvement here, because false cache line sharing cannot be _that_ bad.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Robert Haas

Date:

02 January 2024, 18:30:43

On Tue, Jan 2, 2024 at 1:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > On 2 Jan 2024, at 19:23, Robert Haas <robertmhaas@gmail.com> wrote:
> >> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at
leastcache line. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? 
> >
> > I don't understand how this could work. We need to be able to look up
> > transaction status by XID, not backend ID.
>
> When GetNewTransactionId() is called we can reserve 256 xids in backend local memory. This values will be reused by
transactionsor subtransactions of this backend. Here 256 == sizeof(CacheLine). 
> This would ensure that different backends touch different cache lines.
>
> But this approach would dramatically increase xid consumption speed on patterns where client reconnects after several
transactions.So we can keep unused xids in procarray for future reuse. 
>
> I doubt we can find significant performance improvement here, because false cache line sharing cannot be _that_ bad.

Yeah, this seems way too complicated for what we'd potentially gain
from it. An additional problem is that the xmin horizon computation
assumes that XIDs are assigned in monotonically increasing fashion;
breaking that would be messy. And even an occasional leak of XIDs
could precipitate enough additional vacuuming to completely outweigh
any gains we could hope to achieve here.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

03 January 2024, 05:08:06

On Tue, Jan 2, 2024 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > Just a side node.
> > It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU.
Nearby**bits** are written by different CPUs. 
> > I think that banks and locks are good thing. But also we could reorganize log so that
> > status of transaction 0 is on a page 0 at bit offset 0
> > status of transaction 1 is on a page 1 at bit offset 0
> > status of transaction 2 is on a page 2 at bit offset 0
> > status of transaction 3 is on a page 3 at bit offset 0
> > status of transaction 4 is on a page 0 at bit offset 2
> > status of transaction 5 is on a page 1 at bit offset 2
> > status of transaction 6 is on a page 2 at bit offset 2
> > status of transaction 7 is on a page 3 at bit offset 2
> > etc...
>
> This is an interesting idea. A variant would be to stripe across
> cachelines within the same page rather than across pages. If we do
> stripe across pages as proposed here, we'd probably want to rethink
> the way the SLRU is extended -- page at a time wouldn't really make
> sense, but preallocating an entire file might.

Yeah, this is indeed an interesting idea.  So I think if we are
interested in working in this direction maybe this can be submitted in
a different thread, IMHO.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Robert Haas

Date:

03 January 2024, 13:45:48

On Wed, Jan 3, 2024 at 12:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah, this is indeed an interesting idea.  So I think if we are
> interested in working in this direction maybe this can be submitted in
> a different thread, IMHO.

Yeah, that's something quite different from the patch before us.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

08 January 2024, 11:25:20

The more I look at TransactionGroupUpdateXidStatus, the more I think
it's broken, and while we do have some tests, I don't have confidence
that they cover all possible cases.

Or, at least, if this code is good, then it hasn't been sufficiently
explained.

If we have multiple processes trying to write bits to clog, and they are
using different banks, then the LWLockConditionalAcquire will be able to
acquire the bank lock 

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The saddest aspect of life right now is that science gathers knowledge faster
 than society gathers wisdom."  (Isaac Asimov)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

08 January 2024, 13:18:09

On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> The more I look at TransactionGroupUpdateXidStatus, the more I think
> it's broken, and while we do have some tests, I don't have confidence
> that they cover all possible cases.
>
> Or, at least, if this code is good, then it hasn't been sufficiently
> explained.

Any thought about a case in which you think it might be broken, I mean
any vague thought might also help where you think it might not work as
expected so that I can also think in that direction.  It might be
possible that I might not be thinking of some perspective that you are
thinking and comments might be lacking from that point of view.

> If we have multiple processes trying to write bits to clog, and they are
> using different banks, then the LWLockConditionalAcquire will be able to
> acquire the bank lock

Do you think there is a problem with multiple processes getting the
lock? I mean they are modifying different CLOG pages so that can be
done concurrently right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

08 January 2024, 15:42:34

On 2024-Jan-08, Dilip Kumar wrote:

> On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > The more I look at TransactionGroupUpdateXidStatus, the more I think
> > it's broken, and while we do have some tests, I don't have confidence
> > that they cover all possible cases.
> >
> > Or, at least, if this code is good, then it hasn't been sufficiently
> > explained.
> 
> Any thought about a case in which you think it might be broken, I mean
> any vague thought might also help where you think it might not work as
> expected so that I can also think in that direction.  It might be
> possible that I might not be thinking of some perspective that you are
> thinking and comments might be lacking from that point of view.

Eh, apologies.  This email was an unfinished draft that I had laying
around before the holidays which I intended to discard it but somehow
kept around, and just now I happened to press the wrong key combination
and it ended up being sent instead.  We had some further discussion,
after which I no longer think that there is a problem here, so please
ignore this email.

I'll come back to this patch later this week.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

10 January 2024, 13:20:20

On Mon, Jan 8, 2024 at 9:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Eh, apologies.  This email was an unfinished draft that I had laying
> around before the holidays which I intended to discard it but somehow
> kept around, and just now I happened to press the wrong key combination
> and it ended up being sent instead.  We had some further discussion,
> after which I no longer think that there is a problem here, so please
> ignore this email.
>
> I'll come back to this patch later this week.

 No problem

The patch was facing some compilation issues after some recent
commits, so I have changed it. Reported by Julien Tachoires (offlist)



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

23 January 2024, 05:35:48

On Wed, Jan 10, 2024 at 6:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 8, 2024 at 9:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > Eh, apologies.  This email was an unfinished draft that I had laying
> > around before the holidays which I intended to discard it but somehow
> > kept around, and just now I happened to press the wrong key combination
> > and it ended up being sent instead.  We had some further discussion,
> > after which I no longer think that there is a problem here, so please
> > ignore this email.
> >
> > I'll come back to this patch later this week.
>
>  No problem
>
> The patch was facing some compilation issues after some recent
> commits, so I have changed it. Reported by Julien Tachoires (offlist)

The last patch conflicted with some of the recent commits, so here is
the updated version of the patch, I also noticed that the slur bank
lock wat event details were missing from the wait_event_names.txt file
so added that as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

25 January 2024, 16:22:07

Here's a touched-up version of this patch.

First, PROC_GLOBAL->clogGroupFirst and SlruCtl->latest_page_number
change from being protected by locks to being atomics, but there's no
mention of considering memory barriers that they should have.  Looking
at the code, I think the former doesn't need any additional barriers,
but latest_page_number is missing some, which I have added.  This
deserves another careful look.

Second and most user visibly, the GUC names seem to have been chosen
based on the source-code variables, which have never meant to be user-
visible.  So I renamed a few:

xact_buffers -> transaction_buffers
subtrans_buffers -> subtransaction_buffers
serial_buffers -> serializable_buffers
commit_ts_buffers -> commit_timestamp_buffers

(unchanged: multixact_offsets_buffers, multixact_members_buffers,
notify_buffers)

I did this explicitly trying to avoid using the term SLRU in a
user-visible manner, because what do users care?  But immediately after
doing this I realized that we already have pg_stat_slru, so maybe the
cat is already out of the bag, and so perhaps we should name these GUCS
as, say slru_transaction_buffers?  That may make the connection between
these things a little more explicit.  (I do think we need to add
cross-links in the documentation of those GUCs to the pg_stat_slru
docs and vice-versa.)


Another thing that bothered me a bit is that we have auto-tuning for
transaction_buffers and commit_timestamp_buffers, but not for
subtransaction_buffers.  (Autotuning means you set the GUC to 0 and it
scales with shared_buffers).  I don't quite understand what's the reason
for the ommision, so I added it for subtrans too.  I think it may make
sense to do likewise for the multixact ones too, not sure.  It doesn't
seem worth having that for pg_serial and notify.

While messing about with these GUCs I realized that the usage of the
show_hook to print what the current number is, when autoturning is used,
was bogus: SHOW would print the number of blocks for (say)
transaction_buffers, but if you asked it to print (say)
multixact_offsets_buffers, it would give a size in MB or kB.  I'm sure
such an inconsistency would bite us.  So, digging around I found that a
good way to handle this is to remove the show_hook, and instead call
SetConfigOption() at the time when the ShmemInit function is called,
with the correct number of buffers determined.  This is pretty much what
is used for XLOGbuffers, and it works correctly as far as my testing
shows.

Still with these auto-tuning GUCs, I noticed that the auto-tuning code
would continue to grow the buffer sizes with shared_buffers to
arbitrarily large values.  I added an arbitrary maximum of 1024 (8 MB),
which is much higher than the current value of 128; but if you have
(say) 30 GB of shared_buffers (not uncommon these days), do you really
need 30MB of pg_clog cache?  It seems mostly unnecessary ... and you can
still set it manually that way if you need it.  So, largely I just
rewrote those small functions completely.

I also made the SGML documentation and postgresql.sample.conf all match
what the code actually docs.  The whole thing wasn't particularly
consistent.

I rewrote a bunch of code comments and moved stuff around to appear in
alphabetical order, etc.

More comment rewriting still pending.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment

v15-slru-optimization.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

25 January 2024, 16:33:43

On 2024-Jan-25, Alvaro Herrera wrote:

> Here's a touched-up version of this patch.

> diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
> index 98fa6035cc..4a5e05d5e4 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
>      [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
>      [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
>      [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> +    [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> +    [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> +    [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> +    [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> +    [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> +    [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> +    [LWTRANCHE_XACT_SLRU] = "XactSLRU",
>  };

Eeek.  Last minute changes ...  Fixed here.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)

Attachment

v16-slru-optimization.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Robert Haas

Date:

25 January 2024, 17:28:19

On Thu, Jan 25, 2024 at 11:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Still with these auto-tuning GUCs, I noticed that the auto-tuning code
> would continue to grow the buffer sizes with shared_buffers to
> arbitrarily large values.  I added an arbitrary maximum of 1024 (8 MB),
> which is much higher than the current value of 128; but if you have
> (say) 30 GB of shared_buffers (not uncommon these days), do you really
> need 30MB of pg_clog cache?  It seems mostly unnecessary ... and you can
> still set it manually that way if you need it.  So, largely I just
> rewrote those small functions completely.

Yeah, I think that if we're going to scale with shared_buffers, it
should be capped.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

26 January 2024, 17:38:04

I've continued to review this and decided that I don't like the mess
this patch proposes in order to support pg_commit_ts's deletion of all
files.  (Yes, I know that I was the one that proposed this idea. It's
still a bad one).  I'd like to change that code by removing the limit
that we can only have 128 bank locks in a SLRU.  To recap, the reason we
did this is that commit_ts sometimes wants to delete all files while
running (DeactivateCommitTs), and for this it needs to acquire all bank
locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
added the SLRU limit making multiple banks share lwlocks.

I propose two alternative solutions:

1. The easiest is to have DeactivateCommitTs continue to hold
CommitTsLock until the end, including while SlruScanDirectory does its
thing.  This sounds terrible, but considering that this code only runs
when the module is being deactivated, I don't think it's really all that
bad in practice.  I mean, if you deactivate the commit_ts module and
then try to read commit timestamp data, you deserve to wait for a few
seconds just as a punishment for your stupidity.  AFAICT the cases where
anything is done in the module mostly check without locking that
commitTsActive is set, so we're not slowing down any critical
operations.  Also, if you don't like to be delayed for a couple of
seconds, just don't deactivate the module.

2. If we want some refinement, the other idea is to change
SlruScanDirCbDeleteAll (the callback that SlruScanDirectory uses in this
case) so that it acquires the bank lock appropriate for all the slots
used by the file that's going to be deleted.  This is OK because in the
default compilation each file only has 32 segments, so that requires
only 32 lwlocks held at once while the file is being deleted.  I think
we don't need to bother with this, but it's an option if we see the
above as unworkable for whatever reason.


The only other user of SlruScanDirCbDeleteAll is async.c (the LISTEN/
NOTIFY code), and what that does is delete all the files at server
start, where nothing is running concurrently anyway.  So whatever we do
for commit_ts, it won't affect async.c.


So, if we do away with the SLRU_MAX_BANKLOCKS idea, then we're assured
one LWLock per bank (instead of sharing some lwlocks to multiple banks),
and that makes the code simpler to reason about.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Andrey Borodin

Date:

26 January 2024, 18:21:18

> On 26 Jan 2024, at 22:38, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> This is OK because in the
> default compilation each file only has 32 segments, so that requires
> only 32 lwlocks held at once while the file is being deleted.

Do we account somehow that different subsystems do not accumulate MAX_SIMUL_LWLOCKS together?
E.g. GiST during split can combine 75 locks, and somehow commit_ts will be deactivated by this backend at the same
momentand add 32 locks more :) 
I understand that this sounds fantastic, these subsystems do not interfere. But this is fantastic only until something
likethat actually happens. 
If possible, I'd prefer one lock at a time, any maybe sometimes two-three with some guarantees that this is safe.
So, from my POV first solution that you proposed seems much better to me.

Thanks for working on this!

Best regard, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

29 January 2024, 05:56:40

On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Jan-25, Alvaro Herrera wrote:
>
> > Here's a touched-up version of this patch.
>
> > diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
> > index 98fa6035cc..4a5e05d5e4 100644
> > --- a/src/backend/storage/lmgr/lwlock.c
> > +++ b/src/backend/storage/lmgr/lwlock.c
> > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
> >       [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
> >       [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
> >       [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> > +     [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> > +     [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> > +     [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> > +     [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> > +     [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> > +     [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> > +     [LWTRANCHE_XACT_SLRU] = "XactSLRU",
> >  };
>
> Eeek.  Last minute changes ...  Fixed here.

Thank you for working on this.  There is one thing that I feel is
problematic.  We have kept the allowed values for these GUCs to be in
multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
values were changed to 16 but in this refactoring patch for some of
the buffers you have changed that to 8 so I think that's not good.

+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for
the MultiXact offset cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_buffers,
+ 16, 8, SLRU_MAX_ALLOWED_BUFFERS,
+ check_multixact_offsets_buffers, NULL, NULL
+ },

Other than this patch looks good to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

29 January 2024, 06:01:32

On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> I've continued to review this and decided that I don't like the mess
> this patch proposes in order to support pg_commit_ts's deletion of all
> files.  (Yes, I know that I was the one that proposed this idea. It's
> still a bad one).  I'd like to change that code by removing the limit
> that we can only have 128 bank locks in a SLRU.  To recap, the reason we
> did this is that commit_ts sometimes wants to delete all files while
> running (DeactivateCommitTs), and for this it needs to acquire all bank
> locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
> added the SLRU limit making multiple banks share lwlocks.
>
> I propose two alternative solutions:
>
> 1. The easiest is to have DeactivateCommitTs continue to hold
> CommitTsLock until the end, including while SlruScanDirectory does its
> thing.  This sounds terrible, but considering that this code only runs
> when the module is being deactivated, I don't think it's really all that
> bad in practice.  I mean, if you deactivate the commit_ts module and
> then try to read commit timestamp data, you deserve to wait for a few
> seconds just as a punishment for your stupidity.

I think this idea looks reasonable.  I agree that if we are trying to
read commit_ts after deactivating it then it's fine to make it wait.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

31 January 2024, 23:31:34

On 2024-Jan-29, Dilip Kumar wrote:

> Thank you for working on this.  There is one thing that I feel is
> problematic.  We have kept the allowed values for these GUCs to be in
> multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
> values were changed to 16 but in this refactoring patch for some of
> the buffers you have changed that to 8 so I think that's not good.

Oh, absolutely, you're right.  Restored the minimum to 16.

So, here's the patchset as two pieces.  0001 converts
SlruSharedData->latest_page_number to use atomics.  I don't see any
reason to mix this in with the rest of the patch, and though it likely
won't have any performance advantage by itself (since the lock
acquisition is pretty much the same), it seems better to get it in ahead
of the rest -- I think that simplifies matters for the second patch,
which is large enough.

So, 0002 introduces the rest of the feature.  I removed use of banklocks
in a different amount as banks, and I made commit_ts use a longer
lwlock acquisition at truncation time, rather than forcing all-lwlock
acquisition.

The more I look at 0002, the more I notice that some comments need badly
updated, so please don't read too much into it yet.  But I wanted to
post it anyway for archives and cfbot purposes.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

In short, I propose the attached.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment

v2-0001-Use-atomics-for-SlruSharedData-latest_page_number.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

04 February 2024, 14:14:11

Sorry, brown paper bag bug there.  Here's the correct one.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)

Attachment

v3-0001-Use-atomics-for-SlruSharedData-latest_page_number.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

04 February 2024, 17:40:17

> On 4 Feb 2024, at 18:38, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> In other words, these barriers are fully useless.

+1. I've tried to understand ideas behind barriers, but latest_page_number is heuristics that does not need any
guaranteesat all. It's also is used in safety check which can fire only when everything is already broken beyond any
repair..(Though using atomic access seems a good idea anyway) 

This patch uses wording "banks" in comments before banks start to exist. But as far as I understand, it is expected to
becommitted before "banks" patch. 

Besides this patch looks good to me.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

05 February 2024, 07:47:11

On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-02, Dilip Kumar wrote:
>
> > I have checked the patch and it looks fine to me other than the above
> > question related to memory barrier usage one more question about the
> > same, basically below to instances 1 and 2 look similar but in 1 you
> > are not using the memory write_barrier whereas in 2 you are using the
> > write_barrier, why is it so?  I mean why the reordering can not happen
> > in 1 and it may happen in 2?
>
> What I was thinking is that there's a lwlock operation just below, which
> acts as a barrier.  But I realized something more important: there are
> only two places that matter, which are SlruSelectLRUPage and
> SimpleLruZeroPage.  The others are all initialization code that run at a
> point where there's no going to be any concurrency in SLRU access, so we
> don't need barriers anyway.  In SlruSelectLRUPage we definitely don't
> want to evict the page that SimpleLruZeroPage has initialized, starting
> from the point where it returns that new page to its caller.
> But if you consider the code of those two routines, you realize that the
> only time an equality between latest_page_number and "this_page_number"
> is going to occur, is when both pages are in the same bank ... and both
> routines are required to be holding the bank lock while they run, so in
> practice this is never a problem.

Right, in fact when I first converted this 'latest_page_number' to an
atomic the thinking was to protect it from concurrently setting the
values in SimpleLruZeroPage() and also concurrently reading in
SlruSelectLRUPage() should not read the corrupted value.  All other
usages were during the initialization phase where we do not need any
protection.

>
> We need the atomic write and atomic read so that multiple processes
> processing pages in different banks can update latest_page_number
> simultaneously.  But the equality condition that we're looking for?
> it can never happen concurrently.

Yeah, that's right, after you told I also realized that the case is
protected by the bank lock.  Earlier I didn't think about this case.

> In other words, these barriers are fully useless.
>
> (We also have SimpleLruTruncate, but I think it's not as critical to
> have a barrier there anyhow: accessing a slightly outdated page number
> could only be a problem if a bug elsewhere causes us to try to truncate
> in the current page.  I think we only have this code there because we
> did have such bugs in the past, but IIUC this shouldn't happen anymore.)

+1, I agree with this theory in general.  But the below comment in
SimpleLruTrucate in your v3 patch doesn't seem correct, because here
we are checking if the latest_page_number is smaller than the cutoff
if so we log it as wraparound and skip the whole thing and that is
fine even if we are reading with atomic variable and slightly outdated
value should not be a problem but the comment claim that this safe
because we have the same bank lock as SimpleLruZeroPage(), but that's
not true here we will be acquiring different bank locks one by one
based on which slotno we are checking.  Am I missing something?


+ * An important safety check: the current endpoint page must not be
+ * eligible for removal.  Like SlruSelectLRUPage, we don't need a
+ * memory barrier here because for the affected page to be relevant,
+ * we'd have to have the same bank lock as SimpleLruZeroPage.
  */
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u64(&shared->latest_page_number),
+   cutoffPage))


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

06 February 2024, 10:53:12

On 2024-Feb-04, Andrey M. Borodin wrote:

> This patch uses wording "banks" in comments before banks start to
> exist. But as far as I understand, it is expected to be committed
> before "banks" patch.

True -- changed that to use ControlLock.

> Besides this patch looks good to me.

Many thanks for reviewing.

On 2024-Feb-05, Dilip Kumar wrote:

> > (We also have SimpleLruTruncate, but I think it's not as critical to
> > have a barrier there anyhow: accessing a slightly outdated page number
> > could only be a problem if a bug elsewhere causes us to try to truncate
> > in the current page.  I think we only have this code there because we
> > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> 
> +1, I agree with this theory in general.  But the below comment in
> SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> we are checking if the latest_page_number is smaller than the cutoff
> if so we log it as wraparound and skip the whole thing and that is
> fine even if we are reading with atomic variable and slightly outdated
> value should not be a problem but the comment claim that this safe
> because we have the same bank lock as SimpleLruZeroPage(), but that's
> not true here we will be acquiring different bank locks one by one
> based on which slotno we are checking.  Am I missing something?

I think you're correct.  I reworded this comment, so now it says this:

    /*
     * An important safety check: the current endpoint page must not be
     * eligible for removal.  This check is just a backstop against wraparound
     * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
     * outdated value; therefore we don't add a memory barrier.
     */

Pushed with those changes.  Thank you!

Now I'll go rebase the rest of the patch on top.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

06 February 2024, 11:35:16

On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > (We also have SimpleLruTruncate, but I think it's not as critical to
> > > have a barrier there anyhow: accessing a slightly outdated page number
> > > could only be a problem if a bug elsewhere causes us to try to truncate
> > > in the current page.  I think we only have this code there because we
> > > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> >
> > +1, I agree with this theory in general.  But the below comment in
> > SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> > we are checking if the latest_page_number is smaller than the cutoff
> > if so we log it as wraparound and skip the whole thing and that is
> > fine even if we are reading with atomic variable and slightly outdated
> > value should not be a problem but the comment claim that this safe
> > because we have the same bank lock as SimpleLruZeroPage(), but that's
> > not true here we will be acquiring different bank locks one by one
> > based on which slotno we are checking.  Am I missing something?
>
> I think you're correct.  I reworded this comment, so now it says this:
>
>     /*
>      * An important safety check: the current endpoint page must not be
>      * eligible for removal.  This check is just a backstop against wraparound
>      * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
>      * outdated value; therefore we don't add a memory barrier.
>      */
>
> Pushed with those changes.  Thank you!

Yeah, this looks perfect, thanks.

> Now I'll go rebase the rest of the patch on top.

Okay, I will review and test after that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

06 February 2024, 15:25:10

Here's the rest of it rebased on top of current master.  I think it
makes sense to have this as one individual commit.

I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
because we don't have that macro at that point, so I just used constant
16.  Obviously need a better solution for this.

I also changed the location of bank_mask in SlruCtlData for better
packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
to SlotGetBankNumber().

Some very critical comments still need to be updated to the new design,
particularly anything that mentions "control lock"; but also the overall
model needs to be explained in some central location, rather than
incongruently some pieces here and other pieces there.  I'll see about
this later.  But at least this is code you should be able to play with.


I've been wondering whether we should add a "slru" to the name of the
GUCs:

commit_timestamp_slru_buffers
transaction_slru_buffers
etc

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Aprender sin pensar es inútil; pensar sin aprender, peligroso" (Confucio)

Attachment

v18-enlarge-slru-buffers.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

07 February 2024, 05:58:58

On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Here's the rest of it rebased on top of current master.  I think it
> makes sense to have this as one individual commit.
>
> I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> because we don't have that macro at that point, so I just used constant
> 16.  Obviously need a better solution for this.

If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
because these files are anyway include slur.h so.

>
> I also changed the location of bank_mask in SlruCtlData for better
> packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
> to SlotGetBankNumber().

Okay.

> Some very critical comments still need to be updated to the new design,
> particularly anything that mentions "control lock"; but also the overall
> model needs to be explained in some central location, rather than
> incongruently some pieces here and other pieces there.  I'll see about
> this later.  But at least this is code you should be able to play with.

Okay, I will review and test this

> I've been wondering whether we should add a "slru" to the name of the
> GUCs:
>
> commit_timestamp_slru_buffers
> transaction_slru_buffers
> etc

I am not sure we are exposing anything related to SLRU to the user, I
mean transaction_buffers should make sense for the user that it stores
transaction-related data in some buffers pool but whether that buffer
pool is called SLRU or not doesn't matter much to the user IMHO.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

07 February 2024, 06:45:53

> On 7 Feb 2024, at 10:58, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>> commit_timestamp_slru_buffers
>> transaction_slru_buffers
>> etc
>
> I am not sure we are exposing anything related to SLRU to the user,

I think we already tell something about SLRU to the user. I’d rather consider if “transaction_slru_buffers" is easier
tounderstand than “transaction_buffers” .. 
IMO transaction_buffers is clearer. But I do not have strong opinion.

> I
> mean transaction_buffers should make sense for the user that it stores
> transaction-related data in some buffers pool but whether that buffer
> pool is called SLRU or not doesn't matter much to the user IMHO.
+1

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

07 February 2024, 10:19:29

On 2024-Feb-07, Dilip Kumar wrote:

> On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > because we don't have that macro at that point, so I just used constant
> > 16.  Obviously need a better solution for this.
> 
> If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> because these files are anyway include slur.h so.

Sure, but is that really what we want?

> > I've been wondering whether we should add a "slru" to the name of the
> > GUCs:
> >
> > commit_timestamp_slru_buffers
> > transaction_slru_buffers
> > etc
> 
> I am not sure we are exposing anything related to SLRU to the user,

We do -- we have pg_stat_slru already.

> I mean transaction_buffers should make sense for the user that it
> stores transaction-related data in some buffers pool but whether that
> buffer pool is called SLRU or not doesn't matter much to the user
> IMHO.

Yeah, that's exactly what my initial argument was for naming these this
way.  But since the term slru already escaped into the wild via the
pg_stat_slru view, perhaps it helps users make the connection between
these things.  Alternatively, we can cross-reference each term from the
other's documentation and call it a day.

Another painful point is that pg_stat_slru uses internal names in the
data it outputs, which obviously do not match the new GUCs.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

07 February 2024, 10:37:07

On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > > because we don't have that macro at that point, so I just used constant
> > > 16.  Obviously need a better solution for this.
> >
> > If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> > because these files are anyway include slur.h so.
>
> Sure, but is that really what we want?

So your question is do we want these buffers to be in multiple of
SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
don't think it should create any problem logically.  I mean we can
look again in the patch to see if we have made any such assumptions
but that should be fairly easy to fix, then maybe if we are going in
this way we should get rid of the check_slru_buffers() function as
well.

> > > I've been wondering whether we should add a "slru" to the name of the
> > > GUCs:
> > >
> > > commit_timestamp_slru_buffers
> > > transaction_slru_buffers
> > > etc
> >
> > I am not sure we are exposing anything related to SLRU to the user,
>
> We do -- we have pg_stat_slru already.
>
> > I mean transaction_buffers should make sense for the user that it
> > stores transaction-related data in some buffers pool but whether that
> > buffer pool is called SLRU or not doesn't matter much to the user
> > IMHO.
>
> Yeah, that's exactly what my initial argument was for naming these this
> way.  But since the term slru already escaped into the wild via the
> pg_stat_slru view, perhaps it helps users make the connection between
> these things.  Alternatively, we can cross-reference each term from the
> other's documentation and call it a day.

Yeah, that's true I forgot this point about the pg_stat_slru, from
this pov if the configuration has the name slru they would be able to
make a better connection with the configured value, and the stats in
this view based on that they can take call if they need to somehow
increase the size of these slru buffers.

> Another painful point is that pg_stat_slru uses internal names in the
> data it outputs, which obviously do not match the new GUCs.

Yeah, that's true, but I think this could be explained somewhere not
sure what is the right place for this.

FYI, I have also repeated all the performance tests I performed in my
first email[1], and I am seeing a similar gain.

Some comments on v18 in my first pass of the review.

1.
@@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  lsnindex = GetLSNIndex(slotno, xid);
  *lsn = XactCtl->shared->group_lsn[lsnindex];

- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));

Maybe here we can add an assert before releasing the lock for a safety check

Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

2.
+ *
+ * XXX could we make the LSNs to be bank-based?
  */
  XLogRecPtr *group_lsn;

IMHO, the flush still happens at the page level so up to which LSN
should be flush before flushing the particular clog page should also
be maintained at the page level.

[1] https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

22 February 2024, 18:43:00

On 2024-Feb-07, Dilip Kumar wrote:

> On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Sure, but is that really what we want?
> 
> So your question is do we want these buffers to be in multiple of
> SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> don't think it should create any problem logically.  I mean we can
> look again in the patch to see if we have made any such assumptions
> but that should be fairly easy to fix, then maybe if we are going in
> this way we should get rid of the check_slru_buffers() function as
> well.

Not really, I just don't think the macro should be in slru.h.

Another thing I've been thinking is that perhaps it would be useful to
make the banks smaller, when the total number of buffers is small.  For
example, if you have 16 or 32 buffers, it's not really clear to me that
it makes sense to have just 1 bank or 2 banks.  It might be more
sensible to have 4 banks with 4 or 8 buffers instead.  That should make
the algorithm scale down as well as up ...

I haven't done either of those things in the attached v19 version.  I
did go over the comments once again and rewrote the parts I was unhappy
with, including some existing ones.  I think it's OK now from that point
of view ... at some point I thought about creating a separate README,
but in the end I thought it not necessary.

I did add a bunch of Assert()s to make sure the locks that are supposed
to be held are actually held.  This led me to testing the page status to
be not EMPTY during SimpleLruWriteAll() before calling
SlruInternalWritePage(), because the assert was firing.  The previous
code is not really *buggy*, but to me it's weird to call WritePage() on
a slot with no contents.

Another change was in TransactionGroupUpdateXidStatus: the original code
had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to
know which bank to lock.  I changed it to simply be the page used by the
leader process; this doesn't need an atomic read, and should be the same
page anyway.  (If it isn't, it's no big deal).  But what's more: even if
we do read ->clogGroupFirst at that point, there's no guarantee that
this is going to be exactly for the same process that ends up being the
first in the list, because since we have not set it to INVALID by the
time we grab the bank lock, it is quite possible for more processes to
add themselves to the list.

I realized all this while rewriting the comments in a way that would let
me understand what was going on ... so IMO the effort was worthwhile.

Anyway, what I send now should be pretty much final, modulo the
change to the check_slru_buffers routines and documentation additions to
match pg_stat_slru to the new GUC names.

> > Another painful point is that pg_stat_slru uses internal names in the
> > data it outputs, which obviously do not match the new GUCs.
> 
> Yeah, that's true, but I think this could be explained somewhere not
> sure what is the right place for this.

Or we can change those names in the view.

> FYI, I have also repeated all the performance tests I performed in my
> first email[1], and I am seeing a similar gain.

Excellent, thanks for doing that.

> Some comments on v18 in my first pass of the review.
> 
> 1.
> @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
>   lsnindex = GetLSNIndex(slotno, xid);
>   *lsn = XactCtl->shared->group_lsn[lsnindex];
> 
> - LWLockRelease(XactSLRULock);
> + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));
> 
> Maybe here we can add an assert before releasing the lock for a safety check
> 
> Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

Hmm, I think this would just throw a warning or error "you don't hold
such lwlock", so it doesn't seem necessary.

> 2.
> + *
> + * XXX could we make the LSNs to be bank-based?
>   */
>   XLogRecPtr *group_lsn;
> 
> IMHO, the flush still happens at the page level so up to which LSN
> should be flush before flushing the particular clog page should also
> be maintained at the page level.

Yeah, this was a misguided thought, nevermind.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.

Attachment

v19-0001-Make-SLRU-buffer-sizes-configurable.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

23 February 2024, 07:36:26

On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > Sure, but is that really what we want?
> >
> > So your question is do we want these buffers to be in multiple of
> > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > don't think it should create any problem logically.  I mean we can
> > look again in the patch to see if we have made any such assumptions
> > but that should be fairly easy to fix, then maybe if we are going in
> > this way we should get rid of the check_slru_buffers() function as
> > well.
>
> Not really, I just don't think the macro should be in slru.h.

Okay

> Another thing I've been thinking is that perhaps it would be useful to
> make the banks smaller, when the total number of buffers is small.  For
> example, if you have 16 or 32 buffers, it's not really clear to me that
> it makes sense to have just 1 bank or 2 banks.  It might be more
> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> the algorithm scale down as well as up ...

It might be helpful to have small-size banks when SLRU buffers are set
to a very low value and we are only accessing a couple of pages at a
time (i.e. no buffer replacement) because in such cases most of the
contention will be on SLRU Bank lock. Although I am not sure how
practical such a use case would be, I mean if someone is using
multi-xact very heavily or creating frequent subtransaction overflow
then wouldn't they should set this buffer limit to some big enough
value?  By doing this we would lose some simplicity of the patch I
mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
need to compute this and store it in SlruShared. Maybe that's not that
bad.

>
> I haven't done either of those things in the attached v19 version.  I
> did go over the comments once again and rewrote the parts I was unhappy
> with, including some existing ones.  I think it's OK now from that point
> of view ... at some point I thought about creating a separate README,
> but in the end I thought it not necessary.

Thanks, I will review those changes.

> I did add a bunch of Assert()s to make sure the locks that are supposed
> to be held are actually held.  This led me to testing the page status to
> be not EMPTY during SimpleLruWriteAll() before calling
> SlruInternalWritePage(), because the assert was firing.  The previous
> code is not really *buggy*, but to me it's weird to call WritePage() on
> a slot with no contents.

Okay,  I mean internally SlruInternalWritePage() will flush only if
the status is SLRU_PAGE_VALID, but it is better the way you have done.

> Another change was in TransactionGroupUpdateXidStatus: the original code
> had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to
> know which bank to lock.  I changed it to simply be the page used by the
> leader process; this doesn't need an atomic read, and should be the same
> page anyway.  (If it isn't, it's no big deal).  But what's more: even if
> we do read ->clogGroupFirst at that point, there's no guarantee that
> this is going to be exactly for the same process that ends up being the
> first in the list, because since we have not set it to INVALID by the
> time we grab the bank lock, it is quite possible for more processes to
> add themselves to the list.

Yeah, this looks better

> I realized all this while rewriting the comments in a way that would let
> me understand what was going on ... so IMO the effort was worthwhile.

+1

I will review and do some more testing early next week and share my feedback.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

23 February 2024, 11:02:23

> On 23 Feb 2024, at 12:36, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>> Another thing I've been thinking is that perhaps it would be useful to
>> make the banks smaller, when the total number of buffers is small.  For
>> example, if you have 16 or 32 buffers, it's not really clear to me that
>> it makes sense to have just 1 bank or 2 banks.  It might be more
>> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
>> the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.

I'm sure anyone with multiple CPUs should increase, not decrease previous default of 128 buffers (with 512MB shared
buffers).Having more CPUs (the only way to benefit from more locks) implies bigger transaction buffers. 
IMO making bank size variable adds unneeded computation overhead, bank search loops should be unrollable by compiler
etc.
Originally there was a patch set step, that packed bank's page addresses together in one array. It was done to make
banksearch a SIMD instruction. 

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

23 February 2024, 11:04:36

On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2024-Feb-07, Dilip Kumar wrote:
> >
> > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > > > Sure, but is that really what we want?
> > >
> > > So your question is do we want these buffers to be in multiple of
> > > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > > don't think it should create any problem logically.  I mean we can
> > > look again in the patch to see if we have made any such assumptions
> > > but that should be fairly easy to fix, then maybe if we are going in
> > > this way we should get rid of the check_slru_buffers() function as
> > > well.
> >
> > Not really, I just don't think the macro should be in slru.h.
>
> Okay
>
> > Another thing I've been thinking is that perhaps it would be useful to
> > make the banks smaller, when the total number of buffers is small.  For
> > example, if you have 16 or 32 buffers, it's not really clear to me that
> > it makes sense to have just 1 bank or 2 banks.  It might be more
> > sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> > the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.
>
> >
> > I haven't done either of those things in the attached v19 version.  I
> > did go over the comments once again and rewrote the parts I was unhappy
> > with, including some existing ones.  I think it's OK now from that point
> > of view ... at some point I thought about creating a separate README,
> > but in the end I thought it not necessary.
>
> Thanks, I will review those changes.

Few other things I noticed while reading through the patch, I haven't
read it completely yet but this is what I got for now.

1.
+ * If no process is already in the list, we're the leader; our first step
+ * is to "close out the group" by resetting the list pointer from
+ * ProcGlobal->clogGroupFirst (this lets other processes set up other
+ * groups later); then we lock the SLRU bank corresponding to our group's
+ * page, do the SLRU updates, release the SLRU bank lock, and wake up the
+ * sleeping processes.

I think here we are saying that we "close out the group" before
acquiring the SLRU lock but that's not true.  We keep the group open
until we gets the lock so that we can get maximum members in while we
are anyway waiting for the lock.

2.
 static void
 TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
  RepOriginId nodeid, int slotno)
 {
- Assert(TransactionIdIsNormal(xid));
+ if (!TransactionIdIsNormal(xid))
+ return;
+
+ entryno = TransactionIdToCTsEntry(xid);

I do not understand why we need this change.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

23 February 2024, 11:48:11

On 2024-Feb-23, Andrey M. Borodin wrote:

> I'm sure anyone with multiple CPUs should increase, not decrease
> previous default of 128 buffers (with 512MB shared buffers). Having
> more CPUs (the only way to benefit from more locks) implies bigger
> transaction buffers.

Sure.

> IMO making bank size variable adds unneeded computation overhead, bank
> search loops should be unrollable by compiler etc.

Makes sense.

> Originally there was a patch set step, that packed bank's page
> addresses together in one array. It was done to make bank search a
> SIMD instruction.

Ants Aasma had proposed a rework of the LRU code for better performance.
He told me it depended on bank size being 16, so you're right that it's
probably not a good idea to make it variable.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

26 February 2024, 16:16:44

On 2024-Feb-23, Dilip Kumar wrote:

> 1.
> + * If no process is already in the list, we're the leader; our first step
> + * is to "close out the group" by resetting the list pointer from
> + * ProcGlobal->clogGroupFirst (this lets other processes set up other
> + * groups later); then we lock the SLRU bank corresponding to our group's
> + * page, do the SLRU updates, release the SLRU bank lock, and wake up the
> + * sleeping processes.
> 
> I think here we are saying that we "close out the group" before
> acquiring the SLRU lock but that's not true.  We keep the group open
> until we gets the lock so that we can get maximum members in while we
> are anyway waiting for the lock.

Absolutely right.  Reworded that.

> 2.
>  static void
>  TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
>   RepOriginId nodeid, int slotno)
>  {
> - Assert(TransactionIdIsNormal(xid));
> + if (!TransactionIdIsNormal(xid))
> + return;
> +
> + entryno = TransactionIdToCTsEntry(xid);
> 
> I do not understand why we need this change.

Ah yeah, I was bothered by the fact that if you pass Xid values earlier
than NormalXid to this function, we'd reply with some nonsensical values
instead of throwing an error.  But you're right that it doesn't belong
in this patch, so I removed that.

Here's a version with these fixes, where I also added some text to the
pg_stat_slru documentation:

+  <para>
+   For each <literal>SLRU</literal> area that's part of the core server,
+   there is a configuration parameter that controls its size, with the suffix
+   <literal>_buffers</literal> appended.  For historical
+   reasons, the names are not exact matches, but <literal>Xact</literal>
+   corresponds to <literal>transaction_buffers</literal> and the rest should
+   be obvious.
+   <!-- Should we edit pgstat_internal.h::slru_names so that the "name" matches
+        the GUC name?? -->
+  </para>

I think I would like to suggest renaming the GUCs to have the _slru_ bit
in the middle:

+# - SLRU Buffers (change requires restart) -
+
+#commit_timestamp_slru_buffers = 0          # memory for pg_commit_ts (0 = auto)
+#multixact_offsets_slru_buffers = 16            # memory for pg_multixact/offsets
+#multixact_members_slru_buffers = 32            # memory for pg_multixact/members
+#notify_slru_buffers = 16                   # memory for pg_notify
+#serializable_slru_buffers = 32             # memory for pg_serial
+#subtransaction_slru_buffers = 0            # memory for pg_subtrans (0 = auto)
+#transaction_slru_buffers = 0               # memory for pg_xact (0 = auto)

and the pgstat_internal.h table:
 
static const char *const slru_names[] = {
    "commit_timestamp",
    "multixact_members",
    "multixact_offsets",
    "notify",
    "serializable",
    "subtransaction",
    "transaction",
    "other"                        /* has to be last */
};

This way they match perfectly.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
                                 (George Orwell's The Lord of the Rings)

Attachment

v20-0001-Make-SLRU-buffer-sizes-configurable.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

27 February 2024, 09:56:35

On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Feb-23, Dilip Kumar wrote:

+ <para>
+ For each <literal>SLRU</literal> area that's part of the core server,
+ there is a configuration parameter that controls its size, with the suffix
+ <literal>_buffers</literal> appended. For historical
+ reasons, the names are not exact matches, but <literal>Xact</literal>
+ corresponds to <literal>transaction_buffers</literal> and the rest should
+ be obvious.
+ 
+ </para>

I think I would like to suggest renaming the GUCs to have the _slru_ bit
in the middle:

+# - SLRU Buffers (change requires restart) -
+
+#commit_timestamp_slru_buffers = 0 # memory for pg_commit_ts (0 = auto)
+#multixact_offsets_slru_buffers = 16 # memory for pg_multixact/offsets
+#multixact_members_slru_buffers = 32 # memory for pg_multixact/members
+#notify_slru_buffers = 16 # memory for pg_notify
+#serializable_slru_buffers = 32 # memory for pg_serial
+#subtransaction_slru_buffers = 0 # memory for pg_subtrans (0 = auto)
+#transaction_slru_buffers = 0 # memory for pg_xact (0 = auto)

and the pgstat_internal.h table:

static const char *const slru_names[] = {
"commit_timestamp",
"multixact_members",
"multixact_offsets",
"notify",
"serializable",
"subtransaction",
"transaction",
"other" /* has to be last */
};

This way they match perfectly.

Yeah, I think this looks fine to me.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

27 February 2024, 16:03:12

On 2024-Feb-27, Dilip Kumar wrote:

> > static const char *const slru_names[] = {
> >         "commit_timestamp",
> >         "multixact_members",
> >         "multixact_offsets",
> >         "notify",
> >         "serializable",
> >         "subtransaction",
> >         "transaction",
> >         "other"                                         /* has to be last
> > */
> > };

Here's a patch for the renaming part.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)

Attachment

0001-Rename-SLRU-elements-in-pg_stat_slru.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

"Andrey M. Borodin"

Date:

27 February 2024, 16:17:11

> On 27 Feb 2024, at 21:03, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-27, Dilip Kumar wrote:
>
>>> static const char *const slru_names[] = {
>>>        "commit_timestamp",
>>>        "multixact_members",
>>>        "multixact_offsets",
>>>        "notify",
>>>        "serializable",
>>>        "subtransaction",
>>>        "transaction",
>>>        "other"                                         /* has to be last
>>> */
>>> };
>
> Here's a patch for the renaming part.

Sorry for the late reply, I have one nit. Are you sure that multixact_members and multixact_offsets are plural, while
transactionand commit_timestamp are singular? 
Maybe multixact_members and multixact_offset? Because there are many members and one offset for a givent multixact?
Userscertainly do not care, thought... 

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

27 February 2024, 16:26:07

On 2024-Feb-27, Andrey M. Borodin wrote:

> Sorry for the late reply, I have one nit. Are you sure that
> multixact_members and multixact_offsets are plural, while transaction
> and commit_timestamp are singular?
> Maybe multixact_members and multixact_offset? Because there are many
> members and one offset for a givent multixact? Users certainly do not
> care, thought...

I made myself the same question actually, and thought about putting them
both in the singular.  I only backed off because I noticed that the
directories themselves are in plural (an old mistake of mine, evidently). 
Maybe we should follow that instinct and use the singular for these.

If we do that, we can rename the directories to also appear in singular
when/if the patch to add standard page headers to the SLRUs lands --
which is going to need code to rewrite the files during pg_upgrade
anyway, so the rename is not going to be a big deal.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

27 February 2024, 17:33:18

Here's the complete set, with these two names using the singular.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Feb-27, Alvaro Herrera wrote:

> Here's the complete set, with these two names using the singular.

BTW one thing I had not noticed is that before this patch we have
minimum shmem size that's lower than the lowest you can go with the new
code.

This means Postgres may no longer start when extremely tight memory
restrictions (and of course use more memory even when idle or with small
databases). I wonder to what extent should we make an effort to relax
that. For small, largely inactive servers, this is just memory we use
for no good reason. However, anything we do here will impact
performance on the high end, because as Andrey says this will add
calculations and jumps where there are none today.

I was just comparing the minimum memory required for SLRU when the system is minimally configured, correct me if I am wrong.

SLRU unpatched patched

commit_timestamp_buffers 4 16
subtransaction_buffers 32 16
transaction_buffers 4 16

multixact_offset_buffers 8 16
multixact_member_buffers 16 16
notify_buffers 8 16
serializable_buffers 16 16

-------------------------------------------------------------------------------------
total buffers 88 112

so that is < 200kB of extra memory on a minimally configured system, IMHO this should not matter.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Kyotaro Horiguchi

Date:

29 February 2024, 04:04:04

At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> Here's the complete set, with these two names using the singular.

The commit by the second patch added several GUC descriptions:

> Sets the size of the dedicated buffer pool used for the commit timestamp cache.

Some of them, commit_timestamp_buffers, transaction_buffers,
subtransaction_buffers use 0 to mean auto-tuning based on
shared-buffer size. I think it's worth adding an extra_desc such as "0
to automatically determine this value based on the shared buffer
size".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

29 February 2024, 10:46:42

On 2024-Feb-29, Kyotaro Horiguchi wrote:

> At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> > Here's the complete set, with these two names using the singular.
> 
> The commit by the second patch added several GUC descriptions:
> 
> > Sets the size of the dedicated buffer pool used for the commit timestamp cache.
> 
> Some of them, commit_timestamp_buffers, transaction_buffers,
> subtransaction_buffers use 0 to mean auto-tuning based on
> shared-buffer size. I think it's worth adding an extra_desc such as "0
> to automatically determine this value based on the shared buffer
> size".

How about this?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"

Attachment

0001-extra_desc.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

03 March 2024, 13:58:39

On 2024-Feb-29, Alvaro Herrera wrote:

> On 2024-Feb-29, Kyotaro Horiguchi wrote:

> > Some of them, commit_timestamp_buffers, transaction_buffers,
> > subtransaction_buffers use 0 to mean auto-tuning based on
> > shared-buffer size. I think it's worth adding an extra_desc such as "0
> > to automatically determine this value based on the shared buffer
> > size".
> 
> How about this?

Pushed that way, but we can discuss further wording improvements/changes
if someone wants to propose any.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Tom Lane

Date:

04 March 2024, 22:14:32

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Pushed that way, but we can discuss further wording improvements/changes
> if someone wants to propose any.

I just noticed that drongo is complaining about two lines added
by commit 53c2a97a9:

 drongo        | 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=':
'SlruPageStatus*' differs in levels of indirection from 'int' 
 drongo        | 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=':
'SlruPageStatus*' differs in levels of indirection from 'int' 

These lines are

    Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY);

    Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY);

These are comparing the address of something with an enum value,
which surely cannot be sane.  Is the "&" operator incorrect?

It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
the numeric value of zero, so I guess the majority of our BF
animals are understanding this as "address != NULL".  But that
doesn't look like a useful test to be making.

            regards, tom lane

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Tom Lane

Date:

04 March 2024, 22:41:58

I wrote:
> It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
> the numeric value of zero, so I guess the majority of our BF
> animals are understanding this as "address != NULL".  But that
> doesn't look like a useful test to be making.

In hopes of noticing whether there are other similar thinkos,
I permuted the order of the SlruPageStatus enum values, and
now I get the expected warnings from gcc:

In file included from ../../../../src/include/postgres.h:45,
                 from slru.c:59:
slru.c: In function ‘SimpleLruWaitIO’:
slru.c:436:38: warning: comparison between pointer and integer
  Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY);
                                      ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
         ^~~~~~~~~
slru.c: In function ‘SimpleLruWritePage’:
slru.c:717:43: warning: comparison between pointer and integer
  Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY);
                                           ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
         ^~~~~~~~~

So it looks like it's just these two places.

            regards, tom lane

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

05 March 2024, 11:18:05

On 2024-Mar-04, Tom Lane wrote:

> In hopes of noticing whether there are other similar thinkos,
> I permuted the order of the SlruPageStatus enum values, and
> now I get the expected warnings from gcc:

Thanks for checking!  I pushed the fixes.

Maybe we should assign a nonzero value (= 1) to the first element of
enums, to avoid this kind of mistake.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alexander Lakhin

Date:

03 April 2024, 12:00:00

Hello Alvaro,

27.02.2024 20:33, Alvaro Herrera wrote:
> Here's the complete set, with these two names using the singular.
>

I've managed to trigger an assert added by 53c2a97a9.
Please try the following script against a server compiled with
-DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
define, it just simplifies reproducing...):
# initdb & start ...

createdb test
echo "
SELECT pg_current_xact_id() AS tx
\gset

SELECT format('CREATE TABLE t%s(i int)', g)
   FROM generate_series(1, 1022 - :tx) g
\gexec

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT pg_current_xact_id();
SELECT pg_sleep(5);
" | psql test &

echo "
SELECT pg_sleep(1);
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 1 INTO a;
COMMIT;

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 2 INTO b;
" | psql test

It fails for me with the following stack trace:
TRAP: failed Assert("LWLockHeldByMeInMode(SimpleLruGetBankLock(ctl, pageno), LW_EXCLUSIVE)"), File: "slru.c", Line:
366,
 
PID: 21711
ExceptionalCondition at assert.c:52:13
SimpleLruZeroPage at slru.c:369:11
SerialAdd at predicate.c:921:20
SummarizeOldestCommittedSxact at predicate.c:1521:2
GetSerializableTransactionSnapshotInt at predicate.c:1787:16
GetSerializableTransactionSnapshot at predicate.c:1691:1
GetTransactionSnapshot at snapmgr.c:264:21
exec_simple_query at postgres.c:1162:4
...

Best regards,
Alexander

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

03 April 2024, 14:10:08

Hello,

On 2024-Apr-03, Alexander Lakhin wrote:

> I've managed to trigger an assert added by 53c2a97a9.
> Please try the following script against a server compiled with
> -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> define, it just simplifies reproducing...):

Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
there.  This rewrite of that small piece should fix it.  Thanks for
reporting this.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)

Attachment

0001-Fix-zeroing-of-pg_serial-page-without-SLRU-bank-lock.patch

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Dilip Kumar

Date:

03 April 2024, 14:18:35

On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Hello,
>
> On 2024-Apr-03, Alexander Lakhin wrote:
>
> > I've managed to trigger an assert added by 53c2a97a9.
> > Please try the following script against a server compiled with
> > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> > define, it just simplifies reproducing...):
>
> Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
> there.  This rewrite of that small piece should fix it.  Thanks for
> reporting this.
>

Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
thanks for reporting.  Your fix looks correct to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From

Alvaro Herrera

Date:

03 April 2024, 16:01:28

On 2024-Apr-03, Dilip Kumar wrote:

> Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
> thanks for reporting.  Your fix looks correct to me.

Thanks for the quick review!  And thanks to Alexander for the report.
Pushed the fix.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"