Thread: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable.  And, in order to deal with the linear search in the
large buffer pool,  we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool.  This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.

So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter.  The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.

Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.

There are some problems I faced while converting this centralized
control lock to a bank-wise lock and that is mainly because this lock
is (mis)used for different purposes. The main purpose of this control
lock as I understand it is to protect the in-memory access
(read/write) of the buffers in the SLRU buffer pool.

Here is the list of some problems and their analysis:

1) In some of the SLRU, we use this lock for protecting the members
inside the control structure which is specific to that SLRU layer i.e.
SerialControlData() members are protected by the SerialSLRULock, and I
don’t think it is the right use of this lock so for this purpose I
have introduced another lock called SerialControlLock for this
specific purpose.  Based on my analysis there is no reason for
protecting these members and the SLRU buffer access with the same
lock.
2) The member called ‘latest_page_number’ inside SlruSharedData is
also protected by the SLRULock, I would not say this use case is wrong
but since this is a common variable and not a per bank variable can
not be protected by the bankwise lock now.  But the usage of this
variable is just to track the latest page in an SLRU so that we do not
evict out the latest page during victim page selection.  So I have
converted this to an atomic variable as it is completely independent
of the SLRU buffer access.
3) In order to protect SlruScanDirectory, basically the
SlruScanDirectory() from DeactivateCommitTs(), is called under the
SLRU control lock, but from all other places  SlruScanDirectory() is
called without lock and that is because the caller of this function is
called from the places which are not executed concurrently(i.e.
startup, checkpoint).  This function DeactivateCommitTs() is also
called from the startup only so there doesn't seem any use in calling
this under the SLRU control lock.  Currently, I have called this under
the all-bank lock because logically this is not a performance path,
and that way we are keeping it consistent with the current logic, but
if others also think that we do not need a lock at this place then we
might remove it and then we don't need this all-bank lock anywhere.

There are some other uses of this lock where we might think it will be
a problem if we divide it into a bank-wise lock but it's not and I
have given my analysis for the same

1) SimpleLruTruncate: We might worry that if we convert to a bank-wise
lock then this could be an issue as we might need to release and
acquire different locks as we scan different banks.  But as per my
analysis, this is not an issue because a) With the current code also
do release and acquire the centralized lock multiple times in order to
perform the I/O on the buffer slot so the behavior is not changed but
the most important thing is b) All SLRU layers take care that this
function should not be accessed concurrently, I have verified all
access to this function and its true and the function header of this
function also says the same.  So this is not an issue as per my
analysis.

2) Extending or adding a new page to SLRU:  I have noticed that this
is also protected by either some other exclusive lock or only done
during startup.  So in short the SLRULock is just used for protecting
against the access of the buffers in the buffer pool but that is not
for guaranteeing the exclusive access inside the function because that
is taken care of in some other way.

3) Another thing that I noticed while writing this and thought it
would be good to make a note of that as well.  Basically for the CLOG
group update of the xid status.  Therein if we do not get the control
lock on the SLRU then we add ourselves to a group and then the group
leader does the job for all the members in the group.  One might think
that different pages in the group might belong to different SLRU bank
so the leader might need to acquire/release the lock as it process the
request in the group.  Yes, that is true, and it is taken care but we
don’t need to worry about the case because as per the implementation
of the group update, we are trying to have the members with the same
page request in one group and only due to some exception there could
be members with the different page request.  So the point is with a
bank-wise lock we are handling that exception case but that's not a
regular case that we need to acquire/release multiple times.  So
design-wise we are good and performance-wise there should not be any
problem because most of the time we might be updating the pages from
the same bank, and if in some cases we have some updates for old
transactions due to long-running transactions then we should do better
by not having a centralized lock.

Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter.  So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200

Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres

savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;

Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003

Results:
Clients Head SlruBank  SlruBank+BankwiseLockAndLru
1          457      491           475
8          3753    3819        3782
32      14594   14328      17028
64      15600    16243      25944
128    15957    16272      31731

So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD.  We might
be thinking that we do not see much gain only with the SlruBank patch.
The reason is that in this particular test case, we are not seeing
much load on the buffer replacement.  In fact, the wait event also
doesn’t show contention on any lock instead the main load is due to
frequently modifying the common variable like the centralized control
lock and the centralized LRU counters.  That is evident in perf data
as shown below

+  74.72%   0.06% postgres postgres  [.] XidInMVCCSnapshot
+  74.08%   0.02% postgres postgres  [.] SubTransGetTopmostTransaction
+  74.04%   0.07% postgres postgres  [.] SubTransGetParent
+  57.66%   0.04% postgres postgres  [.] LWLockAcquire
+  57.64%   0.26% postgres postgres  [.] SimpleLruReadPage_ReadOnly
……
+  16.53%   0.07% postgres postgres  [.] LWLockRelease
+  16.36%   0.04% postgres postgres  [.] pg_atomic_sub_fetch_u32
+  16.31%  16.24% postgres postgres  [.] pg_atomic_fetch_sub_u32_impl

We can notice that the main load is on the atomic variable within the
LWLockAcquire and LWLockRelease.  Once we apply the bankwise lock
patch(v1-0002) the same problem is visible on cur_lru_count updation
in the SlruRecentlyUsed[2] macro (I have not shown that here but it
was visible in my perf report).  And that is resolved by implementing
a bankwise counter.

[2]
#define SlruRecentlyUsed(shared, slotno) \
do { \
..
(shared)->cur_lru_count = ++new_lru_count; \
..
} \
} while (0)


Exp2: This test shows the load on SLRU frequent buffer replacement. In
this test, we are running the pgbench kind script which frequently
generates multixact-id, and parallelly we are starting and committing
a long-running transaction so that the multixact-ids are not
immediately cleaned up by the vacuum and we create contention on the
SLRU buffer pool.  I am not leaving the long-running transaction
running forever as that will start to show another problem with
respect to bloat and we will lose the purpose of what I am trying to
show here.

Note: test configurations are the same as Exp1, just the workload is
different, we are running below 2 scripts.
and new config parameter(added in v1-0001) slru_buffers_size_scale=4,
that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head
and
NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head

./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres
./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres

cat > multixact.sql <<EOF
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE;
SAVEPOINT S1;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
EOF

cat > longrunning.sql << EOF
BEGIN;
INSERT INTO pgbench_test VALUES(1);
select pg_sleep(10);
COMMIT;
EOF

Results:
Clients     Head    SlruBank   SlruBank+BankwiseLock
1              528        513           531
8             3870     4239         4157
32         13945   14470       14556
64         10086   19034       24482
128         6909    15627      18161

Here we can see good improvement with the SlruBank patch itself
because of increasing the SLRU buffer pool, as in this workload there
is a lot of contention due to buffer replacement.  As shown below we
can see a lot of load on MultiXactOffsetSLRU as well as on
MultiXactOffsetBuffer which shows there are frequent buffer evictions
in this workload.  And, increasing the SLRU buffer pool size is
helping a lot, and further dividing the SLRU lock into bank-wise locks
we are seeing a further gain.  So in total, we are seeing ~2.5x TPS at
64 and 128 thread compared to head.

   3401  LWLock          | MultiXactOffsetSLRU
   2031  LWLock          | MultiXactOffsetBuffer
    687                         |
    427  LWLock          | BufferContent

Credits:
- The base patch v1-0001 is authored by Thomas Munro and I have just rebased it.
- 0002 and 0003 are new patches written by me based on design ideas
from Robert and Myself.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> The small size of the SLRU buffer pools can sometimes become a
> performance problem because it’s not difficult to have a workload
> where the number of buffers actively in use is larger than the
> fixed-size buffer pool. However, just increasing the size of the
> buffer pool doesn’t necessarily help, because the linear search that
> we use for buffer replacement doesn’t scale, and also because
> contention on the single centralized lock limits scalability.
>
> There is a couple of patches proposed in the past to address the
> problem of increasing the buffer pool size, one of the patch [1] was
> proposed by Thomas Munro where we make the size of the buffer pool
> configurable.

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1].  And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev.  Apologies for missing
that in the first email.

[1] https://commitfest.postgresql.org/43/2627/

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> In my last email, I forgot to give the link from where I have taken
> the base path for dividing the buffer pool in banks so giving the same
> here[1].  And looking at this again it seems that the idea of that
> patch was from
> Andrey M. Borodin and the idea of the SLRU scale factor were
> introduced by Yura Sokolov and Ivan Lazarev.  Apologies for missing
> that in the first email.
>
> [1] https://commitfest.postgresql.org/43/2627/

In my last email I have just rebased the base patch, so now while
reading through that patch I realized that there was some refactoring
needed and some unused functions were there so I have removed that and
also added some comments.  Also did some refactoring to my patches. So
reposting the patch series.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Oct 11, 2023 at 4:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> The small size of the SLRU buffer pools can sometimes become a
> performance problem because it’s not difficult to have a workload
> where the number of buffers actively in use is larger than the
> fixed-size buffer pool. However, just increasing the size of the
> buffer pool doesn’t necessarily help, because the linear search that
> we use for buffer replacement doesn’t scale, and also because
> contention on the single centralized lock limits scalability.
>
> There is a couple of patches proposed in the past to address the
> problem of increasing the buffer pool size, one of the patch [1] was
> proposed by Thomas Munro where we make the size of the buffer pool
> configurable.  And, in order to deal with the linear search in the
> large buffer pool,  we divide the SLRU buffer pool into associative
> banks so that searching in the buffer pool doesn’t get affected by the
> large size of the buffer pool.  This does well for the workloads which
> are mainly impacted by the frequent buffer replacement but this still
> doesn’t stand well with the workloads where the centralized control
> lock is the bottleneck.
>
> So I have taken this patch as my base patch (v1-0001) and further
> added 2 more improvements to this 1) In v1-0002, Instead of a
> centralized control lock for the SLRU I have introduced a bank-wise
> control lock 2)In v1-0003, I have removed the global LRU counter and
> introduced a bank-wise counter.  The second change (v1-0003) is in
> order to avoid the CPU/OS cache invalidation due to frequent updates
> of the single variable, later in my performance test I will show how
> much gain we have gotten because of these 2 changes.
>
> Note: This is going to be a long email but I have summarised the main
> idea above this point and now I am going to discuss more internal
> information in order to show that the design idea is valid and also
> going to show 2 performance tests where one is specific to the
> contention on the centralized lock and other is mainly contention due
> to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
> TPS compared to the head by these patches and in later sections, I am
> going discuss this in more detail i.e. exact performance numbers and
> analysis of why we are seeing the gain.
>
...
>
> Performance Test:
> Exp1: Show problems due to CPU/OS cache invalidation due to frequent
> updates of the centralized lock and a common LRU counter.  So here we
> are running a parallel transaction to pgbench script which frequently
> creates subtransaction overflow and that forces the visibility-check
> mechanism to access the subtrans SLRU.
> Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
> scale factor: 300
> shared_buffers=20GB
> checkpoint_timeout=40min
> max_wal_size=20GB
> max_connections=200
>
> Workload: Run these 2 scripts parallelly:
> ./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
> ./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres
>
> savepoint.sql (create subtransaction overflow)
> BEGIN;
> SAVEPOINT S1;
> INSERT INTO test VALUES(1)
> ← repeat 70 times →
> SELECT pg_sleep(1);
> COMMIT;
>
> Code under test:
> Head: PostgreSQL head code
> SlruBank: The first patch applied to convert the SLRU buffer pool into
> the bank (0001)
> SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003
>
> Results:
> Clients Head SlruBank  SlruBank+BankwiseLockAndLru
> 1          457      491           475
> 8          3753    3819        3782
> 32      14594   14328      17028
> 64      15600    16243      25944
> 128    15957    16272      31731
>
> So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
> BankwiseLock and bankwise LRU counter) as compared to HEAD.
>

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage? BTW, I remember
that there was also a discussion of moving SLRU into a regular buffer
pool [1]. You have not provided any explanation as to whether that
approach will have any merits after we do this or whether that
approach is not worth pursuing at all.

[1] - https://commitfest.postgresql.org/43/3514/

--
With Regards,
Amit Kapila.



On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> This and other results shared by you look promising. Will there be any
> improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit?  In
short, if it is regarding the performance then we will see it for all
the SLRUs as the control lock is not centralized anymore instead it is
a bank-wise lock.

 BTW, I remember
> that there was also a discussion of moving SLRU into a regular buffer
> pool [1]. You have not provided any explanation as to whether that
> approach will have any merits after we do this or whether that
> approach is not worth pursuing at all.
>
> [1] - https://commitfest.postgresql.org/43/3514/

Yeah, I haven't read that thread in detail about performance numbers
and all.  But both of these can not coexist because this patch is
improving the SLRU buffer pool access/configurable size and also lock
contention.  If we move SLRU to the main buffer pool then we might not
have a similar problem instead there might be other problems like SLRU
buffers getting swapped out due to other relation buffers and all and
OTOH the advantages of that approach would be that we can just use a
bigger buffer pool and SLRU can also take advantage of that.  But in
my opinion, most of the time we have limited page access in SLRU and
the SLRU buffer access pattern is also quite different from the
relation pages access pattern so keeping them under the same buffer
pool and comparing against relation pages for victim buffer selection
might cause different problems.  But anyway I would discuss those
points maybe in that thread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
On 2023-Oct-11, Dilip Kumar wrote:

> In my last email, I forgot to give the link from where I have taken
> the base path for dividing the buffer pool in banks so giving the same
> here[1].  And looking at this again it seems that the idea of that
> patch was from Andrey M. Borodin and the idea of the SLRU scale factor
> were introduced by Yura Sokolov and Ivan Lazarev.  Apologies for
> missing that in the first email.

You mean [1].
[1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:

NUM_CLOG_BUFFERS            = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_COMMIT_TS_BUFFERS       = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_SUBTRANS_BUFFERS        = Min(64 << slru_buffers_size_scale, shared_buffers/256)
NUM_NOTIFY_BUFFERS          = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_SERIAL_BUFFERS          = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)

... which look pretty random already, if similar enough to the current
hardcoded values.  In reality, the code implements different values than
what the documentation says.

I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range.  The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).

... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires.  This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory.  This doesn't sound sensible
to me.

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me.  IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.


I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas.  I have not tried to reproduce the claimed
performance benefits.  I think measuring this patch set with the tests
posted by Shawn Debnath in [3] is important, too.
[3] https://postgr.es/m/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com


On the other hand, here's a somewhat crazy idea.  What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use?  We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest.  The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
                  -- Paul Graham, http://www.paulgraham.com/opensource.html



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Nathan Bossart
Date:
On Tue, Oct 24, 2023 at 06:04:13PM +0200, Alvaro Herrera wrote:
> Like everybody else, I like having less GUCs to configure, but going
> this far to avoid them looks rather disastrous to me.  IMO we should
> just use Munro's older patches that gave one GUC per SLRU, and users
> only need to increase the one that shows up in pg_wait_event sampling.
> Someday we will get the (much more complicated) patches to move these
> buffers to steal memory from shared buffers, and that'll hopefully let
> use get rid of all this complexity.

+1

> On the other hand, here's a somewhat crazy idea.  What if, instead of
> stealing buffers from shared_buffers (which causes a lot of complexity),
> we allocate a common pool for all SLRUs to use?  We provide a single
> knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
> algorithm (or something) to distribute that memory across all the SLRUs.
> So the ratio to use for this SLRU or that one would depend on the nature
> of the workload: maybe more for multixact in this server here, but more
> for subtrans in that server there; it's just the total amount that the
> user would have to configure, side by side with shared_buffers (and
> perhaps scale with it like wal_buffers), and the LRU would handle the
> rest.  The "only" problem here is finding a distribution algorithm that
> doesn't further degrade performance, of course ...

I think it's worth a try.  It does seem simpler, and it might allow us to
sidestep some concerns about scaling when the SLRU pages are in
shared_buffers [0].

[0] https://postgr.es/m/ZPsaEGRvllitxB3v%40tamriel.snowman.net

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Oct-11, Dilip Kumar wrote:
>
> > In my last email, I forgot to give the link from where I have taken
> > the base path for dividing the buffer pool in banks so giving the same
> > here[1].  And looking at this again it seems that the idea of that
> > patch was from Andrey M. Borodin and the idea of the SLRU scale factor
> > were introduced by Yura Sokolov and Ivan Lazarev.  Apologies for
> > missing that in the first email.
>
> You mean [1].
> [1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru
> I don't like this idea very much, because of the magic numbers that act
> as ratios for numbers of buffers on each SLRU compared to other SLRUs.
> These values, which I took from the documentation part of the patch,
> appear to have been selected by throwing darts at the wall:
>
> NUM_CLOG_BUFFERS            = Min(128 << slru_buffers_size_scale, shared_buffers/256)
> NUM_COMMIT_TS_BUFFERS       = Min(128 << slru_buffers_size_scale, shared_buffers/256)
> NUM_SUBTRANS_BUFFERS        = Min(64 << slru_buffers_size_scale, shared_buffers/256)
> NUM_NOTIFY_BUFFERS          = Min(32 << slru_buffers_size_scale, shared_buffers/256)
> NUM_SERIAL_BUFFERS          = Min(32 << slru_buffers_size_scale, shared_buffers/256)
> NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
> NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
>
> ... which look pretty random already, if similar enough to the current
> hardcoded values.  In reality, the code implements different values than
> what the documentation says.
>
> I don't see why would CLOG have the same number as COMMIT_TS, when the
> size for elements of the latter is like 32 times bigger -- however, the
> frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
> SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
> same range.  The MULTIXACT ones appear to keep the current ratio among
> them (8/16 gets changed to 32/64).
>
> ... and this whole mess is scaled exponentially without regard to the
> size that each SLRU requires.  This is just betting that enough memory
> can be wasted across all SLRUs up to the point where the one that is
> actually contended has sufficient memory.  This doesn't sound sensible
> to me.
>
> Like everybody else, I like having less GUCs to configure, but going
> this far to avoid them looks rather disastrous to me.  IMO we should
> just use Munro's older patches that gave one GUC per SLRU, and users
> only need to increase the one that shows up in pg_wait_event sampling.
> Someday we will get the (much more complicated) patches to move these
> buffers to steal memory from shared buffers, and that'll hopefully let
> use get rid of all this complexity.

Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter.  Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.

> I'm inclined to use Borodin's patch last posted here [2] instead of your
> proposed 0001.
> [2] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I will rebase my patches on top of this.

> I did skim patches 0002 and 0003 without going into too much detail;
> they look reasonable ideas.  I have not tried to reproduce the claimed
> performance benefits.  I think measuring this patch set with the tests
> posted by Shawn Debnath in [3] is important, too.
> [3] https://postgr.es/m/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com

Thanks for taking a look.

>
> On the other hand, here's a somewhat crazy idea.  What if, instead of
> stealing buffers from shared_buffers (which causes a lot of complexity),

Currently, we do not steal buffers from shared_buffers, computation is
dependent upon Nbuffers though. I mean for each SLRU we are computing
separate memory which is additional than the shared_buffers no?

> we allocate a common pool for all SLRUs to use?  We provide a single
> knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
> algorithm (or something) to distribute that memory across all the SLRUs.
> So the ratio to use for this SLRU or that one would depend on the nature
> of the workload: maybe more for multixact in this server here, but more
> for subtrans in that server there; it's just the total amount that the
> user would have to configure, side by side with shared_buffers (and
> perhaps scale with it like wal_buffers), and the LRU would handle the
> rest.  The "only" problem here is finding a distribution algorithm that
> doesn't further degrade performance, of course ...

Yeah, this could be an idea, but are you talking about that all the
SLRUs will share the single buffer pool and based on the LRU algorithm
it will be decided which page will stay in the buffer pool and which
will be out?  But wouldn't that create another issue of different
SLRUs starting to contend on the same lock if we have a common buffer
pool for all the SLRUs?  Or am I missing something?  Or you are saying
that although there is a common buffer pool each SLRU will have its
own boundaries in it so protected by a separate lock and based on the
workload those boundaries can change dynamically?  I haven't put much
thought into how practical the idea is but just trying to understand
what you have in mind.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > This and other results shared by you look promising. Will there be any
> > improvement in workloads related to clog buffer usage?
>
> I did not understand this question can you explain this a bit?
>

I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?

--
With Regards,
Amit Kapila.



On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > This and other results shared by you look promising. Will there be any
> > > improvement in workloads related to clog buffer usage?
> >
> > I did not understand this question can you explain this a bit?
> >
>
> I meant to ask about the impact of this patch on accessing transaction
> status via TransactionIdGetStatus(). Shouldn't we expect some
> improvement in accessing CLOG buffers?

Yes, there should be because 1) Now there is no common lock so
contention on a centralized control lock will be reduced when we are
accessing the transaction status from pages falling in different SLRU
banks 2) Buffers size is configurable so if the workload is accessing
transactions status of different range then it would help in frequent
buffer eviction but this might not be most common case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Wed, Oct 25, 2023 at 10:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Overall I agree with your comments, actually, I haven't put that much
> thought into the GUC part and how it scales the SLRU buffers w.r.t.
> this single configurable parameter.  Yeah, so I think it is better
> that we take the older patch version as our base patch where we have
> separate GUC per SLRU.
>
> > I'm inclined to use Borodin's patch last posted here [2] instead of your
> > proposed 0001.
> > [2] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru
>
> I will rebase my patches on top of this.

I have taken 0001 and 0002 from [1], done some bug fixes in 0001, and
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
many banks, but also that we don't have very large banks.  There was
also a patch 0003 in this thread but I haven't taken this as this is
another optimization of merging some structure members and I will
analyze the performance characteristic of this and try to add it on
top of the complete patch series.

Patch details:
0001 - GUC parameter for each SLRU
0002 - Divide the SLRU pool into banks
(The above 2 are taken from [1] with some modification and rebasing by me)
0003 - Implement bank-wise SLRU lock as described in the first email
of this thread
0004 - Implement bank-wise LRU counter as described in the first email
of this thread
0005 - Some other optimization suggested offlist by Alvaro, i.e.
merging buffer locks and bank locks in the same array so that the
bank-wise LRU counter does not fetch the next cache line in a hot
function SlruRecentlyUsed()

Note: I think 0003,0004 and 0005 can be merged together but kept
separate so that we can review them independently and see how useful
each of them is.

[1] https://postgr.es/m/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
Based on some offlist discussions with Alvaro and Robert in separate
conversations, I and Alvaro we came to the same point if a user sets a
very high value for the number of slots (say 1GB) then the number of
slots in each bank will be 1024 (considering max number of bank 128)
and if we continue the sequence search for finding the buffer for the
page then that could be costly in such cases.  But later in one of the
conversations with Robert, I realized that we can have this bank-wise
lock approach along with the partitioned hash table.

So the idea is, that we will use the buffer mapping hash table
something like Thoams used in one of his patches [1], but instead of a
normal hash table, we will use the partitioned hash table.  The SLRU
buffer pool is still divided as we have done in the bank-wise approach
and there will be separate locks for each slot range.  So now we get
the benefit of both approaches 1) By having a  mapping hash we can
avoid the sequence search 2) By dividing the buffer pool into banks
and keeping the victim buffer search within those banks we avoid
locking all the partitions during victim buffer search 3) And we can
also maintain a bank-wise LRU counter so that we avoid contention on a
single variable as we have discussed in my first email of this thread.
Please find the updated patch set details and patches attached to the
email.

[1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same
patch as the previous patch set
[2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce
buffer mapping hash table
[3] 0003-Partition-wise-slru-locks: Partition the hash table and also
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure
that the number of slots is in multiple of the number of banks

With this approach, I have also made some changes where the number of
banks is constant (i.e. 8) so that some of the computations are easy.
I think with a buffer mapping hash table we should not have much
problem in keeping this fixed as with very extreme configuration and
very high numbers of slots also we do not have performance problems as
we are not doing sequence search because of buffer mapping hash and if
the number of slots is set so high then the victim buffer search also
should not be frequent so we should not be worried about sequence
search within a bank for victim buffer search.  I have also changed
the default value of the number of slots to 64 and the minimum value
to 16 I think this is a reasonable default value because the existing
values are too low considering the modern hardware and these
parameters is configurable so user can set it to low value if running
with very low memory.


[1] https://www.postgresql.org/message-id/CA%2BhUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-%3Dpyg%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:


On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
many banks
There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.



On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?


Thanks for working on this!


Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> changed the logic of SlruAdjustNSlots() in 0002, such that now it
> starts with the next power of 2 value of the configured slots and
> keeps doubling the number of banks until we reach the number of banks
> to the max SLRU_MAX_BANKS(128) and bank size is bigger than
> SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
> many banks
>
> There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.

I agree with that, but I feel with bank-wise locks we are removing
major contention from the centralized control lock and we can see that
from my first email that how much benefit we can get in one of the
simple test cases when we create subtransaction overflow.

> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here
[0].
> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this
isthe case. 
> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear
search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster.  So
far in my test I did not find the slowness.  Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

> Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of
locallinear search. Lock partitions have to be bigger for sure. 

Yeah, that could also be an idea if we plan to drop the hash.  I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks.  Let's hear more
opinion on this.

>
> On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have taken 0001 and 0002 from [1], done some bug fixes in 0001
>
>
> BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for  MultiXactMemberCtl, see the below patch snippet from the
original patch.

@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
  MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;

  SimpleLruInit(MultiXactOffsetCtl,
-   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+   "MultiXactOffset", multixact_offsets_buffers, 0,
    MultiXactOffsetSLRULock, "pg_multixact/offsets",
    LWTRANCHE_MULTIXACTOFFSET_BUFFER,
    SYNC_HANDLER_MULTIXACT_OFFSET);
  SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
  SimpleLruInit(MultiXactMemberCtl,
-   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+   "MultiXactMember", multixact_offsets_buffers, 0,
    MultiXactMemberSLRULock, "pg_multixact/members",
    LWTRANCHE_MULTIXACTMEMBER_BUFFER,
    SYNC_HANDLER_MULTIXACT_MEMBER);


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
>> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here
[0].
>> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this
isthe case. 
>> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear
search.
>
> The main intention of having this buffer mapping hash is to find the
> SLRU page faster than sequence search when banks are relatively bigger
> in size, but if we find the cases where having hash creates more
> overhead than providing gain then I am fine to remove the hash because
> the whole purpose of adding hash here to make the lookup faster.  So
> far in my test I did not find the slowness.  Do you or anyone else
> have any test case based on the previous research on whether it
> creates any slowness?
PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of
TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)

This hash table makes getting transaction status ~50% slower.

Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo
"shared_preload_libraries= 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension
test_slru'postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU')) 
  FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)

>
>> Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of
locallinear search. Lock partitions have to be bigger for sure. 
>
> Yeah, that could also be an idea if we plan to drop the hash.  I mean
> bank-wise counter is fine as we are finding a victim buffer within a
> bank itself, but each lock could cover more slots than one bank size
> or in other words, it can protect multiple banks.  Let's hear more
> opinion on this.
+1

>
>>
>> On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> I have taken 0001 and 0002 from [1], done some bug fixes in 0001
>>
>>
>> BTW can you please describe in more detail what kind of bugs?
>
> Yeah, actually that patch was using the same GUC
> (multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
> well as for  MultiXactMemberCtl, see the below patch snippet from the
> original patch.

Ouch. We were running this for serveral years with this bug... Thanks!


Best regards, Andrey Borodin.




Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Mon, Nov 6, 2023 at 1:05 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> > On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> >> Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here
[0].
> >> I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt
thisis the case. 
> >> I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to
linearsearch. 
> >
> > The main intention of having this buffer mapping hash is to find the
> > SLRU page faster than sequence search when banks are relatively bigger
> > in size, but if we find the cases where having hash creates more
> > overhead than providing gain then I am fine to remove the hash because
> > the whole purpose of adding hash here to make the lookup faster.  So
> > far in my test I did not find the slowness.  Do you or anyone else
> > have any test case based on the previous research on whether it
> > creates any slowness?
> PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of
TransactionIdGetStatus())
> before introducing HTAB for buffer mapping I get
> Time: 14837.851 ms (00:14.838)
> with buffer HTAB I get
> Time: 22723.243 ms (00:22.723)
>
> This hash table makes getting transaction status ~50% slower.
>
> Benchmark script I used:
> make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo
"shared_preload_libraries= 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension
test_slru'postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU')) 
>   FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);"
postgres)

With this test, I got below numbers,

nslots.  no-hash    hash
8            10s       13s
16          10s       13s
32          15s       13s
64          17s       13s

Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected.  But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16.  Let me put some more thought into this and get back.
Any other opinions on this?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
On 2023-Nov-06, Dilip Kumar wrote:

> Yeah so we can see with a small bank size <=16 slots we are seeing
> that the fetching page with hash is 30% slower than the sequential
> search, but beyond 32 slots sequential search is become slower as you
> grow the number of slots whereas with hash it stays constant as
> expected.  But now as you told if keep the lock partition range
> different than the bank size then we might not have any problem by
> having more numbers of banks and with that, we can keep the bank size
> small like 16.  Let me put some more thought into this and get back.
> Any other opinions on this?

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5.  Maybe we could use that instead.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> dynahash is notoriously slow, which is why we have simplehash.h since
> commit b30d3ea824c5.  Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need
tocope with collisions much - we can evict a collided buffer. 

Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh
patchin April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software,
it’staken from hardware cache. 


Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > dynahash is notoriously slow, which is why we have simplehash.h since
> > commit b30d3ea824c5.  Maybe we could use that instead.
>
> Dynahash has lock partitioning. Simplehash has not, AFAIK.

Yeah, Simplehash doesn't have partitioning so with simple hash we will
be stuck with the centralized control lock that is one of the main
problems trying to solve here.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com





On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:


> On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> dynahash is notoriously slow, which is why we have simplehash.h since
> commit b30d3ea824c5.  Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need to cope with collisions much - we can evict a collided buffer.

Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh patch in April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software, it’s taken from hardware cache.

I agree that we don't need the hash function to generate hash value out of
pageno which itself is sufficient, but I don't understand how we can get rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all? 

Regards,
Amul


On Fri, Nov 3, 2023 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [...]
[1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same
patch as the previous patch set
[2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce
buffer mapping hash table
[3] 0003-Partition-wise-slru-locks: Partition the hash table and also
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure
that the number of slots is in multiple of the number of banks
[...]

Here are some minor comments:

+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
  */
 Size
 CommitTsShmemBuffers(void)
 {
-   return Min(256, Max(4, NBuffers / 256));
+   /* Use configured value if provided. */
+   if (commit_ts_buffers > 0)
+       return Max(16, commit_ts_buffers);
+   return Min(256, Max(16, NBuffers / 256));

Do you mean "4MB of for every 1GB"  in the comment?
--

diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
 #include "replication/origin.h"
 #include "storage/sync.h"

-
 extern PGDLLIMPORT bool track_commit_timestamp;

A spurious change.
--

@@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
    if (nlsns > 0)
        sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));    /* group_lsn[] */
-
    return BUFFERALIGN(sz) + BLCKSZ * nslots;
 }

Another spurious change in 0002 patch.
--

+/*
+ * The slru buffer mapping table is partitioned to reduce contention. To
+ * determine which partition lock a given pageno requires, compute the pageno's
+ * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
+ */

I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in
your patches, is that outdated comment?
--

-   sz += MAXALIGN(nslots * sizeof(LWLockPadded));  /* buffer_locks[] */
-   sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
+   sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));  /* locks[] */

I am a bit uncomfortable with these changes, merging parts and buffer locks
making it hard to understand the code. Not sure what we were getting out of
this?
--

Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
 partitions

I think the 0005 patch can be merged to 0001.

Regards,
Amul 

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Wed, Nov 8, 2023 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote:

Thanks for review Amul,

> Here are some minor comments:
>
> + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
> + * maximum value that slru.c will allow, but always at least 16 buffers.
>   */
>  Size
>  CommitTsShmemBuffers(void)
>  {
> -   return Min(256, Max(4, NBuffers / 256));
> +   /* Use configured value if provided. */
> +   if (commit_ts_buffers > 0)
> +       return Max(16, commit_ts_buffers);
> +   return Min(256, Max(16, NBuffers / 256));
>
> Do you mean "4MB of for every 1GB"  in the comment?

You are right

> --
>
> diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
> index 5087cdce51..78d017ad85 100644
> --- a/src/include/access/commit_ts.h
> +++ b/src/include/access/commit_ts.h
> @@ -16,7 +16,6 @@
>  #include "replication/origin.h"
>  #include "storage/sync.h"
>
> -
>  extern PGDLLIMPORT bool track_commit_timestamp;
>
> A spurious change.

Will fix

> --
>
> @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
>
>     if (nlsns > 0)
>         sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));    /* group_lsn[] */
> -
>     return BUFFERALIGN(sz) + BLCKSZ * nslots;
>  }
>
> Another spurious change in 0002 patch.

Will fix

> --
>
> +/*
> + * The slru buffer mapping table is partitioned to reduce contention. To
> + * determine which partition lock a given pageno requires, compute the pageno's
> + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
> + */
>
> I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in
> your patches, is that outdated comment?

Yes will fix it, actually, there are some major design changes to this.

> --
>
> -   sz += MAXALIGN(nslots * sizeof(LWLockPadded));  /* buffer_locks[] */
> -   sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
> +   sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));  /* locks[] */
>
> I am a bit uncomfortable with these changes, merging parts and buffer locks
> making it hard to understand the code. Not sure what we were getting out of
> this?

Yes, even I do not like this much because it is confusing.  But the
advantage of this is that we are using a single pointer for the lock
which means the next variable for the LRU counter will come in the
same cacheline and frequent updates of lru counter will be benefitted
from this.  Although I don't have any number which proves this.
Currently, I want to focus on all the base patches and keep this patch
as add on and later if we find its useful and want to pursue this then
we will see how to make it better readable.


>
> Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
>  partitions
>
> I think the 0005 patch can be merged to 0001.

Yeah in the next version, it is done that way.  Planning to post end of the day.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
many banks
There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise locks will increase the number of locks < 10%.

I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank. The data layout is:

struct CacheBank {
    int page_numbers[16];
    char access_age[16];
}

The first part uses up one cache line, and the second line has 48 bytes of space left over that could fit a lwlock and page_status, page_dirty arrays.

Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only branch is for found/not found. Hoping to have a working prototype of SLRU on top in the next couple of days.

Regards,
Ants Aasma

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote:
>
> Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise
lockswill increase the number of locks < 10%. 

The problem was not attracting much attention for some years. So my reasoning was that solution should not have any
costsat all. Initial patchset with banks did not add any memory footprint. 


> On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote:
>
> I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank.

FWIW I tried to pack struct parts together to minimize cache lines touched, see step 3 in [0]. So far I could not prove
anyperformance benefits of this approach. But maybe your implementation will be more efficient. 


Thanks!


Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Mon, Nov 6, 2023 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of
locallinear search. Lock partitions have to be bigger for sure. 
>
> Yeah, that could also be an idea if we plan to drop the hash.  I mean
> bank-wise counter is fine as we are finding a victim buffer within a
> bank itself, but each lock could cover more slots than one bank size
> or in other words, it can protect multiple banks.  Let's hear more
> opinion on this.

Here is the updated version of the patch, here I have taken the
approach suggested by Andrey and I discussed the same with Alvaro
offlist and he also agrees with it.  So the idea is that we will keep
the bank size fixed which is 16 buffers per bank and the allowed GUC
value for each slru buffer must be in multiple of the bank size.  We
have removed the centralized lock but instead of one lock per bank, we
have kept the maximum limit on the number of bank locks which is 128.
We kept the max limit as 128 because, in one of the operations (i.e.
ActivateCommitTs), we need to acquire all the bank locks (but this is
not a performance path at all) and at a time we can acquire a max of
200 LWlocks, so we think this limit of 128 is good.  So now if the
number of banks is <= 128 then we will be using one lock per bank
otherwise the one lock may protect access of buffer in multiple banks.

We might argue that we should keep the max lock lesser than 128 i.e.
64 or 32 and I am open to that we can do more experiments with a very
large buffer pool and a very heavy workload to see whether having lock
up to 128 is helpful or not

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
IMO the whole area of SLRU buffering is in horrible shape and many users
are struggling with overall PG performance because of it.  An
improvement doesn't have to be perfect -- it just has to be much better
than the current situation, which should be easy enough.  We can
continue to improve later, using more scalable algorithms or ones that
allow us to raise the limits higher.

The only point on which we do not have full consensus yet is the need to
have one GUC per SLRU, and a lot of effort seems focused on trying to
fix the problem without adding so many GUCs (for example, using shared
buffers instead, or use a single "scaling" GUC).  I think that hinders
progress.  Let's just add multiple GUCs, and users can leave most of
them alone and only adjust the one with which they have a performance
problems; it's not going to be the same one for everybody.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Robert Haas
Date:
On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Here is the updated version of the patch, here I have taken the
> approach suggested by Andrey and I discussed the same with Alvaro
> offlist and he also agrees with it.  So the idea is that we will keep
> the bank size fixed which is 16 buffers per bank and the allowed GUC
> value for each slru buffer must be in multiple of the bank size.  We
> have removed the centralized lock but instead of one lock per bank, we
> have kept the maximum limit on the number of bank locks which is 128.
> We kept the max limit as 128 because, in one of the operations (i.e.
> ActivateCommitTs), we need to acquire all the bank locks (but this is
> not a performance path at all) and at a time we can acquire a max of
> 200 LWlocks, so we think this limit of 128 is good.  So now if the
> number of banks is <= 128 then we will be using one lock per bank
> otherwise the one lock may protect access of buffer in multiple banks.

Just so I understand, I guess this means that an SLRU is limited to
16*128 = 2k buffers = 16MB?

When we were talking about this earlier, I suggested fixing the number
of banks and allowing the number of buffers per bank to scale
depending on the setting. That seemed simpler than allowing both the
number of banks and the number of buffers to vary, and it might allow
the compiler to optimize some code better, by converting a calculation
like page_no%number_of_banks into a masking operation like page_no&0xf
or whatever. However, because it allows an individual bank to become
arbitrarily large, it more or less requires us to use a buffer mapping
table. Some of the performance problems mentioned could be alleviated
by omitting the hash table when the number of buffers per bank is
small, and we could also create the dynahash with a custom hash
function that just does modular arithmetic on the page number rather
than a real hashing operation. However, maybe we don't really need to
do any of that. I agree that dynahash is clunky on a good day. I
hadn't realized the impact would be so noticeable.

This proposal takes the opposite approach of fixing the number of
buffers per bank, letting the number of banks vary. I think that's
probably fine, although it does reduce the effective associativity of
the cache. If there are more hot buffers in a bank than the bank size,
the bank will be contended, even if other banks are cold. However,
given the way SLRUs are accessed, it seems hard to imagine this being
a real problem in practice. There aren't likely to be say 20 hot
buffers that just so happen to all be separated from one another by a
number of pages that is a multiple of the configured number of banks.
And in the seemingly very unlikely event that you have a workload that
behaves like that, you could always adjust the number of banks up or
down by one, and the problem would go away. So this seems OK to me.

I also agree with a couple of points that Alvaro made, specifically
that (1) this doesn't have to be perfect, just better than now and (2)
separate GUCs for each SLRU is fine. On the latter point, it's worth
keeping in mind that the cost of a GUC that most people don't need to
tune is fairly low. GUCs like work_mem and shared_buffers are
"expensive" because everybody more or less needs to understand what
they are and how to set them and getting the right value can tricky --
but a GUC like autovacuum_naptime is a lot cheaper, because almost
nobody needs to change it. It seems to me that these GUCs will fall
into the latter category. Users can hopefully just ignore them except
if they see a contention on the SLRU bank locks -- and then they can
consider increasing the number of banks for that particular SLRU. That
seems simple enough. As with autovacuum_naptime, there is a danger
that people will configure a ridiculous value of the parameter for no
good reason and get bad results, so it would be nice if someday we had
a magical system that just got all of this right without the user
needing to configure anything. But in the meantime, it's better to
have a somewhat manual system to relieve pressure on these locks than
no system at all.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Nov 9, 2023 at 9:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Here is the updated version of the patch, here I have taken the
> > approach suggested by Andrey and I discussed the same with Alvaro
> > offlist and he also agrees with it.  So the idea is that we will keep
> > the bank size fixed which is 16 buffers per bank and the allowed GUC
> > value for each slru buffer must be in multiple of the bank size.  We
> > have removed the centralized lock but instead of one lock per bank, we
> > have kept the maximum limit on the number of bank locks which is 128.
> > We kept the max limit as 128 because, in one of the operations (i.e.
> > ActivateCommitTs), we need to acquire all the bank locks (but this is
> > not a performance path at all) and at a time we can acquire a max of
> > 200 LWlocks, so we think this limit of 128 is good.  So now if the
> > number of banks is <= 128 then we will be using one lock per bank
> > otherwise the one lock may protect access of buffer in multiple banks.
>
> Just so I understand, I guess this means that an SLRU is limited to
> 16*128 = 2k buffers = 16MB?

Not really, because 128 is the maximum limit on the number of bank
locks not on the number of banks.  So for example, if you have 16*128
= 2k buffers then each lock will protect one bank, and likewise when
you have 16 * 512 =  8k buffers then each lock will protect 4 banks.
So in short we can get the lock for each bank by simple computation
(banklockno = bankno % 128 )

> When we were talking about this earlier, I suggested fixing the number
> of banks and allowing the number of buffers per bank to scale
> depending on the setting. That seemed simpler than allowing both the
> number of banks and the number of buffers to vary, and it might allow
> the compiler to optimize some code better, by converting a calculation
> like page_no%number_of_banks into a masking operation like page_no&0xf
> or whatever. However, because it allows an individual bank to become
> arbitrarily large, it more or less requires us to use a buffer mapping
> table. Some of the performance problems mentioned could be alleviated
> by omitting the hash table when the number of buffers per bank is
> small, and we could also create the dynahash with a custom hash
> function that just does modular arithmetic on the page number rather
> than a real hashing operation. However, maybe we don't really need to
> do any of that. I agree that dynahash is clunky on a good day. I
> hadn't realized the impact would be so noticeable.

Yes, so one idea is that we keep the number of banks fixed and with
that, as you pointed out correctly with a large number of buffers, the
bank size can be quite big and for that, we would need a hash table
and OTOH what I am doing here is keeping the bank size fixed and
smaller (16 buffers per bank) and with that we can have large numbers
of the bank when the buffer pool size is quite big.  But I feel having
more banks is not really a problem if we grow the number of locks
beyond a certain limit as in some corner cases we need to acquire all
locks together and there is a limit on that.   So I like this idea of
sharing locks across the banks with that 1) We can have enough locks
so that lock contention or cache invalidation due to a common lock
should not be a problem anymore 2) We can keep a small bank size with
that seq search within the bank is quite fast so reads are fast  3)
With small bank size victim buffer search which has to be sequential
is quite fast.

> This proposal takes the opposite approach of fixing the number of
> buffers per bank, letting the number of banks vary. I think that's
> probably fine, although it does reduce the effective associativity of
> the cache. If there are more hot buffers in a bank than the bank size,
> the bank will be contended, even if other banks are cold. However,
> given the way SLRUs are accessed, it seems hard to imagine this being
> a real problem in practice. There aren't likely to be say 20 hot
> buffers that just so happen to all be separated from one another by a
> number of pages that is a multiple of the configured number of banks.
> And in the seemingly very unlikely event that you have a workload that
> behaves like that, you could always adjust the number of banks up or
> down by one, and the problem would go away. So this seems OK to me.

I agree with this

> I also agree with a couple of points that Alvaro made, specifically
> that (1) this doesn't have to be perfect, just better than now and (2)
> separate GUCs for each SLRU is fine. On the latter point, it's worth
> keeping in mind that the cost of a GUC that most people don't need to
> tune is fairly low. GUCs like work_mem and shared_buffers are
> "expensive" because everybody more or less needs to understand what
> they are and how to set them and getting the right value can tricky --
> but a GUC like autovacuum_naptime is a lot cheaper, because almost
> nobody needs to change it. It seems to me that these GUCs will fall
> into the latter category. Users can hopefully just ignore them except
> if they see a contention on the SLRU bank locks -- and then they can
> consider increasing the number of banks for that particular SLRU. That
> seems simple enough. As with autovacuum_naptime, there is a danger
> that people will configure a ridiculous value of the parameter for no
> good reason and get bad results, so it would be nice if someday we had
> a magical system that just got all of this right without the user
> needing to configure anything. But in the meantime, it's better to
> have a somewhat manual system to relieve pressure on these locks than
> no system at all.

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> IMO the whole area of SLRU buffering is in horrible shape and many users
> are struggling with overall PG performance because of it.  An
> improvement doesn't have to be perfect -- it just has to be much better
> than the current situation, which should be easy enough.  We can
> continue to improve later, using more scalable algorithms or ones that
> allow us to raise the limits higher.

I agree with this.

> The only point on which we do not have full consensus yet is the need to
> have one GUC per SLRU, and a lot of effort seems focused on trying to
> fix the problem without adding so many GUCs (for example, using shared
> buffers instead, or use a single "scaling" GUC).  I think that hinders
> progress.  Let's just add multiple GUCs, and users can leave most of
> them alone and only adjust the one with which they have a performance
> problems; it's not going to be the same one for everybody.

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Nathan Bossart
Date:
On Fri, Nov 10, 2023 at 10:17:49AM +0530, Dilip Kumar wrote:
> On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> The only point on which we do not have full consensus yet is the need to
>> have one GUC per SLRU, and a lot of effort seems focused on trying to
>> fix the problem without adding so many GUCs (for example, using shared
>> buffers instead, or use a single "scaling" GUC).  I think that hinders
>> progress.  Let's just add multiple GUCs, and users can leave most of
>> them alone and only adjust the one with which they have a performance
>> problems; it's not going to be the same one for everybody.
> 
> +1

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
I just noticed that 0003 does some changes to
TransactionGroupUpdateXidStatus() that haven't been adequately
explained AFAICS.  How do you know that these changes are safe?

0001 contains one typo in the docs, "cotents".

I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
leaving others in clog.c.  Maybe add commentary cross-linking both.
Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
the slru.h-defined limit of 131072 is not that bad, even if it's more
than could possibly be needed for xact_buffers; nobody is going to use
64k buffers, since useful values are below a couple thousand anyhow.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
                               http://postgr.es/m/482D1632.8010507@sigaev.ru



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> I just noticed that 0003 does some changes to
> TransactionGroupUpdateXidStatus() that haven't been adequately
> explained AFAICS.  How do you know that these changes are safe?

IMHO this is safe as well as logical to do w.r.t. performance.  It's
safe because whenever we are updating any page in a group we are
acquiring the respective bank lock in exclusive mode and in extreme
cases if there are pages from different banks then we do switch the
lock as well before updating the pages from different groups.  And, we
do not wake any process in a group unless we have done the status
update for all the processes so there could not be any race condition
as well.  Also, It should not affect the performance adversely as well
and this will not remove the need for group updates.  The main use
case of group update is that it will optimize a situation when most of
the processes are contending for status updates on the same page and
processes that are waiting for status updates on different pages will
go to different groups w.r.t. that page, so in short in a group on
best effort basis we are trying to have the processes which are
waiting to update the same clog page that mean logically all the
processes in the group will be waiting on the same bank lock.  In an
extreme situation if there are processes in the group that are trying
to update different pages or even pages from different banks then we
are handling it well by changing the lock.  Although someone may raise
a concern that in cases where there are processes that are waiting for
different bank locks then after releasing one lock why not wake up
those processes, I think that is not required because that is the
situation we are trying to avoid where there are processes trying to
update different are in the same group so there is no point in adding
complexity to optimize that case.


> 0001 contains one typo in the docs, "cotents".
>
> I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
> leaving others in clog.c.  Maybe add commentary cross-linking both.
> Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
> the slru.h-defined limit of 131072 is not that bad, even if it's more
> than could possibly be needed for xact_buffers; nobody is going to use
> 64k buffers, since useful values are below a couple thousand anyhow.

I agree, that allowing xact_buffers to grow beyond 65536 up to the
slru.h-defined limit of 131072 is not that bad, so I will change that
in the next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

PFA, updated patch version, this fixes the comment given by Alvaro and
also improves some of the comments.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
In SlruSharedData, a new comment is added that starts:
  "Instead of global counter we maintain a bank-wise lru counter because ..." 
You don't need to explain what we don't do.  Just explain what we do do.
So remove the words "Instead of a global counter" from there, because
they offer no wisdom.  Same with the phrase "so there is no point to ...".
I think "The oldest page is therefore" should say "The oldest page *in
the bank* is therefore", for extra clarity.

I wonder what's the deal with false sharing in the new
bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
bank_locks, we should have a new struct, with both the LWLock and the
LRU counter; then pad *that* to the cacheline size.  This way, both the
lwlock and the counter come to the CPU running this code together.

Looking at SlruRecentlyUsed, which was a macro and is now a function.
The comment about "multiple evaluation of arguments" no longer applies,
so it needs to be removed; and it shouldn't talk about itself being a
macro.

Using "Size" as type for bank_mask looks odd.  For a bitmask, maybe it's
be more appropriate to use bits64 if we do need a 64-bit mask (we don't
have bits64, but it's easy to add a typedef).  I bet we don't really
need a 64 bit mask, and a 32-bit or even a 16-bit is sufficient, given
the other limitations on number of buffers.

I think SimpleLruReadPage should have this assert at start:

+   Assert(LWLockHeldByMe(SimpleLruGetSLRUBankLock(ctl, pageno)));

Do we really need one separate lwlock tranche for each SLRU?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
On 2023-Nov-17, Dilip Kumar wrote:

> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > I just noticed that 0003 does some changes to
> > TransactionGroupUpdateXidStatus() that haven't been adequately
> > explained AFAICS.  How do you know that these changes are safe?
> 
> IMHO this is safe as well as logical to do w.r.t. performance.  It's
> safe because whenever we are updating any page in a group we are
> acquiring the respective bank lock in exclusive mode and in extreme
> cases if there are pages from different banks then we do switch the
> lock as well before updating the pages from different groups.

Looking at the coverage for this code,
https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
it seems in our test suites we never hit the case where there is
anything in the "nextidx" field for commit groups.  To be honest, I
don't understand this group stuff, and so I'm doubly hesitant to go
without any testing here.  Maybe it'd be possible to use Michael
Paquier's injection points somehow?


I think in the code comments where you use "w.r.t.", that acronym can be
replaced with "for", which improves readability.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
                                 (George Orwell's The Lord of the Rings)



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Thanks for the review, all comments looks fine to me, replying to
those that need some clarification

> I wonder what's the deal with false sharing in the new
> bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
> bank_locks, we should have a new struct, with both the LWLock and the
> LRU counter; then pad *that* to the cacheline size.  This way, both the
> lwlock and the counter come to the CPU running this code together.

Actually, the array lengths of both LWLock and the LRU counter are
different so I don't think we can move them to a common structure.
The length of the *buffer_locks array is equal to the number of slots,
the length of the *bank_locks array is Min (number_of_banks, 128), and
the length of the *bank_cur_lru_count array is number_of_banks.

> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.  To be honest, I
> don't understand this group stuff, and so I'm doubly hesitant to go
> without any testing here.  Maybe it'd be possible to use Michael
> Paquier's injection points somehow?

Sorry, but I am not aware of "Michael Paquier's injection" Is it
something already in the repo? Can you redirect me to some of the
example test cases if we already have them? Then I will try it out.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
On 2023-Nov-18, Dilip Kumar wrote:

> On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > I wonder what's the deal with false sharing in the new
> > bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
> > bank_locks, we should have a new struct, with both the LWLock and the
> > LRU counter; then pad *that* to the cacheline size.  This way, both the
> > lwlock and the counter come to the CPU running this code together.
> 
> Actually, the array lengths of both LWLock and the LRU counter are
> different so I don't think we can move them to a common structure.
> The length of the *buffer_locks array is equal to the number of slots,
> the length of the *bank_locks array is Min (number_of_banks, 128), and
> the length of the *bank_cur_lru_count array is number_of_banks.

Oh.

> > Looking at the coverage for this code,
> > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> > it seems in our test suites we never hit the case where there is
> > anything in the "nextidx" field for commit groups.  To be honest, I
> > don't understand this group stuff, and so I'm doubly hesitant to go
> > without any testing here.  Maybe it'd be possible to use Michael
> > Paquier's injection points somehow?
> 
> Sorry, but I am not aware of "Michael Paquier's injection" Is it
> something already in the repo? Can you redirect me to some of the
> example test cases if we already have them? Then I will try it out.

https://postgr.es/ZVWufO_YKzTJHEHW@paquier.xyz

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 17 Nov 2023, at 16:11, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> PFA, updated patch version, this fixes the comment given by Alvaro and
> also improves some of the comments.

I’ve skimmed through the patch set. Here are some minor notes.

1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in SlruSelectLRUPage() and
SimpleLruReadPage_ReadOnly()now have identical comments. I think a little of copy-paste is OK. 
But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while SlruSelectLRUPage() does not. This is not
relatedto the patch set, just a code nearby. 

2. Do we really want these functions doing all the same?
extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource source);
extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource source);
extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source);
extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source);

3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d suggest truncating prefix of infix.

I do not have hard opinion on any of this items.


Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Sun, Nov 19, 2023 at 12:39 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> I’ve skimmed through the patch set. Here are some minor notes.

Thanks for the review
>
> 1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in SlruSelectLRUPage() and
SimpleLruReadPage_ReadOnly()now have identical comments. I think a little of copy-paste is OK. 
> But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while SlruSelectLRUPage() does not. This is not
relatedto the patch set, just a code nearby. 

Do you mean to say we need to modify the comments or you are saying
pgstat_count_slru_page_hit() is missing in SlruSelectLRUPage(), if it
is later then I can see the caller of SlruSelectLRUPage() is calling
pgstat_count_slru_page_hit() and the SlruRecentlyUsed().

> 2. Do we really want these functions doing all the same?
> extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource source);
> extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource source);
> extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source);
> extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
> extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
> extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
> extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source);

I tried duplicating these by doing all the work inside the
check_slru_buffers() function.  But I think it is hard to make them a
single function because there is no option to pass an SLRU name in the
GUC check hook and IMHO in the check hook we need to print the GUC
name, any suggestions on how we can avoid having so many functions?

> 3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d suggest truncating prefix of infix.
>
> I do not have hard opinion on any of this items.
>

I prefer SimpleLruGetBankLock() so that it is consistent with other
external functions starting with "SimpleLruGet", are you fine with
this name?


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Nov-17, Dilip Kumar wrote:

I think I need some more clarification for some of the review comments

> > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > I just noticed that 0003 does some changes to
> > > TransactionGroupUpdateXidStatus() that haven't been adequately
> > > explained AFAICS.  How do you know that these changes are safe?
> >
> > IMHO this is safe as well as logical to do w.r.t. performance.  It's
> > safe because whenever we are updating any page in a group we are
> > acquiring the respective bank lock in exclusive mode and in extreme
> > cases if there are pages from different banks then we do switch the
> > lock as well before updating the pages from different groups.
>
> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.

1)
I was looking into your coverage report and I have attached a
screenshot from the same, it seems we do hit the block where nextidx
is not INVALID_PGPROCNO, which means there is some other process other
than the group leader.  Although I have already started exploring the
injection point but just wanted to be sure what is your main concern
point about the coverage so though of checking that first.

470             :     /*
     471             :      * If the list was not empty, the leader
will update the status of our
     472             :      * XID. It is impossible to have followers
without a leader because the
     473             :      * first process that has added itself to
the list will always have
     474             :      * nextidx as INVALID_PGPROCNO.
     475             :      */
     476          98 :     if (nextidx != INVALID_PGPROCNO)
     477             :     {
     478           2 :         int         extraWaits = 0;
     479             :
     480             :         /* Sleep until the leader updates our
XID status. */
     481           2 :
pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE);
     482             :         for (;;)
     483             :         {
     484             :             /* acts as a read barrier */
     485           2 :             PGSemaphoreLock(proc->sem);
     486           2 :             if (!proc->clogGroupMember)
     487           2 :                 break;
     488           0 :             extraWaits++;
     489             :         }

2) Do we really need one separate lwlock tranche for each SLRU?

IMHO if we use the same lwlock tranche then the wait event will show
the same wait event name, right? And that would be confusing for the
user, whether we are waiting for Subtransaction or Multixact or
anything else.  Is my understanding no correct here?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> 2) Do we really need one separate lwlock tranche for each SLRU?
>
> IMHO if we use the same lwlock tranche then the wait event will show
> the same wait event name, right? And that would be confusing for the
> user, whether we are waiting for Subtransaction or Multixact or
> anything else.  Is my understanding no correct here?

If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when they
observewait times. 

Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > 2) Do we really need one separate lwlock tranche for each SLRU?
> >
> > IMHO if we use the same lwlock tranche then the wait event will show
> > the same wait event name, right? And that would be confusing for the
> > user, whether we are waiting for Subtransaction or Multixact or
> > anything else.  Is my understanding no correct here?
>
> If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when they
observewait times. 

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > 2) Do we really need one separate lwlock tranche for each SLRU?
> > >
> > > IMHO if we use the same lwlock tranche then the wait event will show
> > > the same wait event name, right? And that would be confusing for the
> > > user, whether we are waiting for Subtransaction or Multixact or
> > > anything else.  Is my understanding no correct here?
> >
> > If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when
theyobserve wait times. 

PFA, updated patch set, I have worked on review comments by Alvaro and
Andrey.  So the only open comments are about clog group commit
testing, for that my question was as I sent in the previous email
exactly what part we are worried about in the coverage report?

The second point is, if we want to generate a group update we will
have to create the injection point after we hold the control lock so
that other processes go for group update and then for waking up the
waiting process who is holding the SLRU control lock in the exclusive
mode we would need to call a function ('test_injection_points_wake()')
to wake that up and for calling the function we would need to again
acquire the SLRU lock in read mode for visibility check in the catalog
for fetching the procedure row and now this wake up session will block
on control lock for the session which is waiting on injection point so
now it will create a deadlock.   Maybe with bank-wise lock we can
create a lot of transaction so that these 2 falls in different banks
and then we can somehow test this, but then we will have to generate
16 * 4096 = 64k transaction so that the SLRU banks are different for
the transaction which inserted procedure row in system table from the
transaction in which we are trying to do the group commit

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Tue, Nov 21, 2023 at 2:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > > > On 20 Nov 2023, at 13:51, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > 2) Do we really need one separate lwlock tranche for each SLRU?
> > > >
> > > > IMHO if we use the same lwlock tranche then the wait event will show
> > > > the same wait event name, right? And that would be confusing for the
> > > > user, whether we are waiting for Subtransaction or Multixact or
> > > > anything else.  Is my understanding no correct here?
> > >
> > > If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when
theyobserve wait times. 
>
> PFA, updated patch set, I have worked on review comments by Alvaro and
> Andrey.  So the only open comments are about clog group commit
> testing, for that my question was as I sent in the previous email
> exactly what part we are worried about in the coverage report?
>
> The second point is, if we want to generate a group update we will
> have to create the injection point after we hold the control lock so
> that other processes go for group update and then for waking up the
> waiting process who is holding the SLRU control lock in the exclusive
> mode we would need to call a function ('test_injection_points_wake()')
> to wake that up and for calling the function we would need to again
> acquire the SLRU lock in read mode for visibility check in the catalog
> for fetching the procedure row and now this wake up session will block
> on control lock for the session which is waiting on injection point so
> now it will create a deadlock.   Maybe with bank-wise lock we can
> create a lot of transaction so that these 2 falls in different banks
> and then we can somehow test this, but then we will have to generate
> 16 * 4096 = 64k transaction so that the SLRU banks are different for
> the transaction which inserted procedure row in system table from the
> transaction in which we are trying to do the group commit

I have attached a POC patch for testing the group update using the
injection point framework.  This is just for testing the group update
part and is not yet a committable test.  I have added a bunch of logs
in the code so that we can see what's going on with the group update.
From the below logs, we can see that multiple processes are getting
accumulated for the group update and the leader is updating their xid
status.


Note: With this testing, we have found a bug in the bank-wise
approach, basically we are clearing a procglobal->clogGroupFirst, even
before acquiring the bank lock that means in most of the cases there
will be a single process in each group as a group leader (I think this
is what Alvaro was pointing in his coverage report).  I have added
this fix in this POC just for testing purposes but in my next version
I will add this fix to my proper patch version after a proper review
and a bit more testing.


here is the output after running the test
==============
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl LOG:
procno 6 got the lock
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl
STATEMENT:  SELECT txid_current();
2023-11-23 05:55:29.406 UTC [93369] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_attach('ClogGroupCommit',
'wait');
2023-11-23 05:55:29.415 UTC [93371] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(1);
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl LOG:
procno 4 got the lock
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(1);
2023-11-23 05:55:29.424 UTC [93373] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(2);
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl LOG:
procno 3 for xid 128742 added for group update
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(2);
2023-11-23 05:55:29.431 UTC [93376] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93378] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(4);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 for xid 128743 added for group update
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 is follower and wait for group leader to update commit status
of xid 128743
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG:
procno 1 for xid 128744 added for group update
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(4);
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG:
procno 1 is follower and wait for group leader to update commit status
of xid 128744
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(4);
2023-11-23 05:55:29.445 UTC [93380] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(5);
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl LOG:
procno 0 for xid 128745 added for group update
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(5);
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl LOG:
procno 0 is follower and wait for group leader to update commit status
of xid 128745
2023-11-23 05:55:29.446 UTC [93380] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(5);
2023-11-23 05:55:29.451 UTC [93382] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_wake();
2023-11-23 05:55:29.460 UTC [93384] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_detach('ClogGroupCommit');

=============

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Note: With this testing, we have found a bug in the bank-wise
> approach, basically we are clearing a procglobal->clogGroupFirst, even
> before acquiring the bank lock that means in most of the cases there
> will be a single process in each group as a group leader

I realized that the bug fix I have done is not proper, so will send
the updated patch set with the proper fix soon.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Note: With this testing, we have found a bug in the bank-wise
> > approach, basically we are clearing a procglobal->clogGroupFirst, even
> > before acquiring the bank lock that means in most of the cases there
> > will be a single process in each group as a group leader
>
> I realized that the bug fix I have done is not proper, so will send
> the updated patch set with the proper fix soon.

PFA, updated patch set fixes the bug found during the testing of the
group update using the injection point.  Also attached a path to test
the injection point but for that, we need to apply the injection point
patches [1]

[1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
tender wang
Date:
The v8-0001 patch failed to apply in my local repo as below:

git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
error: patch failed: src/backend/access/transam/multixact.c:1851
error: src/backend/access/transam/multixact.c: patch does not apply
error: patch failed: src/backend/access/transam/subtrans.c:184
error: src/backend/access/transam/subtrans.c: patch does not apply
error: patch failed: src/backend/commands/async.c:117
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/predicate.c:808
error: src/backend/storage/lmgr/predicate.c: patch does not apply
error: patch failed: src/include/commands/async.h:15
error: src/include/commands/async.h: patch does not apply

My local head commit is 15c9ac36299.  Is there something I missed?

Dilip Kumar <dilipbalaut@gmail.com> 于2023年11月24日周五 17:08写道:
On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Note: With this testing, we have found a bug in the bank-wise
> > approach, basically we are clearing a procglobal->clogGroupFirst, even
> > before acquiring the bank lock that means in most of the cases there
> > will be a single process in each group as a group leader
>
> I realized that the bug fix I have done is not proper, so will send
> the updated patch set with the proper fix soon.

PFA, updated patch set fixes the bug found during the testing of the
group update using the injection point.  Also attached a path to test
the injection point but for that, we need to apply the injection point
patches [1]

[1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
On 2023-Nov-29, tender wang wrote:

> The v8-0001 patch failed to apply in my local repo as below:
> 
> git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> error: patch failed: src/backend/access/transam/multixact.c:1851
> error: src/backend/access/transam/multixact.c: patch does not apply
> error: patch failed: src/backend/access/transam/subtrans.c:184
> error: src/backend/access/transam/subtrans.c: patch does not apply
> error: patch failed: src/backend/commands/async.c:117
> error: src/backend/commands/async.c: patch does not apply
> error: patch failed: src/backend/storage/lmgr/predicate.c:808
> error: src/backend/storage/lmgr/predicate.c: patch does not apply
> error: patch failed: src/include/commands/async.h:15
> error: src/include/commands/async.h: patch does not apply

Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Nov-29, tender wang wrote:
>
> > The v8-0001 patch failed to apply in my local repo as below:
> >
> > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> > error: patch failed: src/backend/access/transam/multixact.c:1851
> > error: src/backend/access/transam/multixact.c: patch does not apply
> > error: patch failed: src/backend/access/transam/subtrans.c:184
> > error: src/backend/access/transam/subtrans.c: patch does not apply
> > error: patch failed: src/backend/commands/async.c:117
> > error: src/backend/commands/async.c: patch does not apply
> > error: patch failed: src/backend/storage/lmgr/predicate.c:808
> > error: src/backend/storage/lmgr/predicate.c: patch does not apply
> > error: patch failed: src/include/commands/async.h:15
> > error: src/include/commands/async.h: patch does not apply
>
> Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

I will send a rebased version by tomorrow.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2023-Nov-29, tender wang wrote:
> >
> > > The v8-0001 patch failed to apply in my local repo as below:
> > >
> > > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> > > error: patch failed: src/backend/access/transam/multixact.c:1851
> > > error: src/backend/access/transam/multixact.c: patch does not apply
> > > error: patch failed: src/backend/access/transam/subtrans.c:184
> > > error: src/backend/access/transam/subtrans.c: patch does not apply
> > > error: patch failed: src/backend/commands/async.c:117
> > > error: src/backend/commands/async.c: patch does not apply
> > > error: patch failed: src/backend/storage/lmgr/predicate.c:808
> > > error: src/backend/storage/lmgr/predicate.c: patch does not apply
> > > error: patch failed: src/include/commands/async.h:15
> > > error: src/include/commands/async.h: patch does not apply
> >
> > Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.
>
> I will send a rebased version by tomorrow.

PFA, a rebased version of the patch, I have avoided attaching because
a) that patch is POC to show the coverage and it has a dependency on
the other thread b) the old patch still applies so it doesn't need
rebase.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the updated patch based on some comments by tender wang (those
comments were sent only to me)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Alvaro Herrera
Date:
[Added Andrey again in CC, because as I understand they are using this
code or something like it in production.  Please don't randomly remove
people from CC lists.]

I've been looking at this some more, and I'm not confident in that the
group clog update stuff is correct.  I think the injection points test
case was good enough to discover a problem, but it's hard to get peace
of mind that there aren't other, more subtle problems.

The problem I see is that the group update mechanism is designed around
contention of the global xact-SLRU control lock; it uses atomics to
coordinate a single queue when the lock is contended.  So if we split up
the global SLRU control lock using banks, then multiple processes using
different bank locks might not contend.  OK, this is fine, but what
happens if two separate groups of processes encounter contention on two
different bank locks?  I think they will both try to update the same
queue, and coordinate access to that *using different bank locks*.  I
don't see how can this work correctly.

I suspect the first part of that algorithm, where atomics are used to
create the list without a lock, might work fine.  But will each "leader"
process, each of which is potentially using a different bank lock,
coordinate correctly?  Maybe this works correctly because only one
process will find the queue head not empty?  If this is what happens,
then there needs to be comments about it.  Without any explanation,
this seems broken and potentially dangerous, as some transaction commit
bits might become lost given high enough concurrency and bad luck.

Maybe this can be made to work by having one more lwlock that we use
solely to coordinate this task.  Though we would have to demonstrate
that coordinating this task with a different lock works correctly in
conjunction with the per-bank lwlock usage in the regular slru.c paths.


Andrey, do you have any stress tests or anything else that you used to
gain confidence in this code?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Tue, Dec 12, 2023 at 6:58 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> [Added Andrey again in CC, because as I understand they are using this
> code or something like it in production.  Please don't randomly remove
> people from CC lists.]

Oh, glad to know that.  Yeah, I generally do not remove but I have
noticed that in the mail chain, some of the reviewers just replied to
me and the hackers' list, and from that point onwards I lost track of
the CC list.

> I've been looking at this some more, and I'm not confident in that the
> group clog update stuff is correct.  I think the injection points test
> case was good enough to discover a problem, but it's hard to get peace
> of mind that there aren't other, more subtle problems.

Yeah, I agree.

> The problem I see is that the group update mechanism is designed around
> contention of the global xact-SLRU control lock; it uses atomics to
> coordinate a single queue when the lock is contended.  So if we split up
> the global SLRU control lock using banks, then multiple processes using
> different bank locks might not contend.  OK, this is fine, but what
> happens if two separate groups of processes encounter contention on two
> different bank locks?  I think they will both try to update the same
> queue, and coordinate access to that *using different bank locks*.  I
> don't see how can this work correctly.

Let's back up a bit and start from the current design with the
centralized lock.  With that, if one process is holding the lock the
other processes will try to perform the group update, and if there is
already a group that still hasn't got the lock but trying to update
the different CLOG page then what this process wants to update then it
will not add itself for the group update instead it will fallback to
the normal lock wait.  Now, in another situation, it may so happen
that the group leader of the other group already got the control lock
and in such case, it would have cleared the
'procglobal->clogGroupFirst' that means now we will start forming a
different group.  So logically if we talk only about the optimization
part then the thing is that it is assumed that at a time when we are
trying to commit a log of concurrent xid then those xids are mostly of
the same range and will fall in the same SLRU page and group update
will help them.  But if we are getting some out-of-range xid of some
long-running transaction they might not even go for group update as
the page number will be different.  Although the situation might be
better here with a bank-wise lock because there if those xids are
falling in altogether a different bank it might not even contend.

Now, let's talk about the correctness, I think even though we are
getting processes that might be contending on different bank-lock,
still we are ensuring that in a group all the processes are trying to
update the same SLRU page (i.e. same bank also, we will talk about the
exception later[1]).  One of the processes is becoming a leader and as
soon as the leader gets the lock it detaches the queue from the
'procglobal->clogGroupFirst' by setting it as INVALID_PGPROCNO so that
other group update requesters now can form another parallel group.
But here I do not see a problem with correctness.

I agree someone might say that since now there is a possibility that
different groups might get formed for different bank locks we do not
get other groups to get formed until we get the lock for our bank as
we do not clear  'procglobal->clogGroupFirst' before we get the lock.
Other requesters might want to update the page in different banks so
why block them?  But the thing is the group update design is optimized
for the cases where all requesters are trying to update the status of
xids generated near the same range.


> I suspect the first part of that algorithm, where atomics are used to
> create the list without a lock, might work fine.  But will each "leader"
> process, each of which is potentially using a different bank lock,
> coordinate correctly?  Maybe this works correctly because only one
> process will find the queue head not empty?  If this is what happens,
> then there needs to be comments about it.

Yes, you got it right, I will try to comment on it better.

  Without any explanation,
> this seems broken and potentially dangerous, as some transaction commit
> bits might become lost given high enough concurrency and bad luck.
>
> Maybe this can be made to work by having one more lwlock that we use
> solely to coordinate this task.

Do you mean to say a different lock for adding/removing in the list
instead of atomic operation?  I think then we will lose the benefit we
got in the group update by having contention on another lock.

[1] I think we already know about the exception case and I have
explained in the comments as well that in some cases we might add
different clog page update requests in the same group, and for
handling that exceptional case we are checking the respective bank
lock for each page and if that exception occurred we will release the
old bank lock and acquire a new lock.  This case might not be
performant because now it is possible that after getting the lock
leader might need to wait again on another bank lock, but this is an
extremely exceptional case so should not be worried about performance
and I do not see any correctness issue here as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Andrey, do you have any stress tests or anything else that you used to
> gain confidence in this code?

We are using only first two steps of the patchset, these steps do not touch locking stuff.

We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!


Best regards, Andrey Borodin.

[0]
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3




On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the updated patch based on some comments by tender wang (those
comments were sent only to me)

few nitpicks:

+
+   /*  
+    * Mask for slotno banks, considering 1GB SLRU buffer pool size and the
+    * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
+    */  
+   bits16      bank_mask;
 } SlruCtlData;

...
...

+ int bankno = pageno & ctl->bank_mask;

I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
of banks (num_banks) and get the bank number through modulus op (pageno %
num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
bit difficult to read compared to modulus op which is quite simple,
straightforward and much common practice in hashing.

Are there any advantages of using &  over % ?

Also, a few places in 0002 and 0003 patch, need the bank number, it is better
to have a macro for that.
---

 extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage,
                                   void *data);
-
+extern bool check_slru_buffers(const char *name, int *newval);
 #endif                         /* SLRU_H */


Add an empty line after the declaration, in 0002 patch.
---

-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+                         int slotno)

Unrelated change for 0003 patch. 
---

Regards,
Amul 

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Dec 14, 2023 at 8:43 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> Here is the updated patch based on some comments by tender wang (those
>> comments were sent only to me)
>
>
> few nitpicks:
>
> +
> +   /*
> +    * Mask for slotno banks, considering 1GB SLRU buffer pool size and the
> +    * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
> +    */
> +   bits16      bank_mask;
>  } SlruCtlData;
>
> ...
> ...
>
> + int bankno = pageno & ctl->bank_mask;
>
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
>
> Are there any advantages of using &  over % ?

I am not sure either but since this change in 0002 is by Andrey, I
will let him comment on this before we change or take any decision.

> Also, a few places in 0002 and 0003 patch, need the bank number, it is better
> to have a macro for that.
> ---
>
>  extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage,
>                                    void *data);
> -
> +extern bool check_slru_buffers(const char *name, int *newval);
>  #endif                         /* SLRU_H */
>
>
> Add an empty line after the declaration, in 0002 patch.
> ---
>
> -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
> +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
> +                         int slotno)
>
> Unrelated change for 0003 patch.

 Fixed

Thanks for your review, PFA updated version.

I have added @Amit Kapila to the list to view his opinion about
whether anything can break in the clog group update with our changes
of bank-wise SLRU lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 14 Dec 2023, at 08:12, Amul Sul <sulamul@gmail.com> wrote:
>
>
> + int bankno = pageno & ctl->bank_mask;
>
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
>
> Are there any advantages of using &  over % ?

The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some
readabilityto save ~ten nanoseconds on each check of a status of a transaction. 


[0] https://www.agner.org/optimize/instruction_tables.pdf




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
tender wang
Date:


Andrey M. Borodin <x4mmm@yandex-team.ru> 于2023年12月14日周四 17:02写道:


> On 14 Dec 2023, at 08:12, Amul Sul <sulamul@gmail.com> wrote:
>
>
> + int bankno = pageno & ctl->bank_mask;
>
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
>
> Are there any advantages of using &  over % ?
 
use Compiler Explorer[1] tool, '%' has more Assembly instructions than '&' . 
int GetBankno1(int pageno) {
    return pageno & 127;
}

int GetBankno2(int pageno) {
    return pageno % 127;
}
under clang 13.0
GetBankno1:                             # @GetBankno1
        push    rbp
        mov     rbp, rsp
        mov     dword ptr [rbp - 4], edi
        mov     eax, dword ptr [rbp - 4]
        and     eax, 127
        pop     rbp
        ret
GetBankno2:                             # @GetBankno2
        push    rbp
        mov     rbp, rsp
        mov     dword ptr [rbp - 4], edi
        mov     eax, dword ptr [rbp - 4]
        mov     ecx, 127
        cdq
        idiv    ecx
        mov     eax, edx
        pop     rbp
        ret
under gcc 13.2
GetBankno1:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        and     eax, 127
        pop     rbp
        ret
GetBankno2:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        movsx   rdx, eax
        imul    rdx, rdx, -2130574327
        shr     rdx, 32
        add     edx, eax
        mov     ecx, edx
        sar     ecx, 6
        cdq
        sub     ecx, edx
        mov     edx, ecx
        sal     edx, 7
        sub     edx, ecx
        sub     eax, edx
        mov     ecx, eax
        mov     eax, ecx
        pop     rbp
        ret



The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some readability to save ~ten nanoseconds on each check of a status of a transaction.

  Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS' operation in  SimpleLruGetBankLock()  with '& 127' :
  SimpleLruGetBankLock()  
{
       int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;
                                                                               use '&'
       return &(ctl->shared->bank_locks[banklockno].lock);
}
Thoughts?

[0] https://www.agner.org/optimize/instruction_tables.pdf

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 14 Dec 2023, at 14:28, tender wang <tndrwang@gmail.com> wrote:
>
>   Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS' operation in  SimpleLruGetBankLock()  with
'&127' 

unsigned int GetBankno1(unsigned int pageno) {
return pageno & 127;
}

unsigned int GetBankno2(unsigned int pageno) {
return pageno % 128;
}

Generates with -O2

GetBankno1(unsigned int):
mov eax, edi
and eax, 127
ret
GetBankno2(unsigned int):
mov eax, edi
and eax, 127
ret


Compiler is smart enough with constants.


Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> > On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > Andrey, do you have any stress tests or anything else that you used to
> > gain confidence in this code?
>
> We are using only first two steps of the patchset, these steps do not touch locking stuff.
>
> We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!
>

I have run this test [1], instead of comparing against the master I
have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs
(patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and
here is the result of the benchmark-1 and benchmark-2.  I have noticed
a very good improvement with the addition of patch 0003.

Machine information:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127

configurations:

max_wal_size=20GB
shared_buffers=20GB
checkpoint_timeout=40min
max_connections=700
maintenance_work_mem=1GB

subtrans_buffers=$variable
multixact_offsets_buffers=$variable
multixact_members_buffers=$variable

benchmark-1
version  | subtrans | multixact   | tps
              | buffers   | offs/memb | func+ballast
-----------+--------------+--------------+------
patch-1 | 64           | 64/128        |  87 + 58
patch-2 | 64           | 64/128         | 128 +83
patch-1 | 1024        | 512/1024    |  96 + 64
patch-2 | 1024        | 512/1024    | 163+108

benchmark-2

version  | subtrans | multixact   | tps
              | buffers   | offs/memb | func
-----------+--------------+--------------+------
patch-1 | 64           | 64/128        |  10
patch-2 | 64           | 64/128         | 12
patch-1 | 1024        | 512/1024    | 44
patch-2 | 1024        | 512/1024    | 72


[1]
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 14 Dec 2023, at 16:06, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>  I have noticed
> a very good improvement with the addition of patch 0003.

Indeed, a very impressive results! It’s almost x2 of performance on high contention scenario, on top of previous
improvements.

Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
tender wang
Date:


Andrey M. Borodin <x4mmm@yandex-team.ru> 于2023年12月14日周四 17:35写道:


> On 14 Dec 2023, at 14:28, tender wang <tndrwang@gmail.com> wrote:
>
>   Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS' operation in  SimpleLruGetBankLock()  with '& 127'

unsigned int GetBankno1(unsigned int pageno) {
return pageno & 127;
}

unsigned int GetBankno2(unsigned int pageno) {
return pageno % 128;
}

Generates with -O2

GetBankno1(unsigned int):
mov eax, edi
and eax, 127
ret
GetBankno2(unsigned int):
mov eax, edi
and eax, 127
ret


Compiler is smart enough with constants.

Yeah, that's true.

int GetBankno(long pageno) {
    unsigned short bank_mask = 128;
    int bankno = (pageno & bank_mask) % 128;
    return bankno;
}
enable -O2, only one  instruction:
xor     eax, eax

But if we all use '%',  thing changs as below:
int GetBankno(long pageno) {
    unsigned short bank_mask = 128;
    int bankno = (pageno % bank_mask) % 128;
    return bankno;
}
        mov     rdx, rdi
        sar     rdx, 63
        shr     rdx, 57
        lea     rax, [rdi+rdx]
        and     eax, 127
        sub     eax, edx


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 14 Dec 2023, at 16:32, tender wang <tndrwang@gmail.com> wrote:
>
> enable -O2, only one  instruction:
> xor     eax, eax

This is not fast code.  This is how friendly C compiler suggests you that mask must be 127, not 128.


Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > > On 12 Dec 2023, at 18:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > Andrey, do you have any stress tests or anything else that you used to
> > > gain confidence in this code?
> >

I have done some more testing for the clog group update as the
attached test file executes two concurrent scripts executed with
pgbench, the first script is the slow script which will run 10-second
long transactions and the second script is a very fast transaction
with ~10000 transactions per second.  Along with that, I have also
changed the bank size such that each bank will contain just 1 page
i.e. 32k transactions per bank.  I have done this way so that we do
not need to keep long-running transactions running for very long in
order to get the transactions from different banks committed during
the same time.  With this test, I have got that behavior and the below
logs shows that multiple transaction range which is in different
slru-bank (considering 32k transactions per bank) are doing group
update at the same time. e.g. in the below logs, we can see xid range
around 70600, 70548, and 70558, and xid range around 755, and 752 are
getting group updates by different leaders but near the same time.

It is running fine when running for a long duration, but I am not sure
how to validate the sanity of this kind of test.

2023-12-14 14:43:31.813 GMT [3306] LOG:  group leader procno 606
updated status of procno 606 xid 70600
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 for xid 70548
added for group update
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 is group leader
and got the lock
2023-12-14 14:43:31.816 GMT [3326] LOG:  group leader procno 586
updated status of procno 586 xid 70548
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 for xid 70558
added for group update
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 is group leader
and got the lock
2023-12-14 14:43:31.818 GMT [3327] LOG:  group leader procno 585
updated status of procno 585 xid 70558
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 for xid 752 added
for group update
2023-12-14 14:43:31.829 GMT [3207] LOG:  procno 669 for xid 755 added
for group update
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 is group leader
and got the lock
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 669 xid 755
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 687 xid 752


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Thu, Dec 14, 2023 at 1:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 8:43 AM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >>
> >> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> >
> >> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >>
> >> Here is the updated patch based on some comments by tender wang (those
> >> comments were sent only to me)
> >
> >
> > few nitpicks:
> >
> > +
> > +   /*
> > +    * Mask for slotno banks, considering 1GB SLRU buffer pool size and the
> > +    * SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
> > +    */
> > +   bits16      bank_mask;
> >  } SlruCtlData;
> >
> > ...
> > ...
> >
> > + int bankno = pageno & ctl->bank_mask;
> >
> > I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> > of banks (num_banks) and get the bank number through modulus op (pageno %
> > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> > bit difficult to read compared to modulus op which is quite simple,
> > straightforward and much common practice in hashing.
> >
> > Are there any advantages of using &  over % ?
>
> I am not sure either but since this change in 0002 is by Andrey, I
> will let him comment on this before we change or take any decision.
>
> > Also, a few places in 0002 and 0003 patch, need the bank number, it is better
> > to have a macro for that.
> > ---
> >
> >  extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage,
> >                                    void *data);
> > -
> > +extern bool check_slru_buffers(const char *name, int *newval);
> >  #endif                         /* SLRU_H */
> >
> >
> > Add an empty line after the declaration, in 0002 patch.
> > ---
> >
> > -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
> > +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
> > +                         int slotno)
> >
> > Unrelated change for 0003 patch.
>
>  Fixed
>
> Thanks for your review, PFA updated version.
>
> I have added @Amit Kapila to the list to view his opinion about
> whether anything can break in the clog group update with our changes
> of bank-wise SLRU lock.

Updated the comments about group commit safety based on the off-list
suggestion by Alvaro.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Robert Haas
Date:
On Tue, Dec 12, 2023 at 8:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> The problem I see is that the group update mechanism is designed around
> contention of the global xact-SLRU control lock; it uses atomics to
> coordinate a single queue when the lock is contended.  So if we split up
> the global SLRU control lock using banks, then multiple processes using
> different bank locks might not contend.  OK, this is fine, but what
> happens if two separate groups of processes encounter contention on two
> different bank locks?  I think they will both try to update the same
> queue, and coordinate access to that *using different bank locks*.  I
> don't see how can this work correctly.
>
> I suspect the first part of that algorithm, where atomics are used to
> create the list without a lock, might work fine.  But will each "leader"
> process, each of which is potentially using a different bank lock,
> coordinate correctly?  Maybe this works correctly because only one
> process will find the queue head not empty?  If this is what happens,
> then there needs to be comments about it.  Without any explanation,
> this seems broken and potentially dangerous, as some transaction commit
> bits might become lost given high enough concurrency and bad luck.

I don't want to be dismissive of this concern, but I took a look at
this part of the patch set and I don't see a correctness problem. I
think the idea of the mechanism is that we have a single linked list
in shared memory that can accumulate those waiters. At some point a
process pops the entire list of waiters, which allows a new group of
waiters to begin accumulating. The process that pops the list must
perform the updates for every process in the just-popped list without
failing, else updates would be lost. In theory, there can be any
number of lists that some process has popped and is currently working
its way through at the same time, although in practice I bet it's
quite rare for there to be more than one. The only correctness problem
is if it's possible for a process that popped the list to error out
before it finishes doing the updates that it "promised" to do by
popping the list.

Having individual bank locks doesn't really change anything here.
That's just a matter of what lock has to be held in order to perform
the update that was promised, and the algorithm described in the
previous paragraph doesn't really care about that. Nor is there a
deadlock hazard here as long as processes only take one lock at a
time, which I believe is the case. The only potential issue that I see
here is with performance. I've heard some questions about whether this
machinery performs well even as it stands, but certainly if we divide
up the lock into a bunch of bankwise locks then that should tend in
the direction of making a mechanism like this less valuable, because
both mechanisms are trying to alleviate contention, and so in a
certain sense they are competing for the same job. However, they do
aim to alleviate different TYPES of contention: the group XID update
stuff should be most valuable when lots of processes are trying to
update the same page, and the banks should be most valuable when there
is simultaneous access to a bunch of different pages. So I'm not
convinced that this patch is a reason to remove the group XID update
mechanism, but someone might argue otherwise.

A related concern is that, if by chance we do end up with multiple
updaters from different pages in the same group, it will now be more
expensive to sort that out because we'll have to potentially keep
switching locks. So that could make the group XID update mechanism
less performant than it is currently. I think that's probably not an
issue because I think it should be a rare occurrence, as the comments
say. A bit more cost in a code path that is almost never taken won't
matter. But if that path is more commonly taken than I think, then
maybe making it more expensive could hurt. It might be worth adding
some debugging to see how often we actually go down that path in a
highly stressed system.

BTW:

+        * as well.  The main reason for now allowing requesters of
different pages

now -> not

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Robert Haas
Date:
On Mon, Dec 18, 2023 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> certain sense they are competing for the same job. However, they do
> aim to alleviate different TYPES of contention: the group XID update
> stuff should be most valuable when lots of processes are trying to
> update the same page, and the banks should be most valuable when there
> is simultaneous access to a bunch of different pages. So I'm not
> convinced that this patch is a reason to remove the group XID update
> mechanism, but someone might argue otherwise.

Hmm, but, on the other hand:

Currently all readers and writers are competing for the same LWLock.
But with this change, the readers will (mostly) no longer be competing
with the writers. So, in theory, that might reduce lock contention
enough to make the group update mechanism pointless.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 18 Dec 2023, at 22:30, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 18, 2023 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> certain sense they are competing for the same job. However, they do
>> aim to alleviate different TYPES of contention: the group XID update
>> stuff should be most valuable when lots of processes are trying to
>> update the same page, and the banks should be most valuable when there
>> is simultaneous access to a bunch of different pages. So I'm not
>> convinced that this patch is a reason to remove the group XID update
>> mechanism, but someone might argue otherwise.
>
> Hmm, but, on the other hand:
>
> Currently all readers and writers are competing for the same LWLock.
> But with this change, the readers will (mostly) no longer be competing
> with the writers. So, in theory, that might reduce lock contention
> enough to make the group update mechanism pointless.

One page still accommodates 32K transaction statuses under one lock. It feels like a lot. About 1 second of
transactionson a typical installation. 

When the group commit was committed did we have a benchmark to estimate efficiency of this technology? Can we repeat
thattest again? 


Best regards, Andrey Borodin.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Robert Haas
Date:
On Mon, Dec 18, 2023 at 12:53 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> One page still accommodates 32K transaction statuses under one lock. It feels like a lot. About 1 second of
transactionson a typical installation. 
>
> When the group commit was committed did we have a benchmark to estimate efficiency of this technology? Can we repeat
thattest again? 

I think we did, but it might take some research to find it in the
archives. If we can, I agree that repeating it feels like a good idea.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Dilip Kumar
Date:
On Mon, Dec 18, 2023 at 11:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 18, 2023 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > certain sense they are competing for the same job. However, they do
> > aim to alleviate different TYPES of contention: the group XID update
> > stuff should be most valuable when lots of processes are trying to
> > update the same page, and the banks should be most valuable when there
> > is simultaneous access to a bunch of different pages. So I'm not
> > convinced that this patch is a reason to remove the group XID update
> > mechanism, but someone might argue otherwise.
>
> Hmm, but, on the other hand:
>
> Currently all readers and writers are competing for the same LWLock.
> But with this change, the readers will (mostly) no longer be competing
> with the writers. So, in theory, that might reduce lock contention
> enough to make the group update mechanism pointless.

Thanks for your feedback, I agree that with a bank-wise lock, we might
not need group updates for some of the use cases as you said where
readers and writers are contenting on the centralized lock because, in
most of the cases, readers will be distributed across different banks.
OTOH there are use cases where the writer commit is the bottleneck (on
SLRU lock) like pgbench simple-update or TPC-B then we will still
benefit by group update.  During group update testing we have seen
benefits with such a scenario[1] with high client counts.  So as per
my understanding by distributing the SLRU locks there are scenarios
where we will not need group update anymore but OTOH there are also
scenarios where we will still benefit from the group update.


[1] https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:


On 19 Dec 2023, at 10:34, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Just a side node.
It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby **bits** are written by different CPUs.
I think that banks and locks are good thing. But also we could reorganize log so that 
status of transaction 0 is on a page 0 at bit offset 0
status of transaction 1 is on a page 1 at bit offset 0
status of transaction 2 is on a page 2 at bit offset 0
status of transaction 3 is on a page 3 at bit offset 0
status of transaction 4 is on a page 0 at bit offset 2
status of transaction 5 is on a page 1 at bit offset 2
status of transaction 6 is on a page 2 at bit offset 2
status of transaction 7 is on a page 3 at bit offset 2
etc...

And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least cache line. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend?

This does not matter much because
0. Patch set in current thread produces robust SLRU anyway
1. One day we are going to throw away SLRU anyway


Best regards, Andrey Borodin.
On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> Just a side node.
> It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby
**bits**are written by different CPUs. 
> I think that banks and locks are good thing. But also we could reorganize log so that
> status of transaction 0 is on a page 0 at bit offset 0
> status of transaction 1 is on a page 1 at bit offset 0
> status of transaction 2 is on a page 2 at bit offset 0
> status of transaction 3 is on a page 3 at bit offset 0
> status of transaction 4 is on a page 0 at bit offset 2
> status of transaction 5 is on a page 1 at bit offset 2
> status of transaction 6 is on a page 2 at bit offset 2
> status of transaction 7 is on a page 3 at bit offset 2
> etc...

This is an interesting idea. A variant would be to stripe across
cachelines within the same page rather than across pages. If we do
stripe across pages as proposed here, we'd probably want to rethink
the way the SLRU is extended -- page at a time wouldn't really make
sense, but preallocating an entire file might.

> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least
cacheline. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? 

I don't understand how this could work. We need to be able to look up
transaction status by XID, not backend ID.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 2 Jan 2024, at 19:23, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>
>> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least
cacheline. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? 
>
> I don't understand how this could work. We need to be able to look up
> transaction status by XID, not backend ID.

When GetNewTransactionId() is called we can reserve 256 xids in backend local memory. This values will be reused by
transactionsor subtransactions of this backend. Here 256 == sizeof(CacheLine). 
This would ensure that different backends touch different cache lines.

But this approach would dramatically increase xid consumption speed on patterns where client reconnects after several
transactions.So we can keep unused xids in procarray for future reuse. 

I doubt we can find significant performance improvement here, because false cache line sharing cannot be _that_ bad.


Best regards, Andrey Borodin.


On Tue, Jan 2, 2024 at 1:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > On 2 Jan 2024, at 19:23, Robert Haas <robertmhaas@gmail.com> wrote:
> >> And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at
leastcache line. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? 
> >
> > I don't understand how this could work. We need to be able to look up
> > transaction status by XID, not backend ID.
>
> When GetNewTransactionId() is called we can reserve 256 xids in backend local memory. This values will be reused by
transactionsor subtransactions of this backend. Here 256 == sizeof(CacheLine). 
> This would ensure that different backends touch different cache lines.
>
> But this approach would dramatically increase xid consumption speed on patterns where client reconnects after several
transactions.So we can keep unused xids in procarray for future reuse. 
>
> I doubt we can find significant performance improvement here, because false cache line sharing cannot be _that_ bad.

Yeah, this seems way too complicated for what we'd potentially gain
from it. An additional problem is that the xmin horizon computation
assumes that XIDs are assigned in monotonically increasing fashion;
breaking that would be messy. And even an occasional leak of XIDs
could precipitate enough additional vacuuming to completely outweigh
any gains we could hope to achieve here.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Jan 2, 2024 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > Just a side node.
> > It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU.
Nearby**bits** are written by different CPUs. 
> > I think that banks and locks are good thing. But also we could reorganize log so that
> > status of transaction 0 is on a page 0 at bit offset 0
> > status of transaction 1 is on a page 1 at bit offset 0
> > status of transaction 2 is on a page 2 at bit offset 0
> > status of transaction 3 is on a page 3 at bit offset 0
> > status of transaction 4 is on a page 0 at bit offset 2
> > status of transaction 5 is on a page 1 at bit offset 2
> > status of transaction 6 is on a page 2 at bit offset 2
> > status of transaction 7 is on a page 3 at bit offset 2
> > etc...
>
> This is an interesting idea. A variant would be to stripe across
> cachelines within the same page rather than across pages. If we do
> stripe across pages as proposed here, we'd probably want to rethink
> the way the SLRU is extended -- page at a time wouldn't really make
> sense, but preallocating an entire file might.

Yeah, this is indeed an interesting idea.  So I think if we are
interested in working in this direction maybe this can be submitted in
a different thread, IMHO.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Wed, Jan 3, 2024 at 12:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah, this is indeed an interesting idea.  So I think if we are
> interested in working in this direction maybe this can be submitted in
> a different thread, IMHO.

Yeah, that's something quite different from the patch before us.

--
Robert Haas
EDB: http://www.enterprisedb.com



The more I look at TransactionGroupUpdateXidStatus, the more I think
it's broken, and while we do have some tests, I don't have confidence
that they cover all possible cases.

Or, at least, if this code is good, then it hasn't been sufficiently
explained.

If we have multiple processes trying to write bits to clog, and they are
using different banks, then the LWLockConditionalAcquire will be able to
acquire the bank lock 

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The saddest aspect of life right now is that science gathers knowledge faster
 than society gathers wisdom."  (Isaac Asimov)



On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> The more I look at TransactionGroupUpdateXidStatus, the more I think
> it's broken, and while we do have some tests, I don't have confidence
> that they cover all possible cases.
>
> Or, at least, if this code is good, then it hasn't been sufficiently
> explained.

Any thought about a case in which you think it might be broken, I mean
any vague thought might also help where you think it might not work as
expected so that I can also think in that direction.  It might be
possible that I might not be thinking of some perspective that you are
thinking and comments might be lacking from that point of view.

> If we have multiple processes trying to write bits to clog, and they are
> using different banks, then the LWLockConditionalAcquire will be able to
> acquire the bank lock

Do you think there is a problem with multiple processes getting the
lock? I mean they are modifying different CLOG pages so that can be
done concurrently right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Jan-08, Dilip Kumar wrote:

> On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > The more I look at TransactionGroupUpdateXidStatus, the more I think
> > it's broken, and while we do have some tests, I don't have confidence
> > that they cover all possible cases.
> >
> > Or, at least, if this code is good, then it hasn't been sufficiently
> > explained.
> 
> Any thought about a case in which you think it might be broken, I mean
> any vague thought might also help where you think it might not work as
> expected so that I can also think in that direction.  It might be
> possible that I might not be thinking of some perspective that you are
> thinking and comments might be lacking from that point of view.

Eh, apologies.  This email was an unfinished draft that I had laying
around before the holidays which I intended to discard it but somehow
kept around, and just now I happened to press the wrong key combination
and it ended up being sent instead.  We had some further discussion,
after which I no longer think that there is a problem here, so please
ignore this email.

I'll come back to this patch later this week.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")



On Mon, Jan 8, 2024 at 9:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Eh, apologies.  This email was an unfinished draft that I had laying
> around before the holidays which I intended to discard it but somehow
> kept around, and just now I happened to press the wrong key combination
> and it ended up being sent instead.  We had some further discussion,
> after which I no longer think that there is a problem here, so please
> ignore this email.
>
> I'll come back to this patch later this week.

 No problem

The patch was facing some compilation issues after some recent
commits, so I have changed it. Reported by Julien Tachoires (offlist)



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Jan 10, 2024 at 6:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 8, 2024 at 9:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > Eh, apologies.  This email was an unfinished draft that I had laying
> > around before the holidays which I intended to discard it but somehow
> > kept around, and just now I happened to press the wrong key combination
> > and it ended up being sent instead.  We had some further discussion,
> > after which I no longer think that there is a problem here, so please
> > ignore this email.
> >
> > I'll come back to this patch later this week.
>
>  No problem
>
> The patch was facing some compilation issues after some recent
> commits, so I have changed it. Reported by Julien Tachoires (offlist)

The last patch conflicted with some of the recent commits, so here is
the updated version of the patch, I also noticed that the slur bank
lock wat event details were missing from the wait_event_names.txt file
so added that as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
Here's a touched-up version of this patch.

First, PROC_GLOBAL->clogGroupFirst and SlruCtl->latest_page_number
change from being protected by locks to being atomics, but there's no
mention of considering memory barriers that they should have.  Looking
at the code, I think the former doesn't need any additional barriers,
but latest_page_number is missing some, which I have added.  This
deserves another careful look.

Second and most user visibly, the GUC names seem to have been chosen
based on the source-code variables, which have never meant to be user-
visible.  So I renamed a few:

xact_buffers -> transaction_buffers
subtrans_buffers -> subtransaction_buffers
serial_buffers -> serializable_buffers
commit_ts_buffers -> commit_timestamp_buffers

(unchanged: multixact_offsets_buffers, multixact_members_buffers,
notify_buffers)

I did this explicitly trying to avoid using the term SLRU in a
user-visible manner, because what do users care?  But immediately after
doing this I realized that we already have pg_stat_slru, so maybe the
cat is already out of the bag, and so perhaps we should name these GUCS
as, say slru_transaction_buffers?  That may make the connection between
these things a little more explicit.  (I do think we need to add
cross-links in the documentation of those GUCs to the pg_stat_slru
docs and vice-versa.)


Another thing that bothered me a bit is that we have auto-tuning for
transaction_buffers and commit_timestamp_buffers, but not for
subtransaction_buffers.  (Autotuning means you set the GUC to 0 and it
scales with shared_buffers).  I don't quite understand what's the reason
for the ommision, so I added it for subtrans too.  I think it may make
sense to do likewise for the multixact ones too, not sure.  It doesn't
seem worth having that for pg_serial and notify.

While messing about with these GUCs I realized that the usage of the
show_hook to print what the current number is, when autoturning is used,
was bogus: SHOW would print the number of blocks for (say)
transaction_buffers, but if you asked it to print (say)
multixact_offsets_buffers, it would give a size in MB or kB.  I'm sure
such an inconsistency would bite us.  So, digging around I found that a
good way to handle this is to remove the show_hook, and instead call
SetConfigOption() at the time when the ShmemInit function is called,
with the correct number of buffers determined.  This is pretty much what
is used for XLOGbuffers, and it works correctly as far as my testing
shows.

Still with these auto-tuning GUCs, I noticed that the auto-tuning code
would continue to grow the buffer sizes with shared_buffers to
arbitrarily large values.  I added an arbitrary maximum of 1024 (8 MB),
which is much higher than the current value of 128; but if you have
(say) 30 GB of shared_buffers (not uncommon these days), do you really
need 30MB of pg_clog cache?  It seems mostly unnecessary ... and you can
still set it manually that way if you need it.  So, largely I just
rewrote those small functions completely.

I also made the SGML documentation and postgresql.sample.conf all match
what the code actually docs.  The whole thing wasn't particularly
consistent.

I rewrote a bunch of code comments and moved stuff around to appear in
alphabetical order, etc.

More comment rewriting still pending.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment
On 2024-Jan-25, Alvaro Herrera wrote:

> Here's a touched-up version of this patch.

> diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
> index 98fa6035cc..4a5e05d5e4 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
>      [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
>      [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
>      [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> +    [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> +    [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> +    [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> +    [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> +    [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> +    [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> +    [LWTRANCHE_XACT_SLRU] = "XactSLRU",
>  };

Eeek.  Last minute changes ...  Fixed here.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)

Attachment
On Thu, Jan 25, 2024 at 11:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Still with these auto-tuning GUCs, I noticed that the auto-tuning code
> would continue to grow the buffer sizes with shared_buffers to
> arbitrarily large values.  I added an arbitrary maximum of 1024 (8 MB),
> which is much higher than the current value of 128; but if you have
> (say) 30 GB of shared_buffers (not uncommon these days), do you really
> need 30MB of pg_clog cache?  It seems mostly unnecessary ... and you can
> still set it manually that way if you need it.  So, largely I just
> rewrote those small functions completely.

Yeah, I think that if we're going to scale with shared_buffers, it
should be capped.

--
Robert Haas
EDB: http://www.enterprisedb.com



I've continued to review this and decided that I don't like the mess
this patch proposes in order to support pg_commit_ts's deletion of all
files.  (Yes, I know that I was the one that proposed this idea. It's
still a bad one).  I'd like to change that code by removing the limit
that we can only have 128 bank locks in a SLRU.  To recap, the reason we
did this is that commit_ts sometimes wants to delete all files while
running (DeactivateCommitTs), and for this it needs to acquire all bank
locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
added the SLRU limit making multiple banks share lwlocks.

I propose two alternative solutions:

1. The easiest is to have DeactivateCommitTs continue to hold
CommitTsLock until the end, including while SlruScanDirectory does its
thing.  This sounds terrible, but considering that this code only runs
when the module is being deactivated, I don't think it's really all that
bad in practice.  I mean, if you deactivate the commit_ts module and
then try to read commit timestamp data, you deserve to wait for a few
seconds just as a punishment for your stupidity.  AFAICT the cases where
anything is done in the module mostly check without locking that
commitTsActive is set, so we're not slowing down any critical
operations.  Also, if you don't like to be delayed for a couple of
seconds, just don't deactivate the module.

2. If we want some refinement, the other idea is to change
SlruScanDirCbDeleteAll (the callback that SlruScanDirectory uses in this
case) so that it acquires the bank lock appropriate for all the slots
used by the file that's going to be deleted.  This is OK because in the
default compilation each file only has 32 segments, so that requires
only 32 lwlocks held at once while the file is being deleted.  I think
we don't need to bother with this, but it's an option if we see the
above as unworkable for whatever reason.


The only other user of SlruScanDirCbDeleteAll is async.c (the LISTEN/
NOTIFY code), and what that does is delete all the files at server
start, where nothing is running concurrently anyway.  So whatever we do
for commit_ts, it won't affect async.c.


So, if we do away with the SLRU_MAX_BANKLOCKS idea, then we're assured
one LWLock per bank (instead of sharing some lwlocks to multiple banks),
and that makes the code simpler to reason about.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)




> On 26 Jan 2024, at 22:38, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> This is OK because in the
> default compilation each file only has 32 segments, so that requires
> only 32 lwlocks held at once while the file is being deleted.

Do we account somehow that different subsystems do not accumulate MAX_SIMUL_LWLOCKS together?
E.g. GiST during split can combine 75 locks, and somehow commit_ts will be deactivated by this backend at the same
momentand add 32 locks more :) 
I understand that this sounds fantastic, these subsystems do not interfere. But this is fantastic only until something
likethat actually happens. 
If possible, I'd prefer one lock at a time, any maybe sometimes two-three with some guarantees that this is safe.
So, from my POV first solution that you proposed seems much better to me.

Thanks for working on this!


Best regard, Andrey Borodin.


On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Jan-25, Alvaro Herrera wrote:
>
> > Here's a touched-up version of this patch.
>
> > diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
> > index 98fa6035cc..4a5e05d5e4 100644
> > --- a/src/backend/storage/lmgr/lwlock.c
> > +++ b/src/backend/storage/lmgr/lwlock.c
> > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
> >       [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
> >       [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
> >       [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> > +     [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> > +     [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> > +     [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> > +     [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> > +     [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> > +     [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> > +     [LWTRANCHE_XACT_SLRU] = "XactSLRU",
> >  };
>
> Eeek.  Last minute changes ...  Fixed here.

Thank you for working on this.  There is one thing that I feel is
problematic.  We have kept the allowed values for these GUCs to be in
multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
values were changed to 16 but in this refactoring patch for some of
the buffers you have changed that to 8 so I think that's not good.

+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for
the MultiXact offset cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_buffers,
+ 16, 8, SLRU_MAX_ALLOWED_BUFFERS,
+ check_multixact_offsets_buffers, NULL, NULL
+ },

Other than this patch looks good to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> I've continued to review this and decided that I don't like the mess
> this patch proposes in order to support pg_commit_ts's deletion of all
> files.  (Yes, I know that I was the one that proposed this idea. It's
> still a bad one).  I'd like to change that code by removing the limit
> that we can only have 128 bank locks in a SLRU.  To recap, the reason we
> did this is that commit_ts sometimes wants to delete all files while
> running (DeactivateCommitTs), and for this it needs to acquire all bank
> locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
> added the SLRU limit making multiple banks share lwlocks.
>
> I propose two alternative solutions:
>
> 1. The easiest is to have DeactivateCommitTs continue to hold
> CommitTsLock until the end, including while SlruScanDirectory does its
> thing.  This sounds terrible, but considering that this code only runs
> when the module is being deactivated, I don't think it's really all that
> bad in practice.  I mean, if you deactivate the commit_ts module and
> then try to read commit timestamp data, you deserve to wait for a few
> seconds just as a punishment for your stupidity.

I think this idea looks reasonable.  I agree that if we are trying to
read commit_ts after deactivating it then it's fine to make it wait.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Jan-29, Dilip Kumar wrote:

> Thank you for working on this.  There is one thing that I feel is
> problematic.  We have kept the allowed values for these GUCs to be in
> multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
> values were changed to 16 but in this refactoring patch for some of
> the buffers you have changed that to 8 so I think that's not good.

Oh, absolutely, you're right.  Restored the minimum to 16.

So, here's the patchset as two pieces.  0001 converts
SlruSharedData->latest_page_number to use atomics.  I don't see any
reason to mix this in with the rest of the patch, and though it likely
won't have any performance advantage by itself (since the lock
acquisition is pretty much the same), it seems better to get it in ahead
of the rest -- I think that simplifies matters for the second patch,
which is large enough.

So, 0002 introduces the rest of the feature.  I removed use of banklocks
in a different amount as banks, and I made commit_ts use a longer
lwlock acquisition at truncation time, rather than forcing all-lwlock
acquisition.

The more I look at 0002, the more I notice that some comments need badly
updated, so please don't read too much into it yet.  But I wanted to
post it anyway for archives and cfbot purposes.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Attachment
Hah:

postgres -c lc_messages=C -c shared_buffers=$((512*17))

2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter "transaction_buffers": 17
2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be a multiple of 16

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.



On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Hah:
>
> postgres -c lc_messages=C -c shared_buffers=$((512*17))
>
> 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter "transaction_buffers": 17
> 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be a multiple of 16

Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
instead of giving an error?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Feb-01, Dilip Kumar wrote:

> On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > postgres -c lc_messages=C -c shared_buffers=$((512*17))
> >
> > 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter "transaction_buffers": 17
> > 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be a multiple of 16
> 
> Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
> instead of giving an error?

Since this is the auto-tuning feature, I think it should use the
previous multiple rather than the next, but yeah, something like that.


While I have your attention -- if you could give a look to the 0001
patch I posted, I would appreciate it.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Los trabajadores menos efectivos son sistematicamente llevados al lugar
donde pueden hacer el menor daño posible: gerencia."  (El principio Dilbert)



On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-01, Dilip Kumar wrote:
>
> > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > postgres -c lc_messages=C -c shared_buffers=$((512*17))
> > >
> > > 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter "transaction_buffers": 17
> > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be a multiple of 16
> >
> > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
> > instead of giving an error?
>
> Since this is the auto-tuning feature, I think it should use the
> previous multiple rather than the next, but yeah, something like that.

Okay.
>
> While I have your attention -- if you could give a look to the 0001
> patch I posted, I would appreciate it.
>

I will look into it.  Thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Okay.
> >
> > While I have your attention -- if you could give a look to the 0001
> > patch I posted, I would appreciate it.
> >
>
> I will look into it.  Thanks.

Some quick observations,

Do we need below two write barriers at the end of the function?
because the next instruction is separated by the function boundary

@@ -766,14 +766,11 @@ StartupCLOG(void)
  ..
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno);
+ pg_write_barrier();
 }

/*
  * Initialize member's idea of the latest page number.
  */
  pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number,
+    pageno);
+
+ pg_write_barrier();
 }

I am looking more into this from the concurrency point of view and
will update you soon.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > Okay.
> > >
> > > While I have your attention -- if you could give a look to the 0001
> > > patch I posted, I would appreciate it.
> > >
> >
> > I will look into it.  Thanks.
>
> Some quick observations,
>
> Do we need below two write barriers at the end of the function?
> because the next instruction is separated by the function boundary
>
> @@ -766,14 +766,11 @@ StartupCLOG(void)
>   ..
> - XactCtl->shared->latest_page_number = pageno;
> -
> - LWLockRelease(XactSLRULock);
> + pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno);
> + pg_write_barrier();
>  }
>
> /*
>   * Initialize member's idea of the latest page number.
>   */
>   pageno = MXOffsetToMemberPage(offset);
> - MultiXactMemberCtl->shared->latest_page_number = pageno;
> + pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number,
> +    pageno);
> +
> + pg_write_barrier();
>  }
>

I have checked the patch and it looks fine to me other than the above
question related to memory barrier usage one more question about the
same, basically below to instances 1 and 2 look similar but in 1 you
are not using the memory write_barrier whereas in 2 you are using the
write_barrier, why is it so?  I mean why the reordering can not happen
in 1 and it may happen in 2?

1.
+ pg_atomic_write_u64(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);

  SimpleLruTruncate(CommitTsCtl, trunc->pageno);

vs
2.

  - shared->latest_page_number = pageno;
+ pg_atomic_write_u64(&shared->latest_page_number, pageno);
+ pg_write_barrier();

  /* update the stats counter of zeroed pages */
  pgstat_count_slru_page_zeroed(shared->slru_stats_idx);



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Feb-02, Dilip Kumar wrote:

> I have checked the patch and it looks fine to me other than the above
> question related to memory barrier usage one more question about the
> same, basically below to instances 1 and 2 look similar but in 1 you
> are not using the memory write_barrier whereas in 2 you are using the
> write_barrier, why is it so?  I mean why the reordering can not happen
> in 1 and it may happen in 2?

What I was thinking is that there's a lwlock operation just below, which
acts as a barrier.  But I realized something more important: there are
only two places that matter, which are SlruSelectLRUPage and
SimpleLruZeroPage.  The others are all initialization code that run at a
point where there's no going to be any concurrency in SLRU access, so we
don't need barriers anyway.  In SlruSelectLRUPage we definitely don't
want to evict the page that SimpleLruZeroPage has initialized, starting
from the point where it returns that new page to its caller.

But if you consider the code of those two routines, you realize that the
only time an equality between latest_page_number and "this_page_number"
is going to occur, is when both pages are in the same bank ... and both
routines are required to be holding the bank lock while they run, so in
practice this is never a problem.

We need the atomic write and atomic read so that multiple processes
processing pages in different banks can update latest_page_number
simultaneously.  But the equality condition that we're looking for?
it can never happen concurrently.

In other words, these barriers are fully useless.

(We also have SimpleLruTruncate, but I think it's not as critical to
have a barrier there anyhow: accessing a slightly outdated page number
could only be a problem if a bug elsewhere causes us to try to truncate
in the current page.  I think we only have this code there because we
did have such bugs in the past, but IIUC this shouldn't happen anymore.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



In short, I propose the attached.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment
Sorry, brown paper bag bug there.  Here's the correct one.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 4 Feb 2024, at 18:38, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> In other words, these barriers are fully useless.

+1. I've tried to understand ideas behind barriers, but latest_page_number is heuristics that does not need any
guaranteesat all. It's also is used in safety check which can fire only when everything is already broken beyond any
repair..(Though using atomic access seems a good idea anyway) 

This patch uses wording "banks" in comments before banks start to exist. But as far as I understand, it is expected to
becommitted before "banks" patch. 

Besides this patch looks good to me.


Best regards, Andrey Borodin.


On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-02, Dilip Kumar wrote:
>
> > I have checked the patch and it looks fine to me other than the above
> > question related to memory barrier usage one more question about the
> > same, basically below to instances 1 and 2 look similar but in 1 you
> > are not using the memory write_barrier whereas in 2 you are using the
> > write_barrier, why is it so?  I mean why the reordering can not happen
> > in 1 and it may happen in 2?
>
> What I was thinking is that there's a lwlock operation just below, which
> acts as a barrier.  But I realized something more important: there are
> only two places that matter, which are SlruSelectLRUPage and
> SimpleLruZeroPage.  The others are all initialization code that run at a
> point where there's no going to be any concurrency in SLRU access, so we
> don't need barriers anyway.  In SlruSelectLRUPage we definitely don't
> want to evict the page that SimpleLruZeroPage has initialized, starting
> from the point where it returns that new page to its caller.
> But if you consider the code of those two routines, you realize that the
> only time an equality between latest_page_number and "this_page_number"
> is going to occur, is when both pages are in the same bank ... and both
> routines are required to be holding the bank lock while they run, so in
> practice this is never a problem.

Right, in fact when I first converted this 'latest_page_number' to an
atomic the thinking was to protect it from concurrently setting the
values in SimpleLruZeroPage() and also concurrently reading in
SlruSelectLRUPage() should not read the corrupted value.  All other
usages were during the initialization phase where we do not need any
protection.

>
> We need the atomic write and atomic read so that multiple processes
> processing pages in different banks can update latest_page_number
> simultaneously.  But the equality condition that we're looking for?
> it can never happen concurrently.

Yeah, that's right, after you told I also realized that the case is
protected by the bank lock.  Earlier I didn't think about this case.

> In other words, these barriers are fully useless.
>
> (We also have SimpleLruTruncate, but I think it's not as critical to
> have a barrier there anyhow: accessing a slightly outdated page number
> could only be a problem if a bug elsewhere causes us to try to truncate
> in the current page.  I think we only have this code there because we
> did have such bugs in the past, but IIUC this shouldn't happen anymore.)

+1, I agree with this theory in general.  But the below comment in
SimpleLruTrucate in your v3 patch doesn't seem correct, because here
we are checking if the latest_page_number is smaller than the cutoff
if so we log it as wraparound and skip the whole thing and that is
fine even if we are reading with atomic variable and slightly outdated
value should not be a problem but the comment claim that this safe
because we have the same bank lock as SimpleLruZeroPage(), but that's
not true here we will be acquiring different bank locks one by one
based on which slotno we are checking.  Am I missing something?


+ * An important safety check: the current endpoint page must not be
+ * eligible for removal.  Like SlruSelectLRUPage, we don't need a
+ * memory barrier here because for the affected page to be relevant,
+ * we'd have to have the same bank lock as SimpleLruZeroPage.
  */
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u64(&shared->latest_page_number),
+   cutoffPage))


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Feb-04, Andrey M. Borodin wrote:

> This patch uses wording "banks" in comments before banks start to
> exist. But as far as I understand, it is expected to be committed
> before "banks" patch.

True -- changed that to use ControlLock.

> Besides this patch looks good to me.

Many thanks for reviewing.

On 2024-Feb-05, Dilip Kumar wrote:

> > (We also have SimpleLruTruncate, but I think it's not as critical to
> > have a barrier there anyhow: accessing a slightly outdated page number
> > could only be a problem if a bug elsewhere causes us to try to truncate
> > in the current page.  I think we only have this code there because we
> > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> 
> +1, I agree with this theory in general.  But the below comment in
> SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> we are checking if the latest_page_number is smaller than the cutoff
> if so we log it as wraparound and skip the whole thing and that is
> fine even if we are reading with atomic variable and slightly outdated
> value should not be a problem but the comment claim that this safe
> because we have the same bank lock as SimpleLruZeroPage(), but that's
> not true here we will be acquiring different bank locks one by one
> based on which slotno we are checking.  Am I missing something?

I think you're correct.  I reworded this comment, so now it says this:

    /*
     * An important safety check: the current endpoint page must not be
     * eligible for removal.  This check is just a backstop against wraparound
     * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
     * outdated value; therefore we don't add a memory barrier.
     */

Pushed with those changes.  Thank you!

Now I'll go rebase the rest of the patch on top.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)



On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > (We also have SimpleLruTruncate, but I think it's not as critical to
> > > have a barrier there anyhow: accessing a slightly outdated page number
> > > could only be a problem if a bug elsewhere causes us to try to truncate
> > > in the current page.  I think we only have this code there because we
> > > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> >
> > +1, I agree with this theory in general.  But the below comment in
> > SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> > we are checking if the latest_page_number is smaller than the cutoff
> > if so we log it as wraparound and skip the whole thing and that is
> > fine even if we are reading with atomic variable and slightly outdated
> > value should not be a problem but the comment claim that this safe
> > because we have the same bank lock as SimpleLruZeroPage(), but that's
> > not true here we will be acquiring different bank locks one by one
> > based on which slotno we are checking.  Am I missing something?
>
> I think you're correct.  I reworded this comment, so now it says this:
>
>     /*
>      * An important safety check: the current endpoint page must not be
>      * eligible for removal.  This check is just a backstop against wraparound
>      * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
>      * outdated value; therefore we don't add a memory barrier.
>      */
>
> Pushed with those changes.  Thank you!

Yeah, this looks perfect, thanks.

> Now I'll go rebase the rest of the patch on top.

Okay, I will review and test after that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Here's the rest of it rebased on top of current master.  I think it
makes sense to have this as one individual commit.

I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
because we don't have that macro at that point, so I just used constant
16.  Obviously need a better solution for this.

I also changed the location of bank_mask in SlruCtlData for better
packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
to SlotGetBankNumber().

Some very critical comments still need to be updated to the new design,
particularly anything that mentions "control lock"; but also the overall
model needs to be explained in some central location, rather than
incongruently some pieces here and other pieces there.  I'll see about
this later.  But at least this is code you should be able to play with.


I've been wondering whether we should add a "slru" to the name of the
GUCs:

commit_timestamp_slru_buffers
transaction_slru_buffers
etc

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Aprender sin pensar es inútil; pensar sin aprender, peligroso" (Confucio)

Attachment
On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Here's the rest of it rebased on top of current master.  I think it
> makes sense to have this as one individual commit.
>
> I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> because we don't have that macro at that point, so I just used constant
> 16.  Obviously need a better solution for this.

If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
because these files are anyway include slur.h so.

>
> I also changed the location of bank_mask in SlruCtlData for better
> packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
> to SlotGetBankNumber().

Okay.

> Some very critical comments still need to be updated to the new design,
> particularly anything that mentions "control lock"; but also the overall
> model needs to be explained in some central location, rather than
> incongruently some pieces here and other pieces there.  I'll see about
> this later.  But at least this is code you should be able to play with.

Okay, I will review and test this

> I've been wondering whether we should add a "slru" to the name of the
> GUCs:
>
> commit_timestamp_slru_buffers
> transaction_slru_buffers
> etc

I am not sure we are exposing anything related to SLRU to the user, I
mean transaction_buffers should make sense for the user that it stores
transaction-related data in some buffers pool but whether that buffer
pool is called SLRU or not doesn't matter much to the user IMHO.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 7 Feb 2024, at 10:58, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>> commit_timestamp_slru_buffers
>> transaction_slru_buffers
>> etc
>
> I am not sure we are exposing anything related to SLRU to the user,

I think we already tell something about SLRU to the user. I’d rather consider if “transaction_slru_buffers" is easier
tounderstand than “transaction_buffers” .. 
IMO transaction_buffers is clearer. But I do not have strong opinion.

> I
> mean transaction_buffers should make sense for the user that it stores
> transaction-related data in some buffers pool but whether that buffer
> pool is called SLRU or not doesn't matter much to the user IMHO.
+1


Best regards, Andrey Borodin.


On 2024-Feb-07, Dilip Kumar wrote:

> On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > because we don't have that macro at that point, so I just used constant
> > 16.  Obviously need a better solution for this.
> 
> If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> because these files are anyway include slur.h so.

Sure, but is that really what we want?

> > I've been wondering whether we should add a "slru" to the name of the
> > GUCs:
> >
> > commit_timestamp_slru_buffers
> > transaction_slru_buffers
> > etc
> 
> I am not sure we are exposing anything related to SLRU to the user,

We do -- we have pg_stat_slru already.

> I mean transaction_buffers should make sense for the user that it
> stores transaction-related data in some buffers pool but whether that
> buffer pool is called SLRU or not doesn't matter much to the user
> IMHO.

Yeah, that's exactly what my initial argument was for naming these this
way.  But since the term slru already escaped into the wild via the
pg_stat_slru view, perhaps it helps users make the connection between
these things.  Alternatively, we can cross-reference each term from the
other's documentation and call it a day.

Another painful point is that pg_stat_slru uses internal names in the
data it outputs, which obviously do not match the new GUCs.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"



On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > > because we don't have that macro at that point, so I just used constant
> > > 16.  Obviously need a better solution for this.
> >
> > If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> > because these files are anyway include slur.h so.
>
> Sure, but is that really what we want?

So your question is do we want these buffers to be in multiple of
SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
don't think it should create any problem logically.  I mean we can
look again in the patch to see if we have made any such assumptions
but that should be fairly easy to fix, then maybe if we are going in
this way we should get rid of the check_slru_buffers() function as
well.

> > > I've been wondering whether we should add a "slru" to the name of the
> > > GUCs:
> > >
> > > commit_timestamp_slru_buffers
> > > transaction_slru_buffers
> > > etc
> >
> > I am not sure we are exposing anything related to SLRU to the user,
>
> We do -- we have pg_stat_slru already.
>
> > I mean transaction_buffers should make sense for the user that it
> > stores transaction-related data in some buffers pool but whether that
> > buffer pool is called SLRU or not doesn't matter much to the user
> > IMHO.
>
> Yeah, that's exactly what my initial argument was for naming these this
> way.  But since the term slru already escaped into the wild via the
> pg_stat_slru view, perhaps it helps users make the connection between
> these things.  Alternatively, we can cross-reference each term from the
> other's documentation and call it a day.

Yeah, that's true I forgot this point about the pg_stat_slru, from
this pov if the configuration has the name slru they would be able to
make a better connection with the configured value, and the stats in
this view based on that they can take call if they need to somehow
increase the size of these slru buffers.

> Another painful point is that pg_stat_slru uses internal names in the
> data it outputs, which obviously do not match the new GUCs.

Yeah, that's true, but I think this could be explained somewhere not
sure what is the right place for this.


FYI, I have also repeated all the performance tests I performed in my
first email[1], and I am seeing a similar gain.

Some comments on v18 in my first pass of the review.

1.
@@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  lsnindex = GetLSNIndex(slotno, xid);
  *lsn = XactCtl->shared->group_lsn[lsnindex];

- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));

Maybe here we can add an assert before releasing the lock for a safety check

Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

2.
+ *
+ * XXX could we make the LSNs to be bank-based?
  */
  XLogRecPtr *group_lsn;

IMHO, the flush still happens at the page level so up to which LSN
should be flush before flushing the particular clog page should also
be maintained at the page level.

[1] https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Feb-07, Dilip Kumar wrote:

> On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Sure, but is that really what we want?
> 
> So your question is do we want these buffers to be in multiple of
> SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> don't think it should create any problem logically.  I mean we can
> look again in the patch to see if we have made any such assumptions
> but that should be fairly easy to fix, then maybe if we are going in
> this way we should get rid of the check_slru_buffers() function as
> well.

Not really, I just don't think the macro should be in slru.h.

Another thing I've been thinking is that perhaps it would be useful to
make the banks smaller, when the total number of buffers is small.  For
example, if you have 16 or 32 buffers, it's not really clear to me that
it makes sense to have just 1 bank or 2 banks.  It might be more
sensible to have 4 banks with 4 or 8 buffers instead.  That should make
the algorithm scale down as well as up ...

I haven't done either of those things in the attached v19 version.  I
did go over the comments once again and rewrote the parts I was unhappy
with, including some existing ones.  I think it's OK now from that point
of view ... at some point I thought about creating a separate README,
but in the end I thought it not necessary.

I did add a bunch of Assert()s to make sure the locks that are supposed
to be held are actually held.  This led me to testing the page status to
be not EMPTY during SimpleLruWriteAll() before calling
SlruInternalWritePage(), because the assert was firing.  The previous
code is not really *buggy*, but to me it's weird to call WritePage() on
a slot with no contents.

Another change was in TransactionGroupUpdateXidStatus: the original code
had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to
know which bank to lock.  I changed it to simply be the page used by the
leader process; this doesn't need an atomic read, and should be the same
page anyway.  (If it isn't, it's no big deal).  But what's more: even if
we do read ->clogGroupFirst at that point, there's no guarantee that
this is going to be exactly for the same process that ends up being the
first in the list, because since we have not set it to INVALID by the
time we grab the bank lock, it is quite possible for more processes to
add themselves to the list.

I realized all this while rewriting the comments in a way that would let
me understand what was going on ... so IMO the effort was worthwhile.

Anyway, what I send now should be pretty much final, modulo the
change to the check_slru_buffers routines and documentation additions to
match pg_stat_slru to the new GUC names.

> > Another painful point is that pg_stat_slru uses internal names in the
> > data it outputs, which obviously do not match the new GUCs.
> 
> Yeah, that's true, but I think this could be explained somewhere not
> sure what is the right place for this.

Or we can change those names in the view.

> FYI, I have also repeated all the performance tests I performed in my
> first email[1], and I am seeing a similar gain.

Excellent, thanks for doing that.

> Some comments on v18 in my first pass of the review.
> 
> 1.
> @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
>   lsnindex = GetLSNIndex(slotno, xid);
>   *lsn = XactCtl->shared->group_lsn[lsnindex];
> 
> - LWLockRelease(XactSLRULock);
> + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));
> 
> Maybe here we can add an assert before releasing the lock for a safety check
> 
> Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

Hmm, I think this would just throw a warning or error "you don't hold
such lwlock", so it doesn't seem necessary.

> 2.
> + *
> + * XXX could we make the LSNs to be bank-based?
>   */
>   XLogRecPtr *group_lsn;
> 
> IMHO, the flush still happens at the page level so up to which LSN
> should be flush before flushing the particular clog page should also
> be maintained at the page level.

Yeah, this was a misguided thought, nevermind.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.

Attachment
On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > Sure, but is that really what we want?
> >
> > So your question is do we want these buffers to be in multiple of
> > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > don't think it should create any problem logically.  I mean we can
> > look again in the patch to see if we have made any such assumptions
> > but that should be fairly easy to fix, then maybe if we are going in
> > this way we should get rid of the check_slru_buffers() function as
> > well.
>
> Not really, I just don't think the macro should be in slru.h.

Okay

> Another thing I've been thinking is that perhaps it would be useful to
> make the banks smaller, when the total number of buffers is small.  For
> example, if you have 16 or 32 buffers, it's not really clear to me that
> it makes sense to have just 1 bank or 2 banks.  It might be more
> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> the algorithm scale down as well as up ...

It might be helpful to have small-size banks when SLRU buffers are set
to a very low value and we are only accessing a couple of pages at a
time (i.e. no buffer replacement) because in such cases most of the
contention will be on SLRU Bank lock. Although I am not sure how
practical such a use case would be, I mean if someone is using
multi-xact very heavily or creating frequent subtransaction overflow
then wouldn't they should set this buffer limit to some big enough
value?  By doing this we would lose some simplicity of the patch I
mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
need to compute this and store it in SlruShared. Maybe that's not that
bad.

>
> I haven't done either of those things in the attached v19 version.  I
> did go over the comments once again and rewrote the parts I was unhappy
> with, including some existing ones.  I think it's OK now from that point
> of view ... at some point I thought about creating a separate README,
> but in the end I thought it not necessary.

Thanks, I will review those changes.

> I did add a bunch of Assert()s to make sure the locks that are supposed
> to be held are actually held.  This led me to testing the page status to
> be not EMPTY during SimpleLruWriteAll() before calling
> SlruInternalWritePage(), because the assert was firing.  The previous
> code is not really *buggy*, but to me it's weird to call WritePage() on
> a slot with no contents.

Okay,  I mean internally SlruInternalWritePage() will flush only if
the status is SLRU_PAGE_VALID, but it is better the way you have done.

> Another change was in TransactionGroupUpdateXidStatus: the original code
> had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to
> know which bank to lock.  I changed it to simply be the page used by the
> leader process; this doesn't need an atomic read, and should be the same
> page anyway.  (If it isn't, it's no big deal).  But what's more: even if
> we do read ->clogGroupFirst at that point, there's no guarantee that
> this is going to be exactly for the same process that ends up being the
> first in the list, because since we have not set it to INVALID by the
> time we grab the bank lock, it is quite possible for more processes to
> add themselves to the list.

Yeah, this looks better

> I realized all this while rewriting the comments in a way that would let
> me understand what was going on ... so IMO the effort was worthwhile.

+1

I will review and do some more testing early next week and share my feedback.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 23 Feb 2024, at 12:36, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>> Another thing I've been thinking is that perhaps it would be useful to
>> make the banks smaller, when the total number of buffers is small.  For
>> example, if you have 16 or 32 buffers, it's not really clear to me that
>> it makes sense to have just 1 bank or 2 banks.  It might be more
>> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
>> the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.

I'm sure anyone with multiple CPUs should increase, not decrease previous default of 128 buffers (with 512MB shared
buffers).Having more CPUs (the only way to benefit from more locks) implies bigger transaction buffers. 
IMO making bank size variable adds unneeded computation overhead, bank search loops should be unrollable by compiler
etc.
Originally there was a patch set step, that packed bank's page addresses together in one array. It was done to make
banksearch a SIMD instruction. 


Best regards, Andrey Borodin.


On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2024-Feb-07, Dilip Kumar wrote:
> >
> > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > > > Sure, but is that really what we want?
> > >
> > > So your question is do we want these buffers to be in multiple of
> > > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > > don't think it should create any problem logically.  I mean we can
> > > look again in the patch to see if we have made any such assumptions
> > > but that should be fairly easy to fix, then maybe if we are going in
> > > this way we should get rid of the check_slru_buffers() function as
> > > well.
> >
> > Not really, I just don't think the macro should be in slru.h.
>
> Okay
>
> > Another thing I've been thinking is that perhaps it would be useful to
> > make the banks smaller, when the total number of buffers is small.  For
> > example, if you have 16 or 32 buffers, it's not really clear to me that
> > it makes sense to have just 1 bank or 2 banks.  It might be more
> > sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> > the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.
>
> >
> > I haven't done either of those things in the attached v19 version.  I
> > did go over the comments once again and rewrote the parts I was unhappy
> > with, including some existing ones.  I think it's OK now from that point
> > of view ... at some point I thought about creating a separate README,
> > but in the end I thought it not necessary.
>
> Thanks, I will review those changes.

Few other things I noticed while reading through the patch, I haven't
read it completely yet but this is what I got for now.

1.
+ * If no process is already in the list, we're the leader; our first step
+ * is to "close out the group" by resetting the list pointer from
+ * ProcGlobal->clogGroupFirst (this lets other processes set up other
+ * groups later); then we lock the SLRU bank corresponding to our group's
+ * page, do the SLRU updates, release the SLRU bank lock, and wake up the
+ * sleeping processes.

I think here we are saying that we "close out the group" before
acquiring the SLRU lock but that's not true.  We keep the group open
until we gets the lock so that we can get maximum members in while we
are anyway waiting for the lock.

2.
 static void
 TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
  RepOriginId nodeid, int slotno)
 {
- Assert(TransactionIdIsNormal(xid));
+ if (!TransactionIdIsNormal(xid))
+ return;
+
+ entryno = TransactionIdToCTsEntry(xid);

I do not understand why we need this change.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Feb-23, Andrey M. Borodin wrote:

> I'm sure anyone with multiple CPUs should increase, not decrease
> previous default of 128 buffers (with 512MB shared buffers). Having
> more CPUs (the only way to benefit from more locks) implies bigger
> transaction buffers.

Sure.

> IMO making bank size variable adds unneeded computation overhead, bank
> search loops should be unrollable by compiler etc.

Makes sense.

> Originally there was a patch set step, that packed bank's page
> addresses together in one array. It was done to make bank search a
> SIMD instruction.

Ants Aasma had proposed a rework of the LRU code for better performance.
He told me it depended on bank size being 16, so you're right that it's
probably not a good idea to make it variable.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



On 2024-Feb-23, Dilip Kumar wrote:

> 1.
> + * If no process is already in the list, we're the leader; our first step
> + * is to "close out the group" by resetting the list pointer from
> + * ProcGlobal->clogGroupFirst (this lets other processes set up other
> + * groups later); then we lock the SLRU bank corresponding to our group's
> + * page, do the SLRU updates, release the SLRU bank lock, and wake up the
> + * sleeping processes.
> 
> I think here we are saying that we "close out the group" before
> acquiring the SLRU lock but that's not true.  We keep the group open
> until we gets the lock so that we can get maximum members in while we
> are anyway waiting for the lock.

Absolutely right.  Reworded that.

> 2.
>  static void
>  TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
>   RepOriginId nodeid, int slotno)
>  {
> - Assert(TransactionIdIsNormal(xid));
> + if (!TransactionIdIsNormal(xid))
> + return;
> +
> + entryno = TransactionIdToCTsEntry(xid);
> 
> I do not understand why we need this change.

Ah yeah, I was bothered by the fact that if you pass Xid values earlier
than NormalXid to this function, we'd reply with some nonsensical values
instead of throwing an error.  But you're right that it doesn't belong
in this patch, so I removed that.

Here's a version with these fixes, where I also added some text to the
pg_stat_slru documentation:

+  <para>
+   For each <literal>SLRU</literal> area that's part of the core server,
+   there is a configuration parameter that controls its size, with the suffix
+   <literal>_buffers</literal> appended.  For historical
+   reasons, the names are not exact matches, but <literal>Xact</literal>
+   corresponds to <literal>transaction_buffers</literal> and the rest should
+   be obvious.
+   <!-- Should we edit pgstat_internal.h::slru_names so that the "name" matches
+        the GUC name?? -->
+  </para>

I think I would like to suggest renaming the GUCs to have the _slru_ bit
in the middle:

+# - SLRU Buffers (change requires restart) -
+
+#commit_timestamp_slru_buffers = 0          # memory for pg_commit_ts (0 = auto)
+#multixact_offsets_slru_buffers = 16            # memory for pg_multixact/offsets
+#multixact_members_slru_buffers = 32            # memory for pg_multixact/members
+#notify_slru_buffers = 16                   # memory for pg_notify
+#serializable_slru_buffers = 32             # memory for pg_serial
+#subtransaction_slru_buffers = 0            # memory for pg_subtrans (0 = auto)
+#transaction_slru_buffers = 0               # memory for pg_xact (0 = auto)

and the pgstat_internal.h table:
 
static const char *const slru_names[] = {
    "commit_timestamp",
    "multixact_members",
    "multixact_offsets",
    "notify",
    "serializable",
    "subtransaction",
    "transaction",
    "other"                        /* has to be last */
};

This way they match perfectly.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
                                 (George Orwell's The Lord of the Rings)

Attachment
On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Feb-23, Dilip Kumar wrote:

+  <para>
+   For each <literal>SLRU</literal> area that's part of the core server,
+   there is a configuration parameter that controls its size, with the suffix
+   <literal>_buffers</literal> appended.  For historical
+   reasons, the names are not exact matches, but <literal>Xact</literal>
+   corresponds to <literal>transaction_buffers</literal> and the rest should
+   be obvious.
+   <!-- Should we edit pgstat_internal.h::slru_names so that the "name" matches
+        the GUC name?? -->
+  </para>

I think I would like to suggest renaming the GUCs to have the _slru_ bit
in the middle:

+# - SLRU Buffers (change requires restart) -
+
+#commit_timestamp_slru_buffers = 0          # memory for pg_commit_ts (0 = auto)
+#multixact_offsets_slru_buffers = 16            # memory for pg_multixact/offsets
+#multixact_members_slru_buffers = 32            # memory for pg_multixact/members
+#notify_slru_buffers = 16                   # memory for pg_notify
+#serializable_slru_buffers = 32             # memory for pg_serial
+#subtransaction_slru_buffers = 0            # memory for pg_subtrans (0 = auto)
+#transaction_slru_buffers = 0               # memory for pg_xact (0 = auto)

and the pgstat_internal.h table:

static const char *const slru_names[] = {
        "commit_timestamp",
        "multixact_members",
        "multixact_offsets",
        "notify",
        "serializable",
        "subtransaction",
        "transaction",
        "other"                                         /* has to be last */
};

This way they match perfectly.

Yeah, I think this looks fine to me.  

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 2024-Feb-27, Dilip Kumar wrote:

> > static const char *const slru_names[] = {
> >         "commit_timestamp",
> >         "multixact_members",
> >         "multixact_offsets",
> >         "notify",
> >         "serializable",
> >         "subtransaction",
> >         "transaction",
> >         "other"                                         /* has to be last
> > */
> > };

Here's a patch for the renaming part.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 27 Feb 2024, at 21:03, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-27, Dilip Kumar wrote:
>
>>> static const char *const slru_names[] = {
>>>        "commit_timestamp",
>>>        "multixact_members",
>>>        "multixact_offsets",
>>>        "notify",
>>>        "serializable",
>>>        "subtransaction",
>>>        "transaction",
>>>        "other"                                         /* has to be last
>>> */
>>> };
>
> Here's a patch for the renaming part.

Sorry for the late reply, I have one nit. Are you sure that multixact_members and multixact_offsets are plural, while
transactionand commit_timestamp are singular? 
Maybe multixact_members and multixact_offset? Because there are many members and one offset for a givent multixact?
Userscertainly do not care, thought... 


Best regards, Andrey Borodin.


On 2024-Feb-27, Andrey M. Borodin wrote:

> Sorry for the late reply, I have one nit. Are you sure that
> multixact_members and multixact_offsets are plural, while transaction
> and commit_timestamp are singular?
> Maybe multixact_members and multixact_offset? Because there are many
> members and one offset for a givent multixact? Users certainly do not
> care, thought...

I made myself the same question actually, and thought about putting them
both in the singular.  I only backed off because I noticed that the
directories themselves are in plural (an old mistake of mine, evidently). 
Maybe we should follow that instinct and use the singular for these.

If we do that, we can rename the directories to also appear in singular
when/if the patch to add standard page headers to the SLRUs lands --
which is going to need code to rewrite the files during pg_upgrade
anyway, so the rename is not going to be a big deal.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)



Here's the complete set, with these two names using the singular.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

Attachment

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
"Andrey M. Borodin"
Date:

> On 27 Feb 2024, at 22:33, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> <v21-0001-Rename-SLRU-elements-in-pg_stat_slru.patch><v21-0002-Make-SLRU-buffer-sizes-configurable.patch>

These patches look amazing!


Best regards, Andrey Borodin.


On 2024-Feb-27, Alvaro Herrera wrote:

> Here's the complete set, with these two names using the singular.

BTW one thing I had not noticed is that before this patch we have
minimum shmem size that's lower than the lowest you can go with the new
code.

This means Postgres may no longer start when extremely tight memory
restrictions (and of course use more memory even when idle or with small
databases).  I wonder to what extent should we make an effort to relax
that.  For small, largely inactive servers, this is just memory we use
for no good reason.  However, anything we do here will impact
performance on the high end, because as Andrey says this will add
calculations and jumps where there are none today.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"We’ve narrowed the problem down to the customer’s pants being in a situation
 of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)



On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Feb-27, Alvaro Herrera wrote:

> Here's the complete set, with these two names using the singular.

BTW one thing I had not noticed is that before this patch we have
minimum shmem size that's lower than the lowest you can go with the new
code.

This means Postgres may no longer start when extremely tight memory
restrictions (and of course use more memory even when idle or with small
databases).  I wonder to what extent should we make an effort to relax
that.  For small, largely inactive servers, this is just memory we use
for no good reason.  However, anything we do here will impact
performance on the high end, because as Andrey says this will add
calculations and jumps where there are none today.


I was just comparing the minimum memory required for SLRU when the system is minimally configured, correct me if I am wrong.

SLRU                                            unpatched            patched                    
commit_timestamp_buffers          4                           16
subtransaction_buffers                 32                         16
transaction_buffers                       4                           16
multixact_offset_buffers                8                           16 
multixact_member_buffers           16                          16 
notify_buffers                                 8                           16
serializable_buffers                       16                          16
-------------------------------------------------------------------------------------
total buffers                                 88                            112

so that is < 200kB of extra memory on a minimally configured system, IMHO this should not matter.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

From
Kyotaro Horiguchi
Date:
At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> Here's the complete set, with these two names using the singular.

The commit by the second patch added several GUC descriptions:

> Sets the size of the dedicated buffer pool used for the commit timestamp cache.

Some of them, commit_timestamp_buffers, transaction_buffers,
subtransaction_buffers use 0 to mean auto-tuning based on
shared-buffer size. I think it's worth adding an extra_desc such as "0
to automatically determine this value based on the shared buffer
size".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On 2024-Feb-29, Kyotaro Horiguchi wrote:

> At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> > Here's the complete set, with these two names using the singular.
> 
> The commit by the second patch added several GUC descriptions:
> 
> > Sets the size of the dedicated buffer pool used for the commit timestamp cache.
> 
> Some of them, commit_timestamp_buffers, transaction_buffers,
> subtransaction_buffers use 0 to mean auto-tuning based on
> shared-buffer size. I think it's worth adding an extra_desc such as "0
> to automatically determine this value based on the shared buffer
> size".

How about this?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"

Attachment
On 2024-Feb-29, Alvaro Herrera wrote:

> On 2024-Feb-29, Kyotaro Horiguchi wrote:

> > Some of them, commit_timestamp_buffers, transaction_buffers,
> > subtransaction_buffers use 0 to mean auto-tuning based on
> > shared-buffer size. I think it's worth adding an extra_desc such as "0
> > to automatically determine this value based on the shared buffer
> > size".
> 
> How about this?

Pushed that way, but we can discuss further wording improvements/changes
if someone wants to propose any.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Pushed that way, but we can discuss further wording improvements/changes
> if someone wants to propose any.

I just noticed that drongo is complaining about two lines added
by commit 53c2a97a9:

 drongo        | 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=':
'SlruPageStatus*' differs in levels of indirection from 'int' 
 drongo        | 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=':
'SlruPageStatus*' differs in levels of indirection from 'int' 

These lines are

    Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY);

    Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY);

These are comparing the address of something with an enum value,
which surely cannot be sane.  Is the "&" operator incorrect?

It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
the numeric value of zero, so I guess the majority of our BF
animals are understanding this as "address != NULL".  But that
doesn't look like a useful test to be making.

            regards, tom lane



I wrote:
> It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
> the numeric value of zero, so I guess the majority of our BF
> animals are understanding this as "address != NULL".  But that
> doesn't look like a useful test to be making.

In hopes of noticing whether there are other similar thinkos,
I permuted the order of the SlruPageStatus enum values, and
now I get the expected warnings from gcc:

In file included from ../../../../src/include/postgres.h:45,
                 from slru.c:59:
slru.c: In function ‘SimpleLruWaitIO’:
slru.c:436:38: warning: comparison between pointer and integer
  Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY);
                                      ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
         ^~~~~~~~~
slru.c: In function ‘SimpleLruWritePage’:
slru.c:717:43: warning: comparison between pointer and integer
  Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY);
                                           ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
         ^~~~~~~~~

So it looks like it's just these two places.

            regards, tom lane



On 2024-Mar-04, Tom Lane wrote:

> In hopes of noticing whether there are other similar thinkos,
> I permuted the order of the SlruPageStatus enum values, and
> now I get the expected warnings from gcc:

Thanks for checking!  I pushed the fixes.

Maybe we should assign a nonzero value (= 1) to the first element of
enums, to avoid this kind of mistake.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Hello Alvaro,

27.02.2024 20:33, Alvaro Herrera wrote:
> Here's the complete set, with these two names using the singular.
>

I've managed to trigger an assert added by 53c2a97a9.
Please try the following script against a server compiled with
-DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
define, it just simplifies reproducing...):
# initdb & start ...

createdb test
echo "
SELECT pg_current_xact_id() AS tx
\gset

SELECT format('CREATE TABLE t%s(i int)', g)
   FROM generate_series(1, 1022 - :tx) g
\gexec

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT pg_current_xact_id();
SELECT pg_sleep(5);
" | psql test &

echo "
SELECT pg_sleep(1);
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 1 INTO a;
COMMIT;

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 2 INTO b;
" | psql test

It fails for me with the following stack trace:
TRAP: failed Assert("LWLockHeldByMeInMode(SimpleLruGetBankLock(ctl, pageno), LW_EXCLUSIVE)"), File: "slru.c", Line:
366,
 
PID: 21711
ExceptionalCondition at assert.c:52:13
SimpleLruZeroPage at slru.c:369:11
SerialAdd at predicate.c:921:20
SummarizeOldestCommittedSxact at predicate.c:1521:2
GetSerializableTransactionSnapshotInt at predicate.c:1787:16
GetSerializableTransactionSnapshot at predicate.c:1691:1
GetTransactionSnapshot at snapmgr.c:264:21
exec_simple_query at postgres.c:1162:4
...

Best regards,
Alexander



Hello,

On 2024-Apr-03, Alexander Lakhin wrote:

> I've managed to trigger an assert added by 53c2a97a9.
> Please try the following script against a server compiled with
> -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> define, it just simplifies reproducing...):

Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
there.  This rewrite of that small piece should fix it.  Thanks for
reporting this.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)

Attachment
On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Hello,
>
> On 2024-Apr-03, Alexander Lakhin wrote:
>
> > I've managed to trigger an assert added by 53c2a97a9.
> > Please try the following script against a server compiled with
> > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> > define, it just simplifies reproducing...):
>
> Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
> there.  This rewrite of that small piece should fix it.  Thanks for
> reporting this.
>

Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
thanks for reporting.  Your fix looks correct to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On 2024-Apr-03, Dilip Kumar wrote:

> Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
> thanks for reporting.  Your fix looks correct to me.

Thanks for the quick review!  And thanks to Alexander for the report.
Pushed the fix.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"