Thread: MultiXact\SLRU buffers configuration

MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:
Hi, hackers!

*** The problem ***
I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80%
MultiXactOffsetLock,20% IO DataFileRead). 
The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic
worldgenerator and could not achieve similar lock configuration: I see a lot of different locks in wait events,
particularlya lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of
MultiXactOffsetLockcan be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers. 

*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.

*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog,
subtransactionsand others will benefit from bigger buffer. But, probably, too much of knobs can be confusing. 

*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If
someonehave good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this. 
Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden
caversand difficulties? 

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> 8 мая 2020 г., в 21:36, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>
> *** The problem ***
> I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80%
MultiXactOffsetLock,20% IO DataFileRead). 
> The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
> The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic
worldgenerator and could not achieve similar lock configuration: I see a lot of different locks in wait events,
particularlya lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of
MultiXactOffsetLockcan be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers. 
>
> *** Question 1 ***
> Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
> I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.
>
> *** Question 2 ***
> Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog,
subtransactionsand others will benefit from bigger buffer. But, probably, too much of knobs can be confusing. 
>
> *** Question 3 ***
> MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If
someonehave good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this. 
> Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden
caversand difficulties? 

I've created benchmark[0] imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select
*from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is
lockedby different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity 

                                    пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  |         wait_event         | wait_event_type | state  |                       query
-------+----------------------------+-----------------+--------+----------------------------------------------------
 41344 | ClientRead                 | Client          | idle   | insert into t1 select generate_series(1,1000000,1)
 41375 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41378 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41381 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
(8 rows)

Finally, the benchmark is measuring time to execute select for update 42 times.

I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size

I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow
downbenchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are
8and 16 respectively). 
2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read
MultiXactIdmembers, but when it replaces one with another - it have to read previous set. I understand that we can
constructbenchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But
itjust make sense to try to take shared lock for reading. 
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to
defeatcache, either way OffsetControlLock would not be stressed. 

For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a
performanceoptimisation of MultiXact. 

Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a
container.Or at least make MultiXact members extendable in-place (reserve some size when multixact is created). 
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read current offset (with exclusive lock for offsets)
2. Iterate through all buffers for members to find current members (with exclusive lock for members)
3. Create new members array with +1 xid
4. Iterate through all cache members to find out maybe there are any such cache item as what we are going to create
5. iterate over 1 again for write
6. Iterate over 2 again for write

Obviously this does not scale well - we cannot increase SLRU sizes for too long.

Thanks! I'd be happy to hear any feedback.

Best regards, Andrey Borodin.

[0] https://github.com/x4m/multixact_stress

Attachment

Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> 11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>
> I've went ahead and created 3 patches:
> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> 2. Reduce locking level to shared on read of MultiXactId members
> 3. Configurable cache size

I'm looking more at MultiXact and it seems to me that we have a race condition there.

When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock

When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.

What am I missing?

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Kyotaro Horiguchi
Date:
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
>
>
> > 11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
> >
> > I've went ahead and created 3 patches:
> > 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> > 2. Reduce locking level to shared on read of MultiXactId members
> > 3. Configurable cache size
>
> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>
> When we create a new MultiXact we do:
> 1. Generate new MultiXactId under MultiXactGenLock
> 2. Record new mxid with members and offset to WAL
> 3. Write offset to SLRU under MultiXactOffsetControlLock
> 4. Write members to SLRU under MultiXactMemberControlLock

But, don't we hold exclusive lock on the buffer through all the steps
above?

> When we read MultiXact we do:
> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.

So transactions never see such incomplete mxids, I believe.

> What am I missing?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> 14 мая 2020 г., в 06:25, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
>
> At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
>>
>>
>>> 11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>>>
>>> I've went ahead and created 3 patches:
>>> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
>>> 2. Reduce locking level to shared on read of MultiXactId members
>>> 3. Configurable cache size
>>
>> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>>
>> When we create a new MultiXact we do:
>> 1. Generate new MultiXactId under MultiXactGenLock
>> 2. Record new mxid with members and offset to WAL
>> 3. Write offset to SLRU under MultiXactOffsetControlLock
>> 4. Write members to SLRU under MultiXactMemberControlLock
>
> But, don't we hold exclusive lock on the buffer through all the steps
> above?
Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers
committedtuple delete, but standby sees it as alive. 

>> When we read MultiXact we do:
>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
>> 3. Retrieve members from SLRU under MultiXactMemberControlLock
>> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
>
> So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is
lockingwhole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention. 

It looks like this:
0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0,timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41 
#0  0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0,timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41 
#1  0x000056186e0d54bd in pg_usleep (microsec=microsec@entry=1000) at ./build/../src/port/pgsleep.c:56
#2  0x000056186dd5edf2 in GetMultiXactIdMembers (from_pgupgrade=0 '\000', onlyLock=<optimized out>,
members=0x7ffd83377080,multi=3106214809) at ./build/../src/backend/access/transam/multixact.c:1370 
#3  GetMultiXactIdMembers () at ./build/../src/backend/access/transam/multixact.c:1202
#4  0x000056186dd2d2d9 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at
./build/../src/backend/access/heap/heapam.c:7039
#5  0x000056186dd35098 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7fcba3b63d58) at
./build/../src/backend/access/heap/heapam.c:7080
#6  0x000056186e0cd0f8 in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0x56186f44a058, buffer=230684) at
./build/../src/backend/utils/time/tqual.c:1091
#7  0x000056186dd2d922 in heapgetpage (scan=scan@entry=0x56186f4c8e78, page=page@entry=3620) at
./build/../src/backend/access/heap/heapam.c:439
#8  0x000056186dd2ea7c in heapgettup_pagemode (key=0x0, nkeys=0, dir=ForwardScanDirection, scan=0x56186f4c8e78) at
./build/../src/backend/access/heap/heapam.c:1034
#9  heap_getnext (scan=scan@entry=0x56186f4c8e78, direction=direction@entry=ForwardScanDirection) at
./build/../src/backend/access/heap/heapam.c:1801
#10 0x000056186de84f51 in SeqNext (node=node@entry=0x56186f4a4f78) at ./build/../src/backend/executor/nodeSeqscan.c:81
#11 0x000056186de6a3f1 in ExecScanFetch (recheckMtd=0x56186de84ef0 <SeqRecheck>, accessMtd=0x56186de84f20 <SeqNext>,
node=0x56186f4a4f78)at ./build/../src/backend/executor/execScan.c:97 
#12 ExecScan (node=0x56186f4a4f78, accessMtd=0x56186de84f20 <SeqNext>, recheckMtd=0x56186de84ef0 <SeqRecheck>) at
./build/../src/backend/executor/execScan.c:164


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Kyotaro Horiguchi
Date:
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in 
> >> I'm looking more at MultiXact and it seems to me that we have a race condition there.
> >> 
> >> When we create a new MultiXact we do:
> >> 1. Generate new MultiXactId under MultiXactGenLock
> >> 2. Record new mxid with members and offset to WAL
> >> 3. Write offset to SLRU under MultiXactOffsetControlLock
> >> 4. Write members to SLRU under MultiXactMemberControlLock
> > 
> > But, don't we hold exclusive lock on the buffer through all the steps
> > above?
> Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers
committedtuple delete, but standby sees it as alive.
 

Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.

> >> When we read MultiXact we do:
> >> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> >> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
> >> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> >> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members
list.
> > 
> > So transactions never see such incomplete mxids, I believe.
> I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is
lockingwhole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.
 

GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.

If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily.  Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> 14 мая 2020 г., в 11:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
>
> At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
>>>> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>>>>
>>>> When we create a new MultiXact we do:
>>>> 1. Generate new MultiXactId under MultiXactGenLock
>>>> 2. Record new mxid with members and offset to WAL
>>>> 3. Write offset to SLRU under MultiXactOffsetControlLock
>>>> 4. Write members to SLRU under MultiXactMemberControlLock
>>>
>>> But, don't we hold exclusive lock on the buffer through all the steps
>>> above?
>> Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers
committedtuple delete, but standby sees it as alive. 
>
> Ah, right. I looked from GetNewMultiXactId. Actually
> XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
> the creating mxact id. And GetMultiXactIdMembers is considering that
> case.
>
>>>> When we read MultiXact we do:
>>>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
>>>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
>>>> 3. Retrieve members from SLRU under MultiXactMemberControlLock
>>>> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members
list.
>>>
>>> So transactions never see such incomplete mxids, I believe.
>> I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
>> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is
lockingwhole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention. 
>
> GetMultiXactIdMembers believes that 4 is successfully done if 2
> returned valid offset, but actually that is not obvious.
>
> If we add a single giant lock just to isolate ,say,
> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> that works similary to buffer lock on primary, that doesn't cause
> confilicts between irrelevant mxids.
>
We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for
example.

Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of
inconsistency,because transaction in this mxid could not commit before we started. ISTM. 
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Kyotaro Horiguchi
Date:
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in 
> > GetMultiXactIdMembers believes that 4 is successfully done if 2
> > returned valid offset, but actually that is not obvious.
> > 
> > If we add a single giant lock just to isolate ,say,
> > GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> > unnecessarily.  Perhaps we need finer-grained locking-key for standby
> > that works similary to buffer lock on primary, that doesn't cause
> > confilicts between irrelevant mxids.
> > 
> We can just replay members before offsets. If offset is already there - members are there too.
> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for
example.

Generally in such cases, condition variables would work.  In the
attached PoC, the reader side gets no penalty in the "likely" code
path.  The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases.  But I couldn't cause the
situation where the sleep 1000u is reached.

> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of
inconsistency,because transaction in this mxid could not commit before we started. ISTM.
 
> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.

The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e2aa5c9ce4..9db8f6cddd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
     /* support for members anti-wraparound measures */
     MultiXactOffset offsetStopLimit;    /* known if oldestOffsetKnown */
 
+    ConditionVariable nextoff_cv;
     /*
      * Per-backend data starts here.  We have two arrays stored in the area
      * immediately following the MultiXactStateData struct. Each is indexed by
@@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
     /* Exchange our lock */
     LWLockRelease(MultiXactOffsetControlLock);
 
+    /*
+     *  Let everybody know the offset of this mxid is recorded now. The waiters
+     *  are waiting for the offset of the mxid next of the target to know the
+     *  number of members of the target mxid, so we don't need to wait for
+     *  members of this mxid are recorded.
+     */
+    ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
+
     LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
 
     prev_pageno = -1;
@@ -1367,9 +1377,19 @@ retry:
         if (nextMXOffset == 0)
         {
             /* Corner case 2: next multixact is still being filled in */
+
+            /*
+             * The recorder of the next mxid is just before writing the offset.
+             * Wait for the offset to be written.
+             */
+            ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv);
+
             LWLockRelease(MultiXactOffsetControlLock);
             CHECK_FOR_INTERRUPTS();
-            pg_usleep(1000L);
+
+            ConditionVariableSleep(&MultiXactState->nextoff_cv,
+                                   WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+            ConditionVariableCancelSleep();
             goto retry;
         }
 
@@ -1847,6 +1867,7 @@ MultiXactShmemInit(void)
 
         /* Make sure we zero out the per-backend state */
         MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+        ConditionVariableInit(&MultiXactState->nextoff_cv);
     }
     else
         Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0ecd29a1d9..1ac6b37188 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3879,6 +3879,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+            event_name = "Mact/WaitNextXactMembers";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..e79bba0bef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -886,7 +886,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_WAIT_NEXT_MXMEMBERS
 } WaitEventIPC;
 
 /* ----------

Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
>
> At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
>>> GetMultiXactIdMembers believes that 4 is successfully done if 2
>>> returned valid offset, but actually that is not obvious.
>>>
>>> If we add a single giant lock just to isolate ,say,
>>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
>>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
>>> that works similary to buffer lock on primary, that doesn't cause
>>> confilicts between irrelevant mxids.
>>>
>> We can just replay members before offsets. If offset is already there - members are there too.
>> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state,
forexample. 
>
> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.
Thanks! That really looks like a good solution without magic timeouts. Beautiful!
I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd
patch)and other tweaks. 

>> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of
inconsistency,because transaction in this mxid could not commit before we started. ISTM. 
>> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
>
> The step 4 of the reader side reads the members of the target mxid. It
> is already written if the offset of the *next* mxid is filled-in.
Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write
first.
But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Kyotaro Horiguchi
Date:
At Fri, 15 May 2020 14:01:46 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
>
>
> > 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
> >
> > At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
> >>> GetMultiXactIdMembers believes that 4 is successfully done if 2
> >>> returned valid offset, but actually that is not obvious.
> >>>
> >>> If we add a single giant lock just to isolate ,say,
> >>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> >>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> >>> that works similary to buffer lock on primary, that doesn't cause
> >>> confilicts between irrelevant mxids.
> >>>
> >> We can just replay members before offsets. If offset is already there - members are there too.
> >> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state,
forexample. 
> >
> > Generally in such cases, condition variables would work.  In the
> > attached PoC, the reader side gets no penalty in the "likely" code
> > path.  The writer side always calls ConditionVariableBroadcast but the
> > waiter list is empty in almost all cases.  But I couldn't cause the
> > situation where the sleep 1000u is reached.
> Thanks! That really looks like a good solution without magic timeouts. Beautiful!
> I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
> This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd
patch)and other tweaks. 

Happy to hear that, It would need to use timeout just in case, though.

> >> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of
inconsistency,because transaction in this mxid could not commit before we started. ISTM. 
> >> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
> >
> > The step 4 of the reader side reads the members of the target mxid. It
> > is already written if the offset of the *next* mxid is filled-in.
> Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write
first.
> But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.

Yeah, right.  Otherwise assertion failure happens.


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Re: MultiXact\SLRU buffers configuration

From
Daniel Gustafsson
Date:
> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.

The submitted patch no longer applies, can you please submit an updated
version?  I'm marking the patch Waiting on Author in the meantime.

cheers ./daniel


Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> 2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
>
>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
>> Generally in such cases, condition variables would work.  In the
>> attached PoC, the reader side gets no penalty in the "likely" code
>> path.  The writer side always calls ConditionVariableBroadcast but the
>> waiter list is empty in almost all cases.  But I couldn't cause the
>> situation where the sleep 1000u is reached.
>
> The submitted patch no longer applies, can you please submit an updated
> version?  I'm marking the patch Waiting on Author in the meantime.
Thanks, Daniel! PFA V2

Best regards, Andrey Borodin.



Attachment

Re: MultiXact\SLRU buffers configuration

From
Daniel Gustafsson
Date:

> On 8 Jul 2020, at 09:03, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
>> 2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
>>
>>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>>
>>> Generally in such cases, condition variables would work.  In the
>>> attached PoC, the reader side gets no penalty in the "likely" code
>>> path.  The writer side always calls ConditionVariableBroadcast but the
>>> waiter list is empty in almost all cases.  But I couldn't cause the
>>> situation where the sleep 1000u is reached.
>>
>> The submitted patch no longer applies, can you please submit an updated
>> version?  I'm marking the patch Waiting on Author in the meantime.
> Thanks, Daniel! PFA V2

This version too has stopped applying according to the CFbot.  I've moved it to
the next commitfest as we're out of time in this one and it was only pointed
out now, but kept it Waiting on Author.

cheers ./daniel


Re: MultiXact\SLRU buffers configuration

From
Anastasia Lubennikova
Date:
On 08.07.2020 10:03, Andrey M. Borodin wrote:
>
>> 2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
>>
>>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>>> Generally in such cases, condition variables would work.  In the
>>> attached PoC, the reader side gets no penalty in the "likely" code
>>> path.  The writer side always calls ConditionVariableBroadcast but the
>>> waiter list is empty in almost all cases.  But I couldn't cause the
>>> situation where the sleep 1000u is reached.
>> The submitted patch no longer applies, can you please submit an updated
>> version?  I'm marking the patch Waiting on Author in the meantime.
> Thanks, Daniel! PFA V2
>
> Best regards, Andrey Borodin.

1) The first patch is sensible and harmless, so I think it is ready for 
committer. I haven't tested the performance impact, though.

2) I like the initial proposal to make various SLRU buffers 
configurable, however, I doubt if it really solves the problem, or just 
moves it to another place?

The previous patch you sent was based on some version that contained 
changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' 
and 'multixact_members_slru_buffers'. Have you just forgot to attach 
them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The 
changes are trivial.

2.1) I think that both min and max values for this parameter are too 
extreme. Have you tested them?

+               &multixact_local_cache_entries,
+               256, 2, INT_MAX / 2,

2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.

3) No changes for third patch. I just renamed it for consistency.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:
Hi, Anastasia!

> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
>
> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance
impact,though. 
>
> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the
problem,or just moves it to another place? 
>
> The previous patch you sent was based on some version that contained changes for other slru buffers numbers:
'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch
message"[PATCH v2 2/4]" hints that you had 4 patches) 
> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
>
> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
>
> +               &multixact_local_cache_entries,
> +               256, 2, INT_MAX / 2,
>
> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
>
> 3) No changes for third patch. I just renamed it for consistency.

Thank you for your review.

Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to
reproduceproblem from production... 

You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation
withvalues ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning
value?

I greatly appreciate your review, sorry for so long delay. Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Anastasia Lubennikova
Date:
On 28.09.2020 17:41, Andrey M. Borodin wrote:
> Hi, Anastasia!
>
>> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
>>
>> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance
impact,though.
 
>>
>> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the
problem,or just moves it to another place?
 
>>
>> The previous patch you sent was based on some version that contained changes for other slru buffers numbers:
'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch
message"[PATCH v2 2/4]" hints that you had 4 patches)
 
>> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
>>
>> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
>>
>> +               &multixact_local_cache_entries,
>> +               256, 2, INT_MAX / 2,
>>
>> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
>>
>> 3) No changes for third patch. I just renamed it for consistency.
> Thank you for your review.
>
> Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to
reproduceproblem from production...
 
>
> You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance
degradationwith values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible
functioningvalue?
 

I would go with the values that we consider adequate for this setting. 
As I see, there is no strict rule about it in guc.c and many variables 
have large border values. Anyway, we need to explain it at least in the 
documentation and code comments.

It seems that the default was conservative enough, so it can be also a 
minimal value too. As for maximum, can you provide any benchmark 
results? If we have a peak and a noticeable performance degradation 
after that, we can use it to calculate the preferable maxvalue.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: MultiXact\SLRU buffers configuration

From
Alexander Korotkov
Date:
Hi!

On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
> >
> > 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance
impact,though. 
> >
> > 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the
problem,or just moves it to another place? 
> >
> > The previous patch you sent was based on some version that contained changes for other slru buffers numbers:
'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch
message"[PATCH v2 2/4]" hints that you had 4 patches) 
> > Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
> >
> > 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
> >
> > +               &multixact_local_cache_entries,
> > +               256, 2, INT_MAX / 2,
> >
> > 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
> >
> > 3) No changes for third patch. I just renamed it for consistency.
>
> Thank you for your review.
>
> Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to
reproduceproblem from production... 
>
> You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance
degradationwith values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible
functioningvalue? 
>
> I greatly appreciate your review, sorry for so long delay. Thanks!

I took a look at this patchset.

The 1st and 3rd patches look good to me.  I made just minor improvements.
1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
SLRU lock, which is already taken in exclusive mode.  I've evaded this
by passing the lock mode as a parameter to
SimpleLruReadPage_ReadOnly().
3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
inside ConditionVariableSleep() if needed.  Also, no current wait
events use slashes, and I don't think we should introduce slashes
here.  Even if we should, then we should also rename existing wait
events to be consistent with a new one.  So, I've renamed the new wait
event to remove the slash.

Regarding the patch 2.  I see the current documentation in the patch
doesn't explain to the user how to set the new parameter.  I think we
should give users an idea what workloads need high values of
multixact_local_cache_entries parameter and what doesn't.  Also, we
should explain the negative aspects of high values
multixact_local_cache_entries.  Ideally, we should get the advantage
without overloading users with new nontrivial parameters, but I don't
have a particular idea here.

I'd like to propose committing 1 and 3, but leave 2 for further review.

------
Regards,
Alexander Korotkov

Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 26 окт. 2020 г., в 06:05, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
>
> Hi!
>
> On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>>> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
>>>
>>> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance
impact,though. 
>>>
>>> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the
problem,or just moves it to another place? 
>>>
>>> The previous patch you sent was based on some version that contained changes for other slru buffers numbers:
'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch
message"[PATCH v2 2/4]" hints that you had 4 patches) 
>>> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
>>>
>>> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
>>>
>>> +               &multixact_local_cache_entries,
>>> +               256, 2, INT_MAX / 2,
>>>
>>> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
>>>
>>> 3) No changes for third patch. I just renamed it for consistency.
>>
>> Thank you for your review.
>>
>> Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to
reproduceproblem from production... 
>>
>> You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance
degradationwith values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible
functioningvalue? 
>>
>> I greatly appreciate your review, sorry for so long delay. Thanks!
>
> I took a look at this patchset.
>
> The 1st and 3rd patches look good to me.  I made just minor improvements.
> 1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
> SLRU lock, which is already taken in exclusive mode.  I've evaded this
> by passing the lock mode as a parameter to
> SimpleLruReadPage_ReadOnly().
> 3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
> inside ConditionVariableSleep() if needed.  Also, no current wait
> events use slashes, and I don't think we should introduce slashes
> here.  Even if we should, then we should also rename existing wait
> events to be consistent with a new one.  So, I've renamed the new wait
> event to remove the slash.
>
> Regarding the patch 2.  I see the current documentation in the patch
> doesn't explain to the user how to set the new parameter.  I think we
> should give users an idea what workloads need high values of
> multixact_local_cache_entries parameter and what doesn't.  Also, we
> should explain the negative aspects of high values
> multixact_local_cache_entries.  Ideally, we should get the advantage
> without overloading users with new nontrivial parameters, but I don't
> have a particular idea here.
>
> I'd like to propose committing 1 and 3, but leave 2 for further review.

Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.


I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than
sizesof offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough. 

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Alexander Korotkov
Date:
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Thanks for your review, Alexander!
> +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> Other changes seem correct to me too.
>
>
> I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less
thansizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
 

Thank you.  I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.

------
Regards,
Alexander Korotkov

Attachment

Re: MultiXact\SLRU buffers configuration

From
Alexander Korotkov
Date:
On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > Thanks for your review, Alexander!
> > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> > Other changes seem correct to me too.
> >
> >
> > I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less
thansizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
 
>
> Thank you.  I've made a few more minor adjustments to the patchset.
> I'm going to push 0001 and 0003 if there are no objections.

I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.

------
Regards,
Alexander Korotkov

Attachment

Re: MultiXact\SLRU buffers configuration

From
Tomas Vondra
Date:
On Tue, Oct 27, 2020 at 08:23:26PM +0300, Alexander Korotkov wrote:
>On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> > Thanks for your review, Alexander!
>> > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
>> > Other changes seem correct to me too.
>> >
>> >
>> > I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less
thansizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
 
>>
>> Thank you.  I've made a few more minor adjustments to the patchset.
>> I'm going to push 0001 and 0003 if there are no objections.
>
>I get that patchset v5 doesn't pass the tests due to typo in assert.
>The fixes version is attached.
>

I did a quick review on this patch series. A couple comments:


0001
----

This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
changed to return information about what lock was used, merely to allow
the callers to do an Assert() that the value is not LW_NONE.

IMO we could achieve exactly the same thing by passing a simple flag
that would say 'make sure we got a lock' or something like that. In
fact, aren't all callers doing the assert? That'd mean we can just do
the check always, without the flag. (I see GetMultiXactIdMembers does
two calls and only checks the second result, but I wonder if that's
intended or omission.)

In any case, it'd make the lwlock.c changes unnecessary, I think.


0002
----

Specifies the number cached MultiXact by backend. Any SLRU lookup ...

should be 'number of cached ...'


0003
----

     * Conditional variable for waiting till the filling of the next multixact
     * will be finished.  See GetMultiXactIdMembers() and RecordNewMultiXact()
     * for details.

Perhaps 'till the next multixact is filled' or 'gets full' would be
better. Not sure.


This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?

If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?


regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:
Tomas, thanks for looking into this!

> 28 окт. 2020 г., в 06:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
>
> This thread started with a discussion about making the SLRU sizes
> configurable, but this patch version only adds a local cache. Does this
> achieve the same goal, or would we still gain something by having GUCs
> for the SLRUs?
>
> If we're claiming this improves performance, it'd be good to have some
> workload demonstrating that and measurements. I don't see anything like
> that in this thread, so it's a bit hand-wavy. Can someone share details
> of such workload (even synthetic one) and some basic measurements?

All patches in this thread aim at the same goal: improve performance in presence of MultiXact locks contention.
I could not build synthetical reproduction of the problem, however I did some MultiXact stressing here [0]. It's a
clumsytest program, because it still is not clear to me which parameters of workload trigger MultiXact locks
contention.In generic case I was encountering other locks like *GenLock: XidGenLock, MultixactGenLock etc. Yet our
productionsystem encounters this problem approximately once in a month through this year. 

Test program locks for share different set of tuples in presence of concurrent full scans.
To produce a set of locks we choose one of 14 bits. If a row number has this bit set to 0 we add lock it.
I've been measuring time to lock all rows 3 time for each of 14 bits, observing total time to set all locks.
During the test I was observing locks in pg_stat_activity, if they did not contain enough MultiXact locks I was tuning
parametersfurther (number of concurrent clients, number of bits, select queries etc). 

Why is it so complicated? It seems that other reproductions of a problem were encountering other locks.

Lets describe patches in this thread from the POV of these test.

*** Configurable SLRU buffers for MultiXact members and offsets.
From tests it is clear that high and low values for these buffers affect the test time. Here are time for a one test
runwith different offsets and members sizes [1] 
Our production currently runs with (numbers are pages of buffers)
+#define NUM_MXACTOFFSET_BUFFERS 32
+#define NUM_MXACTMEMBER_BUFFERS 64
And, looking back to incidents in summer and fall 2020, seems like it helped mostly.

But it's hard to give some tuning advises based on test results. Values (32,64) produce 10% better result than current
hardcodedvalues (8,16). In generic case this is not what someone should tune first. 

*** Configurable caches of MultiXacts.
Tests were specifically designed to beat caches. So, according to test the bigger cache is - the more time it takes to
accomplishthe test [2]. 
Anyway cache is local for backend and it's purpose is deduplication of written MultiXacts, not enhancing reads.

*** Using advantage of SimpleLruReadPage_ReadOnly() in MultiXacts.
This simply aligns MultiXact with Subtransactions and CLOG. Other SLRUs already take advantage of reading SLRU with
sharedlock. 
On synthetical tests without background selects this patch adds another ~4.7% of performance [3] against [4]. This
improvementseems consistent between different parameter values, yet within measurements deviation (see difference
betweenwarmup run [5] and closing run [6]). 
All in all, these attempts to measure impact are hand-wavy too. But it makes sense to use consistent approach among
similarsubsystems (MultiXacts, Subtrans, CLOG etc). 

*** Reduce sleep in GetMultiXactIdMembers() on standby.
The problem with pg_usleep(1000L) within GetMultiXactIdMembers() manifests on standbys during contention of
MultiXactOffsetControlLock.It's even harder to reproduce. 
Yet it seems obvious that reducing sleep to shorter time frame will make count of sleeping backend smaller.

For consistency I've returned patch with SLRU buffer configs to patchset (other patches are intact). But I'm mostly
concernedabout patches 1 and 3. 

Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/x4m/multixact_stress
[1] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L22-L39
[2] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L83-L99
[3] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L9
[4] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L29
[5] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L3
[6] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L19


Attachment

Re: MultiXact\SLRU buffers configuration

From
Alexander Korotkov
Date:
Hi, Tomas!

Thank you for your review.

On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I did a quick review on this patch series. A couple comments:
>
>
> 0001
> ----
>
> This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
> changed to return information about what lock was used, merely to allow
> the callers to do an Assert() that the value is not LW_NONE.

Yes, but this is not merely to allow callers to do an Assert().
Sometimes in multixacts it could save us some relocks.  So, we can
skip relocking lock to exclusive mode if it's in exclusive already.
Adding Assert() to every caller is probably overkill.

> IMO we could achieve exactly the same thing by passing a simple flag
> that would say 'make sure we got a lock' or something like that. In
> fact, aren't all callers doing the assert? That'd mean we can just do
> the check always, without the flag. (I see GetMultiXactIdMembers does
> two calls and only checks the second result, but I wonder if that's
> intended or omission.)

Having just the flag is exactly what the original version by Andrey
did.  But if we have to read two multixact offsets pages or multiple
members page in one GetMultiXactIdMembers()), then it does relocks
from exclusive mode to exclusive mode.  I decide that once we decide
to optimize this locks, this situation is nice to evade.

> In any case, it'd make the lwlock.c changes unnecessary, I think.

I agree that it would be better to not touch lwlock.c.  But I didn't
find a way to evade relocking exclusive mode to exclusive mode without
touching lwlock.c or making code cumbersome in other places.

> 0002
> ----
>
> Specifies the number cached MultiXact by backend. Any SLRU lookup ...
>
> should be 'number of cached ...'

Sounds reasonable.

> 0003
> ----
>
>      * Conditional variable for waiting till the filling of the next multixact
>      * will be finished.  See GetMultiXactIdMembers() and RecordNewMultiXact()
>      * for details.
>
> Perhaps 'till the next multixact is filled' or 'gets full' would be
> better. Not sure.

Sounds reasonable as well.

------
Regards,
Alexander Korotkov



Re: MultiXact\SLRU buffers configuration

From
Tomas Vondra
Date:
On Wed, Oct 28, 2020 at 10:36:39PM +0300, Alexander Korotkov wrote:
>Hi, Tomas!
>
>Thank you for your review.
>
>On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> I did a quick review on this patch series. A couple comments:
>>
>>
>> 0001
>> ----
>>
>> This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
>> changed to return information about what lock was used, merely to allow
>> the callers to do an Assert() that the value is not LW_NONE.
>
>Yes, but this is not merely to allow callers to do an Assert().
>Sometimes in multixacts it could save us some relocks.  So, we can
>skip relocking lock to exclusive mode if it's in exclusive already.
>Adding Assert() to every caller is probably overkill.
>

Hmm, OK. That can only happen in GetMultiXactIdMembers, which is the
only place where we do retry, right? Do we actually know this makes any
measurable difference? It seems we're mostly imagining that it might
help, but we don't have any actual proof of that (e.g. a workload which
we might benchmark). Or am I missing something?

For me, the extra conditions make it way harder to reason about the
behavior of the code, and I can't convince myself it's worth it.


>> IMO we could achieve exactly the same thing by passing a simple flag
>> that would say 'make sure we got a lock' or something like that. In
>> fact, aren't all callers doing the assert? That'd mean we can just do
>> the check always, without the flag. (I see GetMultiXactIdMembers does
>> two calls and only checks the second result, but I wonder if that's
>> intended or omission.)
>
>Having just the flag is exactly what the original version by Andrey
>did.  But if we have to read two multixact offsets pages or multiple
>members page in one GetMultiXactIdMembers()), then it does relocks
>from exclusive mode to exclusive mode.  I decide that once we decide
>to optimize this locks, this situation is nice to evade.
>
>> In any case, it'd make the lwlock.c changes unnecessary, I think.
>
>I agree that it would be better to not touch lwlock.c.  But I didn't
>find a way to evade relocking exclusive mode to exclusive mode without
>touching lwlock.c or making code cumbersome in other places.
>

Hmm. OK.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: MultiXact\SLRU buffers configuration

From
Tomas Vondra
Date:
Hi,

On Wed, Oct 28, 2020 at 12:34:58PM +0500, Andrey Borodin wrote:
>Tomas, thanks for looking into this!
>
>> 28 окт. 2020 г., в 06:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>
>>
>> This thread started with a discussion about making the SLRU sizes
>> configurable, but this patch version only adds a local cache. Does this
>> achieve the same goal, or would we still gain something by having GUCs
>> for the SLRUs?
>>
>> If we're claiming this improves performance, it'd be good to have some
>> workload demonstrating that and measurements. I don't see anything like
>> that in this thread, so it's a bit hand-wavy. Can someone share details
>> of such workload (even synthetic one) and some basic measurements?
>
>All patches in this thread aim at the same goal: improve performance in presence of MultiXact locks contention.
>I could not build synthetical reproduction of the problem, however I did some MultiXact stressing here [0]. It's a
clumsytest program, because it still is not clear to me which parameters of workload trigger MultiXact locks
contention.In generic case I was encountering other locks like *GenLock: XidGenLock, MultixactGenLock etc. Yet our
productionsystem encounters this problem approximately once in a month through this year.
 
>
>Test program locks for share different set of tuples in presence of concurrent full scans.
>To produce a set of locks we choose one of 14 bits. If a row number has this bit set to 0 we add lock it.
>I've been measuring time to lock all rows 3 time for each of 14 bits, observing total time to set all locks.
>During the test I was observing locks in pg_stat_activity, if they did not contain enough MultiXact locks I was tuning
parametersfurther (number of concurrent clients, number of bits, select queries etc).
 
>
>Why is it so complicated? It seems that other reproductions of a problem were encountering other locks.
>

It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...

I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:

(a) it actually will help with the issue you're observing on production

and 

(b) it's actually worth the extra complexity (e.g. the lwlock changes)


I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:


29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):

It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...

I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:

(a) it actually will help with the issue you're observing on production

and 
(b) it's actually worth the extra complexity (e.g. the lwlock changes)


I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.

Thanks, Tomas! This totally makes sense.

Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x less shards), copy of production workload queries. But the problem does not manifest there.
Why do I think the problem is in MultiXacts?
Here is a chart with number of wait events on each host

During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover to another node SLRURead continued for a bit there, then disappeared.
Backtraces on standbys during the problem show that most of backends are sleeping in pg_sleep(1000L) and are not included into wait stats on these charts.

Currently I'm considering writing test that directly calls MultiXactIdExpand(), MultiXactIdCreate(), and GetMultiXactIdMembers() from an extension. How do you think, would benchmarks in such tests be meaningful?


Thanks!

Best regards, Andrey Borodin.
Attachment

Re: MultiXact\SLRU buffers configuration

From
Tomas Vondra
Date:
On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote:
>
>
>> 29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>
>> It's not my intention to be mean or anything like that, but to me this
>> means we don't really understand the problem we're trying to solve. Had
>> we understood it, we should be able to construct a workload reproducing
>> the issue ...
>>
>> I understand what the individual patches are doing, and maybe those
>> changes are desirable in general. But without any benchmarks from a
>> plausible workload I find it hard to convince myself that:
>>
>> (a) it actually will help with the issue you're observing on production
>>
>> and
>> (b) it's actually worth the extra complexity (e.g. the lwlock changes)
>>
>>
>> I'm willing to invest some of my time into reviewing/testing this, but I
>> think we badly need better insight into the issue, so that we can build
>> a workload reproducing it. Perhaps collecting some perf profiles and a
>> sample of the queries might help, but I assume you already tried that.
>
>Thanks, Tomas! This totally makes sense.
>
>Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x less
shards),copy of production workload queries. But the problem does not manifest there.
 
>Why do I think the problem is in MultiXacts?
>Here is a chart with number of wait events on each host
>
>
>During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover to
anothernode SLRURead continued for a bit there, then disappeared.
 

OK, so most of this seems to be due to SLRURead and
MultiXactOffsetControlLock. Could it be that there were too many
multixact members, triggering autovacuum to prevent multixact
wraparound? That might generate a lot of IO on the SLRU. Are you
monitoring the size of the pg_multixact directory?

>Backtraces on standbys during the problem show that most of backends are sleeping in pg_sleep(1000L) and are not
includedinto wait stats on these charts.
 
>
>Currently I'm considering writing test that directly calls MultiXactIdExpand(), MultiXactIdCreate(), and
GetMultiXactIdMembers()from an extension. How do you think, would benchmarks in such tests be meaningful?
 
>

I don't know. I'd much rather have a SQL-level benchmark than an
extension doing this kind of stuff.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 29 окт. 2020 г., в 18:49, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
> On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote:
>>
>>
>>> 29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>>
>>> It's not my intention to be mean or anything like that, but to me this
>>> means we don't really understand the problem we're trying to solve. Had
>>> we understood it, we should be able to construct a workload reproducing
>>> the issue ...
>>>
>>> I understand what the individual patches are doing, and maybe those
>>> changes are desirable in general. But without any benchmarks from a
>>> plausible workload I find it hard to convince myself that:
>>>
>>> (a) it actually will help with the issue you're observing on production
>>>
>>> and
>>> (b) it's actually worth the extra complexity (e.g. the lwlock changes)
>>>
>>>
>>> I'm willing to invest some of my time into reviewing/testing this, but I
>>> think we badly need better insight into the issue, so that we can build
>>> a workload reproducing it. Perhaps collecting some perf profiles and a
>>> sample of the queries might help, but I assume you already tried that.
>>
>> Thanks, Tomas! This totally makes sense.
>>
>> Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x
lessshards), copy of production workload queries. But the problem does not manifest there. 
>> Why do I think the problem is in MultiXacts?
>> Here is a chart with number of wait events on each host
>>
>>
>> During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover
toanother node SLRURead continued for a bit there, then disappeared. 
>
> OK, so most of this seems to be due to SLRURead and
> MultiXactOffsetControlLock. Could it be that there were too many
> multixact members, triggering autovacuum to prevent multixact
> wraparound? That might generate a lot of IO on the SLRU. Are you
> monitoring the size of the pg_multixact directory?

Yes, we had some problems with 'multixact "members" limit exceeded' long time ago.
We tuned autovacuum_multixact_freeze_max_age = 200000000 and vacuum_multixact_freeze_table_age = 75000000 (half of
defaults)and since then did not ever encounter this problem (~5 months). 
But the MultiXactOffsetControlLock problem persists. Partially the problem was solved by adding more shards. But when
oneof shards encounters a problem it's either MultiXacts or vacuum causing relation truncation (unrelated to this
thread).

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Tomas Vondra
Date:
Hi,

After the issue reported in [1] got fixed, I've restarted the multi-xact
stress test, hoping to reproduce the issue. But so far no luck :-(

I've started slightly different tests on two machines - on one machine
I've done this:

  a) init.sql

  create table t (a int);
  insert into t select i from generate_series(1,100000000) s(i);
  alter table t add primary key (a);

  b) select.sql

  SELECT * FROM t
   WHERE a = (1+mod(abs(hashint4(extract(epoch from now())::int)),
                    100000000)) FOR KEY SHARE;

  c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test

The idea is to have large table and many clients hitting a small random
subset of the rows. A sample of wait events from ~24h run looks like this:

      e_type  |        e_name        |   sum
    ----------+----------------------+----------
     LWLock   | BufferContent        | 13913863
              |                      |  7194679
     LWLock   | WALWrite             |  1710507
     Activity | LogicalLauncherMain  |   726599
     Activity | AutoVacuumMain       |   726127
     Activity | WalWriterMain        |   725183
     Activity | CheckpointerMain     |   604694
     Client   | ClientRead           |   599900
     IO       | WALSync              |   502904
     Activity | BgWriterMain         |   378110
     Activity | BgWriterHibernate    |   348464
     IO       | WALWrite             |   129412
     LWLock   | ProcArray            |     6633
     LWLock   | WALInsert            |     5714
     IO       | SLRUWrite            |     2580
     IPC      | ProcArrayGroupUpdate |     2216
     LWLock   | XactSLRU             |     2196
     Timeout  | VacuumDelay          |     1078
     IPC      | XactGroupUpdate      |      737
     LWLock   | LockManager          |      503
     LWLock   | WALBufMapping        |      295
     LWLock   | MultiXactMemberSLRU  |      267
     IO       | DataFileWrite        |       68
     LWLock   | BufferIO             |       59
     IO       | DataFileRead         |       27
     IO       | DataFileFlush        |       14
     LWLock   | MultiXactGen         |        7
     LWLock   | BufferMapping        |        1

So, nothing particularly interesting - there certainly are not many wait
events related to SLRU.

On the other machine I did this:

  a) init.sql
  create table t (a int primary key);
  insert into t select i from generate_series(1,1000) s(i);

  b) select.sql
  select * from t for key share;

  c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test

and the wait events (24h run too) look like this:

      e_type   |        e_name         |   sum
    -----------+-----------------------+----------
     LWLock    | BufferContent         | 20804925
               |                       |  2575369
     Activity  | LogicalLauncherMain   |   745780
     Activity  | AutoVacuumMain        |   745292
     Activity  | WalWriterMain         |   740507
     Activity  | CheckpointerMain      |   737691
     Activity  | BgWriterHibernate     |   731123
     LWLock    | WALWrite              |   570107
     IO        | WALSync               |   452603
     Client    | ClientRead            |   151438
     BufferPin | BufferPin             |    23466
     LWLock    | WALInsert             |    21631
     IO        | WALWrite              |    19050
     LWLock    | ProcArray             |    15082
     Activity  | BgWriterMain          |    14655
     IPC       | ProcArrayGroupUpdate  |     7772
     LWLock    | WALBufMapping         |     3555
     IO        | SLRUWrite             |     1951
     LWLock    | MultiXactGen          |     1661
     LWLock    | MultiXactMemberSLRU   |      359
     LWLock    | MultiXactOffsetSLRU   |      242
     LWLock    | XactSLRU              |      141
     IPC       | XactGroupUpdate       |      104
     LWLock    | LockManager           |       28
     IO        | DataFileRead          |        4
     IO        | ControlFileSyncUpdate |        1
     Timeout   | VacuumDelay           |        1
     IO        | WALInitWrite          |        1

Also nothing particularly interesting - few SLRU wait events.

So unfortunately this does not really reproduce the SLRU locking issues
you're observing - clearly, there has to be something else triggering
it. Perhaps this workload is too simplistic, or maybe we need to run
different queries. Or maybe the hw needs to be somewhat different (more
CPUs? different storage?)


[1]
https://www.postgresql.org/message-id/20201104013205.icogbi773przyny5@development

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 10 нояб. 2020 г., в 05:13, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):
> After the issue reported in [1] got fixed, I've restarted the multi-xact
> stress test, hoping to reproduce the issue. But so far no luck :-(


Tomas, many thanks for looking into this. I figured out that to make multixact sets bigger transactions must hang for a
whileand lock large set of tuples. But not continuous range to avoid locking on buffer_content. 
I did not manage to implement this via pgbench, that's why I was trying to hack on separate go program. But,
essentially,no luck either. 
I was observing something resemblant though

                                    пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  |         wait_event         | wait_event_type | state  |                       query
-------+----------------------------+-----------------+--------+----------------------------------------------------
 41344 | ClientRead                 | Client          | idle   | insert into t1 select generate_series(1,1000000,1)
 41375 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41378 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41381 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
(8 rows)

but this picture was not stable.

How do you collect wait events for aggregation? just insert into some table with cron?

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Tomas Vondra
Date:


On 11/10/20 7:16 AM, Andrey Borodin wrote:
> 
> 
>> 10 нояб. 2020 г., в 05:13, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):
>> After the issue reported in [1] got fixed, I've restarted the multi-xact
>> stress test, hoping to reproduce the issue. But so far no luck :-(
> 
> 
> Tomas, many thanks for looking into this. I figured out that to make multixact sets bigger transactions must hang for
awhile and lock large set of tuples. But not continuous range to avoid locking on buffer_content.
 
> I did not manage to implement this via pgbench, that's why I was trying to hack on separate go program. But,
essentially,no luck either.
 
> I was observing something resemblant though
> 
>                                     пятница,  8 мая 2020 г. 15:08:37 (every 1s)
> 
>   pid  |         wait_event         | wait_event_type | state  |                       query                        
> -------+----------------------------+-----------------+--------+----------------------------------------------------
>  41344 | ClientRead                 | Client          | idle   | insert into t1 select generate_series(1,1000000,1)
>  41375 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
>  41377 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
>  41378 |                            |                 | active | select * from t1 where i = ANY ($1) for share
>  41379 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
>  41381 |                            |                 | active | select * from t1 where i = ANY ($1) for share
>  41383 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
>  41385 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
> (8 rows)
> 
> but this picture was not stable.
> 

Seems we haven't made much progress in reproducing the issue :-( I guess
we'll need to know more about the machine where this happens. Is there
anything special about the hardware/config? Are you monitoring size of
the pg_multixact directory?

> How do you collect wait events for aggregation? just insert into some table with cron?
> 

No, I have a simple shell script (attached) sampling data from
pg_stat_activity regularly. Then I load it into a table and aggregate to
get a summary.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Wed, Nov 11, 2020 at 7:07 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Seems we haven't made much progress in reproducing the issue :-( I guess
> we'll need to know more about the machine where this happens. Is there
> anything special about the hardware/config? Are you monitoring size of
> the pg_multixact directory?

Which release was the original problem seen on?



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 10 нояб. 2020 г., в 23:07, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):
>
> On 11/10/20 7:16 AM, Andrey Borodin wrote:
>>
>>
>> but this picture was not stable.
>>
>
> Seems we haven't made much progress in reproducing the issue :-( I guess
> we'll need to know more about the machine where this happens. Is there
> anything special about the hardware/config? Are you monitoring size of
> the pg_multixact directory?

It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM.
PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit

No, unfortunately we do not have signals for SLRU sizes.
3.5Tb mdadm raid10 over 28 SSD drives, 82% full.

First incident triggering investigation was on 2020-04-19, at that time cluster was running on PG 10.11. But I think it
washappening before. 

I'd say nothing special...

>
>> How do you collect wait events for aggregation? just insert into some table with cron?
>>
>
> No, I have a simple shell script (attached) sampling data from
> pg_stat_activity regularly. Then I load it into a table and aggregate to
> get a summary.

Thanks!

Best regards, Andrey Borodin.




Re: MultiXact\SLRU buffers configuration

From
Gilles Darold
Date:
Le 13/11/2020 à 12:49, Andrey Borodin a écrit :
>
>> 10 нояб. 2020 г., в 23:07, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):
>>
>> On 11/10/20 7:16 AM, Andrey Borodin wrote:
>>>
>>> but this picture was not stable.
>>>
>> Seems we haven't made much progress in reproducing the issue :-( I guess
>> we'll need to know more about the machine where this happens. Is there
>> anything special about the hardware/config? Are you monitoring size of
>> the pg_multixact directory?
> It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM.
> PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit
>
> No, unfortunately we do not have signals for SLRU sizes.
> 3.5Tb mdadm raid10 over 28 SSD drives, 82% full.
>
> First incident triggering investigation was on 2020-04-19, at that time cluster was running on PG 10.11. But I think
itwas happening before.
 
>
> I'd say nothing special...
>
>>> How do you collect wait events for aggregation? just insert into some table with cron?
>>>
>> No, I have a simple shell script (attached) sampling data from
>> pg_stat_activity regularly. Then I load it into a table and aggregate to
>> get a summary.
> Thanks!
>
> Best regards, Andrey Borodin.


Hi,


Some time ago I have encountered a contention on 
MultiXactOffsetControlLock with a performances benchmark. Here are the 
wait event monitoring result with a pooling each 10 seconds and a 30 
minutes run for the benchmarl:


  event_type |           event            |   sum
------------+----------------------------+----------
  Client     | ClientRead                 | 44722952
  LWLock     | MultiXactOffsetControlLock | 30343060
  LWLock     | multixact_offset           | 16735250
  LWLock     | MultiXactMemberControlLock |  1601470
  LWLock     | buffer_content             |   991344
  LWLock     | multixact_member           |   805624
  Lock       | transactionid              |   204997
  Activity   | LogicalLauncherMain        |   198834
  Activity   | CheckpointerMain           |   198834
  Activity   | AutoVacuumMain             |   198469
  Activity   | BgWriterMain               |   184066
  Activity   | WalWriterMain              |   171571
  LWLock     | WALWriteLock               |    72428
  IO         | DataFileRead               |    35708
  Activity   | BgWriterHibernate          |    12741
  IO         | SLRURead                   |     9121
  Lock       | relation                   |     8858
  LWLock     | ProcArrayLock              |     7309
  LWLock     | lock_manager               |     6677
  LWLock     | pg_stat_statements         |     4194
  LWLock     | buffer_mapping             |     3222


After reading this thread I change the value of the buffer size to 32 
and 64 and obtain the following results:


  event_type |           event            |    sum
------------+----------------------------+-----------
  Client     | ClientRead                 | 268297572
  LWLock     | MultiXactMemberControlLock |  65162906
  LWLock     | multixact_member           |  33397714
  LWLock     | buffer_content             |   4737065
  Lock       | transactionid              |   2143750
  LWLock     | SubtransControlLock        |   1318230
  LWLock     | WALWriteLock               |   1038999
  Activity   | LogicalLauncherMain        |    940598
  Activity   | AutoVacuumMain             |    938566
  Activity   | CheckpointerMain           |    799052
  Activity   | WalWriterMain              |    749069
  LWLock     | subtrans                   |    710163
  Activity   | BgWriterHibernate          |    536763
  Lock       | object                     |    514225
  Activity   | BgWriterMain               |    394206
  LWLock     | lock_manager               |    295616
  IO         | DataFileRead               |    274236
  LWLock     | ProcArrayLock              |     77099
  Lock       | tuple                      |     59043
  IO         | CopyFileWrite              |     45611
  Lock       | relation                   |     42714

There was still contention on multixact but less than the first run. I 
have increased the buffers to 128 and 512 and obtain the best results 
for this bench:

  event_type |           event            |    sum
------------+----------------------------+-----------
  Client     | ClientRead                 | 160463037
  LWLock     | MultiXactMemberControlLock |   5334188
  LWLock     | buffer_content             |   5228256
  LWLock     | buffer_mapping             |   2368505
  LWLock     | SubtransControlLock        |   2289977
  IPC        | ProcArrayGroupUpdate       |   1560875
  LWLock     | ProcArrayLock              |   1437750
  Lock       | transactionid              |    825561
  LWLock     | subtrans                   |    772701
  LWLock     | WALWriteLock               |    666138
  Activity   | LogicalLauncherMain        |    492585
  Activity   | CheckpointerMain           |    492458
  Activity   | AutoVacuumMain             |    491548
  LWLock     | lock_manager               |    426531
  Lock       | object                     |    403581
  Activity   | WalWriterMain              |    394668
  Activity   | BgWriterHibernate          |    293112
  Activity   | BgWriterMain               |    195312
  LWLock     | MultiXactGenLock           |    177820
  LWLock     | pg_stat_statements         |    173864
  IO         | DataFileRead               |    173009


I hope these metrics can have some interest to show the utility of this 
patch but unfortunately I can not be more precise and provide reports 
for the entire patch. The problem is that this benchmark is run on an 
application that use PostgreSQL 11 and I can not back port the full 
patch, there was too much changes since PG11. I have just increase the 
size of NUM_MXACTOFFSET_BUFFERS and NUM_MXACTMEMBER_BUFFERS. This allow 
us to triple the number of simultaneous connections between the first 
and the last test.


I know that this report is not really helpful but at least I can give 
more information on the benchmark that was used. This is the proprietary 
zRef benchmark which compares the same Cobol programs (transactional and 
batch) executed both on mainframes and on x86 servers. Instead  of a DB2 
z/os database we use PostgreSQL v11. This test has extensive use of 
cursors (each select, even read only, is executed through a cursor) and 
the contention was observed with update on tables with some foreign 
keys. There is no explicit FOR SHARE on the queries, only some FOR 
UPDATE clauses. I guess that the multixact contention is the result of 
the for share locks produced for FK.


So in our case being able to tune the multixact buffers could help a lot 
to improve the performances.


--
Gilles Darold




Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:
Hi Gilles!

Many thanks for your message!

> 8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а):
>
> I know that this report is not really helpful

Quite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you
pleasebenchmark them (without extending SLRU)? 

Best regards, Andrey Borodin.


Attachment

Re: MultiXact\SLRU buffers configuration

From
Gilles Darold
Date:
Hi Andrey,

Thanks for the backport. I have issue with the first patch "Use shared lock in GetMultiXactIdMembers for offsets and members" (v1106-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-o.patch) the applications are not working anymore when I'm applying it. Also PG regression tests are failing too on several part.

test insert_conflict              ... ok
test create_function_1            ... FAILED
test create_type                  ... FAILED
test create_table                 ... FAILED
test create_function_2            ... FAILED
test copy                         ... FAILED
test copyselect                   ... ok
test copydml                      ... ok
test create_misc                  ... FAILED
test create_operator              ... FAILED
test create_procedure             ... ok
test create_index                 ... FAILED
test index_including              ... ok
test create_view                  ... FAILED
test create_aggregate             ... ok
test create_function_3            ... ok
test create_cast                  ... ok
test constraints                  ... FAILED
test triggers                     ... FAILED
test inherit                      ...
^C


This is also where I left my last try to back port for PG11, I will try to fix it again but it could take time to have it working.

Best regards,
-- 
Gilles Darold

Le 08/12/2020 à 18:52, Andrey Borodin a écrit :
Hi Gilles!

Many thanks for your message!

8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а):

I know that this report is not really helpful 
Quite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)?

Best regards, Andrey Borodin.

Re: MultiXact\SLRU buffers configuration

From
Gilles Darold
Date:
Le 09/12/2020 à 11:51, Gilles Darold a écrit :
> Also PG regression tests are failing too on several part.

Forget this, I have not run the regression tests in the right repository:

...

=======================
  All 189 tests passed.
=======================


I'm looking why the application is failing.


-- 
Gilles Darold
http://www.darold.net/




Re: MultiXact\SLRU buffers configuration

From
Gilles Darold
Date:
Le 08/12/2020 à 18:52, Andrey Borodin a écrit :
Hi Gilles!

Many thanks for your message!

8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а):

I know that this report is not really helpful 
Quite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)?

Best regards, Andrey Borodin.

Hi,


Running tests yesterday with the patches has reported log of failures with error on INSERT and UPDATE statements:


ERROR:  lock MultiXactOffsetControlLock is not held


After a patch review this morning I think I have found what's going wrong. In patch v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think there is a missing reinitialisation of the lockmode variable to LW_NONE inside the retry loop after the call to LWLockRelease() in src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). I've attached a new version of the patch for master that include the fix I'm using now with PG11 and with which everything works very well now.


I'm running more tests to see the impact on the performances to play with multixact_offsets_slru_buffers, multixact_members_slru_buffers and multixact_local_cache_entries. I will reports the results later today.

-- 
Gilles Darold
http://www.darold.net/
Attachment

Re: MultiXact\SLRU buffers configuration

From
Gilles Darold
Date:
Le 10/12/2020 à 15:45, Gilles Darold a écrit :
Le 08/12/2020 à 18:52, Andrey Borodin a écrit :
Hi Gilles!

Many thanks for your message!

8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а):

I know that this report is not really helpful 
Quite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)?

Best regards, Andrey Borodin.

Hi,


Running tests yesterday with the patches has reported log of failures with error on INSERT and UPDATE statements:


ERROR:  lock MultiXactOffsetControlLock is not held


After a patch review this morning I think I have found what's going wrong. In patch v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think there is a missing reinitialisation of the lockmode variable to LW_NONE inside the retry loop after the call to LWLockRelease() in src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). I've attached a new version of the patch for master that include the fix I'm using now with PG11 and with which everything works very well now.


I'm running more tests to see the impact on the performances to play with multixact_offsets_slru_buffers, multixact_members_slru_buffers and multixact_local_cache_entries. I will reports the results later today.


Hi,

Sorry for the delay, I have done some further tests to try to reach the limit without bottlenecks on multixact or shared buffers. The tests was done on a Microsoft Asure machine with 2TB of RAM and 4 sockets Intel Xeon Platinum 8280M (128 cpu). PG configuration:

    max_connections = 4096
    shared_buffers = 64GB
    max_prepared_transactions = 2048
    work_mem = 256MB
    maintenance_work_mem = 2GB
    wal_level = minimal
    synchronous_commit = off
    commit_delay = 1000
    commit_siblings = 10
    checkpoint_timeout = 1h
    max_wal_size = 32GB
    checkpoint_completion_target = 0.9

I have tested with several values for the different buffer's variables starting from:

    multixact_offsets_slru_buffers = 64
    multixact_members_slru_buffers = 128
    multixact_local_cache_entries = 256

to the values with the best performances we achieve with this test to avoid MultiXactOffsetControlLock or MultiXactMemberControlLock:

    multixact_offsets_slru_buffers = 128
    multixact_members_slru_buffers = 512
    multixact_local_cache_entries = 1024

Also shared_buffers have been increased up to 256GB to avoid buffer_mapping contention.

Our last best test reports the following wait events:

     event_type |           event            |    sum
    ------------+----------------------------+-----------
     Client     | ClientRead                 | 321690211
     LWLock     | buffer_content             |   2970016
     IPC        | ProcArrayGroupUpdate       |   2317388
     LWLock     | ProcArrayLock              |   1445828
     LWLock     | WALWriteLock               |   1187606
     LWLock     | SubtransControlLock        |    972889
     Lock       | transactionid              |    840560
     Lock       | relation                   |    587600
     Activity   | LogicalLauncherMain        |    529599
     Activity   | AutoVacuumMain             |    528097

At this stage I don't think we can have better performances by tuning these buffers at least with PG11.

About performances gain related to the patch for shared lock in GetMultiXactIdMembers unfortunately I can not see a difference with or without this patch, it could be related to our particular benchmark. But clearly the patch on multixact buffers should be committed as this is really helpfull to be able to tuned PG when multixact bottlenecks are found.


Best regards,
-- 
Gilles Darold
LzLabs GmbH
https://www.lzlabs.com/


			
		

Re: MultiXact\SLRU buffers configuration

From
Gilles Darold
Date:
Le 11/12/2020 à 18:50, Gilles Darold a écrit :
Le 10/12/2020 à 15:45, Gilles Darold a écrit :
Le 08/12/2020 à 18:52, Andrey Borodin a écrit :
Hi Gilles!

Many thanks for your message!

8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а):

I know that this report is not really helpful 
Quite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)?

Best regards, Andrey Borodin.

Hi,


Running tests yesterday with the patches has reported log of failures with error on INSERT and UPDATE statements:


ERROR:  lock MultiXactOffsetControlLock is not held


After a patch review this morning I think I have found what's going wrong. In patch v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think there is a missing reinitialisation of the lockmode variable to LW_NONE inside the retry loop after the call to LWLockRelease() in src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). I've attached a new version of the patch for master that include the fix I'm using now with PG11 and with which everything works very well now.


I'm running more tests to see the impact on the performances to play with multixact_offsets_slru_buffers, multixact_members_slru_buffers and multixact_local_cache_entries. I will reports the results later today.


Hi,

Sorry for the delay, I have done some further tests to try to reach the limit without bottlenecks on multixact or shared buffers. The tests was done on a Microsoft Asure machine with 2TB of RAM and 4 sockets Intel Xeon Platinum 8280M (128 cpu). PG configuration:

    max_connections = 4096
    shared_buffers = 64GB
    max_prepared_transactions = 2048
    work_mem = 256MB
    maintenance_work_mem = 2GB
    wal_level = minimal
    synchronous_commit = off
    commit_delay = 1000
    commit_siblings = 10
    checkpoint_timeout = 1h
    max_wal_size = 32GB
    checkpoint_completion_target = 0.9

I have tested with several values for the different buffer's variables starting from:

    multixact_offsets_slru_buffers = 64
    multixact_members_slru_buffers = 128
    multixact_local_cache_entries = 256

to the values with the best performances we achieve with this test to avoid MultiXactOffsetControlLock or MultiXactMemberControlLock:

    multixact_offsets_slru_buffers = 128
    multixact_members_slru_buffers = 512
    multixact_local_cache_entries = 1024

Also shared_buffers have been increased up to 256GB to avoid buffer_mapping contention.

Our last best test reports the following wait events:

     event_type |           event            |    sum
    ------------+----------------------------+-----------
     Client     | ClientRead                 | 321690211
     LWLock     | buffer_content             |   2970016
     IPC        | ProcArrayGroupUpdate       |   2317388
     LWLock     | ProcArrayLock              |   1445828
     LWLock     | WALWriteLock               |   1187606
     LWLock     | SubtransControlLock        |    972889
     Lock       | transactionid              |    840560
     Lock       | relation                   |    587600
     Activity   | LogicalLauncherMain        |    529599
     Activity   | AutoVacuumMain             |    528097

At this stage I don't think we can have better performances by tuning these buffers at least with PG11.

About performances gain related to the patch for shared lock in GetMultiXactIdMembers unfortunately I can not see a difference with or without this patch, it could be related to our particular benchmark. But clearly the patch on multixact buffers should be committed as this is really helpfull to be able to tuned PG when multixact bottlenecks are found.


I've done more review on these patches.


1) as reported in my previous message patch 0001 looks useless as it doesn't allow measurable performances gain.


2) In patch 0004 there is two typo: s/informaion/information/ will fix them


3) the GUC are missing in the postgresql.conf.sample file, see patch in attachment for a proposal.


Best regards,

-- 
Gilles Darold
LzLabs GmbH
https://www.lzlabs.com/
Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 13 дек. 2020 г., в 14:17, Gilles Darold <gilles@darold.net> написал(а):
>
> I've done more review on these patches.

Thanks, Gilles! I'll incorporate all your fixes to patchset.
Can you also benchmark conditional variable sleep? The patch "Add conditional variable to wait for next MultXact offset
incorner case"? 
The problem manifests on Standby when Primary is heavily loaded with MultiXactOffsetControlLock.

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 13 дек. 2020 г., в 22:24, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
>
>
>> 13 дек. 2020 г., в 14:17, Gilles Darold <gilles@darold.net> написал(а):
>>
>> I've done more review on these patches.
>
> Thanks, Gilles! I'll incorporate all your fixes to patchset.

PFA patches.
Also, I've noted that patch "Add conditional variable to wait for next MultXact offset in corner case" removes
CHECK_FOR_INTERRUPTS();,I'm not sure it's correct. 

Thanks!

Best regards, Andrey Borodin.




Attachment

Re: MultiXact\SLRU buffers configuration

From
Gilles Darold
Date:
Le 13/12/2020 à 18:24, Andrey Borodin a écrit :

13 дек. 2020 г., в 14:17, Gilles Darold <gilles@darold.net> написал(а):

I've done more review on these patches.
Thanks, Gilles! I'll incorporate all your fixes to patchset.
Can you also benchmark conditional variable sleep? The patch "Add conditional variable to wait for next MultXact offset in corner case"?
The problem manifests on Standby when Primary is heavily loaded with MultiXactOffsetControlLock.

Thanks!

Best regards, Andrey Borodin.

Hi Andrey,


Sorry for the response delay, we have run several others tests trying to figure out the performances gain per patch but unfortunately we have very heratic results. With the same parameters and patches the test doesn't returns the same results following the day or the hour of the day. This is very frustrating and I suppose that this is related to the Azure architecture. The only thing that I am sure is that we had the best performances results with all patches and

multixact_offsets_slru_buffers = 256
multixact_members_slru_buffers = 512
multixact_local_cache_entries = 4096

but I can not say if all or part of the patches are improving the performances. My feeling is that performances gain related to patches 1 (shared lock) and 3 (conditional variable) do not have much to do with the performances gain compared to just tuning the multixact buffers. This is when the multixact contention is observed but perhaps they are delaying the contention. It's all the more frustrating that we had a test case to reproduce the contention but not the architecture apparently.


Can't do much more at this point.


Best regards,

-- 
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 23 дек. 2020 г., в 21:31, Gilles Darold <gilles@darold.net> написал(а):
>
> Sorry for the response delay, we have run several others tests trying to figure out the performances gain per patch
butunfortunately we have very heratic results. With the same parameters and patches the test doesn't returns the same
resultsfollowing the day or the hour of the day. This is very frustrating and I suppose that this is related to the
Azurearchitecture. The only thing that I am sure is that we had the best performances results with all patches and 
>
> multixact_offsets_slru_buffers = 256
> multixact_members_slru_buffers = 512
> multixact_local_cache_entries = 4096
>
>
> but I can not say if all or part of the patches are improving the performances. My feeling is that performances gain
relatedto patches 1 (shared lock) and 3 (conditional variable) do not have much to do with the performances gain
comparedto just tuning the multixact buffers. This is when the multixact contention is observed but perhaps they are
delayingthe contention. It's all the more frustrating that we had a test case to reproduce the contention but not the
architectureapparently. 

Hi! Thanks for the input.
I think we have a consensus here that configuring SLRU size is beneficial for MultiXacts.
There is proposal in nearby thread [0] on changing default value of commit_ts SLRU buffers.
In my experience from time to time there can be problems with subtransactions cured by extending subtrans SLRU.

Let's make all SLRUs configurable?
PFA patch with draft of these changes.

Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql


Attachment

Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
Hi Andrey,

On Sat, Mar 13, 2021 at 1:44 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> [v10]

+int            multixact_offsets_slru_buffers = 8;
+int            multixact_members_slru_buffers = 16;
+int            subtrans_slru_buffers = 32;
+int            notify_slru_buffers = 8;
+int            serial_slru_buffers = 16;
+int            clog_slru_buffers = 0;
+int            commit_ts_slru_buffers = 0;

I don't think we should put "slru" (the name of the buffer replacement
algorithm, implementation detail) in the GUC names.

+        It defaults to 0, in this case CLOG size is taken as
<varname>shared_buffers</varname> / 512.

We already know that increasing the number of CLOG buffers above the
current number hurts as the linear search begins to dominate
(according to the commit message for 5364b357), and it doesn't seem
great to ship a new feature that melts your CPU when you turn it up.
Perhaps, to ship this, we need to introduce a buffer mapping table?  I
have attached a "one coffee" attempt at that, on top of your v10 patch
(unmodified), for discussion.  It survives basic testing but I don't
know how it performs.

Attachment

Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Thu, Mar 25, 2021 at 10:31 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> We already know that increasing the number of CLOG buffers above the
> current number hurts as the linear search begins to dominate
> (according to the commit message for 5364b357), and it doesn't seem
> great to ship a new feature that melts your CPU when you turn it up.
> Perhaps, to ship this, we need to introduce a buffer mapping table?  I
> have attached a "one coffee" attempt at that, on top of your v10 patch
> (unmodified), for discussion.  It survives basic testing but I don't
> know how it performs.

Hrrr... Cfbot showed an assertion failure.  Here's the two coffee
version with a couple of silly mistakes fixed.

Attachment

Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
Hi Andrey, all,

I propose some changes, and I'm attaching a new version:

I renamed the GUCs as clog_buffers etc (no "_slru_").  I fixed some
copy/paste mistakes where the different GUCs were mixed up.  I made
some changes to the .conf.sample.  I rewrote the documentation so that
it states the correct unit and defaults, and refers to the
subdirectories that are cached by these buffers instead of trying to
give a new definition of each of the SLRUs.

Do you like those changes?

Some things I thought about but didn't change:

I'm not entirely sure if we should use the internal and historical
names well known to hackers (CLOG), or the visible directory names (I
mean, we could use pg_xact_buffers instead of clog_buffers).  I am not
sure why these GUCs need to be PGDLLIMPORT, but I see that NBuffers is
like that.

I wanted to do some very simple smoke testing of CLOG sizes on my
local development machine:

  pgbench -i -s1000 postgres
  pgbench -t4000000 -c8 -j8 -Mprepared postgres

I disabled autovacuum after running that just to be sure it wouldn't
interfere with my experiment:

  alter table pgbench_accounts set (autovacuum_enabled = off);

Then I shut the cluster down and made a copy, so I could do some
repeated experiments from the same initial conditions each time.  At
this point I had 30 files 0000-001E under pg_xact, holding 256kB = ~1
million transactions each.  It'd take ~960 buffers to cache it all.
So how long does VACUUM FREEZE pgbench_accounts take?

I tested with just the 0001 patch, and also with the 0002 patch
(improved version, attached):

clog_buffers=128:  0001=2:28.499, 0002=2:17:891
clog_buffers=1024: 0001=1:38.485, 0002=1:29.701

I'm sure the speedup of the 0002 patch can be amplified by increasing
the number of transactions referenced in the table OR number of
clog_buffers, considering that the linear search produces
O(transactions * clog_buffers) work.  That was 32M transactions and
8MB of CLOG, but I bet if you double both of those numbers once or
twice things start to get hot.  I don't see why you shouldn't be able
to opt to cache literally all of CLOG if you want (something like 50MB
assuming default autovacuum_freeze_max_age, scale to taste, up to
512MB for the theoretical maximum useful value).

I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity.

Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:
Hi Thomas, Gilles, all!

Thanks for reviewing this!

> 25 марта 2021 г., в 02:31, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> I don't think we should put "slru" (the name of the buffer replacement
> algorithm, implementation detail) in the GUC names.
+1


> +        It defaults to 0, in this case CLOG size is taken as
> <varname>shared_buffers</varname> / 512.
>
> We already know that increasing the number of CLOG buffers above the
> current number hurts as the linear search begins to dominate
Uh, my intent was to copy original approach of CLOG SLRU size, I just missed that Min(,) thing in shared_buffers logic.


> 26 марта 2021 г., в 08:46, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> Hi Andrey, all,
>
> I propose some changes, and I'm attaching a new version:
>
> I renamed the GUCs as clog_buffers etc (no "_slru_").  I fixed some
> copy/paste mistakes where the different GUCs were mixed up.  I made
> some changes to the .conf.sample.  I rewrote the documentation so that
> it states the correct unit and defaults, and refers to the
> subdirectories that are cached by these buffers instead of trying to
> give a new definition of each of the SLRUs.
>
> Do you like those changes?
Yes!

> Some things I thought about but didn't change:
>
> I'm not entirely sure if we should use the internal and historical
> names well known to hackers (CLOG), or the visible directory names (I
> mean, we could use pg_xact_buffers instead of clog_buffers).
While it is good idea to make notes about directory name, I think the real naming criteria is to help find this GUC
whenuser encounters wait events in pg_stat_activity. I think there is no CLOG mentions in docs [0], only XactBuffer,
XactSLRUet c. 

> I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity.
I think the idea of speeding up linear search is really really good for scaling SLRUs. It's not even about improving
normalperformance of the cluster, but it's important from preventing pathological degradation under certain
circumstances.Bigger cache really saves SLAs :) I'll look into the patch more closely this weekend. Thank you! 

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW


Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 26 марта 2021 г., в 11:00, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
>> I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity.
> I think the idea of speeding up linear search is really really good for scaling SLRUs. It's not even about improving
normalperformance of the cluster, but it's important from preventing pathological degradation under certain
circumstances.Bigger cache really saves SLAs :) I'll look into the patch more closely this weekend. Thank you! 


Some thoughts on HashTable patch:
1. Can we allocate bigger hashtable to reduce probability of collisions?
2. Can we use specialised hashtable for this case? I'm afraid hash_search() does comparable number of CPU cycles as
simplecycle from 0 to 128. We could inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not
insistingon special hash though, just an idea. 
3. pageno in SlruMappingTableEntry seems to be unused.

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Sat, Mar 27, 2021 at 4:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Some thoughts on HashTable patch:
> 1. Can we allocate bigger hashtable to reduce probability of collisions?

Yeah, good idea, might require some study.

> 2. Can we use specialised hashtable for this case? I'm afraid hash_search() does comparable number of CPU cycles as
simplecycle from 0 to 128. We could inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not
insistingon special hash though, just an idea. 

I tried really hard to not fall into this rabbit h.... [hack hack
hack], OK, here's a first attempt to use simplehash, Andres's
steampunk macro-based robinhood template that we're already using for
several other things, and murmurhash which is inlineable and
branch-free.  I had to tweak it to support "in-place" creation and
fixed size (in other words, no allocators, for use in shared memory).
Then I was annoyed that I had to add a "status" member to our struct,
so I tried to fix that.  Definitely needs more work to think about
failure modes when running out of memory, how much spare space you
need, etc.

I have not experimented with this much beyond hacking until the tests
pass, but it *should* be more efficient...

> 3. pageno in SlruMappingTableEntry seems to be unused.

It's the key (dynahash uses the first N bytes of your struct as the
key, but in this new simplehash version it's more explicit).

Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 27 марта 2021 г., в 01:26, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> On Sat, Mar 27, 2021 at 4:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Some thoughts on HashTable patch:
>> 1. Can we allocate bigger hashtable to reduce probability of collisions?
>
> Yeah, good idea, might require some study.
In a long run we always have this table filled with nslots. But the keys will be usually consecutive numbers (current
workingset of CLOG\Multis\etc). So in a happy hashing scenario collisions will only appear for some random backward
jumps.I think just size = nslots * 2 will produce results which cannot be improved significantly. 
And this reflects original growth strategy SH_GROW(tb, tb->size * 2).

>> 2. Can we use specialised hashtable for this case? I'm afraid hash_search() does comparable number of CPU cycles as
simplecycle from 0 to 128. We could inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not
insistingon special hash though, just an idea. 
>
> I tried really hard to not fall into this rabbit h.... [hack hack
> hack], OK, here's a first attempt to use simplehash,

> Andres's
> steampunk macro-based robinhood template
Sounds magnificent.

> that we're already using for
> several other things
I could not find much tests to be sure that we do not break something...

> , and murmurhash which is inlineable and
> branch-free.
I think pageno is a hash already. Why hash any further? And pages accessed together will have smaller access time due
tocolocation. 

>  I had to tweak it to support "in-place" creation and
> fixed size (in other words, no allocators, for use in shared memory).
We really need to have a test to know what happens when this structure goes out of memory, as you mentioned below. What
wouldbe apropriate place for simplehash tests? 

> Then I was annoyed that I had to add a "status" member to our struct,
> so I tried to fix that.
Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash align it well?


Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Sat, Mar 27, 2021 at 6:31 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 27 марта 2021 г., в 01:26, Thomas Munro <thomas.munro@gmail.com> написал(а):
> > , and murmurhash which is inlineable and
> > branch-free.

> I think pageno is a hash already. Why hash any further? And pages accessed together will have smaller access time due
tocolocation. 

Yeah, if clog_buffers is large enough then it's already a "perfect
hash", but if it's not then you might get some weird "harmonic"
effects (not sure if that's the right word), basically higher or lower
collision rate depending on coincidences in the data.  If you apply a
hash, the collisions should be evenly spread out so at least it'll be
somewhat consistent.  Does that make sense?

(At some point I figured out that the syscaches have lower collision
rates and perform better if you use oids directly instead of hashing
them... but then it's easy to create a pathological pattern of DDL
that turns your hash table into a linked list.  Not sure what to think
about that.)

> >  I had to tweak it to support "in-place" creation and
> > fixed size (in other words, no allocators, for use in shared memory).

> We really need to have a test to know what happens when this structure goes out of memory, as you mentioned below.
Whatwould be apropriate place for simplehash tests? 

Good questions.  This has to be based on being guaranteed to have
enough space for all of the entries, so the question is really just
"how bad can performance get with different load factors".  FWIW there
were some interesting cases with clustering when simplehash was first
used in the executor (see commits ab9f2c42 and parent) which required
some work on hashing quality to fix.

> > Then I was annoyed that I had to add a "status" member to our struct,
> > so I tried to fix that.

> Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash align it well?

With that "intrusive status" patch, the size is back to 8.  But I
think I made a mistake: I made it steal some key space to indicate
presence, but I think the presence test should really get access to
the whole entry so that you can encode it in more ways.  For example,
with slotno == -1.

Alright, considering the date, if we want to get this into PostgreSQL
14 it's time to make some decisions.

1.  Do we want customisable SLRU sizes in PG14?

+1 from me, we have multiple reports of performance gains from
increasing various different SLRUs, and it's easy to find workloads
that go faster.

One thought: it'd be nice if the user could *see* the current size,
when using the default.  SHOW clog_buffers -> 0 isn't very helpful if
you want to increase it, but don't know what it's currently set to.
Not sure off the top of my head how best to do that.

2.  What names do we want the GUCs to have?  Here's what we have:

Proposed GUC              Directory            System views
clog_buffers              pg_xact              Xact
multixact_offsets_buffers pg_multixact/offsets MultiXactOffset
multixact_members_buffers pg_multixact/members MultiXactMember
notify_buffers            pg_notify            Notify
serial_buffers            pg_serial            Serial
subtrans_buffers          pg_subtrans          Subtrans
commit_ts_buffers         pg_commit_ts         CommitTs

By system views, I mean pg_stat_slru, pg_shmem_allocations and
pg_stat_activity (lock names add "SLRU" on the end).

Observations:

It seems obvious that "clog_buffers" should be renamed to "xact_buffers".
It's not clear whether the multixact GUCs should have the extra "s"
like the directories, or not, like the system views.
It see that we have "Shared Buffer Lookup Table" in
pg_shmem_allocations, so where I generated names like "Subtrans
Mapping Table" I should change that to "Lookup" to match.

3.  What recommendations should we make about how to set it?

I think the answer depends partially on the next questions!  I think
we should probably at least say something short about the pg_stat_slru
view (cache miss rate) and pg_stat_actitity view (waits on locks), and
how to tell if you might need to increase it.  I think this probably
needs a new paragraph, separate from the docs for the individual GUC.

4.  Do we want to ship the dynahash patch?

+0.9.  The slight hesitation is that it's new code written very late
in the cycle, so it may still have bugs or unintended consequences,
and as you said, at small sizes the linear search must be faster than
the hash computation.  Could you help test it, and try to break it?
Can we quantify the scaling effect for some interesting workloads, to
see at what size the dynahash beats the linear search, so that we can
make an informed decision?  Of course, without a hash table, large
sizes will surely work badly, so it'd be tempting to restrict the size
you can set the GUC to.

If we do include the dynahash patch, then I think it would also be
reasonable to change the formula for the default, to make it higher on
large systems.  The restriction to 128 buffers (= 1MB) doesn't make
much sense on a high frequency OLTP system with 128GB of shared
buffers or even 4GB.  I think "unleashing better defaults" would
actually be bigger news than the GUC for typical users, because
they'll just see PG14 use a few extra MB and go faster without having
to learn about these obscure new settings.

5.  Do we want to ship the simplehash patch?

-0.5.  It's a bit too exciting for the last minute, so I'd be inclined
to wait until the next cycle to do some more research and testing.  I
know it's a better idea in the long run.



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 29 марта 2021 г., в 02:15, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> On Sat, Mar 27, 2021 at 6:31 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>> 27 марта 2021 г., в 01:26, Thomas Munro <thomas.munro@gmail.com> написал(а):
>>> , and murmurhash which is inlineable and
>>> branch-free.
>
>> I think pageno is a hash already. Why hash any further? And pages accessed together will have smaller access time
dueto colocation. 
>
> Yeah, if clog_buffers is large enough then it's already a "perfect
> hash", but if it's not then you might get some weird "harmonic"
> effects (not sure if that's the right word), basically higher or lower
> collision rate depending on coincidences in the data.  If you apply a
> hash, the collisions should be evenly spread out so at least it'll be
> somewhat consistent.  Does that make sense?
As far as I understand "Harmonic" effects only make sense if the distribution is unknown. Hash protects from "periodic"
datawhen periods are equal to hash table size. I don't think we need to protect from this case, SLRU data is expected
tobe localised... 
Cost of this protection is necessity to calculate murmur hash on each SLRU lookup. Probably, 10-100ns. Seems like not a
bigdeal. 

> (At some point I figured out that the syscaches have lower collision
> rates and perform better if you use oids directly instead of hashing
> them... but then it's easy to create a pathological pattern of DDL
> that turns your hash table into a linked list.  Not sure what to think
> about that.)
>
>>> I had to tweak it to support "in-place" creation and
>>> fixed size (in other words, no allocators, for use in shared memory).
>
>> We really need to have a test to know what happens when this structure goes out of memory, as you mentioned below.
Whatwould be apropriate place for simplehash tests? 
>
> Good questions.  This has to be based on being guaranteed to have
> enough space for all of the entries, so the question is really just
> "how bad can performance get with different load factors".  FWIW there
> were some interesting cases with clustering when simplehash was first
> used in the executor (see commits ab9f2c42 and parent) which required
> some work on hashing quality to fix.
Interesting read, I didn't know much about simple hash, but seems like there is still many cases where it can be used
forgood. I always wondered why Postgres uses only Larson's linear hash. 

>
>>> Then I was annoyed that I had to add a "status" member to our struct,
>>> so I tried to fix that.
>
>> Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash align it well?
>
> With that "intrusive status" patch, the size is back to 8.  But I
> think I made a mistake: I made it steal some key space to indicate
> presence, but I think the presence test should really get access to
> the whole entry so that you can encode it in more ways.  For example,
> with slotno == -1.
>
> Alright, considering the date, if we want to get this into PostgreSQL
> 14 it's time to make some decisions.
>
> 1.  Do we want customisable SLRU sizes in PG14?
>
> +1 from me, we have multiple reports of performance gains from
> increasing various different SLRUs, and it's easy to find workloads
> that go faster.
Yes, this is main point of this discussion. So +1 from me too.

>
> One thought: it'd be nice if the user could *see* the current size,
> when using the default.  SHOW clog_buffers -> 0 isn't very helpful if
> you want to increase it, but don't know what it's currently set to.
> Not sure off the top of my head how best to do that.
Don't we expect that SHOW command indicate exactly same value as in config or SET command? If this convention does not
exist- probably showing effective value is a good idea. 


> 2.  What names do we want the GUCs to have?  Here's what we have:
>
> Proposed GUC              Directory            System views
> clog_buffers              pg_xact              Xact
> multixact_offsets_buffers pg_multixact/offsets MultiXactOffset
> multixact_members_buffers pg_multixact/members MultiXactMember
> notify_buffers            pg_notify            Notify
> serial_buffers            pg_serial            Serial
> subtrans_buffers          pg_subtrans          Subtrans
> commit_ts_buffers         pg_commit_ts         CommitTs
>
> By system views, I mean pg_stat_slru, pg_shmem_allocations and
> pg_stat_activity (lock names add "SLRU" on the end).
>
> Observations:
>
> It seems obvious that "clog_buffers" should be renamed to "xact_buffers".
+1
> It's not clear whether the multixact GUCs should have the extra "s"
> like the directories, or not, like the system views.
I think we show break the ties by native English speaker's ears or typing habits. I'm not a native speaker.

> It see that we have "Shared Buffer Lookup Table" in
> pg_shmem_allocations, so where I generated names like "Subtrans
> Mapping Table" I should change that to "Lookup" to match.
>
> 3.  What recommendations should we make about how to set it?
>
> I think the answer depends partially on the next questions!  I think
> we should probably at least say something short about the pg_stat_slru
> view (cache miss rate) and pg_stat_actitity view (waits on locks), and
> how to tell if you might need to increase it.  I think this probably
> needs a new paragraph, separate from the docs for the individual GUC.
I can only suggest incident-driven approach.
1. Observe ridiculous amount of backends waiting on particular SLRU.
2. Double SLRU buffers for that SLRU.
3. Goto 1.
I don't think we should mention this approach in docs.

> 4.  Do we want to ship the dynahash patch?

This patch allows to throw infinite amount of memory on a problem of SLRU waiting for IO. So the scale of improvement
ismuch higher. Do I want that we ship this patch? Definitely. Does this change much? I don't know. 

>
> +0.9.  The slight hesitation is that it's new code written very late
> in the cycle, so it may still have bugs or unintended consequences,
> and as you said, at small sizes the linear search must be faster than
> the hash computation.  Could you help test it, and try to break it?
I'll test it and try to break.

> Can we quantify the scaling effect for some interesting workloads, to
> see at what size the dynahash beats the linear search, so that we can
> make an informed decision?
I think we cannot statistically distinguish linear search from hash search by means of SLRU. But we can create some
syntheticbenchmarks. 

> Of course, without a hash table, large
> sizes will surely work badly, so it'd be tempting to restrict the size
> you can set the GUC to.
>
> If we do include the dynahash patch, then I think it would also be
> reasonable to change the formula for the default, to make it higher on
> large systems.  The restriction to 128 buffers (= 1MB) doesn't make
> much sense on a high frequency OLTP system with 128GB of shared
> buffers or even 4GB.  I think "unleashing better defaults" would
> actually be bigger news than the GUC for typical users, because
> they'll just see PG14 use a few extra MB and go faster without having
> to learn about these obscure new settings.
I agree. I don't see why we would need to limit buffers to 128 in presence of hash search.

> 5.  Do we want to ship the simplehash patch?
>
> -0.5.  It's a bit too exciting for the last minute, so I'd be inclined
> to wait until the next cycle to do some more research and testing.  I
> know it's a better idea in the long run.
OK, obviously, it's safer decision.


My TODO list:
1. Try to break patch set v13-[0001-0004]
2. Think how to measure performance of linear search versus hash search in SLRU buffer mapping.

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:
> 29 марта 2021 г., в 11:26, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> My TODO list:
> 1. Try to break patch set v13-[0001-0004]
> 2. Think how to measure performance of linear search versus hash search in SLRU buffer mapping.

Hi Thomas!
I'm still doing my homework. And to this moment all my catch is that "utils/dynahash.h" is not necessary.

I'm thinking about hashtables and measuring performance near optimum of linear search does not seem a good idea now.
It's impossible to prove that difference is statistically significant on all platforms. But even on one platform
measurementsare just too noisy. 

Shared buffers lookup table is indeed very similar to this SLRU lookup table. And it does not try to use more memory
thanneeded. I could not find pgbench-visible impact of growing shared buffer lookup table. Obviously, because it's not
abottleneck on regular workload. And it's hard to guess representative pathological workload. 

In fact, this thread started with proposal to use reader-writer lock for multis (instead of exclusive lock), and this
proposalencountered same problem. It's very hard to create stable reproduction of pathological workload when this lock
isheavily contented. Many people observed the problem, but still there is no open repro. 

I bet it will be hard to prove that simplehash is any better then HTAB. But if it is really better, shared buffers
couldbenefit from the same technique. 

I think its just fine to use HTAB with normal size, as long as shared buffers do so. But there we allocate slightly
morespace InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS). Don't we need to allocate nslots + 1 ? It seems that we
alwaysdo SlruMappingRemove() before SlruMappingAdd() and it is not necessary. 

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Thu, Apr 1, 2021 at 10:09 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 29 марта 2021 г., в 11:26, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
> > My TODO list:
> > 1. Try to break patch set v13-[0001-0004]
> > 2. Think how to measure performance of linear search versus hash search in SLRU buffer mapping.
>
> Hi Thomas!
> I'm still doing my homework. And to this moment all my catch is that "utils/dynahash.h" is not necessary.

Thanks.  Here's a new patch with that fixed, and also:

1.  New names ("... Mapping Table" -> "... Lookup Table" in
pg_shmem_allocations view, "clog_buffers" -> "xact_buffers") and a
couple of typos fixed here and there.

2.  Remove the cap of 128 buffers for xact_buffers as agreed.  We
still need a cap though, to avoid a couple of kinds of overflow inside
slru.c, both when computing the default value and accepting a
user-provided number.  I introduced SLRU_MAX_ALLOWED_BUFFERS to keep
it <= 1GB, and tested this on a 32 bit build with extreme block sizes.

Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
only if you have track_commit_timestamp enabled.  It seems silly to
waste 1MB per 1GB of shared_buffers on a feature you're not using.  So
the default is capped at 16 in that case to preserve existing
behaviour, but otherwise can be set very large if you want.

I think it's plausible that we'll want to make the multixact sizes
adaptive too, but I that might have to be a job for later.  Likewise,
I am sure that substransaction-heavy workloads might be slower than
they need to be due to the current default, but I have not done the
research,  With these new GUCs, people are free to experiment and
develop theories about what the defaults should be in v15.

3.  In the special case of xact_buffers, there is a lower cap of
512MB, because there's no point trying to cache more xids than there
can be in existence, and that is computed by working backwards from
CLOG_XACTS_PER_PAGE etc.  It's not possible to do the same sort of
thing for the other SLRUs without overflow problems, and it doesn't
seem worth trying to fix that right now (1GB of cached commit
timestamps ought to be enough for anyone™, while the theoretical
maximum is 10 bytes for 2b xids = 20GB).

To make this more explicit for people not following our discussion in
detail: with shared_buffers=0...512MB, the behaviour doesn't change.
But for shared_buffers=1GB you'll get twice as many xact_buffers as
today (2MB instead of being capped at 1MB), and it keeps scaling
linearly from there at 0.2%.  In other words, all real world databases
will get a boost in this department.

4.  Change the default for commit_ts_buffers back to shared_buffers /
1024 (with a minimum of 4), because I think you might have changed it
by a copy and paste error -- or did you intend to make the default
higher?

5.  Improve descriptions for the GUCs, visible in pg_settings view, to
match the documentation for related wait events.  So "for commit log
SLRU" -> "for the transaction status SLRU cache", and similar
corrections elsewhere.  (I am tempted to try to find a better word
than "SLRU", which doesn't seem relevant to users, but for now
consistency is good.)

6.  Added a callback so that SHOW xact_buffers and SHOW
commit_ts_buffers display the real size in effect (instead of "0" for
default).

I tried running with xact_buffers=1 and soon saw why you change it to
interpret 1 the same as 0; with 1 you hit buffer starvation and get
stuck.  I wish there were some way to say "the default for this GUC is
0, but if it's not zero then it's got to be at least 4".  I didn't
study the theoretical basis for the previous minimum value of 4, so I
think we should keep it that way, so that if you say 3 you get 4.  I
thought it was better to express that like so:

    /* Use configured value if provided. */
    if (xact_buffers > 0)
        return Max(4, xact_buffers);
    return Min(CLOG_MAX_ALLOWED_BUFFERS, Max(4, NBuffers / 512));

> I'm thinking about hashtables and measuring performance near optimum of linear search does not seem a good idea now.
> It's impossible to prove that difference is statistically significant on all platforms. But even on one platform
measurementsare just too noisy. 
>
> Shared buffers lookup table is indeed very similar to this SLRU lookup table. And it does not try to use more memory
thanneeded. I could not find pgbench-visible impact of growing shared buffer lookup table. Obviously, because it's not
abottleneck on regular workload. And it's hard to guess representative pathological workload. 

Thanks for testing.  I agree that it's a good idea to follow the main
buffer pool's approach for our first version of this.  One small
difference is that the SLRU patch performs the hash computation while
it holds the lock.  If we computed the hash first and used
hash_search_with_hash_value(), we could compute it before we obtain
the lock, like the main buffer pool.

If we computed the hash value first, we could also ignore the rule in
the documentation for hash_search_with_hash_value() that says that you
must calculate it with get_hash_value(), and just call the hash
function ourselves, so that it's fully inlinable.  The same
opportunity exists for the main buffer pool.  That'd get you one of
the micro-optimisations that simplehash.h offers.  Whether it's worth
bothering with, I don't know.

> In fact, this thread started with proposal to use reader-writer lock for multis (instead of exclusive lock), and this
proposalencountered same problem. It's very hard to create stable reproduction of pathological workload when this lock
isheavily contented. Many people observed the problem, but still there is no open repro. 
>
> I bet it will be hard to prove that simplehash is any better then HTAB. But if it is really better, shared buffers
couldbenefit from the same technique. 

Agreed, there are a lot of interesting future projects in this area,
when you compare the main buffer pool, these special buffer pools, and
maybe also the "shared relfilenode" pool patch I have proposed for v15
(CF entry 2933).  All have mapping tables and buffer replacement
algorithms (why should they be different?), one has partitions, some
have atomic-based header locks, they interlock with WAL differently
(on page LSN, FPIs and checksum support), ... etc etc.

> I think its just fine to use HTAB with normal size, as long as shared buffers do so. But there we allocate slightly
morespace InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS). Don't we need to allocate nslots + 1 ? It seems that we
alwaysdo SlruMappingRemove() before SlruMappingAdd() and it is not necessary. 

Yeah, we never try to add more elements than allowed, because we have
a big lock around the mapping.  The main buffer mapping table has a
more concurrent design and might temporarily have one extra entry per
partition.

Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 1 апр. 2021 г., в 06:40, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> 2.  Remove the cap of 128 buffers for xact_buffers as agreed.  We
> still need a cap though, to avoid a couple of kinds of overflow inside
> slru.c, both when computing the default value and accepting a
> user-provided number.  I introduced SLRU_MAX_ALLOWED_BUFFERS to keep
> it <= 1GB, and tested this on a 32 bit build with extreme block sizes.
BTW we do not document maximum values right now.
I was toying around with big values. For example if we set different big xact_buffers we can get something like
FATAL:  not enough shared memory for data structure "Notify" (72768 bytes requested)
FATAL:  not enough shared memory for data structure "Async Queue Control" (2492 bytes requested)
FATAL:  not enough shared memory for data structure "Checkpointer Data" (393280 bytes requested)

But never anything about xact_buffers. I don't think it's important, though.

>
> Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
> only if you have track_commit_timestamp enabled.
Is there a reason to leave 16 pages if commit_ts is disabled? They might be useful for some artefacts of previously
enabledcommit_ts? 

> 4.  Change the default for commit_ts_buffers back to shared_buffers /
> 1024 (with a minimum of 4), because I think you might have changed it
> by a copy and paste error -- or did you intend to make the default
> higher?
I changed default due to some experiments with
https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql
In fact most important part of that thread was removing the cap, which is done by the patchset now.

Thanks!

Best regards, Andrey Borodin.






Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Sun, Apr 4, 2021 at 7:57 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I was toying around with big values. For example if we set different big xact_buffers we can get something like
> FATAL:  not enough shared memory for data structure "Notify" (72768 bytes requested)
> FATAL:  not enough shared memory for data structure "Async Queue Control" (2492 bytes requested)
> FATAL:  not enough shared memory for data structure "Checkpointer Data" (393280 bytes requested)
>
> But never anything about xact_buffers. I don't think it's important, though.

I had added the hash table size in SimpleLruShmemSize(), but then SimpleLruInit() passed that same value in when allocating the struct, so the struct was oversized.  Oops.  Fixed.

> > Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
> > only if you have track_commit_timestamp enabled.
 
> Is there a reason to leave 16 pages if commit_ts is disabled? They might be useful for some artefacts of previously enabled commit_ts?

Alvaro, do you have an opinion on that?

The remaining thing that bothers me about this patch set is that there is still a linear search in the replacement algorithm, and it runs with an exclusive lock.  That creates a serious problem for large caches that still aren't large enough.  I wonder if we can do something to improve that situation in the time we have.  I considered a bunch of ideas but could only find one that fits with slru.c's simplistic locking while tracking recency.  What do you think about a hybrid of SLRU and random replacement, that retains some characteristics of both?  You could think of it as being a bit like the tournament selection of the genetic algorithm, with a tournament size of (say) 8 or 16.  Any ideas on how to evaluate this and choose the number?  See attached.
Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 7 апр. 2021 г., в 08:59, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> The remaining thing that bothers me about this patch set is that there is still a linear search in the replacement
algorithm,and it runs with an exclusive lock.  That creates a serious problem for large caches that still aren't large
enough. I wonder if we can do something to improve that situation in the time we have.  I considered a bunch of ideas
butcould only find one that fits with slru.c's simplistic locking while tracking recency.  What do you think about a
hybridof SLRU and random replacement, that retains some characteristics of both?  You could think of it as being a bit
likethe tournament selection of the genetic algorithm, with a tournament size of (say) 8 or 16.  Any ideas on how to
evaluatethis and choose the number?  See attached. 
>
<v15-0001-Add-a-buffer-mapping-table-for-SLRUs.patch><v15-0002-Make-all-SLRU-buffer-sizes-configurable.patch><v15-0003-Use-hybrid-random-SLRU-replacement-for-SLRUs.patch>

Maybe instead of fully associative cache with random replacement we could use 1-associative cache?
i.e. each page can reside only in one spcific buffer slot. If there's something else - evict it.
I think this would be as efficient as RR cache. And it's soooo fast.

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 7 апр. 2021 г., в 14:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> Maybe instead of fully associative cache with random replacement we could use 1-associative cache?
> i.e. each page can reside only in one spcific buffer slot. If there's something else - evict it.
> I think this would be as efficient as RR cache. And it's soooo fast.

I thought a bit more and understood that RR is protected from two competing pages in working set, while 1-associative
cacheis not. So, discard that idea. 

Best regards, Andrey Borodin.





Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Thu, Apr 8, 2021 at 12:13 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 7 апр. 2021 г., в 14:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
> > Maybe instead of fully associative cache with random replacement we could use 1-associative cache?
> > i.e. each page can reside only in one spcific buffer slot. If there's something else - evict it.
> > I think this would be as efficient as RR cache. And it's soooo fast.
>
> I thought a bit more and understood that RR is protected from two competing pages in working set, while 1-associative
cacheis not. So, discard that idea. 

It's an interesting idea.  I know that at least one proprietary fork
just puts the whole CLOG in memory for direct indexing, which is what
we'd have here if we said "oh, your xact_buffers setting is so large
I'm just going to use slotno = pageno & mask".

Here's another approach that is a little less exciting than
"tournament RR" (or whatever that should be called; I couldn't find an
established name for it).  This version is just our traditional linear
search, except that it stops at 128, and remembers where to start from
next time (like a sort of Fisher-Price GCLOCK hand).  This feels more
committable to me.  You can argue that all buffers above 128 are bonus
buffers that PostgreSQL 13 didn't have, so the fact that we can no
longer find the globally least recently used page when you set
xact_buffers > 128 doesn't seem too bad to me, as an incremental step
(but to be clear, of course we can do better than this with more work
in later releases).

Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 8 апр. 2021 г., в 03:30, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> Here's another approach that is a little less exciting than
> "tournament RR" (or whatever that should be called; I couldn't find an
> established name for it).  This version is just our traditional linear
> search, except that it stops at 128, and remembers where to start from
> next time (like a sort of Fisher-Price GCLOCK hand).  This feels more
> committable to me.  You can argue that all buffers above 128 are bonus
> buffers that PostgreSQL 13 didn't have, so the fact that we can no
> longer find the globally least recently used page when you set
> xact_buffers > 128 doesn't seem too bad to me, as an incremental step
> (but to be clear, of course we can do better than this with more work
> in later releases).
I agree that this version of eviction seems much more effective and less intrusive than RR. And it's still LRU, which
isimportant for subsystem that is called SLRU. 
shared->search_slotno is initialized implicitly with memset(). But this seems like a common practice.
Also comment above "max_search = Min(shared->num_slots, MAX_REPLACEMENT_SEARCH);" does not reflect changes.

Besides this patch looks good to me.

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Thu, Apr 8, 2021 at 7:24 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I agree that this version of eviction seems much more effective and less intrusive than RR. And it's still LRU, which
isimportant for subsystem that is called SLRU.
 
> shared->search_slotno is initialized implicitly with memset(). But this seems like a common practice.
> Also comment above "max_search = Min(shared->num_slots, MAX_REPLACEMENT_SEARCH);" does not reflect changes.
>
> Besides this patch looks good to me.

Thanks!  I chickened out of committing a buffer replacement algorithm
patch written 11 hours before the feature freeze, but I also didn't
really want to commit the GUC patch without that.  Ahh, if only we'd
latched onto the real problems here just a little sooner, but there is
always PostgreSQL 15, I heard it's going to be amazing.  Moved to next
CF.



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 8 апр. 2021 г., в 15:22, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> On Thu, Apr 8, 2021 at 7:24 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> I agree that this version of eviction seems much more effective and less intrusive than RR. And it's still LRU,
whichis important for subsystem that is called SLRU. 
>> shared->search_slotno is initialized implicitly with memset(). But this seems like a common practice.
>> Also comment above "max_search = Min(shared->num_slots, MAX_REPLACEMENT_SEARCH);" does not reflect changes.
>>
>> Besides this patch looks good to me.
>
> Thanks!  I chickened out of committing a buffer replacement algorithm
> patch written 11 hours before the feature freeze, but I also didn't
> really want to commit the GUC patch without that.  Ahh, if only we'd
> latched onto the real problems here just a little sooner, but there is
> always PostgreSQL 15, I heard it's going to be amazing.  Moved to next
> CF.

I have one more idea inspired by CPU caches.
Let's make SLRU n-associative, where n ~ 8.
We can divide buffers into "banks", number of banks must be power of 2.
All banks are of equal size. We choose bank size to approximately satisfy user's configured buffer size.
Each page can live only within one bank. We use same search and eviction algorithms as we used in SLRU, but we only
needto search\evict over 8 elements. 
All SLRU data of a single bank will be colocated within at most 2 cache line.

I did not come up with idea how to avoid multiplication of bank_number * bank_size in case when user configured 31337
buffers(any number that is radically not a power of 2). 

PFA patch implementing this idea.

Best regards, Andrey Borodin.



Attachment

Re: MultiXact\SLRU buffers configuration

From
Васильев Дмитрий
Date:
пн, 14 июн. 2021 г. в 15:07, Andrey Borodin <x4mmm@yandex-team.ru>:
PFA patch implementing this idea.

I'm benchmarked v17 patches.
Testing was done on a 96-core machine, with PGDATA completely placed in tmpfs.
PostgreSQL was built with CFLAGS -O2.

for-update PgBench script:
\set aid random_zipfian(1, 100, 2)
begin;
select :aid from pgbench_accounts where aid = :aid for update;
update pgbench_accounts set abalance = abalance + 1 where aid = :aid;
update pgbench_accounts set abalance = abalance * 2 where aid = :aid;
update pgbench_accounts set abalance = abalance - 2 where aid = :aid;
end;

Before each test sample data was filled with "pgbench -i -s 100", testing was performed 3 times for 1 hour each test.
The benchmark results are presented with changing multi_xact_members_buffers and multicast_offsets_buffers (1:2 respectively):
settings                          tps
multixact_members_buffers_64Kb   693.2
multixact_members_buffers_128Kb  691.4
multixact_members_buffers_192Kb  696.3
multixact_members_buffers_256Kb  694.4
multixact_members_buffers_320Kb  692.3
multixact_members_buffers_448Kb  693.7
multixact_members_buffers_512Kb  693.3
vanilla                          676.1

Best regards, Dmitry Vasiliev.

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

>> 8 апр. 2021 г., в 15:22, Thomas Munro <thomas.munro@gmail.com> написал(а):
>>
> I have one more idea inspired by CPU caches.
> Let's make SLRU n-associative, where n ~ 8.
> We can divide buffers into "banks", number of banks must be power of 2.
> All banks are of equal size. We choose bank size to approximately satisfy user's configured buffer size.
> Each page can live only within one bank. We use same search and eviction algorithms as we used in SLRU, but we only
needto search\evict over 8 elements.
 
> All SLRU data of a single bank will be colocated within at most 2 cache line.
> 
> I did not come up with idea how to avoid multiplication of bank_number * bank_size in case when user configured 31337
buffers(any number that is radically not a power of 2).
 

We can avoid this multiplication by using gapped memory under SLRU page_statuses, but from my POV here complexity does
notworth possible performance gain.
 

PFA rebase of the patchset. Also I've added a patch to combine page_number, page_status, and page_dirty together to
touchless cachelines.
 

Best regards, Andrey Borodin.


Attachment

Re: MultiXact\SLRU buffers configuration

From
Julien Rouhaud
Date:
Hi,
On Sun, Dec 26, 2021 at 03:09:59PM +0500, Andrey Borodin wrote:
> 
> 
> PFA rebase of the patchset. Also I've added a patch to combine page_number, page_status, and page_dirty together to
touchless cachelines.
 

The cfbot reports some errors on the latest version of the patch:

https://cirrus-ci.com/task/6121317215764480
[04:56:38.432] su postgres -c "make -s -j${BUILD_JOBS} world-bin"
[04:56:48.270] In file included from async.c:134:
[04:56:48.270] ../../../src/include/access/slru.h:47:28: error: expected identifier or ‘(’ before ‘:’ token
[04:56:48.270]    47 | typedef enum SlruPageStatus:int16_t
[04:56:48.270]       |                            ^
[04:56:48.270] ../../../src/include/access/slru.h:53:3: warning: data definition has no type or storage class
[04:56:48.270]    53 | } SlruPageStatus;
[04:56:48.270]       |   ^~~~~~~~~~~~~~
[04:56:48.270] ../../../src/include/access/slru.h:53:3: warning: type defaults to ‘int’ in declaration of
‘SlruPageStatus’[-Wimplicit-int]
 
[04:56:48.270] ../../../src/include/access/slru.h:58:2: error: expected specifier-qualifier-list before
‘SlruPageStatus’
[04:56:48.270]    58 |  SlruPageStatus page_status;
[04:56:48.270]       |  ^~~~~~~~~~~~~~
[04:56:48.270] async.c: In function ‘asyncQueueAddEntries’:
[04:56:48.270] async.c:1448:41: error: ‘SlruPageEntry’ has no member named ‘page_dirty’
[04:56:48.270]  1448 |  NotifyCtl->shared->page_entries[slotno].page_dirty = true;
[04:56:48.270]       |                                         ^
[04:56:48.271] make[3]: *** [<builtin>: async.o] Error 1
[04:56:48.271] make[3]: *** Waiting for unfinished jobs....
[04:56:48.297] make[2]: *** [common.mk:39: commands-recursive] Error 2
[04:56:48.297] make[2]: *** Waiting for unfinished jobs....
[04:56:54.554] In file included from clog.c:36:
[04:56:54.554] ../../../../src/include/access/slru.h:47:28: error: expected identifier or ‘(’ before ‘:’ token
[04:56:54.554]    47 | typedef enum SlruPageStatus:int16_t
[04:56:54.554]       |                            ^
[04:56:54.554] ../../../../src/include/access/slru.h:53:3: warning: data definition has no type or storage class
[04:56:54.554]    53 | } SlruPageStatus;
[04:56:54.554]       |   ^~~~~~~~~~~~~~
[04:56:54.554] ../../../../src/include/access/slru.h:53:3: warning: type defaults to ‘int’ in declaration of
‘SlruPageStatus’[-Wimplicit-int]
 
[04:56:54.554] ../../../../src/include/access/slru.h:58:2: error: expected specifier-qualifier-list before
‘SlruPageStatus’
[04:56:54.554]    58 |  SlruPageStatus page_status;
[04:56:54.554]       |  ^~~~~~~~~~~~~~
[04:56:54.554] clog.c: In function ‘TransactionIdSetPageStatusInternal’:
[04:56:54.554] clog.c:396:39: error: ‘SlruPageEntry’ has no member named ‘page_dirty’
[04:56:54.554]   396 |  XactCtl->shared->page_entries[slotno].page_dirty = true;
[04:56:54.554]       |                                       ^
[04:56:54.554] In file included from ../../../../src/include/postgres.h:46,
[04:56:54.554]                  from clog.c:33:
[04:56:54.554] clog.c: In function ‘BootStrapCLOG’:
[04:56:54.554] clog.c:716:47: error: ‘SlruPageEntry’ has no member named ‘page_dirty’
[04:56:54.554]   716 |  Assert(!XactCtl->shared->page_entries[slotno].page_dirty);
[04:56:54.554]       |                                               ^
[04:56:54.554] ../../../../src/include/c.h:848:9: note: in definition of macro ‘Assert’
[04:56:54.554]   848 |   if (!(condition)) \
[04:56:54.554]       |         ^~~~~~~~~
[04:56:54.554] clog.c: In function ‘TrimCLOG’:
[04:56:54.554] clog.c:801:40: error: ‘SlruPageEntry’ has no member named ‘page_dirty’
[04:56:54.554]   801 |   XactCtl->shared->page_entries[slotno].page_dirty = true;
[04:56:54.554]       |                                        ^
[04:56:54.554] In file included from ../../../../src/include/postgres.h:46,
[04:56:54.554]                  from clog.c:33:
[04:56:54.554] clog.c: In function ‘clog_redo’:
[04:56:54.554] clog.c:997:48: error: ‘SlruPageEntry’ has no member named ‘page_dirty’
[04:56:54.554]   997 |   Assert(!XactCtl->shared->page_entries[slotno].page_dirty);
[04:56:54.554]       |                                                ^
[04:56:54.554] ../../../../src/include/c.h:848:9: note: in definition of macro ‘Assert’
[04:56:54.554]   848 |   if (!(condition)) \
[04:56:54.554]       |         ^~~~~~~~~
[04:56:54.555] make[4]: *** [<builtin>: clog.o] Error 1
[04:56:54.555] make[3]: *** [../../../src/backend/common.mk:39: transam-recursive] Error 2
[04:56:54.555] make[3]: *** Waiting for unfinished jobs....
[04:56:56.405] make[2]: *** [common.mk:39: access-recursive] Error 2
[04:56:56.405] make[1]: *** [Makefile:42: all-backend-recurse] Error 2
[04:56:56.405] make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2
[04:56:56.407]
[04:56:56.407] Exit status: 2

Could you send a new version?  In the meantime I will switch the patch status
to Waiting on Author.



Re: MultiXact\SLRU buffers configuration

From
Shawn Debnath
Date:
On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote:
> > PFA rebase of the patchset. Also I've added a patch to combine 
> > page_number, page_status, and page_dirty together to touch less 
> > cachelines.
> 
> The cfbot reports some errors on the latest version of the patch:
> 
> https://cirrus-ci.com/task/6121317215764480
> [...]
> Could you send a new version?  In the meantime I will switch the patch 
> status
> to Waiting on Author.
> 

I was planning on running a set of stress tests on these patches. Could 
we confirm which ones we plan to include in the commitfest?

-- 
Shawn Debnath
Amazon Web Services (AWS)



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 15 янв. 2022 г., в 03:20, Shawn Debnath <sdn@amazon.com> написал(а):
>
> On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote:
>>> PFA rebase of the patchset. Also I've added a patch to combine
>>> page_number, page_status, and page_dirty together to touch less
>>> cachelines.
>>
>> The cfbot reports some errors on the latest version of the patch:
>>
>> https://cirrus-ci.com/task/6121317215764480
>> [...]
>> Could you send a new version?  In the meantime I will switch the patch
>> status
>> to Waiting on Author.
>>
>
> I was planning on running a set of stress tests on these patches. Could
> we confirm which ones we plan to include in the commitfest?

Many thanks for your interest. Here's the  latest version.

Best regards, Andrey Borodin.

Attachment

Re: MultiXact/SLRU buffers configuration

From
Justin Pryzby
Date:
On Sat, Jan 15, 2022 at 12:16:59PM +0500, Andrey Borodin wrote:
> > 15 янв. 2022 г., в 03:20, Shawn Debnath <sdn@amazon.com> написал(а):
> > On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote:
> >>> PFA rebase of the patchset. Also I've added a patch to combine 
> >>> page_number, page_status, and page_dirty together to touch less 
> >>> cachelines.
> >> 
> >> The cfbot reports some errors on the latest version of the patch:
> >> 
> >> https://cirrus-ci.com/task/6121317215764480
> >> [...]
> >> Could you send a new version?  In the meantime I will switch the patch 
> >> status to Waiting on Author.
> > 
> > I was planning on running a set of stress tests on these patches. Could 
> > we confirm which ones we plan to include in the commitfest?
> 
> Many thanks for your interest. Here's the  latest version.

This is failing to compile under linux and windows due to bitfield syntax.
http://cfbot.cputube.org/andrey-borodin.html

And compile warnings:

slru.c: In function ‘SlruAdjustNSlots’:
slru.c:161:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  161 |  int nbanks = 1;
      |  ^~~
slru.c: In function ‘SimpleLruReadPage_ReadOnly’:
slru.c:530:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  530 |  int bankstart = (pageno & shared->bank_mask) * shared->bank_size;
      |  ^~~

Note that you can test the CI result using any github account without waiting
for the cfbot.  See ./src/tools/ci/README.

-- 
Justin



Re: MultiXact/SLRU buffers configuration

From
Andrey Borodin
Date:

> 15 янв. 2022 г., в 20:46, Justin Pryzby <pryzby@telsasoft.com> написал(а):
>>>
>>> I was planning on running a set of stress tests on these patches. Could
>>> we confirm which ones we plan to include in the commitfest?
>>
>> Many thanks for your interest. Here's the latest version.
>
> This is failing to compile under linux and windows due to bitfield syntax.
> http://cfbot.cputube.org/andrey-borodin.html

Uh, sorry, I formatted a patch from wrong branch.

Just tested Cirrus. It's wonderful, thanks! Really faster than doing stuff on my machines...

Best regards, Andrey Borodin.

Attachment

Re: MultiXact\SLRU buffers configuration

From
Shawn Debnath
Date:
On Sat, Jan 15, 2022 at 12:16:59PM +0500, Andrey Borodin wrote:

> > I was planning on running a set of stress tests on these patches. Could
> > we confirm which ones we plan to include in the commitfest?
> 
> Many thanks for your interest. Here's the  latest version.

Here are the results of the multixact perf test I ran on the patch that splits
the linear SLRU caches into banks.  With my test setup, the binaries
with the patch applied performed slower marginally across the test
matrix against unpatched binaries. Here are the results:

+-------------------------------+---------------------+-----------------------+------------+
|           workload            | patched average tps | unpatched average tps | difference |
+-------------------------------+---------------------+-----------------------+------------+
| create only                   |         10250.54396 |           10349.67487 | -1.0%      |
| create and select             |         9677.711286 |           9991.065037 | -3.2%      |
| large cache create only       |         10310.96646 |           10337.16455 | -0.3%      |
| large cache create and select |          9654.24077 |           9924.270242 | -2.8%      |
+-------------------------------+---------------------+-----------------------+------------+

The test was configured in the following manner:
- AWS EC2 c5d.24xlarge instances, located in the same AZ, were used as
  the database host and the test driver. These systems have 96 vcpus and
  184 GB memory. NVMe drives were configured as RAID5.
- GUCs were changed from defaults to be the following:
    max_connections = 5000
    shared_buffers = 96GB
    max_wal_size = 2GB
    min_wal_size = 192MB
- pgbench runs were done with -c 1000 -j 1000 and a scale of 10,000
- Two multixact workloads were tested, first [0] was a create only
  script which selected 100 pgbench_account rows for share. Second
  workload [1] added a select statement to visit rows touched in the
  past which had multixacts generated for them. pgbench test script [2]
  wraps the call to the functions inside an explicit transaction.
- Large cache tests are multixact offsets cache size hard coded to 128
  and members cache size hard coded to 256.
- Row selection is based on time based approach that lets all client
  connections coordinate which rows to work with based on the
  millisecond they start executing. To allow for more multixacts to be
  generated and reduce contention, the workload uses offsets ahead of
  the start id based on a random number.
- The one bummer about these runs were that they only ran for 600
  seconds for insert only and 400 seconds for insert and select. I
  consistently ran into checkpointer getting oom-killed on this instance
  after that timeframe. Will dig into this separately. But the TPS was 
  consistent.
- Each test was repeated at least 3 times and the average of those runs
  were used.
- I am using the master branch and changes were applied on commit
  f47ed79cc8a0cfa154dc7f01faaf59822552363f


I think patch 1 is a must-have. Regarding patch 2, I would propose we 
avoid introducing more complexity into SimpleLRU cache and instead focus 
on making the SLRU to buffer cache effort [3] a reality. I would also 
add that we have a few customers in our fleet who have been successfully 
running the large cache configuration on the regular SLRU without any 
issues. With cache sizes this small, the linear searches are still quite 
efficient.

If my test workload can be made better, please let me know. Happy to 
re-run tests as needed.


[0] https://gist.github.com/sdebnath/e015561811adf721dd40dd6638969c69
[1] https://gist.github.com/sdebnath/2f3802e1fe288594b6661a7a59a7ca07
[2] https://gist.github.com/sdebnath/6bbfd5f87945a7d819e30a9a1701bc97
[3] https://www.postgresql.org/message-id/CA%2BhUKGKAYze99B-jk9NoMp-2BDqAgiRC4oJv%2BbFxghNgdieq8Q%40mail.gmail.com



--
Shawn Debnath
Amazon Web Services (AWS)



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 20 янв. 2022 г., в 20:44, Shawn Debnath <sdn@amazon.com> написал(а):
>
> If my test workload can be made better, please let me know. Happy to
> re-run tests as needed.

Shawn, thanks for the benchmarking!

Can you please also test 2nd patch against a large multixact SLRUs?
2nd patch is not intended to do make thing better on default buffer sizes. It must save the perfromance in case of
reallyhuge SLRU buffers. 

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Shawn Debnath
Date:
On Thu, Jan 20, 2022 at 09:21:24PM +0500, Andrey Borodin wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 
>
> > 20 янв. 2022 г., в 20:44, Shawn Debnath <sdn@amazon.com> написал(а):
> Can you please also test 2nd patch against a large multixact SLRUs?
> 2nd patch is not intended to do make thing better on default buffer sizes. It must save the perfromance in case of
reallyhuge SLRU buffers.
 

Test was performed on 128/256 for multixact offset/members cache as 
stated in my previous email.  Sure I can test it for higher values - but 
what's a real world value that would make sense? We have been using this 
configuration successfully for a few of our customers that ran into 
MultiXact contention.

-- 
Shawn Debnath
Amazon Web Services (AWS)



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 21 янв. 2022 г., в 05:19, Shawn Debnath <sdn@amazon.com> написал(а):
>
> On Thu, Jan 20, 2022 at 09:21:24PM +0500, Andrey Borodin wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe. 
>>
>>> 20 янв. 2022 г., в 20:44, Shawn Debnath <sdn@amazon.com> написал(а):
>> Can you please also test 2nd patch against a large multixact SLRUs?
>> 2nd patch is not intended to do make thing better on default buffer sizes. It must save the perfromance in case of
reallyhuge SLRU buffers. 
>
> Test was performed on 128/256 for multixact offset/members cache as
> stated in my previous email.  Sure I can test it for higher values - but
> what's a real world value that would make sense? We have been using this
> configuration successfully for a few of our customers that ran into
> MultiXact contention.

Sorry, seems like I misinterpreted results yesterday.
I had one concern about 1st patch step: it makes CLOG buffers size dependent on shared_buffers. But in your tests you
seemto have already exercised xact_buffers = 24576 without noticeable degradation. Is it correct? I doubt a little bit
thatlinear search among 24K elements on each CLOG access does not incur performance impact, but your tests seem to
proveit. 

IMV splitting SLRU buffer into banks would make sense for values greater than 1<<10. But you are right that 256 seems
enoughto cope with most of problems of multixacts so far. I just thought about stressing SLRU buffers with multixacts
tobe sure that CLOG buffers will not suffer degradation. But yes, it's too indirect test. 

Maybe, just to be sure, let's repeat tests with autovacuum turned off to stress xact_buffers?

Thanks!

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> 8 апр. 2021 г., в 17:22, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> Thanks!  I chickened out of committing a buffer replacement algorithm
> patch written 11 hours before the feature freeze, but I also didn't
> really want to commit the GUC patch without that.  Ahh, if only we'd
> latched onto the real problems here just a little sooner, but there is
> always PostgreSQL 15, I heard it's going to be amazing.  Moved to next
> CF.

Hi Thomas!

There's feature freeze approaching again. I see that you are working on moving SLRUs to buffer pools, but it is not
clearto which PG version it will land. And there is 100% consensus that first patch is useful and helps to prevent big
issues.Maybe let's commit 1'st step without lifting default xact_buffers limit? Or 1st patch as-is with any simple
techniquethat prevents linear search in SLRU buffers. 

Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Sat, Feb 19, 2022 at 6:34 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> There's feature freeze approaching again. I see that you are working on moving SLRUs to buffer pools, but it is not
clearto which PG version it will land. And there is 100% consensus that first patch is useful and helps to prevent big
issues.Maybe let's commit 1'st step without lifting default xact_buffers limit? Or 1st patch as-is with any simple
techniquethat prevents linear search in SLRU buffers. 

Hi Andrey,

Yeah, the SLRU/buffer pool thing would be complete enough to propose
for 16 at the earliest.  I posted the early prototype to see what sort
of reaction the concept would get before doing more work; I know
others have investigated this topic too... maybe it can encourage more
patches, experimental results, ideas to be shared... but this is not
relevant for 15.

Back to this patch: assuming we can settle on a good-enough-for-now
replacement algorithm, do we want to add this set of 7 GUCs?  Does
anyone else want to weigh in on that?  Concretely, this patch adds:

multixact_offsets_buffers
multixact_members_buffers
subtrans_buffers
notify_buffers
serial_buffers
xact_buffers
commit_ts_buffers

I guess the people at
https://ottertune.com/blog/postgresql-knobs-list/ would be happy if we
did.  Hopefully we'd drop the settings in a future release once we
figure out the main buffer pool thing (or some other scheme to
automate sizing).



Re: MultiXact\SLRU buffers configuration

From
Andres Freund
Date:
On 2022-02-20 10:38:53 +1300, Thomas Munro wrote:
> Back to this patch: assuming we can settle on a good-enough-for-now
> replacement algorithm, do we want to add this set of 7 GUCs?  Does
> anyone else want to weigh in on that?

I'm -0.2 on it, given that we have a better path forward.



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> On 20 Feb 2022, at 02:42, Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-02-20 10:38:53 +1300, Thomas Munro wrote:
>> Back to this patch: assuming we can settle on a good-enough-for-now
>> replacement algorithm, do we want to add this set of 7 GUCs?  Does
>> anyone else want to weigh in on that?
>
> I'm -0.2 on it, given that we have a better path forward.
That’s a really good path forward, but it's discussed at least for 3.5 years[0]. And guaranteed not to be there until
2023.Gilles, Shawn, Dmitry expressed their opinion in lines with that the patch “is a must-have” referring to real
pathologicalperformance degradation inflicted by SLRU cache starvation. And I can remember dozen of other incidents
thatwould not happen if the patch was applied, e.g. this post is referring to the patch as a cure [1]. 

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/flat/20180814213500.GA74618%4060f81dc409fc.ant.amazon.com
[1]
https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/#what-can-we-do-about-getting-rid-of-nessie
   


Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> On 20 Feb 2022, at 02:38, Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> Back to this patch: assuming we can settle on a good-enough-for-now
> replacement algorithm, do we want to add this set of 7 GUCs?  Does
> anyone else want to weigh in on that?

Hi Thomas!

It seems we don't have any other input besides reviews and Andres's -0.2.
Is there a chance to proceed?

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

From
Yura Sokolov
Date:
Good day, all.

I did benchmark of patch on 2 socket Xeon 5220 CPU @ 2.20GHz .
I used "benchmark" used to reproduce problems with SLRU on our
customers setup.
In opposite to Shawn's tests I concentrated on bad case: a lot
of contention.

slru-funcs.sql - function definitions
  - functions creates a lot of subtrunsactions to stress subtrans
  - and select random rows for share to stress multixacts

slru-call.sql - function call for benchmark

slru-ballast.sql - randomly select 1000 consequent rows
    "for update skip locked" to stress multixacts

patch1 - make SLRU buffers configurable
patch2 - make "8-associative banks"

Benchmark done by pgbench.
Inited with scale 1 to induce contention.
    pgbench -i -s 1 testdb

Benchmark 1:
- low number of connections (50), 60% slru-call, 40% slru-ballast
    pgbench -f slru-call.sql@60 -f slru-ballast.sql@40 -c 50 -j 75 -P 1 -T 30 testdb

version | subtrans | multixact | tps
        | buffers  | offs/memb | func+ballast
--------+----------+-----------+------
master  | 32       | 8/16      | 184+119
patch1  | 32       | 8/16      | 184+119
patch1  | 1024     | 8/16      | 121+77
patch1  | 1024     | 512/1024  | 118+75
patch2  | 32       | 8/16      | 190+122
patch2  | 1024     | 8/16      | 190+125
patch2  | 1024     | 512/1024  | 190+127

As you see, this test case degrades with dumb increase of
SLRU buffers. But use of "hash table" in form of "associative
buckets" makes performance stable.

Benchmark 2:
- high connection number (600), 98% slru-call, 2% slru-ballast
    pgbench -f slru-call.sql@98 -f slru-ballast.sql@2 -c 600 -j 75 -P 1 -T 30 testdb

I don't paste "ballast" tps here since 2% make them too small,
and they're very noisy.

version | subtrans | multixact | tps
        | buffers  | offs/memb | func
--------+----------+-----------+------
master  | 32       | 8/16      | 13
patch1  | 32  
    | 8/16      | 13
patch1  | 1024     | 8/16      | 31
patch1  | 1024     | 512/1024  | 53
patch2  | 32       | 8/16      | 12
patch2  | 1024     | 8/16      | 34
patch2  | 1024     | 512/1024  | 67

In this case simple buffer increase does help. But "buckets"
increase performance gain.

I didn't paste here results third part of patch ("Pack SLRU...")
because I didn't see any major performance gain from it, and
it consumes large part of patch diff.

Rebased versions of first two patch parts are attached.

regards,

Yura Sokolov

Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> On 21 Jul 2022, at 18:00, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
> In this case simple buffer increase does help. But "buckets"
> increase performance gain.
Yura, thank you for your benchmarks!
We already knew that patch can save the day on pathological workloads, now we have a proof of this.
Also there's the evidence that user can blindly increase size of SLRU if they want (with the second patch). So there's
noneed for hard explanations on how to tune the buffers size. 

Thomas, do you still have any doubts? Or is it certain that SLRU will be replaced by any better subsystem in 16?


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thomas Munro
Date:
On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Thomas, do you still have any doubts? Or is it certain that SLRU will be replaced by any better subsystem in 16?

Hi Andrey,

Sorry for my lack of replies on this and the other SLRU thread -- I'm
thinking and experimenting.  More soon.



Re: MultiXact\SLRU buffers configuration

From
i.lazarev@postgrespro.ru
Date:
> Andrey Borodin wrote 2022-07-23 11:39:
> 
> Yura, thank you for your benchmarks!
> We already knew that patch can save the day on pathological workloads,
> now we have a proof of this.
> Also there's the evidence that user can blindly increase size of SLRU
> if they want (with the second patch). So there's no need for hard
> explanations on how to tune the buffers size.

Hi @Andrey.Borodin, With some considerations and performance checks from 
@Yura.Sokolov we simplified your approach by the following:

1. Preamble. We feel free to increase any SLRU's, since there's no 
performance degradation on large Buffers count using your SLRU buckets 
solution.
2. `slru_buffers_size_scale` is only one config param introduced for all 
SLRUs. It scales SLRUs upper cap by power 2.
3. All SLRU buffers count are capped by both `MBuffers (shared_buffers)` 
and `slru_buffers_size_scale`. see
4. Magic initial constants `NUM_*_BUFFERS << slru_buffers_size_scale` 
are applied for every SLRU.
5. All SLRU buffers are always sized as power of 2, their hash bucket 
size is always 8.

There's attached patch for your consideration. It does gather and 
simplify both `v21-0001-Make-all-SLRU-buffer-sizes-configurable.patch` 
and `v21-0002-Divide-SLRU-buffers-into-8-associative-banks.patch` to 
much simpler approach.

Thank you, Yours,
- Ivan

Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> On 17 Aug 2022, at 00:36, i.lazarev@postgrespro.ru wrote:
>
>> Andrey Borodin wrote 2022-07-23 11:39:
>> Yura, thank you for your benchmarks!
>> We already knew that patch can save the day on pathological workloads,
>> now we have a proof of this.
>> Also there's the evidence that user can blindly increase size of SLRU
>> if they want (with the second patch). So there's no need for hard
>> explanations on how to tune the buffers size.
>
> Hi @Andrey.Borodin, With some considerations and performance checks from @Yura.Sokolov we simplified your approach by
thefollowing: 
>
> 1. Preamble. We feel free to increase any SLRU's, since there's no performance degradation on large Buffers count
usingyour SLRU buckets solution. 
> 2. `slru_buffers_size_scale` is only one config param introduced for all SLRUs. It scales SLRUs upper cap by power 2.
> 3. All SLRU buffers count are capped by both `MBuffers (shared_buffers)` and `slru_buffers_size_scale`. see
> 4. Magic initial constants `NUM_*_BUFFERS << slru_buffers_size_scale` are applied for every SLRU.
> 5. All SLRU buffers are always sized as power of 2, their hash bucket size is always 8.
>
> There's attached patch for your consideration. It does gather and simplify both
`v21-0001-Make-all-SLRU-buffer-sizes-configurable.patch`and
`v21-0002-Divide-SLRU-buffers-into-8-associative-banks.patch`to much simpler approach. 

I like the idea of one knob instead of one per each SLRU. Maybe we even could deduce sane value from NBuffers? That
wouldeffectively lead to 0 knobs :) 

Your patch have a prefix "v22-0006", does it mean there are 5 previous steps of the patchset?

Thank you!


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
i.lazarev@postgrespro.ru
Date:
Andrey Borodin wrote 2022-08-18 06:35:
> 
> I like the idea of one knob instead of one per each SLRU. Maybe we
> even could deduce sane value from NBuffers? That would effectively
> lead to 0 knobs :)
> 
> Your patch have a prefix "v22-0006", does it mean there are 5 previous
> steps of the patchset?
> 
> Thank you!
> 
> 
> Best regards, Andrey Borodin.

Not sure it's possible to deduce from NBuffers only. 
slru_buffers_scale_shift looks like relief valve for systems with ultra 
scaled NBuffers.

Regarding v22-0006 I just tried to choose index unique for this thread 
so now it's fixed to 0001 indexing.
Attachment

Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:
On Sat, Jul 23, 2022 at 1:48 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > Thomas, do you still have any doubts? Or is it certain that SLRU will be replaced by any better subsystem in 16?
>
> Hi Andrey,
>
> Sorry for my lack of replies on this and the other SLRU thread -- I'm
> thinking and experimenting.  More soon.
>

Hi Thomas,

PostgreSQL 16 feature freeze is approaching again. Let's choose
something from possible solutions, even if the chosen one is
temporary.
Thank you!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

From
vignesh C
Date:
On Fri, 19 Aug 2022 at 21:18, <i.lazarev@postgrespro.ru> wrote:
>
> Andrey Borodin wrote 2022-08-18 06:35:
> >
> > I like the idea of one knob instead of one per each SLRU. Maybe we
> > even could deduce sane value from NBuffers? That would effectively
> > lead to 0 knobs :)
> >
> > Your patch have a prefix "v22-0006", does it mean there are 5 previous
> > steps of the patchset?
> >
> > Thank you!
> >
> >
> > Best regards, Andrey Borodin.
>
> Not sure it's possible to deduce from NBuffers only.
> slru_buffers_scale_shift looks like relief valve for systems with ultra
> scaled NBuffers.
>
> Regarding v22-0006 I just tried to choose index unique for this thread
> so now it's fixed to 0001 indexing.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
325bc54eed4ea0836a0bb715bb18342f0c1c668a ===
=== applying patch ./v23-0001-bucketed-SLRUs-simplified.patch
patching file src/include/miscadmin.h
...
patching file src/backend/utils/misc/guc.c
Hunk #1 FAILED at 32.
Hunk #2 FAILED at 2375.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/utils/misc/guc.c.rej

[1] - http://cfbot.cputube.org/patch_41_2627.log

Regards,
Vignesh



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:
On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote:
> does not apply on top of HEAD as in [1], please post a rebased patch:
>
Thanks! Here's the rebase.

Best regards, Andrey Borodin.

Attachment

Re: MultiXact\SLRU buffers configuration

From
Dilip Kumar
Date:
On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin <amborodin86@gmail.com> wrote:
>
> On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote:
> > does not apply on top of HEAD as in [1], please post a rebased patch:
> >
> Thanks! Here's the rebase.

I was looking into this patch, it seems like three different
optimizations are squeezed in a single patch
1) dividing buffer space in banks to reduce the seq search cost 2) guc
parameter for buffer size scale 3) some of the buffer size values are
modified compared to what it is on the head.  I think these are 3
patches which should be independently committable.

While looking into the first idea of dividing the buffer space in
banks, I see that it will speed up finding the buffers but OTOH while
searching the victim buffer it will actually can hurt the performance
the slru pages which are frequently accessed are not evenly
distributed across the banks.  So imagine the cases where we have some
banks with a lot of empty slots and other banks from which we
frequently have to evict out the pages in order to get the new pages
in.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:
Hi Dilip! Thank you for the review!

On Tue, Jan 10, 2023 at 9:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin <amborodin86@gmail.com> wrote:
> >
> > On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote:
> > > does not apply on top of HEAD as in [1], please post a rebased patch:
> > >
> > Thanks! Here's the rebase.
>
> I was looking into this patch, it seems like three different
> optimizations are squeezed in a single patch
> 1) dividing buffer space in banks to reduce the seq search cost 2) guc
> parameter for buffer size scale 3) some of the buffer size values are
> modified compared to what it is on the head.  I think these are 3
> patches which should be independently committable.
There's no point in dividing SLRU buffers in parts unless the buffer's
size is configurable.
And it's only possible to enlarge default buffers size if SLRU buffers
are divided into banks.
So the features can be viewed as independent commits, but make no
sense separately.

But, probably, it's a good idea to split the patch back anyway, for
easier review.

>
> While looking into the first idea of dividing the buffer space in
> banks, I see that it will speed up finding the buffers but OTOH while
> searching the victim buffer it will actually can hurt the performance
> the slru pages which are frequently accessed are not evenly
> distributed across the banks.  So imagine the cases where we have some
> banks with a lot of empty slots and other banks from which we
> frequently have to evict out the pages in order to get the new pages
> in.
>

Yes. Despite the extremely low probability of such a case, this
pattern when a user accesses pages assigned to only one bank may
happen.
This case is equivalent to having just one bank, which means small
buffers. Just as we have now.

Thank you!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

From
vignesh C
Date:
On Mon, 9 Jan 2023 at 09:49, Andrey Borodin <amborodin86@gmail.com> wrote:
>
> On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote:
> > does not apply on top of HEAD as in [1], please post a rebased patch:
> >
> Thanks! Here's the rebase.

I'm seeing that there has been no activity in this thread for more
than 1 year now, I'm planning to close this in the current commitfest
unless someone is planning to take it forward.

Regards,
Vignesh



Re: MultiXact\SLRU buffers configuration

From
Andrey Borodin
Date:

> On 20 Jan 2024, at 08:31, vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 9 Jan 2023 at 09:49, Andrey Borodin <amborodin86@gmail.com> wrote:
>>
>> On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote:
>>> does not apply on top of HEAD as in [1], please post a rebased patch:
>>>
>> Thanks! Here's the rebase.
>
> I'm seeing that there has been no activity in this thread for more
> than 1 year now, I'm planning to close this in the current commitfest
> unless someone is planning to take it forward.

Hi Vignesh,

thanks for the ping! Most important parts of this patch set are discussed in [0]. If that patchset will be committed,
I'llwithdraw entry for this thread from commitfest. 
There's a version of Multixact-specific optimizations [1], but I hope they will not be necessary with effective caches
developedin [0]. It seems to me that most important part of those optimization is removing sleeps under SLRU lock on
standby[2] by Kyotaro Horiguchi. But given that cache optimizations took 4 years to get closer to commit, I'm not sure
wewill get this optimization any time soon... 

Thanks!

Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com
[1]
https://www.postgresql.org/message-id/flat/2ECE132B-C042-4489-930E-DBC5D0DAB84A%40yandex-team.ru#5f7e7022647be9eeecfc2ae75d765500
[2]
https://www.postgresql.org/message-id/flat/20200515.090333.24867479329066911.horikyota.ntt%40gmail.com#855f8bb7205890579a363d2344b4484d


Re: MultiXact\SLRU buffers configuration

From
Alvaro Herrera
Date:
On 2024-Jan-27, Andrey Borodin wrote:

> thanks for the ping! Most important parts of this patch set are discussed in [0]. If that patchset will be committed,
I'llwithdraw entry for this thread from commitfest.
 
> There's a version of Multixact-specific optimizations [1], but I hope they will not be necessary with effective
cachesdeveloped in [0]. It seems to me that most important part of those optimization is removing sleeps under SLRU
lockon standby [2] by Kyotaro Horiguchi. But given that cache optimizations took 4 years to get closer to commit, I'm
notsure we will get this optimization any time soon...
 

I'd appreciate it if you or Horiguchi-san can update his patch to remove
use of usleep in favor of a CV in multixact, and keep this CF entry to
cover it.

Perhaps a test to make the code reach the usleep(1000) can be written
using injection points (49cd2b93d7db)?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)



Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 28 Jan 2024, at 17:49, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> I'd appreciate it if you or Horiguchi-san can update his patch to remove
> use of usleep in favor of a CV in multixact, and keep this CF entry to
> cover it.

Sure! Sounds great!

> Perhaps a test to make the code reach the usleep(1000) can be written
> using injection points (49cd2b93d7db)?

I've tried to prototype something like that. But interesting point between GetNewMultiXactId() and RecordNewMultiXact()
isa critical section, and we cannot have injection points in critical sections... 
Also, to implement such a test we need "wait" type of injection points, see step 2 in attachment. With this type of
injectionpoints I can stop a backend amidst entering information about new MultiXact. 


Best regards, Andrey Borodin.


Attachment

Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 28 Jan 2024, at 23:17, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>> Perhaps a test to make the code reach the usleep(1000) can be written
>> using injection points (49cd2b93d7db)?
>
> I've tried to prototype something like that. But interesting point between GetNewMultiXactId() and
RecordNewMultiXact()is a critical section, and we cannot have injection points in critical sections... 
> Also, to implement such a test we need "wait" type of injection points, see step 2 in attachment. With this type of
injectionpoints I can stop a backend amidst entering information about new MultiXact. 

Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into
"members"SLRU. 
Cost of having this test:
1. We need a new injection point type "wait" (in addition to "error" and "notice"). It cannot be avoided, because we
needto sync at least 3 processed to observe condition we want. 
2. We need new way to declare injection point that can happen inside critical section. I've called it "prepared
injectionpoint". 

Complexity of having this test is higher than complexity of CV-sleep patch itself. Do we want it? If so I can produce
cleanerversion, currently all multixact tests are int injection_points test module. 


Best regards, Andrey Borodin.

Attachment

Re: MultiXact\SLRU buffers configuration

From
Kyotaro Horiguchi
Date:
At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in 
> Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into
"members"SLRU.
 

By the way, I raised a question about using multiple CVs
simultaneously [1]. That is, I suspect that the current CV
implementation doesn't allow us to use multiple condition variables at
the same time, because all CVs use the same PCPROC member cvWaitLink
to accommodate different waiter sets. If this assumption is correct,
we should resolve the issue before spreading more uses of CVs.

[1] https://www.postgresql.org/message-id/20240227.150709.1766217736683815840.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 29 Feb 2024, at 06:59, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
>> Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into
"members"SLRU. 
>
> By the way, I raised a question about using multiple CVs
> simultaneously [1]. That is, I suspect that the current CV
> implementation doesn't allow us to use multiple condition variables at
> the same time, because all CVs use the same PCPROC member cvWaitLink
> to accommodate different waiter sets. If this assumption is correct,
> we should resolve the issue before spreading more uses of CVs.

Alvaro, Kyotaro, what's our plan for this?
It seems to late to deal with this pg_usleep(1000L) for PG17.
I propose following course of action
1. Close this long-standing CF item
2. Start new thread with CV-sleep patch aimed at PG18
3. Create new entry in July CF

What do you think?


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 6 Apr 2024, at 14:24, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> What do you think?

OK, I'll follow this plan.
As long as most parts of this thread were committed, I'll mark CF item as "committed".
Thanks to everyone involved!
See you in a followup thread about sleeping on CV.


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Alvaro Herrera
Date:
On 2024-Feb-03, Andrey M. Borodin wrote:

> Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into
"members"SLRU.
 
> Cost of having this test:
> 1. We need a new injection point type "wait" (in addition to "error" and "notice"). It cannot be avoided, because we
needto sync at least 3 processed to observe condition we want.
 
> 2. We need new way to declare injection point that can happen inside critical section. I've called it "prepared
injectionpoint".
 
> 
> Complexity of having this test is higher than complexity of CV-sleep patch itself. Do we want it? If so I can produce
cleanerversion, currently all multixact tests are int injection_points test module.
 

Well, it would be nice to have *some* test, but as you say it is way
more complex than the thing being tested, and it zooms in on the
functioning of the multixact creation in insane quantum-physics ways ...
to the point that you can no longer trust that multixact works the same
way with the test than without it.  So what I did is manually run other
tests (pgbench) to verify that the corner case in multixact creation is
being handled correctly, and pushed the patch after a few corrections,
in particular so that it would follow the CV sleeping protocol a little
more closely to what the CV documentation suggests, which is not at all
what the patch did.  I also fixed a few comments that had been neglected
and changed the name and description of the CV in the docs.

Now, maybe we can still add the test later, but it needs a rebase.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)



Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 7 Apr 2024, at 21:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Well, it would be nice to have *some* test, but as you say it is way
> more complex than the thing being tested, and it zooms in on the
> functioning of the multixact creation in insane quantum-physics ways ...
> to the point that you can no longer trust that multixact works the same
> way with the test than without it.  So what I did is manually run other
> tests (pgbench) to verify that the corner case in multixact creation is
> being handled correctly, and pushed the patch after a few corrections,
> in particular so that it would follow the CV sleeping protocol a little
> more closely to what the CV documentation suggests, which is not at all
> what the patch did.  I also fixed a few comments that had been neglected
> and changed the name and description of the CV in the docs.

That's excellent! Thank you!

> Now, maybe we can still add the test later, but it needs a rebase.

Sure. 'wait' injection points are there already, so I'll produce patch with "prepared" injection points and
re-implementtest on top of that. I'll put it on July CF. 


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 5 Jul 2024, at 14:16, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jun 10, 2024 at 03:10:33PM +0900, Michael Paquier wrote:
>> OK, cool.  I'll try to get that into the tree once v18 opens up.
>
> And I've spent more time on this one, and applied it to v18 after some
> slight tweaks.  Please feel free to re-post your tests with
> multixacts, Andrey.

Thanks Michael!


> On 7 Apr 2024, at 23:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Now, maybe we can still add the test later, but it needs a rebase.

Alvaro, please find attached the test.
I’ve addressed some Michael’s comments in a nearby thread: removed extra load, made injection point names lowercase,
fixedsome grammar issues. 


Best regards, Andrey Borodin.

Attachment

Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 5 Jul 2024, at 23:18, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> Alvaro, please find attached the test.
> I’ve addressed some Michael’s comments in a nearby thread: removed extra load, made injection point names lowercase,
fixedsome grammar issues. 

I’ve made several runs on Github to test stability [0, 1, 2, 4]. CI seems to be stable.

Thanks!


Best regards, Andrey Borodin.

[0] https://github.com/x4m/postgres_g/commit/c9c362679f244
[1] https://github.com/x4m/postgres_g/commit/9d7e43cc1
[2] https://github.com/x4m/postgres_g/commit/18cf186617
[3] https://github.com/x4m/postgres_g/commit/4fbce73997


Re: MultiXact\SLRU buffers configuration

From
Alvaro Herrera
Date:
On 2024-Aug-19, Andrey M. Borodin wrote:

> > On 5 Jul 2024, at 23:18, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > 
> > Alvaro, please find attached the test.
> > I’ve addressed some Michael’s comments in a nearby thread: removed
> > extra load, made injection point names lowercase, fixed some grammar
> > issues.
> 
> I’ve made several runs on Github to test stability [0, 1, 2, 4]. CI seems to be stable.

OK, I've made some minor adjustments and pushed.  CI seemed OK for me,
let's see what does the BF have to say.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: MultiXact\SLRU buffers configuration

From
Alvaro Herrera
Date:
On 2024-Aug-21, Michael Paquier wrote:

> I see that you've gone the way with the SQL function doing a load().
> Would it be worth switching the test to rely on the two macros for
> load and caching instead?  I've mentioned that previously but never
> got down to present a patch for the sake of this test.

Hmm, I have no opinion on which way is best.  You probably have a better
sense of what's better for the injections point interface, so I'm happy
to defer to you on this.

> +    /* reset in case this is a restart within the postmaster */
> +    inj_state = NULL;

I'm not sure that this assignment actually accomplishes anything ...

I don't understand what do the inj_stats_enabled stuff have to do with
this patch.  I suspect it's a git operation error, ie., you seem to have
squashed two different things together.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."                                      (E. Dijkstra)



Re: MultiXact\SLRU buffers configuration

From
Alvaro Herrera
Date:
On 2024-Aug-21, Michael Paquier wrote:

> From fd8ab7b6845a2c56aa2c8d9c60f404f6b3407338 Mon Sep 17 00:00:00 2001
> From: Michael Paquier <michael@paquier.xyz>
> Date: Wed, 21 Aug 2024 15:16:06 +0900
> Subject: [PATCH 2/3] injection_point: Add injection_points.stats

> This GUC controls if statistics should be used or not in the module.
> Custom statistics require the module to be loaded with
> shared_preload_libraries, hence this GUC is made PGC_POSTMASTER.  By
> default, stats are disabled.
> 
> This will be used by an upcoming change in a test where stats should not
> be used, as the test has a dependency on a critical section.

I find it's strange that the information that stats cannot be used with
injection points that have dependency on critical sections (?), is only
in the commit message and not in the code.

Also, maybe it'd make sense for stats to be globally enabled, and that
only the tests that require it would disable them?  (It's probably easy
enough to have a value "injection_points.stats=auto" which means, if the
module is loaded in shared_preload_libraries them set stats on,
otherwise turn them off.)  TBH I don't understand why the issue that
stats require shared_preload_libraries only comes up now ...  Maybe
another approach is to say that if an injection point is loaded via
_LOAD() rather than the normal way, then stats are disabled for that one
rather than globally?  Or give the _LOAD() macro a boolean argument to
indicate whether to collect stats for that injection point or not.

Lastly, it's not clear to me what does it mean that the test has a
"dependency" on a critical section.  Do you mean to say that the
injection point runs inside a critical section?

> From e5329d080b9d8436af8f65aac118745cf1f81ca2 Mon Sep 17 00:00:00 2001
> From: Michael Paquier <michael@paquier.xyz>
> Date: Wed, 21 Aug 2024 15:09:06 +0900
> Subject: [PATCH 3/3] Rework new SLRU test with injection points
> 
> Rather than the SQL injection_points_load, this commit makes the test
> rely on the two macros to load and run an injection point from the
> cache, acting as an example of how to use them.

No issues with this, feel free to go ahead.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)



Re: MultiXact\SLRU buffers configuration

From
Alvaro Herrera
Date:
On 2024-Aug-22, Michael Paquier wrote:

> On Wed, Aug 21, 2024 at 01:55:06PM -0400, Alvaro Herrera wrote:

> > Also, maybe it'd make sense for stats to be globally enabled, and that
> > only the tests that require it would disable them?  (It's probably easy
> > enough to have a value "injection_points.stats=auto" which means, if the
> > module is loaded in shared_preload_libraries them set stats on,
> > otherwise turn them off.)
> 
> I'm not sure that we need to get down to that until somebody has a
> case where they want to rely on stats of injection points for their
> stuff.  At this stage, I only want the stats to be enabled to provide
> automated checks for the custom pgstats APIs, so disabling it by
> default and enabling it only in the stats test of the module
> injection_points sounds kind of enough to me for now.

Oh!  I thought the stats were useful by themselves.  That not being the
case, I agree with simplifying; and the other ways to enhance this point
might not be necessary for now.

> > Or give the _LOAD() macro a boolean argument to
> > indicate whether to collect stats for that injection point or not.
> 
> Sticking some knowledge about the stats in the backend part of
> injection points does not sound like a good idea to me.

You could flip this around: have the bool be for "this injection point
is going to be invoked inside a critical section".  Then core code just
needs to tell the injection points module what core code does, and it's
injection_points that decides what to do with that information.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)



Re: MultiXact\SLRU buffers configuration

From
Thom Brown
Date:
On Fri, 23 Aug 2024 at 01:29, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Aug 22, 2024 at 10:36:38AM -0400, Alvaro Herrera wrote:
> On 2024-Aug-22, Michael Paquier wrote:
>> I'm not sure that we need to get down to that until somebody has a
>> case where they want to rely on stats of injection points for their
>> stuff.  At this stage, I only want the stats to be enabled to provide
>> automated checks for the custom pgstats APIs, so disabling it by
>> default and enabling it only in the stats test of the module
>> injection_points sounds kind of enough to me for now.
>
> Oh!  I thought the stats were useful by themselves.

Yep, currently they're not, but I don't want to discard that they'll
never be, either.  Perhaps there would be a case where somebody would
like to run a callback N times and trigger a condition?  That's
something where the stats could be useful, but I don't have a specific
case for that now.  I'm just imagining possibilities.

I believe I am seeing the problem being discussed occuring on a production system running 15.6, causing ever-increasing replay lag on the standby, until I cancel the offending process on the standby and force it to process its interrupts.

Here's the backtrace before I do that:

#0  0x00007f4503b81876 in select () from /lib64/libc.so.6
#1  0x0000558b0956891a in pg_usleep (microsec=microsec@entry=1000) at pgsleep.c:56
#2  0x0000558b0917e01a in GetMultiXactIdMembers (from_pgupgrade=false, onlyLock=<optimized out>, members=0x7ffcd2a9f1e0, multi=109187502) at multixact.c:1392
#3  GetMultiXactIdMembers (multi=109187502, members=members@entry=0x7ffcd2a9f1e0, from_pgupgrade=from_pgupgrade@entry=false, onlyLock=onlyLock@entry=false) at multixact.c:1224
#4  0x0000558b0913de15 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at heapam.c:6924
#5  0x0000558b09146028 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7f440d428308) at heapam.c:6965
#6  0x0000558b0914c02f in HeapTupleSatisfiesMVCC (htup=0x558b0b7cbf20, htup=0x558b0b7cbf20, buffer=8053429, snapshot=0x558b0b63a2d8) at heapam_visibility.c:1089
#7  HeapTupleSatisfiesVisibility (tup=tup@entry=0x7ffcd2a9f2b0, snapshot=snapshot@entry=0x558b0b63a2d8, buffer=buffer@entry=8053429) at heapam_visibility.c:1771
#8  0x0000558b0913e819 in heapgetpage (sscan=sscan@entry=0x558b0b7ccfa0, page=page@entry=115) at heapam.c:468
#9  0x0000558b0913eb7e in heapgettup_pagemode (scan=scan@entry=0x558b0b7ccfa0, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1120
#10 0x0000558b0913fb5e in heap_getnextslot (sscan=0x558b0b7ccfa0, direction=<optimized out>, slot=0x558b0b7cc000) at heapam.c:1352
#11 0x0000558b092c0e7a in table_scan_getnextslot (slot=0x558b0b7cc000, direction=ForwardScanDirection, sscan=<optimized out>) at ../../../src/include/access/tableam.h:1046
#12 SeqNext (node=0x558b0b7cbe10) at nodeSeqscan.c:80
#13 0x0000558b0929b9bf in ExecScan (node=0x558b0b7cbe10, accessMtd=0x558b092c0df0 <SeqNext>, recheckMtd=0x558b092c0dc0 <SeqRecheck>) at execScan.c:198
#14 0x0000558b09292cb2 in ExecProcNode (node=0x558b0b7cbe10) at ../../../src/include/executor/executor.h:262
#15 ExecutePlan (execute_once=<optimized out>, dest=0x558b0bca1350, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x558b0b7cbe10, estate=0x558b0b7cbbe8)
    at execMain.c:1636
#16 standard_ExecutorRun (queryDesc=0x558b0b8c9798, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:363
#17 0x00007f44f64d43c5 in pgss_ExecutorRun (queryDesc=0x558b0b8c9798, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:1010
#18 0x0000558b093fda0f in PortalRunSelect (portal=portal@entry=0x558b0b6ba458, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x558b0bca1350) at pquery.c:924
#19 0x0000558b093fedb8 in PortalRun (portal=portal@entry=0x558b0b6ba458, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x558b0bca1350,
    altdest=altdest@entry=0x558b0bca1350, qc=0x7ffcd2a9f7a0) at pquery.c:768
#20 0x0000558b093fb243 in exec_simple_query (
    query_string=0x558b0b6170c8 "<redacted>")
    at postgres.c:1250
#21 0x0000558b093fd412 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4598
#22 0x0000558b0937e170 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4514
#23 BackendStartup (port=<optimized out>) at postmaster.c:4242
#24 ServerLoop () at postmaster.c:1809
#25 0x0000558b0937f147 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x558b0b5d0a30) at postmaster.c:1481
#26 0x0000558b09100a2c in main (argc=5, argv=0x558b0b5d0a30) at main.c:202
 
This occurred twice, meaning 2 processes needed terminating.

Thom

Re: MultiXact\SLRU buffers configuration

From
Thom Brown
Date:
On Tue, 29 Oct 2024 at 16:38, Thom Brown <thom@linux.com> wrote:
>
> On Fri, 23 Aug 2024 at 01:29, Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Thu, Aug 22, 2024 at 10:36:38AM -0400, Alvaro Herrera wrote:
>> > On 2024-Aug-22, Michael Paquier wrote:
>> >> I'm not sure that we need to get down to that until somebody has a
>> >> case where they want to rely on stats of injection points for their
>> >> stuff.  At this stage, I only want the stats to be enabled to provide
>> >> automated checks for the custom pgstats APIs, so disabling it by
>> >> default and enabling it only in the stats test of the module
>> >> injection_points sounds kind of enough to me for now.
>> >
>> > Oh!  I thought the stats were useful by themselves.
>>
>> Yep, currently they're not, but I don't want to discard that they'll
>> never be, either.  Perhaps there would be a case where somebody would
>> like to run a callback N times and trigger a condition?  That's
>> something where the stats could be useful, but I don't have a specific
>> case for that now.  I'm just imagining possibilities.
>
>
> I believe I am seeing the problem being discussed occuring on a production system running 15.6, causing
ever-increasingreplay lag on the standby, until I cancel the offending process on the standby and force it to process
itsinterrupts.
 
>
> Here's the backtrace before I do that:
>
> #0  0x00007f4503b81876 in select () from /lib64/libc.so.6
> #1  0x0000558b0956891a in pg_usleep (microsec=microsec@entry=1000) at pgsleep.c:56
> #2  0x0000558b0917e01a in GetMultiXactIdMembers (from_pgupgrade=false, onlyLock=<optimized out>,
members=0x7ffcd2a9f1e0,multi=109187502) at multixact.c:1392
 
> #3  GetMultiXactIdMembers (multi=109187502, members=members@entry=0x7ffcd2a9f1e0,
from_pgupgrade=from_pgupgrade@entry=false,onlyLock=onlyLock@entry=false) at multixact.c:1224
 
> #4  0x0000558b0913de15 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at
heapam.c:6924
> #5  0x0000558b09146028 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7f440d428308) at heapam.c:6965
> #6  0x0000558b0914c02f in HeapTupleSatisfiesMVCC (htup=0x558b0b7cbf20, htup=0x558b0b7cbf20, buffer=8053429,
snapshot=0x558b0b63a2d8)at heapam_visibility.c:1089
 
> #7  HeapTupleSatisfiesVisibility (tup=tup@entry=0x7ffcd2a9f2b0, snapshot=snapshot@entry=0x558b0b63a2d8,
buffer=buffer@entry=8053429)at heapam_visibility.c:1771
 
> #8  0x0000558b0913e819 in heapgetpage (sscan=sscan@entry=0x558b0b7ccfa0, page=page@entry=115) at heapam.c:468
> #9  0x0000558b0913eb7e in heapgettup_pagemode (scan=scan@entry=0x558b0b7ccfa0, dir=ForwardScanDirection, nkeys=0,
key=0x0)at heapam.c:1120
 
> #10 0x0000558b0913fb5e in heap_getnextslot (sscan=0x558b0b7ccfa0, direction=<optimized out>, slot=0x558b0b7cc000) at
heapam.c:1352
> #11 0x0000558b092c0e7a in table_scan_getnextslot (slot=0x558b0b7cc000, direction=ForwardScanDirection,
sscan=<optimizedout>) at ../../../src/include/access/tableam.h:1046
 
> #12 SeqNext (node=0x558b0b7cbe10) at nodeSeqscan.c:80
> #13 0x0000558b0929b9bf in ExecScan (node=0x558b0b7cbe10, accessMtd=0x558b092c0df0 <SeqNext>,
recheckMtd=0x558b092c0dc0<SeqRecheck>) at execScan.c:198
 
> #14 0x0000558b09292cb2 in ExecProcNode (node=0x558b0b7cbe10) at ../../../src/include/executor/executor.h:262
> #15 ExecutePlan (execute_once=<optimized out>, dest=0x558b0bca1350, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x558b0b7cbe10,
estate=0x558b0b7cbbe8)
>     at execMain.c:1636
> #16 standard_ExecutorRun (queryDesc=0x558b0b8c9798, direction=<optimized out>, count=0, execute_once=<optimized out>)
atexecMain.c:363
 
> #17 0x00007f44f64d43c5 in pgss_ExecutorRun (queryDesc=0x558b0b8c9798, direction=ForwardScanDirection, count=0,
execute_once=<optimizedout>) at pg_stat_statements.c:1010
 
> #18 0x0000558b093fda0f in PortalRunSelect (portal=portal@entry=0x558b0b6ba458, forward=forward@entry=true, count=0,
count@entry=9223372036854775807,dest=dest@entry=0x558b0bca1350) at pquery.c:924
 
> #19 0x0000558b093fedb8 in PortalRun (portal=portal@entry=0x558b0b6ba458, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x558b0bca1350,
 
>     altdest=altdest@entry=0x558b0bca1350, qc=0x7ffcd2a9f7a0) at pquery.c:768
> #20 0x0000558b093fb243 in exec_simple_query (
>     query_string=0x558b0b6170c8 "<redacted>")
>     at postgres.c:1250
> #21 0x0000558b093fd412 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4598
> #22 0x0000558b0937e170 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4514
> #23 BackendStartup (port=<optimized out>) at postmaster.c:4242
> #24 ServerLoop () at postmaster.c:1809
> #25 0x0000558b0937f147 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x558b0b5d0a30) at postmaster.c:1481
> #26 0x0000558b09100a2c in main (argc=5, argv=0x558b0b5d0a30) at main.c:202
>
> This occurred twice, meaning 2 processes needed terminating.

Taking a look at what's happening under the hood, it seems to be
getting stuck here:

                if (nextMXOffset == 0)
                {
                        /* Corner case 2: next multixact is still
being filled in */
                        LWLockRelease(MultiXactOffsetSLRULock);
                        CHECK_FOR_INTERRUPTS();
                        pg_usleep(1000L);
                        goto retry;
                }

It clearly checks for interrupts, but when I saw this issue happen, it
wasn't interruptible.

I looked around for similar reports, and I found a fork of Postgres
reporting the same issue, but with more info than I thought to get at
the time. They determined that the root cause involved a deadlock
where the replay thread is stuck due to an out-of-order multixact
playback sequence. The standby process waits indefinitely for a page
pin to release before proceeding, but the required multixact ID hasn't
yet been replayed due to log ordering. What they did to "fix" the
issue in their version was to add an error when the reading thread
cannot retrieve the next multixact ID, allowing it to exit instead of
becoming indefinitely stuck. This took the form of only allowing it to
loop 100 times before exiting.

Thom



Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote:
>
> It clearly checks for interrupts, but when I saw this issue happen, it
> wasn't interruptible.

Perhaps, that was different multixacts?
When I was observing this pathology on Standby, it was a stream of different reads encountering different multis.

Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you
actuallyneed termination, not cancel? 

Thanks!


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thom Brown
Date:
On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
> > On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote:
> >
> > It clearly checks for interrupts, but when I saw this issue happen, it
> > wasn't interruptible.
>
> Perhaps, that was different multixacts?
> When I was observing this pathology on Standby, it was a stream of different reads encountering different multis.
>
> Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you
actuallyneed termination, not cancel?
 

Termination didn't work on either of the processes.

Thom



Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 31 Oct 2024, at 17:29, Thom Brown <thom@linux.com> wrote:
>
> On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>>
>>
>>
>>> On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote:
>>>
>>> It clearly checks for interrupts, but when I saw this issue happen, it
>>> wasn't interruptible.
>>
>> Perhaps, that was different multixacts?
>> When I was observing this pathology on Standby, it was a stream of different reads encountering different multis.
>>
>> Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you
actuallyneed termination, not cancel? 
>
> Termination didn't work on either of the processes.

How did you force the process to actually terminate?
Did you observe repeated read of the same multixact?
Was offending process holding any locks while waiting?


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Thom Brown
Date:
On Thu, 31 Oct 2024, 17:33 Andrey M. Borodin, <x4mmm@yandex-team.ru> wrote:


> On 31 Oct 2024, at 17:29, Thom Brown <thom@linux.com> wrote:
>
> On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>>
>>
>>
>>> On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote:
>>>
>>> It clearly checks for interrupts, but when I saw this issue happen, it
>>> wasn't interruptible.
>>
>> Perhaps, that was different multixacts?
>> When I was observing this pathology on Standby, it was a stream of different reads encountering different multis.
>>
>> Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you actually need termination, not cancel?
>
> Termination didn't work on either of the processes.

How did you force the process to actually terminate?
Did you observe repeated read of the same multixact?
Was offending process holding any locks while waiting?

Unfortunately I didn't gather much information when it was occuring, and prioritised getting rid of the process blocking replay. I just attached gdb to it, got a backtrace and then "print ProcessInterrupts()".

Thom

Re: MultiXact\SLRU buffers configuration

From
"Andrey M. Borodin"
Date:

> On 1 Nov 2024, at 00:25, Thom Brown <thom@linux.com> wrote:
>
> Unfortunately I didn't gather much information when it was occuring, and prioritised getting rid of the process
blockingreplay. I just attached gdb to it, got a backtrace and then "print ProcessInterrupts()". 
>

Currently I do not see how this wait can result in a deadlock.
But I did observe standby in a pathological sequential scan encountering recent multixact again and again (new each
time).
I hope this situation will be alleviated by recent cahnges - now there is not a millisecond wait, but hopefully smaller
amountof time. 

In v17 we also added injection point test which reproduces reading very recent multixact. If there is a deadlock I hope
buildfarmwill show it to us. 


Best regards, Andrey Borodin.


Re: MultiXact\SLRU buffers configuration

From
Robert Haas
Date:
On Tue, Oct 29, 2024 at 1:45 PM Thom Brown <thom@linux.com> wrote:
> Taking a look at what's happening under the hood, it seems to be
> getting stuck here:
>
>                 if (nextMXOffset == 0)
>                 {
>                         /* Corner case 2: next multixact is still
> being filled in */
>                         LWLockRelease(MultiXactOffsetSLRULock);
>                         CHECK_FOR_INTERRUPTS();
>                         pg_usleep(1000L);
>                         goto retry;
>                 }
>
> It clearly checks for interrupts, but when I saw this issue happen, it
> wasn't interruptible.

I don't understand the underlying issue here; however, if a process
holds an lwlock, it's not interruptible, even if it checks for
interrupts. So it could be that at this point in the code we're
holding some other LWLock, such as a buffer content lock, and that's
why this code fails to achieve its objective.

--
Robert Haas
EDB: http://www.enterprisedb.com