Thread: MultiXact\SLRU buffers configuration
Hi, hackers! *** The problem *** I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock,20% IO DataFileRead). The problem manifested itself during index repack and constraint validation. Both being effectively full table scans. The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic worldgenerator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularlya lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLockcan be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers. *** Question 1 *** Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual? I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem. *** Question 2 *** Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactionsand others will benefit from bigger buffer. But, probably, too much of knobs can be confusing. *** Question 3 *** MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someonehave good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this. Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden caversand difficulties? Thanks! Best regards, Andrey Borodin.
> 8 мая 2020 г., в 21:36, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а): > > *** The problem *** > I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock,20% IO DataFileRead). > The problem manifested itself during index repack and constraint validation. Both being effectively full table scans. > The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic worldgenerator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularlya lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLockcan be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers. > > *** Question 1 *** > Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual? > I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem. > > *** Question 2 *** > Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactionsand others will benefit from bigger buffer. But, probably, too much of knobs can be confusing. > > *** Question 3 *** > MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someonehave good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this. > Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden caversand difficulties? I've created benchmark[0] imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select *from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is lockedby different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity пятница, 8 мая 2020 г. 15:08:37 (every 1s) pid | wait_event | wait_event_type | state | query -------+----------------------------+-----------------+--------+---------------------------------------------------- 41344 | ClientRead | Client | idle | insert into t1 select generate_series(1,1000000,1) 41375 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41377 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41378 | | | active | select * from t1 where i = ANY ($1) for share 41379 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41381 | | | active | select * from t1 where i = ANY ($1) for share 41383 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41385 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share (8 rows) Finally, the benchmark is measuring time to execute select for update 42 times. I've went ahead and created 3 patches: 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers 2. Reduce locking level to shared on read of MultiXactId members 3. Configurable cache size I've found out that: 1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow downbenchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are 8and 16 respectively). 2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read MultiXactIdmembers, but when it replaces one with another - it have to read previous set. I understand that we can constructbenchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But itjust make sense to try to take shared lock for reading. 3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to defeatcache, either way OffsetControlLock would not be stressed. For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a performanceoptimisation of MultiXact. Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a container.Or at least make MultiXact members extendable in-place (reserve some size when multixact is created). When we want to extend number of locks for a tuple currently we will: 1. Iterate through all SLRU buffers for offsets to read current offset (with exclusive lock for offsets) 2. Iterate through all buffers for members to find current members (with exclusive lock for members) 3. Create new members array with +1 xid 4. Iterate through all cache members to find out maybe there are any such cache item as what we are going to create 5. iterate over 1 again for write 6. Iterate over 2 again for write Obviously this does not scale well - we cannot increase SLRU sizes for too long. Thanks! I'd be happy to hear any feedback. Best regards, Andrey Borodin. [0] https://github.com/x4m/multixact_stress
Attachment
> 11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а): > > I've went ahead and created 3 patches: > 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers > 2. Reduce locking level to shared on read of MultiXactId members > 3. Configurable cache size I'm looking more at MultiXact and it seems to me that we have a race condition there. When we create a new MultiXact we do: 1. Generate new MultiXactId under MultiXactGenLock 2. Record new mxid with members and offset to WAL 3. Write offset to SLRU under MultiXactOffsetControlLock 4. Write members to SLRU under MultiXactMemberControlLock When we read MultiXact we do: 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1 3. Retrieve members from SLRU under MultiXactMemberControlLock 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list. What am I missing? Best regards, Andrey Borodin.
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in > > > > 11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а): > > > > I've went ahead and created 3 patches: > > 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers > > 2. Reduce locking level to shared on read of MultiXactId members > > 3. Configurable cache size > > I'm looking more at MultiXact and it seems to me that we have a race condition there. > > When we create a new MultiXact we do: > 1. Generate new MultiXactId under MultiXactGenLock > 2. Record new mxid with members and offset to WAL > 3. Write offset to SLRU under MultiXactOffsetControlLock > 4. Write members to SLRU under MultiXactMemberControlLock But, don't we hold exclusive lock on the buffer through all the steps above? > When we read MultiXact we do: > 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock > 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1 > 3. Retrieve members from SLRU under MultiXactMemberControlLock > 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list. So transactions never see such incomplete mxids, I believe. > What am I missing? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> 14 мая 2020 г., в 06:25, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а): > > At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in >> >> >>> 11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а): >>> >>> I've went ahead and created 3 patches: >>> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers >>> 2. Reduce locking level to shared on read of MultiXactId members >>> 3. Configurable cache size >> >> I'm looking more at MultiXact and it seems to me that we have a race condition there. >> >> When we create a new MultiXact we do: >> 1. Generate new MultiXactId under MultiXactGenLock >> 2. Record new mxid with members and offset to WAL >> 3. Write offset to SLRU under MultiXactOffsetControlLock >> 4. Write members to SLRU under MultiXactMemberControlLock > > But, don't we hold exclusive lock on the buffer through all the steps > above? Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committedtuple delete, but standby sees it as alive. >> When we read MultiXact we do: >> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock >> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1 >> 3. Retrieve members from SLRU under MultiXactMemberControlLock >> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list. > > So transactions never see such incomplete mxids, I believe. I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too. Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is lockingwhole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention. It looks like this: 0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0,timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41 #0 0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0,timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41 #1 0x000056186e0d54bd in pg_usleep (microsec=microsec@entry=1000) at ./build/../src/port/pgsleep.c:56 #2 0x000056186dd5edf2 in GetMultiXactIdMembers (from_pgupgrade=0 '\000', onlyLock=<optimized out>, members=0x7ffd83377080,multi=3106214809) at ./build/../src/backend/access/transam/multixact.c:1370 #3 GetMultiXactIdMembers () at ./build/../src/backend/access/transam/multixact.c:1202 #4 0x000056186dd2d2d9 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at ./build/../src/backend/access/heap/heapam.c:7039 #5 0x000056186dd35098 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7fcba3b63d58) at ./build/../src/backend/access/heap/heapam.c:7080 #6 0x000056186e0cd0f8 in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0x56186f44a058, buffer=230684) at ./build/../src/backend/utils/time/tqual.c:1091 #7 0x000056186dd2d922 in heapgetpage (scan=scan@entry=0x56186f4c8e78, page=page@entry=3620) at ./build/../src/backend/access/heap/heapam.c:439 #8 0x000056186dd2ea7c in heapgettup_pagemode (key=0x0, nkeys=0, dir=ForwardScanDirection, scan=0x56186f4c8e78) at ./build/../src/backend/access/heap/heapam.c:1034 #9 heap_getnext (scan=scan@entry=0x56186f4c8e78, direction=direction@entry=ForwardScanDirection) at ./build/../src/backend/access/heap/heapam.c:1801 #10 0x000056186de84f51 in SeqNext (node=node@entry=0x56186f4a4f78) at ./build/../src/backend/executor/nodeSeqscan.c:81 #11 0x000056186de6a3f1 in ExecScanFetch (recheckMtd=0x56186de84ef0 <SeqRecheck>, accessMtd=0x56186de84f20 <SeqNext>, node=0x56186f4a4f78)at ./build/../src/backend/executor/execScan.c:97 #12 ExecScan (node=0x56186f4a4f78, accessMtd=0x56186de84f20 <SeqNext>, recheckMtd=0x56186de84ef0 <SeqRecheck>) at ./build/../src/backend/executor/execScan.c:164 Best regards, Andrey Borodin.
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in > >> I'm looking more at MultiXact and it seems to me that we have a race condition there. > >> > >> When we create a new MultiXact we do: > >> 1. Generate new MultiXactId under MultiXactGenLock > >> 2. Record new mxid with members and offset to WAL > >> 3. Write offset to SLRU under MultiXactOffsetControlLock > >> 4. Write members to SLRU under MultiXactMemberControlLock > > > > But, don't we hold exclusive lock on the buffer through all the steps > > above? > Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committedtuple delete, but standby sees it as alive. Ah, right. I looked from GetNewMultiXactId. Actually XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to the creating mxact id. And GetMultiXactIdMembers is considering that case. > >> When we read MultiXact we do: > >> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock > >> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1 > >> 3. Retrieve members from SLRU under MultiXactMemberControlLock > >> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list. > > > > So transactions never see such incomplete mxids, I believe. > I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too. > Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is lockingwhole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention. GetMultiXactIdMembers believes that 4 is successfully done if 2 returned valid offset, but actually that is not obvious. If we add a single giant lock just to isolate ,say, GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency unnecessarily. Perhaps we need finer-grained locking-key for standby that works similary to buffer lock on primary, that doesn't cause confilicts between irrelevant mxids. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> 14 мая 2020 г., в 11:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а): > > At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in >>>> I'm looking more at MultiXact and it seems to me that we have a race condition there. >>>> >>>> When we create a new MultiXact we do: >>>> 1. Generate new MultiXactId under MultiXactGenLock >>>> 2. Record new mxid with members and offset to WAL >>>> 3. Write offset to SLRU under MultiXactOffsetControlLock >>>> 4. Write members to SLRU under MultiXactMemberControlLock >>> >>> But, don't we hold exclusive lock on the buffer through all the steps >>> above? >> Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committedtuple delete, but standby sees it as alive. > > Ah, right. I looked from GetNewMultiXactId. Actually > XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to > the creating mxact id. And GetMultiXactIdMembers is considering that > case. > >>>> When we read MultiXact we do: >>>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock >>>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1 >>>> 3. Retrieve members from SLRU under MultiXactMemberControlLock >>>> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list. >>> >>> So transactions never see such incomplete mxids, I believe. >> I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too. >> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is lockingwhole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention. > > GetMultiXactIdMembers believes that 4 is successfully done if 2 > returned valid offset, but actually that is not obvious. > > If we add a single giant lock just to isolate ,say, > GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency > unnecessarily. Perhaps we need finer-grained locking-key for standby > that works similary to buffer lock on primary, that doesn't cause > confilicts between irrelevant mxids. > We can just replay members before offsets. If offset is already there - members are there too. But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example. Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency,because transaction in this mxid could not commit before we started. ISTM. So instead of fix, we, probably, can just add a comment. If this reasoning is correct. Best regards, Andrey Borodin.
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in > > GetMultiXactIdMembers believes that 4 is successfully done if 2 > > returned valid offset, but actually that is not obvious. > > > > If we add a single giant lock just to isolate ,say, > > GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency > > unnecessarily. Perhaps we need finer-grained locking-key for standby > > that works similary to buffer lock on primary, that doesn't cause > > confilicts between irrelevant mxids. > > > We can just replay members before offsets. If offset is already there - members are there too. > But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example. Generally in such cases, condition variables would work. In the attached PoC, the reader side gets no penalty in the "likely" code path. The writer side always calls ConditionVariableBroadcast but the waiter list is empty in almost all cases. But I couldn't cause the situation where the sleep 1000u is reached. > Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency,because transaction in this mxid could not commit before we started. ISTM. > So instead of fix, we, probably, can just add a comment. If this reasoning is correct. The step 4 of the reader side reads the members of the target mxid. It is already written if the offset of the *next* mxid is filled-in. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index e2aa5c9ce4..9db8f6cddd 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -82,6 +82,7 @@ #include "lib/ilist.h" #include "miscadmin.h" #include "pg_trace.h" +#include "pgstat.h" #include "postmaster/autovacuum.h" #include "storage/lmgr.h" #include "storage/pmsignal.h" @@ -233,6 +234,7 @@ typedef struct MultiXactStateData /* support for members anti-wraparound measures */ MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */ + ConditionVariable nextoff_cv; /* * Per-backend data starts here. We have two arrays stored in the area * immediately following the MultiXactStateData struct. Each is indexed by @@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset, /* Exchange our lock */ LWLockRelease(MultiXactOffsetControlLock); + /* + * Let everybody know the offset of this mxid is recorded now. The waiters + * are waiting for the offset of the mxid next of the target to know the + * number of members of the target mxid, so we don't need to wait for + * members of this mxid are recorded. + */ + ConditionVariableBroadcast(&MultiXactState->nextoff_cv); + LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE); prev_pageno = -1; @@ -1367,9 +1377,19 @@ retry: if (nextMXOffset == 0) { /* Corner case 2: next multixact is still being filled in */ + + /* + * The recorder of the next mxid is just before writing the offset. + * Wait for the offset to be written. + */ + ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv); + LWLockRelease(MultiXactOffsetControlLock); CHECK_FOR_INTERRUPTS(); - pg_usleep(1000L); + + ConditionVariableSleep(&MultiXactState->nextoff_cv, + WAIT_EVENT_WAIT_NEXT_MXMEMBERS); + ConditionVariableCancelSleep(); goto retry; } @@ -1847,6 +1867,7 @@ MultiXactShmemInit(void) /* Make sure we zero out the per-backend state */ MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE); + ConditionVariableInit(&MultiXactState->nextoff_cv); } else Assert(found); diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 0ecd29a1d9..1ac6b37188 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3879,6 +3879,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_WAIT_NEXT_MXMEMBERS: + event_name = "Mact/WaitNextXactMembers"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/include/pgstat.h b/src/include/pgstat.h index ae9a39573c..e79bba0bef 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -886,7 +886,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_WAIT_NEXT_MXMEMBERS } WaitEventIPC; /* ----------
> 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а): > > At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in >>> GetMultiXactIdMembers believes that 4 is successfully done if 2 >>> returned valid offset, but actually that is not obvious. >>> >>> If we add a single giant lock just to isolate ,say, >>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency >>> unnecessarily. Perhaps we need finer-grained locking-key for standby >>> that works similary to buffer lock on primary, that doesn't cause >>> confilicts between irrelevant mxids. >>> >> We can just replay members before offsets. If offset is already there - members are there too. >> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, forexample. > > Generally in such cases, condition variables would work. In the > attached PoC, the reader side gets no penalty in the "likely" code > path. The writer side always calls ConditionVariableBroadcast but the > waiter list is empty in almost all cases. But I couldn't cause the > situation where the sleep 1000u is reached. Thanks! That really looks like a good solution without magic timeouts. Beautiful! I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait. This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch)and other tweaks. >> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency,because transaction in this mxid could not commit before we started. ISTM. >> So instead of fix, we, probably, can just add a comment. If this reasoning is correct. > > The step 4 of the reader side reads the members of the target mxid. It > is already written if the offset of the *next* mxid is filled-in. Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first. But nobody can read members of MXID before it is returned. And its members will be written before returning MXID. Best regards, Andrey Borodin.
At Fri, 15 May 2020 14:01:46 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in > > > > 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а): > > > > At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in > >>> GetMultiXactIdMembers believes that 4 is successfully done if 2 > >>> returned valid offset, but actually that is not obvious. > >>> > >>> If we add a single giant lock just to isolate ,say, > >>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency > >>> unnecessarily. Perhaps we need finer-grained locking-key for standby > >>> that works similary to buffer lock on primary, that doesn't cause > >>> confilicts between irrelevant mxids. > >>> > >> We can just replay members before offsets. If offset is already there - members are there too. > >> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, forexample. > > > > Generally in such cases, condition variables would work. In the > > attached PoC, the reader side gets no penalty in the "likely" code > > path. The writer side always calls ConditionVariableBroadcast but the > > waiter list is empty in almost all cases. But I couldn't cause the > > situation where the sleep 1000u is reached. > Thanks! That really looks like a good solution without magic timeouts. Beautiful! > I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait. > This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch)and other tweaks. Happy to hear that, It would need to use timeout just in case, though. > >> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency,because transaction in this mxid could not commit before we started. ISTM. > >> So instead of fix, we, probably, can just add a comment. If this reasoning is correct. > > > > The step 4 of the reader side reads the members of the target mxid. It > > is already written if the offset of the *next* mxid is filled-in. > Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first. > But nobody can read members of MXID before it is returned. And its members will be written before returning MXID. Yeah, right. Otherwise assertion failure happens. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Generally in such cases, condition variables would work. In the > attached PoC, the reader side gets no penalty in the "likely" code > path. The writer side always calls ConditionVariableBroadcast but the > waiter list is empty in almost all cases. But I couldn't cause the > situation where the sleep 1000u is reached. The submitted patch no longer applies, can you please submit an updated version? I'm marking the patch Waiting on Author in the meantime. cheers ./daniel
> 2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а): > >> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > >> Generally in such cases, condition variables would work. In the >> attached PoC, the reader side gets no penalty in the "likely" code >> path. The writer side always calls ConditionVariableBroadcast but the >> waiter list is empty in almost all cases. But I couldn't cause the >> situation where the sleep 1000u is reached. > > The submitted patch no longer applies, can you please submit an updated > version? I'm marking the patch Waiting on Author in the meantime. Thanks, Daniel! PFA V2 Best regards, Andrey Borodin.
Attachment
> On 8 Jul 2020, at 09:03, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > >> 2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а): >> >>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: >> >>> Generally in such cases, condition variables would work. In the >>> attached PoC, the reader side gets no penalty in the "likely" code >>> path. The writer side always calls ConditionVariableBroadcast but the >>> waiter list is empty in almost all cases. But I couldn't cause the >>> situation where the sleep 1000u is reached. >> >> The submitted patch no longer applies, can you please submit an updated >> version? I'm marking the patch Waiting on Author in the meantime. > Thanks, Daniel! PFA V2 This version too has stopped applying according to the CFbot. I've moved it to the next commitfest as we're out of time in this one and it was only pointed out now, but kept it Waiting on Author. cheers ./daniel
On 08.07.2020 10:03, Andrey M. Borodin wrote: > >> 2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а): >> >>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: >>> Generally in such cases, condition variables would work. In the >>> attached PoC, the reader side gets no penalty in the "likely" code >>> path. The writer side always calls ConditionVariableBroadcast but the >>> waiter list is empty in almost all cases. But I couldn't cause the >>> situation where the sleep 1000u is reached. >> The submitted patch no longer applies, can you please submit an updated >> version? I'm marking the patch Waiting on Author in the meantime. > Thanks, Daniel! PFA V2 > > Best regards, Andrey Borodin. 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though. 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place? The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches) Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial. 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them? + &multixact_local_cache_entries, + 256, 2, INT_MAX / 2, 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted. 3) No changes for third patch. I just renamed it for consistency. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi, Anastasia! > 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а): > > 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact,though. > > 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem,or just moves it to another place? > > The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message"[PATCH v2 2/4]" hints that you had 4 patches) > Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial. > > 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them? > > + &multixact_local_cache_entries, > + 256, 2, INT_MAX / 2, > > 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted. > > 3) No changes for third patch. I just renamed it for consistency. Thank you for your review. Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduceproblem from production... You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation withvalues ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value? I greatly appreciate your review, sorry for so long delay. Thanks! Best regards, Andrey Borodin.
On 28.09.2020 17:41, Andrey M. Borodin wrote: > Hi, Anastasia! > >> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а): >> >> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact,though. >> >> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem,or just moves it to another place? >> >> The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message"[PATCH v2 2/4]" hints that you had 4 patches) >> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial. >> >> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them? >> >> + &multixact_local_cache_entries, >> + 256, 2, INT_MAX / 2, >> >> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted. >> >> 3) No changes for third patch. I just renamed it for consistency. > Thank you for your review. > > Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduceproblem from production... > > You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradationwith values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioningvalue? I would go with the values that we consider adequate for this setting. As I see, there is no strict rule about it in guc.c and many variables have large border values. Anyway, we need to explain it at least in the documentation and code comments. It seems that the default was conservative enough, so it can be also a minimal value too. As for maximum, can you provide any benchmark results? If we have a peak and a noticeable performance degradation after that, we can use it to calculate the preferable maxvalue. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi! On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а): > > > > 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact,though. > > > > 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem,or just moves it to another place? > > > > The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message"[PATCH v2 2/4]" hints that you had 4 patches) > > Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial. > > > > 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them? > > > > + &multixact_local_cache_entries, > > + 256, 2, INT_MAX / 2, > > > > 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted. > > > > 3) No changes for third patch. I just renamed it for consistency. > > Thank you for your review. > > Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduceproblem from production... > > You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradationwith values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioningvalue? > > I greatly appreciate your review, sorry for so long delay. Thanks! I took a look at this patchset. The 1st and 3rd patches look good to me. I made just minor improvements. 1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the SLRU lock, which is already taken in exclusive mode. I've evaded this by passing the lock mode as a parameter to SimpleLruReadPage_ReadOnly(). 3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called inside ConditionVariableSleep() if needed. Also, no current wait events use slashes, and I don't think we should introduce slashes here. Even if we should, then we should also rename existing wait events to be consistent with a new one. So, I've renamed the new wait event to remove the slash. Regarding the patch 2. I see the current documentation in the patch doesn't explain to the user how to set the new parameter. I think we should give users an idea what workloads need high values of multixact_local_cache_entries parameter and what doesn't. Also, we should explain the negative aspects of high values multixact_local_cache_entries. Ideally, we should get the advantage without overloading users with new nontrivial parameters, but I don't have a particular idea here. I'd like to propose committing 1 and 3, but leave 2 for further review. ------ Regards, Alexander Korotkov
Attachment
> 26 окт. 2020 г., в 06:05, Alexander Korotkov <aekorotkov@gmail.com> написал(а): > > Hi! > > On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >>> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а): >>> >>> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact,though. >>> >>> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem,or just moves it to another place? >>> >>> The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message"[PATCH v2 2/4]" hints that you had 4 patches) >>> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial. >>> >>> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them? >>> >>> + &multixact_local_cache_entries, >>> + 256, 2, INT_MAX / 2, >>> >>> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted. >>> >>> 3) No changes for third patch. I just renamed it for consistency. >> >> Thank you for your review. >> >> Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduceproblem from production... >> >> You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradationwith values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioningvalue? >> >> I greatly appreciate your review, sorry for so long delay. Thanks! > > I took a look at this patchset. > > The 1st and 3rd patches look good to me. I made just minor improvements. > 1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the > SLRU lock, which is already taken in exclusive mode. I've evaded this > by passing the lock mode as a parameter to > SimpleLruReadPage_ReadOnly(). > 3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called > inside ConditionVariableSleep() if needed. Also, no current wait > events use slashes, and I don't think we should introduce slashes > here. Even if we should, then we should also rename existing wait > events to be consistent with a new one. So, I've renamed the new wait > event to remove the slash. > > Regarding the patch 2. I see the current documentation in the patch > doesn't explain to the user how to set the new parameter. I think we > should give users an idea what workloads need high values of > multixact_local_cache_entries parameter and what doesn't. Also, we > should explain the negative aspects of high values > multixact_local_cache_entries. Ideally, we should get the advantage > without overloading users with new nontrivial parameters, but I don't > have a particular idea here. > > I'd like to propose committing 1 and 3, but leave 2 for further review. Thanks for your review, Alexander! +1 for avoiding double locking in SimpleLruReadPage_ReadOnly(). Other changes seem correct to me too. I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizesof offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough. Best regards, Andrey Borodin.
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Thanks for your review, Alexander! > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly(). > Other changes seem correct to me too. > > > I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less thansizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough. Thank you. I've made a few more minor adjustments to the patchset. I'm going to push 0001 and 0003 if there are no objections. ------ Regards, Alexander Korotkov
Attachment
On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Thanks for your review, Alexander! > > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly(). > > Other changes seem correct to me too. > > > > > > I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less thansizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough. > > Thank you. I've made a few more minor adjustments to the patchset. > I'm going to push 0001 and 0003 if there are no objections. I get that patchset v5 doesn't pass the tests due to typo in assert. The fixes version is attached. ------ Regards, Alexander Korotkov
Attachment
On Tue, Oct 27, 2020 at 08:23:26PM +0300, Alexander Korotkov wrote: >On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: >> On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> > Thanks for your review, Alexander! >> > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly(). >> > Other changes seem correct to me too. >> > >> > >> > I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less thansizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough. >> >> Thank you. I've made a few more minor adjustments to the patchset. >> I'm going to push 0001 and 0003 if there are no objections. > >I get that patchset v5 doesn't pass the tests due to typo in assert. >The fixes version is attached. > I did a quick review on this patch series. A couple comments: 0001 ---- This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is changed to return information about what lock was used, merely to allow the callers to do an Assert() that the value is not LW_NONE. IMO we could achieve exactly the same thing by passing a simple flag that would say 'make sure we got a lock' or something like that. In fact, aren't all callers doing the assert? That'd mean we can just do the check always, without the flag. (I see GetMultiXactIdMembers does two calls and only checks the second result, but I wonder if that's intended or omission.) In any case, it'd make the lwlock.c changes unnecessary, I think. 0002 ---- Specifies the number cached MultiXact by backend. Any SLRU lookup ... should be 'number of cached ...' 0003 ---- * Conditional variable for waiting till the filling of the next multixact * will be finished. See GetMultiXactIdMembers() and RecordNewMultiXact() * for details. Perhaps 'till the next multixact is filled' or 'gets full' would be better. Not sure. This thread started with a discussion about making the SLRU sizes configurable, but this patch version only adds a local cache. Does this achieve the same goal, or would we still gain something by having GUCs for the SLRUs? If we're claiming this improves performance, it'd be good to have some workload demonstrating that and measurements. I don't see anything like that in this thread, so it's a bit hand-wavy. Can someone share details of such workload (even synthetic one) and some basic measurements? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas, thanks for looking into this! > 28 окт. 2020 г., в 06:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): > > > This thread started with a discussion about making the SLRU sizes > configurable, but this patch version only adds a local cache. Does this > achieve the same goal, or would we still gain something by having GUCs > for the SLRUs? > > If we're claiming this improves performance, it'd be good to have some > workload demonstrating that and measurements. I don't see anything like > that in this thread, so it's a bit hand-wavy. Can someone share details > of such workload (even synthetic one) and some basic measurements? All patches in this thread aim at the same goal: improve performance in presence of MultiXact locks contention. I could not build synthetical reproduction of the problem, however I did some MultiXact stressing here [0]. It's a clumsytest program, because it still is not clear to me which parameters of workload trigger MultiXact locks contention.In generic case I was encountering other locks like *GenLock: XidGenLock, MultixactGenLock etc. Yet our productionsystem encounters this problem approximately once in a month through this year. Test program locks for share different set of tuples in presence of concurrent full scans. To produce a set of locks we choose one of 14 bits. If a row number has this bit set to 0 we add lock it. I've been measuring time to lock all rows 3 time for each of 14 bits, observing total time to set all locks. During the test I was observing locks in pg_stat_activity, if they did not contain enough MultiXact locks I was tuning parametersfurther (number of concurrent clients, number of bits, select queries etc). Why is it so complicated? It seems that other reproductions of a problem were encountering other locks. Lets describe patches in this thread from the POV of these test. *** Configurable SLRU buffers for MultiXact members and offsets. From tests it is clear that high and low values for these buffers affect the test time. Here are time for a one test runwith different offsets and members sizes [1] Our production currently runs with (numbers are pages of buffers) +#define NUM_MXACTOFFSET_BUFFERS 32 +#define NUM_MXACTMEMBER_BUFFERS 64 And, looking back to incidents in summer and fall 2020, seems like it helped mostly. But it's hard to give some tuning advises based on test results. Values (32,64) produce 10% better result than current hardcodedvalues (8,16). In generic case this is not what someone should tune first. *** Configurable caches of MultiXacts. Tests were specifically designed to beat caches. So, according to test the bigger cache is - the more time it takes to accomplishthe test [2]. Anyway cache is local for backend and it's purpose is deduplication of written MultiXacts, not enhancing reads. *** Using advantage of SimpleLruReadPage_ReadOnly() in MultiXacts. This simply aligns MultiXact with Subtransactions and CLOG. Other SLRUs already take advantage of reading SLRU with sharedlock. On synthetical tests without background selects this patch adds another ~4.7% of performance [3] against [4]. This improvementseems consistent between different parameter values, yet within measurements deviation (see difference betweenwarmup run [5] and closing run [6]). All in all, these attempts to measure impact are hand-wavy too. But it makes sense to use consistent approach among similarsubsystems (MultiXacts, Subtrans, CLOG etc). *** Reduce sleep in GetMultiXactIdMembers() on standby. The problem with pg_usleep(1000L) within GetMultiXactIdMembers() manifests on standbys during contention of MultiXactOffsetControlLock.It's even harder to reproduce. Yet it seems obvious that reducing sleep to shorter time frame will make count of sleeping backend smaller. For consistency I've returned patch with SLRU buffer configs to patchset (other patches are intact). But I'm mostly concernedabout patches 1 and 3. Thanks! Best regards, Andrey Borodin. [0] https://github.com/x4m/multixact_stress [1] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L22-L39 [2] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L83-L99 [3] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L9 [4] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L29 [5] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L3 [6] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L19
Attachment
Hi, Tomas! Thank you for your review. On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I did a quick review on this patch series. A couple comments: > > > 0001 > ---- > > This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is > changed to return information about what lock was used, merely to allow > the callers to do an Assert() that the value is not LW_NONE. Yes, but this is not merely to allow callers to do an Assert(). Sometimes in multixacts it could save us some relocks. So, we can skip relocking lock to exclusive mode if it's in exclusive already. Adding Assert() to every caller is probably overkill. > IMO we could achieve exactly the same thing by passing a simple flag > that would say 'make sure we got a lock' or something like that. In > fact, aren't all callers doing the assert? That'd mean we can just do > the check always, without the flag. (I see GetMultiXactIdMembers does > two calls and only checks the second result, but I wonder if that's > intended or omission.) Having just the flag is exactly what the original version by Andrey did. But if we have to read two multixact offsets pages or multiple members page in one GetMultiXactIdMembers()), then it does relocks from exclusive mode to exclusive mode. I decide that once we decide to optimize this locks, this situation is nice to evade. > In any case, it'd make the lwlock.c changes unnecessary, I think. I agree that it would be better to not touch lwlock.c. But I didn't find a way to evade relocking exclusive mode to exclusive mode without touching lwlock.c or making code cumbersome in other places. > 0002 > ---- > > Specifies the number cached MultiXact by backend. Any SLRU lookup ... > > should be 'number of cached ...' Sounds reasonable. > 0003 > ---- > > * Conditional variable for waiting till the filling of the next multixact > * will be finished. See GetMultiXactIdMembers() and RecordNewMultiXact() > * for details. > > Perhaps 'till the next multixact is filled' or 'gets full' would be > better. Not sure. Sounds reasonable as well. ------ Regards, Alexander Korotkov
On Wed, Oct 28, 2020 at 10:36:39PM +0300, Alexander Korotkov wrote: >Hi, Tomas! > >Thank you for your review. > >On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> I did a quick review on this patch series. A couple comments: >> >> >> 0001 >> ---- >> >> This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is >> changed to return information about what lock was used, merely to allow >> the callers to do an Assert() that the value is not LW_NONE. > >Yes, but this is not merely to allow callers to do an Assert(). >Sometimes in multixacts it could save us some relocks. So, we can >skip relocking lock to exclusive mode if it's in exclusive already. >Adding Assert() to every caller is probably overkill. > Hmm, OK. That can only happen in GetMultiXactIdMembers, which is the only place where we do retry, right? Do we actually know this makes any measurable difference? It seems we're mostly imagining that it might help, but we don't have any actual proof of that (e.g. a workload which we might benchmark). Or am I missing something? For me, the extra conditions make it way harder to reason about the behavior of the code, and I can't convince myself it's worth it. >> IMO we could achieve exactly the same thing by passing a simple flag >> that would say 'make sure we got a lock' or something like that. In >> fact, aren't all callers doing the assert? That'd mean we can just do >> the check always, without the flag. (I see GetMultiXactIdMembers does >> two calls and only checks the second result, but I wonder if that's >> intended or omission.) > >Having just the flag is exactly what the original version by Andrey >did. But if we have to read two multixact offsets pages or multiple >members page in one GetMultiXactIdMembers()), then it does relocks >from exclusive mode to exclusive mode. I decide that once we decide >to optimize this locks, this situation is nice to evade. > >> In any case, it'd make the lwlock.c changes unnecessary, I think. > >I agree that it would be better to not touch lwlock.c. But I didn't >find a way to evade relocking exclusive mode to exclusive mode without >touching lwlock.c or making code cumbersome in other places. > Hmm. OK. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On Wed, Oct 28, 2020 at 12:34:58PM +0500, Andrey Borodin wrote: >Tomas, thanks for looking into this! > >> 28 окт. 2020 г., в 06:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): >> >> >> This thread started with a discussion about making the SLRU sizes >> configurable, but this patch version only adds a local cache. Does this >> achieve the same goal, or would we still gain something by having GUCs >> for the SLRUs? >> >> If we're claiming this improves performance, it'd be good to have some >> workload demonstrating that and measurements. I don't see anything like >> that in this thread, so it's a bit hand-wavy. Can someone share details >> of such workload (even synthetic one) and some basic measurements? > >All patches in this thread aim at the same goal: improve performance in presence of MultiXact locks contention. >I could not build synthetical reproduction of the problem, however I did some MultiXact stressing here [0]. It's a clumsytest program, because it still is not clear to me which parameters of workload trigger MultiXact locks contention.In generic case I was encountering other locks like *GenLock: XidGenLock, MultixactGenLock etc. Yet our productionsystem encounters this problem approximately once in a month through this year. > >Test program locks for share different set of tuples in presence of concurrent full scans. >To produce a set of locks we choose one of 14 bits. If a row number has this bit set to 0 we add lock it. >I've been measuring time to lock all rows 3 time for each of 14 bits, observing total time to set all locks. >During the test I was observing locks in pg_stat_activity, if they did not contain enough MultiXact locks I was tuning parametersfurther (number of concurrent clients, number of bits, select queries etc). > >Why is it so complicated? It seems that other reproductions of a problem were encountering other locks. > It's not my intention to be mean or anything like that, but to me this means we don't really understand the problem we're trying to solve. Had we understood it, we should be able to construct a workload reproducing the issue ... I understand what the individual patches are doing, and maybe those changes are desirable in general. But without any benchmarks from a plausible workload I find it hard to convince myself that: (a) it actually will help with the issue you're observing on production and (b) it's actually worth the extra complexity (e.g. the lwlock changes) I'm willing to invest some of my time into reviewing/testing this, but I think we badly need better insight into the issue, so that we can build a workload reproducing it. Perhaps collecting some perf profiles and a sample of the queries might help, but I assume you already tried that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...
I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:
(a) it actually will help with the issue you're observing on production
and
(b) it's actually worth the extra complexity (e.g. the lwlock changes)
I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.
Attachment
On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote: > > >> 29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): >> >> It's not my intention to be mean or anything like that, but to me this >> means we don't really understand the problem we're trying to solve. Had >> we understood it, we should be able to construct a workload reproducing >> the issue ... >> >> I understand what the individual patches are doing, and maybe those >> changes are desirable in general. But without any benchmarks from a >> plausible workload I find it hard to convince myself that: >> >> (a) it actually will help with the issue you're observing on production >> >> and >> (b) it's actually worth the extra complexity (e.g. the lwlock changes) >> >> >> I'm willing to invest some of my time into reviewing/testing this, but I >> think we badly need better insight into the issue, so that we can build >> a workload reproducing it. Perhaps collecting some perf profiles and a >> sample of the queries might help, but I assume you already tried that. > >Thanks, Tomas! This totally makes sense. > >Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x less shards),copy of production workload queries. But the problem does not manifest there. >Why do I think the problem is in MultiXacts? >Here is a chart with number of wait events on each host > > >During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover to anothernode SLRURead continued for a bit there, then disappeared. OK, so most of this seems to be due to SLRURead and MultiXactOffsetControlLock. Could it be that there were too many multixact members, triggering autovacuum to prevent multixact wraparound? That might generate a lot of IO on the SLRU. Are you monitoring the size of the pg_multixact directory? >Backtraces on standbys during the problem show that most of backends are sleeping in pg_sleep(1000L) and are not includedinto wait stats on these charts. > >Currently I'm considering writing test that directly calls MultiXactIdExpand(), MultiXactIdCreate(), and GetMultiXactIdMembers()from an extension. How do you think, would benchmarks in such tests be meaningful? > I don't know. I'd much rather have a SQL-level benchmark than an extension doing this kind of stuff. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 29 окт. 2020 г., в 18:49, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): > > On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote: >> >> >>> 29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): >>> >>> It's not my intention to be mean or anything like that, but to me this >>> means we don't really understand the problem we're trying to solve. Had >>> we understood it, we should be able to construct a workload reproducing >>> the issue ... >>> >>> I understand what the individual patches are doing, and maybe those >>> changes are desirable in general. But without any benchmarks from a >>> plausible workload I find it hard to convince myself that: >>> >>> (a) it actually will help with the issue you're observing on production >>> >>> and >>> (b) it's actually worth the extra complexity (e.g. the lwlock changes) >>> >>> >>> I'm willing to invest some of my time into reviewing/testing this, but I >>> think we badly need better insight into the issue, so that we can build >>> a workload reproducing it. Perhaps collecting some perf profiles and a >>> sample of the queries might help, but I assume you already tried that. >> >> Thanks, Tomas! This totally makes sense. >> >> Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x lessshards), copy of production workload queries. But the problem does not manifest there. >> Why do I think the problem is in MultiXacts? >> Here is a chart with number of wait events on each host >> >> >> During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover toanother node SLRURead continued for a bit there, then disappeared. > > OK, so most of this seems to be due to SLRURead and > MultiXactOffsetControlLock. Could it be that there were too many > multixact members, triggering autovacuum to prevent multixact > wraparound? That might generate a lot of IO on the SLRU. Are you > monitoring the size of the pg_multixact directory? Yes, we had some problems with 'multixact "members" limit exceeded' long time ago. We tuned autovacuum_multixact_freeze_max_age = 200000000 and vacuum_multixact_freeze_table_age = 75000000 (half of defaults)and since then did not ever encounter this problem (~5 months). But the MultiXactOffsetControlLock problem persists. Partially the problem was solved by adding more shards. But when oneof shards encounters a problem it's either MultiXacts or vacuum causing relation truncation (unrelated to this thread). Thanks! Best regards, Andrey Borodin.
Hi, After the issue reported in [1] got fixed, I've restarted the multi-xact stress test, hoping to reproduce the issue. But so far no luck :-( I've started slightly different tests on two machines - on one machine I've done this: a) init.sql create table t (a int); insert into t select i from generate_series(1,100000000) s(i); alter table t add primary key (a); b) select.sql SELECT * FROM t WHERE a = (1+mod(abs(hashint4(extract(epoch from now())::int)), 100000000)) FOR KEY SHARE; c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test The idea is to have large table and many clients hitting a small random subset of the rows. A sample of wait events from ~24h run looks like this: e_type | e_name | sum ----------+----------------------+---------- LWLock | BufferContent | 13913863 | | 7194679 LWLock | WALWrite | 1710507 Activity | LogicalLauncherMain | 726599 Activity | AutoVacuumMain | 726127 Activity | WalWriterMain | 725183 Activity | CheckpointerMain | 604694 Client | ClientRead | 599900 IO | WALSync | 502904 Activity | BgWriterMain | 378110 Activity | BgWriterHibernate | 348464 IO | WALWrite | 129412 LWLock | ProcArray | 6633 LWLock | WALInsert | 5714 IO | SLRUWrite | 2580 IPC | ProcArrayGroupUpdate | 2216 LWLock | XactSLRU | 2196 Timeout | VacuumDelay | 1078 IPC | XactGroupUpdate | 737 LWLock | LockManager | 503 LWLock | WALBufMapping | 295 LWLock | MultiXactMemberSLRU | 267 IO | DataFileWrite | 68 LWLock | BufferIO | 59 IO | DataFileRead | 27 IO | DataFileFlush | 14 LWLock | MultiXactGen | 7 LWLock | BufferMapping | 1 So, nothing particularly interesting - there certainly are not many wait events related to SLRU. On the other machine I did this: a) init.sql create table t (a int primary key); insert into t select i from generate_series(1,1000) s(i); b) select.sql select * from t for key share; c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test and the wait events (24h run too) look like this: e_type | e_name | sum -----------+-----------------------+---------- LWLock | BufferContent | 20804925 | | 2575369 Activity | LogicalLauncherMain | 745780 Activity | AutoVacuumMain | 745292 Activity | WalWriterMain | 740507 Activity | CheckpointerMain | 737691 Activity | BgWriterHibernate | 731123 LWLock | WALWrite | 570107 IO | WALSync | 452603 Client | ClientRead | 151438 BufferPin | BufferPin | 23466 LWLock | WALInsert | 21631 IO | WALWrite | 19050 LWLock | ProcArray | 15082 Activity | BgWriterMain | 14655 IPC | ProcArrayGroupUpdate | 7772 LWLock | WALBufMapping | 3555 IO | SLRUWrite | 1951 LWLock | MultiXactGen | 1661 LWLock | MultiXactMemberSLRU | 359 LWLock | MultiXactOffsetSLRU | 242 LWLock | XactSLRU | 141 IPC | XactGroupUpdate | 104 LWLock | LockManager | 28 IO | DataFileRead | 4 IO | ControlFileSyncUpdate | 1 Timeout | VacuumDelay | 1 IO | WALInitWrite | 1 Also nothing particularly interesting - few SLRU wait events. So unfortunately this does not really reproduce the SLRU locking issues you're observing - clearly, there has to be something else triggering it. Perhaps this workload is too simplistic, or maybe we need to run different queries. Or maybe the hw needs to be somewhat different (more CPUs? different storage?) [1] https://www.postgresql.org/message-id/20201104013205.icogbi773przyny5@development regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> 10 нояб. 2020 г., в 05:13, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): > After the issue reported in [1] got fixed, I've restarted the multi-xact > stress test, hoping to reproduce the issue. But so far no luck :-( Tomas, many thanks for looking into this. I figured out that to make multixact sets bigger transactions must hang for a whileand lock large set of tuples. But not continuous range to avoid locking on buffer_content. I did not manage to implement this via pgbench, that's why I was trying to hack on separate go program. But, essentially,no luck either. I was observing something resemblant though пятница, 8 мая 2020 г. 15:08:37 (every 1s) pid | wait_event | wait_event_type | state | query -------+----------------------------+-----------------+--------+---------------------------------------------------- 41344 | ClientRead | Client | idle | insert into t1 select generate_series(1,1000000,1) 41375 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41377 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41378 | | | active | select * from t1 where i = ANY ($1) for share 41379 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41381 | | | active | select * from t1 where i = ANY ($1) for share 41383 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41385 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share (8 rows) but this picture was not stable. How do you collect wait events for aggregation? just insert into some table with cron? Thanks! Best regards, Andrey Borodin.
On 11/10/20 7:16 AM, Andrey Borodin wrote: > > >> 10 нояб. 2020 г., в 05:13, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): >> After the issue reported in [1] got fixed, I've restarted the multi-xact >> stress test, hoping to reproduce the issue. But so far no luck :-( > > > Tomas, many thanks for looking into this. I figured out that to make multixact sets bigger transactions must hang for awhile and lock large set of tuples. But not continuous range to avoid locking on buffer_content. > I did not manage to implement this via pgbench, that's why I was trying to hack on separate go program. But, essentially,no luck either. > I was observing something resemblant though > > пятница, 8 мая 2020 г. 15:08:37 (every 1s) > > pid | wait_event | wait_event_type | state | query > -------+----------------------------+-----------------+--------+---------------------------------------------------- > 41344 | ClientRead | Client | idle | insert into t1 select generate_series(1,1000000,1) > 41375 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share > 41377 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share > 41378 | | | active | select * from t1 where i = ANY ($1) for share > 41379 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share > 41381 | | | active | select * from t1 where i = ANY ($1) for share > 41383 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share > 41385 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share > (8 rows) > > but this picture was not stable. > Seems we haven't made much progress in reproducing the issue :-( I guess we'll need to know more about the machine where this happens. Is there anything special about the hardware/config? Are you monitoring size of the pg_multixact directory? > How do you collect wait events for aggregation? just insert into some table with cron? > No, I have a simple shell script (attached) sampling data from pg_stat_activity regularly. Then I load it into a table and aggregate to get a summary. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Nov 11, 2020 at 7:07 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Seems we haven't made much progress in reproducing the issue :-( I guess > we'll need to know more about the machine where this happens. Is there > anything special about the hardware/config? Are you monitoring size of > the pg_multixact directory? Which release was the original problem seen on?
> 10 нояб. 2020 г., в 23:07, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): > > On 11/10/20 7:16 AM, Andrey Borodin wrote: >> >> >> but this picture was not stable. >> > > Seems we haven't made much progress in reproducing the issue :-( I guess > we'll need to know more about the machine where this happens. Is there > anything special about the hardware/config? Are you monitoring size of > the pg_multixact directory? It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM. PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit No, unfortunately we do not have signals for SLRU sizes. 3.5Tb mdadm raid10 over 28 SSD drives, 82% full. First incident triggering investigation was on 2020-04-19, at that time cluster was running on PG 10.11. But I think it washappening before. I'd say nothing special... > >> How do you collect wait events for aggregation? just insert into some table with cron? >> > > No, I have a simple shell script (attached) sampling data from > pg_stat_activity regularly. Then I load it into a table and aggregate to > get a summary. Thanks! Best regards, Andrey Borodin.
Le 13/11/2020 à 12:49, Andrey Borodin a écrit : > >> 10 нояб. 2020 г., в 23:07, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): >> >> On 11/10/20 7:16 AM, Andrey Borodin wrote: >>> >>> but this picture was not stable. >>> >> Seems we haven't made much progress in reproducing the issue :-( I guess >> we'll need to know more about the machine where this happens. Is there >> anything special about the hardware/config? Are you monitoring size of >> the pg_multixact directory? > It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM. > PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit > > No, unfortunately we do not have signals for SLRU sizes. > 3.5Tb mdadm raid10 over 28 SSD drives, 82% full. > > First incident triggering investigation was on 2020-04-19, at that time cluster was running on PG 10.11. But I think itwas happening before. > > I'd say nothing special... > >>> How do you collect wait events for aggregation? just insert into some table with cron? >>> >> No, I have a simple shell script (attached) sampling data from >> pg_stat_activity regularly. Then I load it into a table and aggregate to >> get a summary. > Thanks! > > Best regards, Andrey Borodin. Hi, Some time ago I have encountered a contention on MultiXactOffsetControlLock with a performances benchmark. Here are the wait event monitoring result with a pooling each 10 seconds and a 30 minutes run for the benchmarl: event_type | event | sum ------------+----------------------------+---------- Client | ClientRead | 44722952 LWLock | MultiXactOffsetControlLock | 30343060 LWLock | multixact_offset | 16735250 LWLock | MultiXactMemberControlLock | 1601470 LWLock | buffer_content | 991344 LWLock | multixact_member | 805624 Lock | transactionid | 204997 Activity | LogicalLauncherMain | 198834 Activity | CheckpointerMain | 198834 Activity | AutoVacuumMain | 198469 Activity | BgWriterMain | 184066 Activity | WalWriterMain | 171571 LWLock | WALWriteLock | 72428 IO | DataFileRead | 35708 Activity | BgWriterHibernate | 12741 IO | SLRURead | 9121 Lock | relation | 8858 LWLock | ProcArrayLock | 7309 LWLock | lock_manager | 6677 LWLock | pg_stat_statements | 4194 LWLock | buffer_mapping | 3222 After reading this thread I change the value of the buffer size to 32 and 64 and obtain the following results: event_type | event | sum ------------+----------------------------+----------- Client | ClientRead | 268297572 LWLock | MultiXactMemberControlLock | 65162906 LWLock | multixact_member | 33397714 LWLock | buffer_content | 4737065 Lock | transactionid | 2143750 LWLock | SubtransControlLock | 1318230 LWLock | WALWriteLock | 1038999 Activity | LogicalLauncherMain | 940598 Activity | AutoVacuumMain | 938566 Activity | CheckpointerMain | 799052 Activity | WalWriterMain | 749069 LWLock | subtrans | 710163 Activity | BgWriterHibernate | 536763 Lock | object | 514225 Activity | BgWriterMain | 394206 LWLock | lock_manager | 295616 IO | DataFileRead | 274236 LWLock | ProcArrayLock | 77099 Lock | tuple | 59043 IO | CopyFileWrite | 45611 Lock | relation | 42714 There was still contention on multixact but less than the first run. I have increased the buffers to 128 and 512 and obtain the best results for this bench: event_type | event | sum ------------+----------------------------+----------- Client | ClientRead | 160463037 LWLock | MultiXactMemberControlLock | 5334188 LWLock | buffer_content | 5228256 LWLock | buffer_mapping | 2368505 LWLock | SubtransControlLock | 2289977 IPC | ProcArrayGroupUpdate | 1560875 LWLock | ProcArrayLock | 1437750 Lock | transactionid | 825561 LWLock | subtrans | 772701 LWLock | WALWriteLock | 666138 Activity | LogicalLauncherMain | 492585 Activity | CheckpointerMain | 492458 Activity | AutoVacuumMain | 491548 LWLock | lock_manager | 426531 Lock | object | 403581 Activity | WalWriterMain | 394668 Activity | BgWriterHibernate | 293112 Activity | BgWriterMain | 195312 LWLock | MultiXactGenLock | 177820 LWLock | pg_stat_statements | 173864 IO | DataFileRead | 173009 I hope these metrics can have some interest to show the utility of this patch but unfortunately I can not be more precise and provide reports for the entire patch. The problem is that this benchmark is run on an application that use PostgreSQL 11 and I can not back port the full patch, there was too much changes since PG11. I have just increase the size of NUM_MXACTOFFSET_BUFFERS and NUM_MXACTMEMBER_BUFFERS. This allow us to triple the number of simultaneous connections between the first and the last test. I know that this report is not really helpful but at least I can give more information on the benchmark that was used. This is the proprietary zRef benchmark which compares the same Cobol programs (transactional and batch) executed both on mainframes and on x86 servers. Instead of a DB2 z/os database we use PostgreSQL v11. This test has extensive use of cursors (each select, even read only, is executed through a cursor) and the contention was observed with update on tables with some foreign keys. There is no explicit FOR SHARE on the queries, only some FOR UPDATE clauses. I guess that the multixact contention is the result of the for share locks produced for FK. So in our case being able to tune the multixact buffers could help a lot to improve the performances. -- Gilles Darold
Hi Gilles! Many thanks for your message! > 8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а): > > I know that this report is not really helpful Quite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you pleasebenchmark them (without extending SLRU)? Best regards, Andrey Borodin.
Attachment
test insert_conflict ... ok
test create_function_1 ... FAILED
test create_type ... FAILED
test create_table ... FAILED
test create_function_2 ... FAILED
test copy ... FAILED
test copyselect ... ok
test copydml ... ok
test create_misc ... FAILED
test create_operator ... FAILED
test create_procedure ... ok
test create_index ... FAILED
test index_including ... ok
test create_view ... FAILED
test create_aggregate ... ok
test create_function_3 ... ok
test create_cast ... ok
test constraints ... FAILED
test triggers ... FAILED
test inherit ...
^C
-- Gilles Darold
Hi Gilles! Many thanks for your message!8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а): I know that this report is not really helpfulQuite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)? Best regards, Andrey Borodin.
Le 09/12/2020 à 11:51, Gilles Darold a écrit : > Also PG regression tests are failing too on several part. Forget this, I have not run the regression tests in the right repository: ... ======================= All 189 tests passed. ======================= I'm looking why the application is failing. -- Gilles Darold http://www.darold.net/
Hi Gilles! Many thanks for your message!8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а): I know that this report is not really helpfulQuite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)? Best regards, Andrey Borodin.
Hi,
Running tests yesterday with the patches has reported log of failures with error on INSERT and UPDATE statements:
ERROR: lock MultiXactOffsetControlLock is not held
After a patch review this morning I think I have found what's going wrong. In patch v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think there is a missing reinitialisation of the lockmode variable to LW_NONE inside the retry loop after the call to LWLockRelease() in src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). I've attached a new version of the patch for master that include the fix I'm using now with PG11 and with which everything works very well now.
I'm running more tests to see the impact on the performances to play with multixact_offsets_slru_buffers, multixact_members_slru_buffers and multixact_local_cache_entries. I will reports the results later today.
-- Gilles Darold http://www.darold.net/
Attachment
Le 08/12/2020 à 18:52, Andrey Borodin a écrit :Hi Gilles! Many thanks for your message!8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а): I know that this report is not really helpfulQuite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)? Best regards, Andrey Borodin.Hi,
Running tests yesterday with the patches has reported log of failures with error on INSERT and UPDATE statements:
ERROR: lock MultiXactOffsetControlLock is not held
After a patch review this morning I think I have found what's going wrong. In patch v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think there is a missing reinitialisation of the lockmode variable to LW_NONE inside the retry loop after the call to LWLockRelease() in src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). I've attached a new version of the patch for master that include the fix I'm using now with PG11 and with which everything works very well now.
I'm running more tests to see the impact on the performances to play with multixact_offsets_slru_buffers, multixact_members_slru_buffers and multixact_local_cache_entries. I will reports the results later today.
Sorry for the delay, I have done some further tests to try to reach the limit without bottlenecks on multixact or shared buffers. The tests was done on a Microsoft Asure machine with 2TB of RAM and 4 sockets Intel Xeon Platinum 8280M (128 cpu). PG configuration:
max_connections = 4096
shared_buffers = 64GB
max_prepared_transactions = 2048
work_mem = 256MB
maintenance_work_mem = 2GB
wal_level = minimal
synchronous_commit = off
commit_delay = 1000
commit_siblings = 10
checkpoint_timeout = 1h
max_wal_size = 32GB
checkpoint_completion_target = 0.9
I have tested with several values for the different buffer's variables starting from:
multixact_offsets_slru_buffers = 64
multixact_members_slru_buffers = 128
multixact_local_cache_entries = 256
to the values with the best performances we achieve with this test to avoid MultiXactOffsetControlLock or MultiXactMemberControlLock:
multixact_offsets_slru_buffers = 128
multixact_members_slru_buffers = 512
multixact_local_cache_entries = 1024
Also shared_buffers have been increased up to 256GB to avoid buffer_mapping contention.
Our last best test reports the following wait events:
event_type | event | sum
------------+----------------------------+-----------
Client | ClientRead | 321690211
LWLock | buffer_content | 2970016
IPC | ProcArrayGroupUpdate | 2317388
LWLock | ProcArrayLock | 1445828
LWLock | WALWriteLock | 1187606
LWLock | SubtransControlLock | 972889
Lock | transactionid | 840560
Lock | relation | 587600
Activity | LogicalLauncherMain | 529599
Activity | AutoVacuumMain | 528097
At this stage I don't think we can have better performances by tuning these buffers at least with PG11.
About performances gain related to the patch for shared lock in GetMultiXactIdMembers unfortunately I can not see a difference with or without this patch, it could be related to our particular benchmark. But clearly the patch on multixact buffers should be committed as this is really helpfull to be able to tuned PG when multixact bottlenecks are found.
Best regards,
-- Gilles Darold LzLabs GmbH https://www.lzlabs.com/
Le 10/12/2020 à 15:45, Gilles Darold a écrit :Le 08/12/2020 à 18:52, Andrey Borodin a écrit :Hi Gilles! Many thanks for your message!8 дек. 2020 г., в 21:05, Gilles Darold <gilles@darold.net> написал(а): I know that this report is not really helpfulQuite contrary - this benchmarks prove that controllable reproduction exists. I've rebased patches for PG11. Can you please benchmark them (without extending SLRU)? Best regards, Andrey Borodin.Hi,
Running tests yesterday with the patches has reported log of failures with error on INSERT and UPDATE statements:
ERROR: lock MultiXactOffsetControlLock is not held
After a patch review this morning I think I have found what's going wrong. In patch v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think there is a missing reinitialisation of the lockmode variable to LW_NONE inside the retry loop after the call to LWLockRelease() in src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). I've attached a new version of the patch for master that include the fix I'm using now with PG11 and with which everything works very well now.
I'm running more tests to see the impact on the performances to play with multixact_offsets_slru_buffers, multixact_members_slru_buffers and multixact_local_cache_entries. I will reports the results later today.
Hi,
Sorry for the delay, I have done some further tests to try to reach the limit without bottlenecks on multixact or shared buffers. The tests was done on a Microsoft Asure machine with 2TB of RAM and 4 sockets Intel Xeon Platinum 8280M (128 cpu). PG configuration:
max_connections = 4096
shared_buffers = 64GB
max_prepared_transactions = 2048
work_mem = 256MB
maintenance_work_mem = 2GB
wal_level = minimal
synchronous_commit = off
commit_delay = 1000
commit_siblings = 10
checkpoint_timeout = 1h
max_wal_size = 32GB
checkpoint_completion_target = 0.9
I have tested with several values for the different buffer's variables starting from:
multixact_offsets_slru_buffers = 64
multixact_members_slru_buffers = 128
multixact_local_cache_entries = 256
to the values with the best performances we achieve with this test to avoid MultiXactOffsetControlLock or MultiXactMemberControlLock:
multixact_offsets_slru_buffers = 128
multixact_members_slru_buffers = 512
multixact_local_cache_entries = 1024
Also shared_buffers have been increased up to 256GB to avoid buffer_mapping contention.
Our last best test reports the following wait events:
event_type | event | sum
------------+----------------------------+-----------
Client | ClientRead | 321690211
LWLock | buffer_content | 2970016
IPC | ProcArrayGroupUpdate | 2317388
LWLock | ProcArrayLock | 1445828
LWLock | WALWriteLock | 1187606
LWLock | SubtransControlLock | 972889
Lock | transactionid | 840560
Lock | relation | 587600
Activity | LogicalLauncherMain | 529599
Activity | AutoVacuumMain | 528097
At this stage I don't think we can have better performances by tuning these buffers at least with PG11.
About performances gain related to the patch for shared lock in GetMultiXactIdMembers unfortunately I can not see a difference with or without this patch, it could be related to our particular benchmark. But clearly the patch on multixact buffers should be committed as this is really helpfull to be able to tuned PG when multixact bottlenecks are found.
I've done more review on these patches.
1) as reported in my previous message patch 0001 looks useless as it doesn't allow measurable performances gain.
2) In patch 0004 there is two typo: s/informaion/information/ will fix them
3) the GUC are missing in the postgresql.conf.sample file, see patch in attachment for a proposal.
Best regards,
-- Gilles Darold LzLabs GmbH https://www.lzlabs.com/
Attachment
> 13 дек. 2020 г., в 14:17, Gilles Darold <gilles@darold.net> написал(а): > > I've done more review on these patches. Thanks, Gilles! I'll incorporate all your fixes to patchset. Can you also benchmark conditional variable sleep? The patch "Add conditional variable to wait for next MultXact offset incorner case"? The problem manifests on Standby when Primary is heavily loaded with MultiXactOffsetControlLock. Thanks! Best regards, Andrey Borodin.
> 13 дек. 2020 г., в 22:24, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > > >> 13 дек. 2020 г., в 14:17, Gilles Darold <gilles@darold.net> написал(а): >> >> I've done more review on these patches. > > Thanks, Gilles! I'll incorporate all your fixes to patchset. PFA patches. Also, I've noted that patch "Add conditional variable to wait for next MultXact offset in corner case" removes CHECK_FOR_INTERRUPTS();,I'm not sure it's correct. Thanks! Best regards, Andrey Borodin.
Attachment
13 дек. 2020 г., в 14:17, Gilles Darold <gilles@darold.net> написал(а): I've done more review on these patches.Thanks, Gilles! I'll incorporate all your fixes to patchset. Can you also benchmark conditional variable sleep? The patch "Add conditional variable to wait for next MultXact offset in corner case"? The problem manifests on Standby when Primary is heavily loaded with MultiXactOffsetControlLock. Thanks! Best regards, Andrey Borodin.
Hi Andrey,
Sorry for the response delay, we have run several others tests trying to figure out the performances gain per patch but unfortunately we have very heratic results. With the same parameters and patches the test doesn't returns the same results following the day or the hour of the day. This is very frustrating and I suppose that this is related to the Azure architecture. The only thing that I am sure is that we had the best performances results with all patches and
multixact_offsets_slru_buffers = 256
multixact_members_slru_buffers = 512
multixact_local_cache_entries = 4096
but I can not say if all or part of the patches are improving the performances. My feeling is that performances gain related to patches 1 (shared lock) and 3 (conditional variable) do not have much to do with the performances gain compared to just tuning the multixact buffers. This is when the multixact contention is observed but perhaps they are delaying the contention. It's all the more frustrating that we had a test case to reproduce the contention but not the architecture apparently.
Can't do much more at this point.
Best regards,
-- Gilles Darold LzLabs GmbH http://www.lzlabs.com/
> 23 дек. 2020 г., в 21:31, Gilles Darold <gilles@darold.net> написал(а): > > Sorry for the response delay, we have run several others tests trying to figure out the performances gain per patch butunfortunately we have very heratic results. With the same parameters and patches the test doesn't returns the same resultsfollowing the day or the hour of the day. This is very frustrating and I suppose that this is related to the Azurearchitecture. The only thing that I am sure is that we had the best performances results with all patches and > > multixact_offsets_slru_buffers = 256 > multixact_members_slru_buffers = 512 > multixact_local_cache_entries = 4096 > > > but I can not say if all or part of the patches are improving the performances. My feeling is that performances gain relatedto patches 1 (shared lock) and 3 (conditional variable) do not have much to do with the performances gain comparedto just tuning the multixact buffers. This is when the multixact contention is observed but perhaps they are delayingthe contention. It's all the more frustrating that we had a test case to reproduce the contention but not the architectureapparently. Hi! Thanks for the input. I think we have a consensus here that configuring SLRU size is beneficial for MultiXacts. There is proposal in nearby thread [0] on changing default value of commit_ts SLRU buffers. In my experience from time to time there can be problems with subtransactions cured by extending subtrans SLRU. Let's make all SLRUs configurable? PFA patch with draft of these changes. Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql
Attachment
Hi Andrey, On Sat, Mar 13, 2021 at 1:44 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > [v10] +int multixact_offsets_slru_buffers = 8; +int multixact_members_slru_buffers = 16; +int subtrans_slru_buffers = 32; +int notify_slru_buffers = 8; +int serial_slru_buffers = 16; +int clog_slru_buffers = 0; +int commit_ts_slru_buffers = 0; I don't think we should put "slru" (the name of the buffer replacement algorithm, implementation detail) in the GUC names. + It defaults to 0, in this case CLOG size is taken as <varname>shared_buffers</varname> / 512. We already know that increasing the number of CLOG buffers above the current number hurts as the linear search begins to dominate (according to the commit message for 5364b357), and it doesn't seem great to ship a new feature that melts your CPU when you turn it up. Perhaps, to ship this, we need to introduce a buffer mapping table? I have attached a "one coffee" attempt at that, on top of your v10 patch (unmodified), for discussion. It survives basic testing but I don't know how it performs.
Attachment
On Thu, Mar 25, 2021 at 10:31 AM Thomas Munro <thomas.munro@gmail.com> wrote: > We already know that increasing the number of CLOG buffers above the > current number hurts as the linear search begins to dominate > (according to the commit message for 5364b357), and it doesn't seem > great to ship a new feature that melts your CPU when you turn it up. > Perhaps, to ship this, we need to introduce a buffer mapping table? I > have attached a "one coffee" attempt at that, on top of your v10 patch > (unmodified), for discussion. It survives basic testing but I don't > know how it performs. Hrrr... Cfbot showed an assertion failure. Here's the two coffee version with a couple of silly mistakes fixed.
Attachment
Hi Andrey, all, I propose some changes, and I'm attaching a new version: I renamed the GUCs as clog_buffers etc (no "_slru_"). I fixed some copy/paste mistakes where the different GUCs were mixed up. I made some changes to the .conf.sample. I rewrote the documentation so that it states the correct unit and defaults, and refers to the subdirectories that are cached by these buffers instead of trying to give a new definition of each of the SLRUs. Do you like those changes? Some things I thought about but didn't change: I'm not entirely sure if we should use the internal and historical names well known to hackers (CLOG), or the visible directory names (I mean, we could use pg_xact_buffers instead of clog_buffers). I am not sure why these GUCs need to be PGDLLIMPORT, but I see that NBuffers is like that. I wanted to do some very simple smoke testing of CLOG sizes on my local development machine: pgbench -i -s1000 postgres pgbench -t4000000 -c8 -j8 -Mprepared postgres I disabled autovacuum after running that just to be sure it wouldn't interfere with my experiment: alter table pgbench_accounts set (autovacuum_enabled = off); Then I shut the cluster down and made a copy, so I could do some repeated experiments from the same initial conditions each time. At this point I had 30 files 0000-001E under pg_xact, holding 256kB = ~1 million transactions each. It'd take ~960 buffers to cache it all. So how long does VACUUM FREEZE pgbench_accounts take? I tested with just the 0001 patch, and also with the 0002 patch (improved version, attached): clog_buffers=128: 0001=2:28.499, 0002=2:17:891 clog_buffers=1024: 0001=1:38.485, 0002=1:29.701 I'm sure the speedup of the 0002 patch can be amplified by increasing the number of transactions referenced in the table OR number of clog_buffers, considering that the linear search produces O(transactions * clog_buffers) work. That was 32M transactions and 8MB of CLOG, but I bet if you double both of those numbers once or twice things start to get hot. I don't see why you shouldn't be able to opt to cache literally all of CLOG if you want (something like 50MB assuming default autovacuum_freeze_max_age, scale to taste, up to 512MB for the theoretical maximum useful value). I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity.
Attachment
Hi Thomas, Gilles, all! Thanks for reviewing this! > 25 марта 2021 г., в 02:31, Thomas Munro <thomas.munro@gmail.com> написал(а): > > I don't think we should put "slru" (the name of the buffer replacement > algorithm, implementation detail) in the GUC names. +1 > + It defaults to 0, in this case CLOG size is taken as > <varname>shared_buffers</varname> / 512. > > We already know that increasing the number of CLOG buffers above the > current number hurts as the linear search begins to dominate Uh, my intent was to copy original approach of CLOG SLRU size, I just missed that Min(,) thing in shared_buffers logic. > 26 марта 2021 г., в 08:46, Thomas Munro <thomas.munro@gmail.com> написал(а): > > Hi Andrey, all, > > I propose some changes, and I'm attaching a new version: > > I renamed the GUCs as clog_buffers etc (no "_slru_"). I fixed some > copy/paste mistakes where the different GUCs were mixed up. I made > some changes to the .conf.sample. I rewrote the documentation so that > it states the correct unit and defaults, and refers to the > subdirectories that are cached by these buffers instead of trying to > give a new definition of each of the SLRUs. > > Do you like those changes? Yes! > Some things I thought about but didn't change: > > I'm not entirely sure if we should use the internal and historical > names well known to hackers (CLOG), or the visible directory names (I > mean, we could use pg_xact_buffers instead of clog_buffers). While it is good idea to make notes about directory name, I think the real naming criteria is to help find this GUC whenuser encounters wait events in pg_stat_activity. I think there is no CLOG mentions in docs [0], only XactBuffer, XactSLRUet c. > I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity. I think the idea of speeding up linear search is really really good for scaling SLRUs. It's not even about improving normalperformance of the cluster, but it's important from preventing pathological degradation under certain circumstances.Bigger cache really saves SLAs :) I'll look into the patch more closely this weekend. Thank you! Best regards, Andrey Borodin. [0] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW
> 26 марта 2021 г., в 11:00, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > >> I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity. > I think the idea of speeding up linear search is really really good for scaling SLRUs. It's not even about improving normalperformance of the cluster, but it's important from preventing pathological degradation under certain circumstances.Bigger cache really saves SLAs :) I'll look into the patch more closely this weekend. Thank you! Some thoughts on HashTable patch: 1. Can we allocate bigger hashtable to reduce probability of collisions? 2. Can we use specialised hashtable for this case? I'm afraid hash_search() does comparable number of CPU cycles as simplecycle from 0 to 128. We could inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not insistingon special hash though, just an idea. 3. pageno in SlruMappingTableEntry seems to be unused. Thanks! Best regards, Andrey Borodin.
On Sat, Mar 27, 2021 at 4:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Some thoughts on HashTable patch: > 1. Can we allocate bigger hashtable to reduce probability of collisions? Yeah, good idea, might require some study. > 2. Can we use specialised hashtable for this case? I'm afraid hash_search() does comparable number of CPU cycles as simplecycle from 0 to 128. We could inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not insistingon special hash though, just an idea. I tried really hard to not fall into this rabbit h.... [hack hack hack], OK, here's a first attempt to use simplehash, Andres's steampunk macro-based robinhood template that we're already using for several other things, and murmurhash which is inlineable and branch-free. I had to tweak it to support "in-place" creation and fixed size (in other words, no allocators, for use in shared memory). Then I was annoyed that I had to add a "status" member to our struct, so I tried to fix that. Definitely needs more work to think about failure modes when running out of memory, how much spare space you need, etc. I have not experimented with this much beyond hacking until the tests pass, but it *should* be more efficient... > 3. pageno in SlruMappingTableEntry seems to be unused. It's the key (dynahash uses the first N bytes of your struct as the key, but in this new simplehash version it's more explicit).
Attachment
> 27 марта 2021 г., в 01:26, Thomas Munro <thomas.munro@gmail.com> написал(а): > > On Sat, Mar 27, 2021 at 4:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> Some thoughts on HashTable patch: >> 1. Can we allocate bigger hashtable to reduce probability of collisions? > > Yeah, good idea, might require some study. In a long run we always have this table filled with nslots. But the keys will be usually consecutive numbers (current workingset of CLOG\Multis\etc). So in a happy hashing scenario collisions will only appear for some random backward jumps.I think just size = nslots * 2 will produce results which cannot be improved significantly. And this reflects original growth strategy SH_GROW(tb, tb->size * 2). >> 2. Can we use specialised hashtable for this case? I'm afraid hash_search() does comparable number of CPU cycles as simplecycle from 0 to 128. We could inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not insistingon special hash though, just an idea. > > I tried really hard to not fall into this rabbit h.... [hack hack > hack], OK, here's a first attempt to use simplehash, > Andres's > steampunk macro-based robinhood template Sounds magnificent. > that we're already using for > several other things I could not find much tests to be sure that we do not break something... > , and murmurhash which is inlineable and > branch-free. I think pageno is a hash already. Why hash any further? And pages accessed together will have smaller access time due tocolocation. > I had to tweak it to support "in-place" creation and > fixed size (in other words, no allocators, for use in shared memory). We really need to have a test to know what happens when this structure goes out of memory, as you mentioned below. What wouldbe apropriate place for simplehash tests? > Then I was annoyed that I had to add a "status" member to our struct, > so I tried to fix that. Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash align it well? Thanks! Best regards, Andrey Borodin.
On Sat, Mar 27, 2021 at 6:31 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > 27 марта 2021 г., в 01:26, Thomas Munro <thomas.munro@gmail.com> написал(а): > > , and murmurhash which is inlineable and > > branch-free. > I think pageno is a hash already. Why hash any further? And pages accessed together will have smaller access time due tocolocation. Yeah, if clog_buffers is large enough then it's already a "perfect hash", but if it's not then you might get some weird "harmonic" effects (not sure if that's the right word), basically higher or lower collision rate depending on coincidences in the data. If you apply a hash, the collisions should be evenly spread out so at least it'll be somewhat consistent. Does that make sense? (At some point I figured out that the syscaches have lower collision rates and perform better if you use oids directly instead of hashing them... but then it's easy to create a pathological pattern of DDL that turns your hash table into a linked list. Not sure what to think about that.) > > I had to tweak it to support "in-place" creation and > > fixed size (in other words, no allocators, for use in shared memory). > We really need to have a test to know what happens when this structure goes out of memory, as you mentioned below. Whatwould be apropriate place for simplehash tests? Good questions. This has to be based on being guaranteed to have enough space for all of the entries, so the question is really just "how bad can performance get with different load factors". FWIW there were some interesting cases with clustering when simplehash was first used in the executor (see commits ab9f2c42 and parent) which required some work on hashing quality to fix. > > Then I was annoyed that I had to add a "status" member to our struct, > > so I tried to fix that. > Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash align it well? With that "intrusive status" patch, the size is back to 8. But I think I made a mistake: I made it steal some key space to indicate presence, but I think the presence test should really get access to the whole entry so that you can encode it in more ways. For example, with slotno == -1. Alright, considering the date, if we want to get this into PostgreSQL 14 it's time to make some decisions. 1. Do we want customisable SLRU sizes in PG14? +1 from me, we have multiple reports of performance gains from increasing various different SLRUs, and it's easy to find workloads that go faster. One thought: it'd be nice if the user could *see* the current size, when using the default. SHOW clog_buffers -> 0 isn't very helpful if you want to increase it, but don't know what it's currently set to. Not sure off the top of my head how best to do that. 2. What names do we want the GUCs to have? Here's what we have: Proposed GUC Directory System views clog_buffers pg_xact Xact multixact_offsets_buffers pg_multixact/offsets MultiXactOffset multixact_members_buffers pg_multixact/members MultiXactMember notify_buffers pg_notify Notify serial_buffers pg_serial Serial subtrans_buffers pg_subtrans Subtrans commit_ts_buffers pg_commit_ts CommitTs By system views, I mean pg_stat_slru, pg_shmem_allocations and pg_stat_activity (lock names add "SLRU" on the end). Observations: It seems obvious that "clog_buffers" should be renamed to "xact_buffers". It's not clear whether the multixact GUCs should have the extra "s" like the directories, or not, like the system views. It see that we have "Shared Buffer Lookup Table" in pg_shmem_allocations, so where I generated names like "Subtrans Mapping Table" I should change that to "Lookup" to match. 3. What recommendations should we make about how to set it? I think the answer depends partially on the next questions! I think we should probably at least say something short about the pg_stat_slru view (cache miss rate) and pg_stat_actitity view (waits on locks), and how to tell if you might need to increase it. I think this probably needs a new paragraph, separate from the docs for the individual GUC. 4. Do we want to ship the dynahash patch? +0.9. The slight hesitation is that it's new code written very late in the cycle, so it may still have bugs or unintended consequences, and as you said, at small sizes the linear search must be faster than the hash computation. Could you help test it, and try to break it? Can we quantify the scaling effect for some interesting workloads, to see at what size the dynahash beats the linear search, so that we can make an informed decision? Of course, without a hash table, large sizes will surely work badly, so it'd be tempting to restrict the size you can set the GUC to. If we do include the dynahash patch, then I think it would also be reasonable to change the formula for the default, to make it higher on large systems. The restriction to 128 buffers (= 1MB) doesn't make much sense on a high frequency OLTP system with 128GB of shared buffers or even 4GB. I think "unleashing better defaults" would actually be bigger news than the GUC for typical users, because they'll just see PG14 use a few extra MB and go faster without having to learn about these obscure new settings. 5. Do we want to ship the simplehash patch? -0.5. It's a bit too exciting for the last minute, so I'd be inclined to wait until the next cycle to do some more research and testing. I know it's a better idea in the long run.
> 29 марта 2021 г., в 02:15, Thomas Munro <thomas.munro@gmail.com> написал(а): > > On Sat, Mar 27, 2021 at 6:31 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >>> 27 марта 2021 г., в 01:26, Thomas Munro <thomas.munro@gmail.com> написал(а): >>> , and murmurhash which is inlineable and >>> branch-free. > >> I think pageno is a hash already. Why hash any further? And pages accessed together will have smaller access time dueto colocation. > > Yeah, if clog_buffers is large enough then it's already a "perfect > hash", but if it's not then you might get some weird "harmonic" > effects (not sure if that's the right word), basically higher or lower > collision rate depending on coincidences in the data. If you apply a > hash, the collisions should be evenly spread out so at least it'll be > somewhat consistent. Does that make sense? As far as I understand "Harmonic" effects only make sense if the distribution is unknown. Hash protects from "periodic" datawhen periods are equal to hash table size. I don't think we need to protect from this case, SLRU data is expected tobe localised... Cost of this protection is necessity to calculate murmur hash on each SLRU lookup. Probably, 10-100ns. Seems like not a bigdeal. > (At some point I figured out that the syscaches have lower collision > rates and perform better if you use oids directly instead of hashing > them... but then it's easy to create a pathological pattern of DDL > that turns your hash table into a linked list. Not sure what to think > about that.) > >>> I had to tweak it to support "in-place" creation and >>> fixed size (in other words, no allocators, for use in shared memory). > >> We really need to have a test to know what happens when this structure goes out of memory, as you mentioned below. Whatwould be apropriate place for simplehash tests? > > Good questions. This has to be based on being guaranteed to have > enough space for all of the entries, so the question is really just > "how bad can performance get with different load factors". FWIW there > were some interesting cases with clustering when simplehash was first > used in the executor (see commits ab9f2c42 and parent) which required > some work on hashing quality to fix. Interesting read, I didn't know much about simple hash, but seems like there is still many cases where it can be used forgood. I always wondered why Postgres uses only Larson's linear hash. > >>> Then I was annoyed that I had to add a "status" member to our struct, >>> so I tried to fix that. > >> Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash align it well? > > With that "intrusive status" patch, the size is back to 8. But I > think I made a mistake: I made it steal some key space to indicate > presence, but I think the presence test should really get access to > the whole entry so that you can encode it in more ways. For example, > with slotno == -1. > > Alright, considering the date, if we want to get this into PostgreSQL > 14 it's time to make some decisions. > > 1. Do we want customisable SLRU sizes in PG14? > > +1 from me, we have multiple reports of performance gains from > increasing various different SLRUs, and it's easy to find workloads > that go faster. Yes, this is main point of this discussion. So +1 from me too. > > One thought: it'd be nice if the user could *see* the current size, > when using the default. SHOW clog_buffers -> 0 isn't very helpful if > you want to increase it, but don't know what it's currently set to. > Not sure off the top of my head how best to do that. Don't we expect that SHOW command indicate exactly same value as in config or SET command? If this convention does not exist- probably showing effective value is a good idea. > 2. What names do we want the GUCs to have? Here's what we have: > > Proposed GUC Directory System views > clog_buffers pg_xact Xact > multixact_offsets_buffers pg_multixact/offsets MultiXactOffset > multixact_members_buffers pg_multixact/members MultiXactMember > notify_buffers pg_notify Notify > serial_buffers pg_serial Serial > subtrans_buffers pg_subtrans Subtrans > commit_ts_buffers pg_commit_ts CommitTs > > By system views, I mean pg_stat_slru, pg_shmem_allocations and > pg_stat_activity (lock names add "SLRU" on the end). > > Observations: > > It seems obvious that "clog_buffers" should be renamed to "xact_buffers". +1 > It's not clear whether the multixact GUCs should have the extra "s" > like the directories, or not, like the system views. I think we show break the ties by native English speaker's ears or typing habits. I'm not a native speaker. > It see that we have "Shared Buffer Lookup Table" in > pg_shmem_allocations, so where I generated names like "Subtrans > Mapping Table" I should change that to "Lookup" to match. > > 3. What recommendations should we make about how to set it? > > I think the answer depends partially on the next questions! I think > we should probably at least say something short about the pg_stat_slru > view (cache miss rate) and pg_stat_actitity view (waits on locks), and > how to tell if you might need to increase it. I think this probably > needs a new paragraph, separate from the docs for the individual GUC. I can only suggest incident-driven approach. 1. Observe ridiculous amount of backends waiting on particular SLRU. 2. Double SLRU buffers for that SLRU. 3. Goto 1. I don't think we should mention this approach in docs. > 4. Do we want to ship the dynahash patch? This patch allows to throw infinite amount of memory on a problem of SLRU waiting for IO. So the scale of improvement ismuch higher. Do I want that we ship this patch? Definitely. Does this change much? I don't know. > > +0.9. The slight hesitation is that it's new code written very late > in the cycle, so it may still have bugs or unintended consequences, > and as you said, at small sizes the linear search must be faster than > the hash computation. Could you help test it, and try to break it? I'll test it and try to break. > Can we quantify the scaling effect for some interesting workloads, to > see at what size the dynahash beats the linear search, so that we can > make an informed decision? I think we cannot statistically distinguish linear search from hash search by means of SLRU. But we can create some syntheticbenchmarks. > Of course, without a hash table, large > sizes will surely work badly, so it'd be tempting to restrict the size > you can set the GUC to. > > If we do include the dynahash patch, then I think it would also be > reasonable to change the formula for the default, to make it higher on > large systems. The restriction to 128 buffers (= 1MB) doesn't make > much sense on a high frequency OLTP system with 128GB of shared > buffers or even 4GB. I think "unleashing better defaults" would > actually be bigger news than the GUC for typical users, because > they'll just see PG14 use a few extra MB and go faster without having > to learn about these obscure new settings. I agree. I don't see why we would need to limit buffers to 128 in presence of hash search. > 5. Do we want to ship the simplehash patch? > > -0.5. It's a bit too exciting for the last minute, so I'd be inclined > to wait until the next cycle to do some more research and testing. I > know it's a better idea in the long run. OK, obviously, it's safer decision. My TODO list: 1. Try to break patch set v13-[0001-0004] 2. Think how to measure performance of linear search versus hash search in SLRU buffer mapping. Thanks! Best regards, Andrey Borodin.
> 29 марта 2021 г., в 11:26, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > My TODO list: > 1. Try to break patch set v13-[0001-0004] > 2. Think how to measure performance of linear search versus hash search in SLRU buffer mapping. Hi Thomas! I'm still doing my homework. And to this moment all my catch is that "utils/dynahash.h" is not necessary. I'm thinking about hashtables and measuring performance near optimum of linear search does not seem a good idea now. It's impossible to prove that difference is statistically significant on all platforms. But even on one platform measurementsare just too noisy. Shared buffers lookup table is indeed very similar to this SLRU lookup table. And it does not try to use more memory thanneeded. I could not find pgbench-visible impact of growing shared buffer lookup table. Obviously, because it's not abottleneck on regular workload. And it's hard to guess representative pathological workload. In fact, this thread started with proposal to use reader-writer lock for multis (instead of exclusive lock), and this proposalencountered same problem. It's very hard to create stable reproduction of pathological workload when this lock isheavily contented. Many people observed the problem, but still there is no open repro. I bet it will be hard to prove that simplehash is any better then HTAB. But if it is really better, shared buffers couldbenefit from the same technique. I think its just fine to use HTAB with normal size, as long as shared buffers do so. But there we allocate slightly morespace InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS). Don't we need to allocate nslots + 1 ? It seems that we alwaysdo SlruMappingRemove() before SlruMappingAdd() and it is not necessary. Thanks! Best regards, Andrey Borodin.
On Thu, Apr 1, 2021 at 10:09 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > 29 марта 2021 г., в 11:26, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > My TODO list: > > 1. Try to break patch set v13-[0001-0004] > > 2. Think how to measure performance of linear search versus hash search in SLRU buffer mapping. > > Hi Thomas! > I'm still doing my homework. And to this moment all my catch is that "utils/dynahash.h" is not necessary. Thanks. Here's a new patch with that fixed, and also: 1. New names ("... Mapping Table" -> "... Lookup Table" in pg_shmem_allocations view, "clog_buffers" -> "xact_buffers") and a couple of typos fixed here and there. 2. Remove the cap of 128 buffers for xact_buffers as agreed. We still need a cap though, to avoid a couple of kinds of overflow inside slru.c, both when computing the default value and accepting a user-provided number. I introduced SLRU_MAX_ALLOWED_BUFFERS to keep it <= 1GB, and tested this on a 32 bit build with extreme block sizes. Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but only if you have track_commit_timestamp enabled. It seems silly to waste 1MB per 1GB of shared_buffers on a feature you're not using. So the default is capped at 16 in that case to preserve existing behaviour, but otherwise can be set very large if you want. I think it's plausible that we'll want to make the multixact sizes adaptive too, but I that might have to be a job for later. Likewise, I am sure that substransaction-heavy workloads might be slower than they need to be due to the current default, but I have not done the research, With these new GUCs, people are free to experiment and develop theories about what the defaults should be in v15. 3. In the special case of xact_buffers, there is a lower cap of 512MB, because there's no point trying to cache more xids than there can be in existence, and that is computed by working backwards from CLOG_XACTS_PER_PAGE etc. It's not possible to do the same sort of thing for the other SLRUs without overflow problems, and it doesn't seem worth trying to fix that right now (1GB of cached commit timestamps ought to be enough for anyone™, while the theoretical maximum is 10 bytes for 2b xids = 20GB). To make this more explicit for people not following our discussion in detail: with shared_buffers=0...512MB, the behaviour doesn't change. But for shared_buffers=1GB you'll get twice as many xact_buffers as today (2MB instead of being capped at 1MB), and it keeps scaling linearly from there at 0.2%. In other words, all real world databases will get a boost in this department. 4. Change the default for commit_ts_buffers back to shared_buffers / 1024 (with a minimum of 4), because I think you might have changed it by a copy and paste error -- or did you intend to make the default higher? 5. Improve descriptions for the GUCs, visible in pg_settings view, to match the documentation for related wait events. So "for commit log SLRU" -> "for the transaction status SLRU cache", and similar corrections elsewhere. (I am tempted to try to find a better word than "SLRU", which doesn't seem relevant to users, but for now consistency is good.) 6. Added a callback so that SHOW xact_buffers and SHOW commit_ts_buffers display the real size in effect (instead of "0" for default). I tried running with xact_buffers=1 and soon saw why you change it to interpret 1 the same as 0; with 1 you hit buffer starvation and get stuck. I wish there were some way to say "the default for this GUC is 0, but if it's not zero then it's got to be at least 4". I didn't study the theoretical basis for the previous minimum value of 4, so I think we should keep it that way, so that if you say 3 you get 4. I thought it was better to express that like so: /* Use configured value if provided. */ if (xact_buffers > 0) return Max(4, xact_buffers); return Min(CLOG_MAX_ALLOWED_BUFFERS, Max(4, NBuffers / 512)); > I'm thinking about hashtables and measuring performance near optimum of linear search does not seem a good idea now. > It's impossible to prove that difference is statistically significant on all platforms. But even on one platform measurementsare just too noisy. > > Shared buffers lookup table is indeed very similar to this SLRU lookup table. And it does not try to use more memory thanneeded. I could not find pgbench-visible impact of growing shared buffer lookup table. Obviously, because it's not abottleneck on regular workload. And it's hard to guess representative pathological workload. Thanks for testing. I agree that it's a good idea to follow the main buffer pool's approach for our first version of this. One small difference is that the SLRU patch performs the hash computation while it holds the lock. If we computed the hash first and used hash_search_with_hash_value(), we could compute it before we obtain the lock, like the main buffer pool. If we computed the hash value first, we could also ignore the rule in the documentation for hash_search_with_hash_value() that says that you must calculate it with get_hash_value(), and just call the hash function ourselves, so that it's fully inlinable. The same opportunity exists for the main buffer pool. That'd get you one of the micro-optimisations that simplehash.h offers. Whether it's worth bothering with, I don't know. > In fact, this thread started with proposal to use reader-writer lock for multis (instead of exclusive lock), and this proposalencountered same problem. It's very hard to create stable reproduction of pathological workload when this lock isheavily contented. Many people observed the problem, but still there is no open repro. > > I bet it will be hard to prove that simplehash is any better then HTAB. But if it is really better, shared buffers couldbenefit from the same technique. Agreed, there are a lot of interesting future projects in this area, when you compare the main buffer pool, these special buffer pools, and maybe also the "shared relfilenode" pool patch I have proposed for v15 (CF entry 2933). All have mapping tables and buffer replacement algorithms (why should they be different?), one has partitions, some have atomic-based header locks, they interlock with WAL differently (on page LSN, FPIs and checksum support), ... etc etc. > I think its just fine to use HTAB with normal size, as long as shared buffers do so. But there we allocate slightly morespace InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS). Don't we need to allocate nslots + 1 ? It seems that we alwaysdo SlruMappingRemove() before SlruMappingAdd() and it is not necessary. Yeah, we never try to add more elements than allowed, because we have a big lock around the mapping. The main buffer mapping table has a more concurrent design and might temporarily have one extra entry per partition.
Attachment
> 1 апр. 2021 г., в 06:40, Thomas Munro <thomas.munro@gmail.com> написал(а): > > 2. Remove the cap of 128 buffers for xact_buffers as agreed. We > still need a cap though, to avoid a couple of kinds of overflow inside > slru.c, both when computing the default value and accepting a > user-provided number. I introduced SLRU_MAX_ALLOWED_BUFFERS to keep > it <= 1GB, and tested this on a 32 bit build with extreme block sizes. BTW we do not document maximum values right now. I was toying around with big values. For example if we set different big xact_buffers we can get something like FATAL: not enough shared memory for data structure "Notify" (72768 bytes requested) FATAL: not enough shared memory for data structure "Async Queue Control" (2492 bytes requested) FATAL: not enough shared memory for data structure "Checkpointer Data" (393280 bytes requested) But never anything about xact_buffers. I don't think it's important, though. > > Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but > only if you have track_commit_timestamp enabled. Is there a reason to leave 16 pages if commit_ts is disabled? They might be useful for some artefacts of previously enabledcommit_ts? > 4. Change the default for commit_ts_buffers back to shared_buffers / > 1024 (with a minimum of 4), because I think you might have changed it > by a copy and paste error -- or did you intend to make the default > higher? I changed default due to some experiments with https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql In fact most important part of that thread was removing the cap, which is done by the patchset now. Thanks! Best regards, Andrey Borodin.
> I was toying around with big values. For example if we set different big xact_buffers we can get something like
> FATAL: not enough shared memory for data structure "Notify" (72768 bytes requested)
> FATAL: not enough shared memory for data structure "Async Queue Control" (2492 bytes requested)
> FATAL: not enough shared memory for data structure "Checkpointer Data" (393280 bytes requested)
>
> But never anything about xact_buffers. I don't think it's important, though.
I had added the hash table size in SimpleLruShmemSize(), but then SimpleLruInit() passed that same value in when allocating the struct, so the struct was oversized. Oops. Fixed.
> > Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
> > only if you have track_commit_timestamp enabled.
> Is there a reason to leave 16 pages if commit_ts is disabled? They might be useful for some artefacts of previously enabled commit_ts?
Alvaro, do you have an opinion on that?
The remaining thing that bothers me about this patch set is that there is still a linear search in the replacement algorithm, and it runs with an exclusive lock. That creates a serious problem for large caches that still aren't large enough. I wonder if we can do something to improve that situation in the time we have. I considered a bunch of ideas but could only find one that fits with slru.c's simplistic locking while tracking recency. What do you think about a hybrid of SLRU and random replacement, that retains some characteristics of both? You could think of it as being a bit like the tournament selection of the genetic algorithm, with a tournament size of (say) 8 or 16. Any ideas on how to evaluate this and choose the number? See attached.
Attachment
> 7 апр. 2021 г., в 08:59, Thomas Munro <thomas.munro@gmail.com> написал(а): > > The remaining thing that bothers me about this patch set is that there is still a linear search in the replacement algorithm,and it runs with an exclusive lock. That creates a serious problem for large caches that still aren't large enough. I wonder if we can do something to improve that situation in the time we have. I considered a bunch of ideas butcould only find one that fits with slru.c's simplistic locking while tracking recency. What do you think about a hybridof SLRU and random replacement, that retains some characteristics of both? You could think of it as being a bit likethe tournament selection of the genetic algorithm, with a tournament size of (say) 8 or 16. Any ideas on how to evaluatethis and choose the number? See attached. > <v15-0001-Add-a-buffer-mapping-table-for-SLRUs.patch><v15-0002-Make-all-SLRU-buffer-sizes-configurable.patch><v15-0003-Use-hybrid-random-SLRU-replacement-for-SLRUs.patch> Maybe instead of fully associative cache with random replacement we could use 1-associative cache? i.e. each page can reside only in one spcific buffer slot. If there's something else - evict it. I think this would be as efficient as RR cache. And it's soooo fast. Best regards, Andrey Borodin.
> 7 апр. 2021 г., в 14:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > Maybe instead of fully associative cache with random replacement we could use 1-associative cache? > i.e. each page can reside only in one spcific buffer slot. If there's something else - evict it. > I think this would be as efficient as RR cache. And it's soooo fast. I thought a bit more and understood that RR is protected from two competing pages in working set, while 1-associative cacheis not. So, discard that idea. Best regards, Andrey Borodin.
On Thu, Apr 8, 2021 at 12:13 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > 7 апр. 2021 г., в 14:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > Maybe instead of fully associative cache with random replacement we could use 1-associative cache? > > i.e. each page can reside only in one spcific buffer slot. If there's something else - evict it. > > I think this would be as efficient as RR cache. And it's soooo fast. > > I thought a bit more and understood that RR is protected from two competing pages in working set, while 1-associative cacheis not. So, discard that idea. It's an interesting idea. I know that at least one proprietary fork just puts the whole CLOG in memory for direct indexing, which is what we'd have here if we said "oh, your xact_buffers setting is so large I'm just going to use slotno = pageno & mask". Here's another approach that is a little less exciting than "tournament RR" (or whatever that should be called; I couldn't find an established name for it). This version is just our traditional linear search, except that it stops at 128, and remembers where to start from next time (like a sort of Fisher-Price GCLOCK hand). This feels more committable to me. You can argue that all buffers above 128 are bonus buffers that PostgreSQL 13 didn't have, so the fact that we can no longer find the globally least recently used page when you set xact_buffers > 128 doesn't seem too bad to me, as an incremental step (but to be clear, of course we can do better than this with more work in later releases).
Attachment
> 8 апр. 2021 г., в 03:30, Thomas Munro <thomas.munro@gmail.com> написал(а): > > Here's another approach that is a little less exciting than > "tournament RR" (or whatever that should be called; I couldn't find an > established name for it). This version is just our traditional linear > search, except that it stops at 128, and remembers where to start from > next time (like a sort of Fisher-Price GCLOCK hand). This feels more > committable to me. You can argue that all buffers above 128 are bonus > buffers that PostgreSQL 13 didn't have, so the fact that we can no > longer find the globally least recently used page when you set > xact_buffers > 128 doesn't seem too bad to me, as an incremental step > (but to be clear, of course we can do better than this with more work > in later releases). I agree that this version of eviction seems much more effective and less intrusive than RR. And it's still LRU, which isimportant for subsystem that is called SLRU. shared->search_slotno is initialized implicitly with memset(). But this seems like a common practice. Also comment above "max_search = Min(shared->num_slots, MAX_REPLACEMENT_SEARCH);" does not reflect changes. Besides this patch looks good to me. Thanks! Best regards, Andrey Borodin.
On Thu, Apr 8, 2021 at 7:24 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > I agree that this version of eviction seems much more effective and less intrusive than RR. And it's still LRU, which isimportant for subsystem that is called SLRU. > shared->search_slotno is initialized implicitly with memset(). But this seems like a common practice. > Also comment above "max_search = Min(shared->num_slots, MAX_REPLACEMENT_SEARCH);" does not reflect changes. > > Besides this patch looks good to me. Thanks! I chickened out of committing a buffer replacement algorithm patch written 11 hours before the feature freeze, but I also didn't really want to commit the GUC patch without that. Ahh, if only we'd latched onto the real problems here just a little sooner, but there is always PostgreSQL 15, I heard it's going to be amazing. Moved to next CF.
> 8 апр. 2021 г., в 15:22, Thomas Munro <thomas.munro@gmail.com> написал(а): > > On Thu, Apr 8, 2021 at 7:24 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> I agree that this version of eviction seems much more effective and less intrusive than RR. And it's still LRU, whichis important for subsystem that is called SLRU. >> shared->search_slotno is initialized implicitly with memset(). But this seems like a common practice. >> Also comment above "max_search = Min(shared->num_slots, MAX_REPLACEMENT_SEARCH);" does not reflect changes. >> >> Besides this patch looks good to me. > > Thanks! I chickened out of committing a buffer replacement algorithm > patch written 11 hours before the feature freeze, but I also didn't > really want to commit the GUC patch without that. Ahh, if only we'd > latched onto the real problems here just a little sooner, but there is > always PostgreSQL 15, I heard it's going to be amazing. Moved to next > CF. I have one more idea inspired by CPU caches. Let's make SLRU n-associative, where n ~ 8. We can divide buffers into "banks", number of banks must be power of 2. All banks are of equal size. We choose bank size to approximately satisfy user's configured buffer size. Each page can live only within one bank. We use same search and eviction algorithms as we used in SLRU, but we only needto search\evict over 8 elements. All SLRU data of a single bank will be colocated within at most 2 cache line. I did not come up with idea how to avoid multiplication of bank_number * bank_size in case when user configured 31337 buffers(any number that is radically not a power of 2). PFA patch implementing this idea. Best regards, Andrey Borodin.
Attachment
PFA patch implementing this idea.
begin;
select :aid from pgbench_accounts where aid = :aid for update;
update pgbench_accounts set abalance = abalance + 1 where aid = :aid;
update pgbench_accounts set abalance = abalance * 2 where aid = :aid;
update pgbench_accounts set abalance = abalance - 2 where aid = :aid;
end;
multixact_members_buffers_64Kb 693.2
multixact_members_buffers_128Kb 691.4
multixact_members_buffers_192Kb 696.3
multixact_members_buffers_256Kb 694.4
multixact_members_buffers_320Kb 692.3
multixact_members_buffers_448Kb 693.7
multixact_members_buffers_512Kb 693.3
vanilla 676.1
>> 8 апр. 2021 г., в 15:22, Thomas Munro <thomas.munro@gmail.com> написал(а): >> > I have one more idea inspired by CPU caches. > Let's make SLRU n-associative, where n ~ 8. > We can divide buffers into "banks", number of banks must be power of 2. > All banks are of equal size. We choose bank size to approximately satisfy user's configured buffer size. > Each page can live only within one bank. We use same search and eviction algorithms as we used in SLRU, but we only needto search\evict over 8 elements. > All SLRU data of a single bank will be colocated within at most 2 cache line. > > I did not come up with idea how to avoid multiplication of bank_number * bank_size in case when user configured 31337 buffers(any number that is radically not a power of 2). We can avoid this multiplication by using gapped memory under SLRU page_statuses, but from my POV here complexity does notworth possible performance gain. PFA rebase of the patchset. Also I've added a patch to combine page_number, page_status, and page_dirty together to touchless cachelines. Best regards, Andrey Borodin.
Attachment
Hi, On Sun, Dec 26, 2021 at 03:09:59PM +0500, Andrey Borodin wrote: > > > PFA rebase of the patchset. Also I've added a patch to combine page_number, page_status, and page_dirty together to touchless cachelines. The cfbot reports some errors on the latest version of the patch: https://cirrus-ci.com/task/6121317215764480 [04:56:38.432] su postgres -c "make -s -j${BUILD_JOBS} world-bin" [04:56:48.270] In file included from async.c:134: [04:56:48.270] ../../../src/include/access/slru.h:47:28: error: expected identifier or ‘(’ before ‘:’ token [04:56:48.270] 47 | typedef enum SlruPageStatus:int16_t [04:56:48.270] | ^ [04:56:48.270] ../../../src/include/access/slru.h:53:3: warning: data definition has no type or storage class [04:56:48.270] 53 | } SlruPageStatus; [04:56:48.270] | ^~~~~~~~~~~~~~ [04:56:48.270] ../../../src/include/access/slru.h:53:3: warning: type defaults to ‘int’ in declaration of ‘SlruPageStatus’[-Wimplicit-int] [04:56:48.270] ../../../src/include/access/slru.h:58:2: error: expected specifier-qualifier-list before ‘SlruPageStatus’ [04:56:48.270] 58 | SlruPageStatus page_status; [04:56:48.270] | ^~~~~~~~~~~~~~ [04:56:48.270] async.c: In function ‘asyncQueueAddEntries’: [04:56:48.270] async.c:1448:41: error: ‘SlruPageEntry’ has no member named ‘page_dirty’ [04:56:48.270] 1448 | NotifyCtl->shared->page_entries[slotno].page_dirty = true; [04:56:48.270] | ^ [04:56:48.271] make[3]: *** [<builtin>: async.o] Error 1 [04:56:48.271] make[3]: *** Waiting for unfinished jobs.... [04:56:48.297] make[2]: *** [common.mk:39: commands-recursive] Error 2 [04:56:48.297] make[2]: *** Waiting for unfinished jobs.... [04:56:54.554] In file included from clog.c:36: [04:56:54.554] ../../../../src/include/access/slru.h:47:28: error: expected identifier or ‘(’ before ‘:’ token [04:56:54.554] 47 | typedef enum SlruPageStatus:int16_t [04:56:54.554] | ^ [04:56:54.554] ../../../../src/include/access/slru.h:53:3: warning: data definition has no type or storage class [04:56:54.554] 53 | } SlruPageStatus; [04:56:54.554] | ^~~~~~~~~~~~~~ [04:56:54.554] ../../../../src/include/access/slru.h:53:3: warning: type defaults to ‘int’ in declaration of ‘SlruPageStatus’[-Wimplicit-int] [04:56:54.554] ../../../../src/include/access/slru.h:58:2: error: expected specifier-qualifier-list before ‘SlruPageStatus’ [04:56:54.554] 58 | SlruPageStatus page_status; [04:56:54.554] | ^~~~~~~~~~~~~~ [04:56:54.554] clog.c: In function ‘TransactionIdSetPageStatusInternal’: [04:56:54.554] clog.c:396:39: error: ‘SlruPageEntry’ has no member named ‘page_dirty’ [04:56:54.554] 396 | XactCtl->shared->page_entries[slotno].page_dirty = true; [04:56:54.554] | ^ [04:56:54.554] In file included from ../../../../src/include/postgres.h:46, [04:56:54.554] from clog.c:33: [04:56:54.554] clog.c: In function ‘BootStrapCLOG’: [04:56:54.554] clog.c:716:47: error: ‘SlruPageEntry’ has no member named ‘page_dirty’ [04:56:54.554] 716 | Assert(!XactCtl->shared->page_entries[slotno].page_dirty); [04:56:54.554] | ^ [04:56:54.554] ../../../../src/include/c.h:848:9: note: in definition of macro ‘Assert’ [04:56:54.554] 848 | if (!(condition)) \ [04:56:54.554] | ^~~~~~~~~ [04:56:54.554] clog.c: In function ‘TrimCLOG’: [04:56:54.554] clog.c:801:40: error: ‘SlruPageEntry’ has no member named ‘page_dirty’ [04:56:54.554] 801 | XactCtl->shared->page_entries[slotno].page_dirty = true; [04:56:54.554] | ^ [04:56:54.554] In file included from ../../../../src/include/postgres.h:46, [04:56:54.554] from clog.c:33: [04:56:54.554] clog.c: In function ‘clog_redo’: [04:56:54.554] clog.c:997:48: error: ‘SlruPageEntry’ has no member named ‘page_dirty’ [04:56:54.554] 997 | Assert(!XactCtl->shared->page_entries[slotno].page_dirty); [04:56:54.554] | ^ [04:56:54.554] ../../../../src/include/c.h:848:9: note: in definition of macro ‘Assert’ [04:56:54.554] 848 | if (!(condition)) \ [04:56:54.554] | ^~~~~~~~~ [04:56:54.555] make[4]: *** [<builtin>: clog.o] Error 1 [04:56:54.555] make[3]: *** [../../../src/backend/common.mk:39: transam-recursive] Error 2 [04:56:54.555] make[3]: *** Waiting for unfinished jobs.... [04:56:56.405] make[2]: *** [common.mk:39: access-recursive] Error 2 [04:56:56.405] make[1]: *** [Makefile:42: all-backend-recurse] Error 2 [04:56:56.405] make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2 [04:56:56.407] [04:56:56.407] Exit status: 2 Could you send a new version? In the meantime I will switch the patch status to Waiting on Author.
On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote: > > PFA rebase of the patchset. Also I've added a patch to combine > > page_number, page_status, and page_dirty together to touch less > > cachelines. > > The cfbot reports some errors on the latest version of the patch: > > https://cirrus-ci.com/task/6121317215764480 > [...] > Could you send a new version? In the meantime I will switch the patch > status > to Waiting on Author. > I was planning on running a set of stress tests on these patches. Could we confirm which ones we plan to include in the commitfest? -- Shawn Debnath Amazon Web Services (AWS)
> 15 янв. 2022 г., в 03:20, Shawn Debnath <sdn@amazon.com> написал(а): > > On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote: >>> PFA rebase of the patchset. Also I've added a patch to combine >>> page_number, page_status, and page_dirty together to touch less >>> cachelines. >> >> The cfbot reports some errors on the latest version of the patch: >> >> https://cirrus-ci.com/task/6121317215764480 >> [...] >> Could you send a new version? In the meantime I will switch the patch >> status >> to Waiting on Author. >> > > I was planning on running a set of stress tests on these patches. Could > we confirm which ones we plan to include in the commitfest? Many thanks for your interest. Here's the latest version. Best regards, Andrey Borodin.
Attachment
On Sat, Jan 15, 2022 at 12:16:59PM +0500, Andrey Borodin wrote: > > 15 янв. 2022 г., в 03:20, Shawn Debnath <sdn@amazon.com> написал(а): > > On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote: > >>> PFA rebase of the patchset. Also I've added a patch to combine > >>> page_number, page_status, and page_dirty together to touch less > >>> cachelines. > >> > >> The cfbot reports some errors on the latest version of the patch: > >> > >> https://cirrus-ci.com/task/6121317215764480 > >> [...] > >> Could you send a new version? In the meantime I will switch the patch > >> status to Waiting on Author. > > > > I was planning on running a set of stress tests on these patches. Could > > we confirm which ones we plan to include in the commitfest? > > Many thanks for your interest. Here's the latest version. This is failing to compile under linux and windows due to bitfield syntax. http://cfbot.cputube.org/andrey-borodin.html And compile warnings: slru.c: In function ‘SlruAdjustNSlots’: slru.c:161:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 161 | int nbanks = 1; | ^~~ slru.c: In function ‘SimpleLruReadPage_ReadOnly’: slru.c:530:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 530 | int bankstart = (pageno & shared->bank_mask) * shared->bank_size; | ^~~ Note that you can test the CI result using any github account without waiting for the cfbot. See ./src/tools/ci/README. -- Justin
> 15 янв. 2022 г., в 20:46, Justin Pryzby <pryzby@telsasoft.com> написал(а): >>> >>> I was planning on running a set of stress tests on these patches. Could >>> we confirm which ones we plan to include in the commitfest? >> >> Many thanks for your interest. Here's the latest version. > > This is failing to compile under linux and windows due to bitfield syntax. > http://cfbot.cputube.org/andrey-borodin.html Uh, sorry, I formatted a patch from wrong branch. Just tested Cirrus. It's wonderful, thanks! Really faster than doing stuff on my machines... Best regards, Andrey Borodin.
Attachment
On Sat, Jan 15, 2022 at 12:16:59PM +0500, Andrey Borodin wrote: > > I was planning on running a set of stress tests on these patches. Could > > we confirm which ones we plan to include in the commitfest? > > Many thanks for your interest. Here's the latest version. Here are the results of the multixact perf test I ran on the patch that splits the linear SLRU caches into banks. With my test setup, the binaries with the patch applied performed slower marginally across the test matrix against unpatched binaries. Here are the results: +-------------------------------+---------------------+-----------------------+------------+ | workload | patched average tps | unpatched average tps | difference | +-------------------------------+---------------------+-----------------------+------------+ | create only | 10250.54396 | 10349.67487 | -1.0% | | create and select | 9677.711286 | 9991.065037 | -3.2% | | large cache create only | 10310.96646 | 10337.16455 | -0.3% | | large cache create and select | 9654.24077 | 9924.270242 | -2.8% | +-------------------------------+---------------------+-----------------------+------------+ The test was configured in the following manner: - AWS EC2 c5d.24xlarge instances, located in the same AZ, were used as the database host and the test driver. These systems have 96 vcpus and 184 GB memory. NVMe drives were configured as RAID5. - GUCs were changed from defaults to be the following: max_connections = 5000 shared_buffers = 96GB max_wal_size = 2GB min_wal_size = 192MB - pgbench runs were done with -c 1000 -j 1000 and a scale of 10,000 - Two multixact workloads were tested, first [0] was a create only script which selected 100 pgbench_account rows for share. Second workload [1] added a select statement to visit rows touched in the past which had multixacts generated for them. pgbench test script [2] wraps the call to the functions inside an explicit transaction. - Large cache tests are multixact offsets cache size hard coded to 128 and members cache size hard coded to 256. - Row selection is based on time based approach that lets all client connections coordinate which rows to work with based on the millisecond they start executing. To allow for more multixacts to be generated and reduce contention, the workload uses offsets ahead of the start id based on a random number. - The one bummer about these runs were that they only ran for 600 seconds for insert only and 400 seconds for insert and select. I consistently ran into checkpointer getting oom-killed on this instance after that timeframe. Will dig into this separately. But the TPS was consistent. - Each test was repeated at least 3 times and the average of those runs were used. - I am using the master branch and changes were applied on commit f47ed79cc8a0cfa154dc7f01faaf59822552363f I think patch 1 is a must-have. Regarding patch 2, I would propose we avoid introducing more complexity into SimpleLRU cache and instead focus on making the SLRU to buffer cache effort [3] a reality. I would also add that we have a few customers in our fleet who have been successfully running the large cache configuration on the regular SLRU without any issues. With cache sizes this small, the linear searches are still quite efficient. If my test workload can be made better, please let me know. Happy to re-run tests as needed. [0] https://gist.github.com/sdebnath/e015561811adf721dd40dd6638969c69 [1] https://gist.github.com/sdebnath/2f3802e1fe288594b6661a7a59a7ca07 [2] https://gist.github.com/sdebnath/6bbfd5f87945a7d819e30a9a1701bc97 [3] https://www.postgresql.org/message-id/CA%2BhUKGKAYze99B-jk9NoMp-2BDqAgiRC4oJv%2BbFxghNgdieq8Q%40mail.gmail.com -- Shawn Debnath Amazon Web Services (AWS)
> 20 янв. 2022 г., в 20:44, Shawn Debnath <sdn@amazon.com> написал(а): > > If my test workload can be made better, please let me know. Happy to > re-run tests as needed. Shawn, thanks for the benchmarking! Can you please also test 2nd patch against a large multixact SLRUs? 2nd patch is not intended to do make thing better on default buffer sizes. It must save the perfromance in case of reallyhuge SLRU buffers. Thanks! Best regards, Andrey Borodin.
On Thu, Jan 20, 2022 at 09:21:24PM +0500, Andrey Borodin wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. > > > 20 янв. 2022 г., в 20:44, Shawn Debnath <sdn@amazon.com> написал(а): > Can you please also test 2nd patch against a large multixact SLRUs? > 2nd patch is not intended to do make thing better on default buffer sizes. It must save the perfromance in case of reallyhuge SLRU buffers. Test was performed on 128/256 for multixact offset/members cache as stated in my previous email. Sure I can test it for higher values - but what's a real world value that would make sense? We have been using this configuration successfully for a few of our customers that ran into MultiXact contention. -- Shawn Debnath Amazon Web Services (AWS)
> 21 янв. 2022 г., в 05:19, Shawn Debnath <sdn@amazon.com> написал(а): > > On Thu, Jan 20, 2022 at 09:21:24PM +0500, Andrey Borodin wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. >> >>> 20 янв. 2022 г., в 20:44, Shawn Debnath <sdn@amazon.com> написал(а): >> Can you please also test 2nd patch against a large multixact SLRUs? >> 2nd patch is not intended to do make thing better on default buffer sizes. It must save the perfromance in case of reallyhuge SLRU buffers. > > Test was performed on 128/256 for multixact offset/members cache as > stated in my previous email. Sure I can test it for higher values - but > what's a real world value that would make sense? We have been using this > configuration successfully for a few of our customers that ran into > MultiXact contention. Sorry, seems like I misinterpreted results yesterday. I had one concern about 1st patch step: it makes CLOG buffers size dependent on shared_buffers. But in your tests you seemto have already exercised xact_buffers = 24576 without noticeable degradation. Is it correct? I doubt a little bit thatlinear search among 24K elements on each CLOG access does not incur performance impact, but your tests seem to proveit. IMV splitting SLRU buffer into banks would make sense for values greater than 1<<10. But you are right that 256 seems enoughto cope with most of problems of multixacts so far. I just thought about stressing SLRU buffers with multixacts tobe sure that CLOG buffers will not suffer degradation. But yes, it's too indirect test. Maybe, just to be sure, let's repeat tests with autovacuum turned off to stress xact_buffers? Thanks! Best regards, Andrey Borodin.
> 8 апр. 2021 г., в 17:22, Thomas Munro <thomas.munro@gmail.com> написал(а): > > Thanks! I chickened out of committing a buffer replacement algorithm > patch written 11 hours before the feature freeze, but I also didn't > really want to commit the GUC patch without that. Ahh, if only we'd > latched onto the real problems here just a little sooner, but there is > always PostgreSQL 15, I heard it's going to be amazing. Moved to next > CF. Hi Thomas! There's feature freeze approaching again. I see that you are working on moving SLRUs to buffer pools, but it is not clearto which PG version it will land. And there is 100% consensus that first patch is useful and helps to prevent big issues.Maybe let's commit 1'st step without lifting default xact_buffers limit? Or 1st patch as-is with any simple techniquethat prevents linear search in SLRU buffers. Best regards, Andrey Borodin.
On Sat, Feb 19, 2022 at 6:34 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > There's feature freeze approaching again. I see that you are working on moving SLRUs to buffer pools, but it is not clearto which PG version it will land. And there is 100% consensus that first patch is useful and helps to prevent big issues.Maybe let's commit 1'st step without lifting default xact_buffers limit? Or 1st patch as-is with any simple techniquethat prevents linear search in SLRU buffers. Hi Andrey, Yeah, the SLRU/buffer pool thing would be complete enough to propose for 16 at the earliest. I posted the early prototype to see what sort of reaction the concept would get before doing more work; I know others have investigated this topic too... maybe it can encourage more patches, experimental results, ideas to be shared... but this is not relevant for 15. Back to this patch: assuming we can settle on a good-enough-for-now replacement algorithm, do we want to add this set of 7 GUCs? Does anyone else want to weigh in on that? Concretely, this patch adds: multixact_offsets_buffers multixact_members_buffers subtrans_buffers notify_buffers serial_buffers xact_buffers commit_ts_buffers I guess the people at https://ottertune.com/blog/postgresql-knobs-list/ would be happy if we did. Hopefully we'd drop the settings in a future release once we figure out the main buffer pool thing (or some other scheme to automate sizing).
On 2022-02-20 10:38:53 +1300, Thomas Munro wrote: > Back to this patch: assuming we can settle on a good-enough-for-now > replacement algorithm, do we want to add this set of 7 GUCs? Does > anyone else want to weigh in on that? I'm -0.2 on it, given that we have a better path forward.
> On 20 Feb 2022, at 02:42, Andres Freund <andres@anarazel.de> wrote: > > On 2022-02-20 10:38:53 +1300, Thomas Munro wrote: >> Back to this patch: assuming we can settle on a good-enough-for-now >> replacement algorithm, do we want to add this set of 7 GUCs? Does >> anyone else want to weigh in on that? > > I'm -0.2 on it, given that we have a better path forward. That’s a really good path forward, but it's discussed at least for 3.5 years[0]. And guaranteed not to be there until 2023.Gilles, Shawn, Dmitry expressed their opinion in lines with that the patch “is a must-have” referring to real pathologicalperformance degradation inflicted by SLRU cache starvation. And I can remember dozen of other incidents thatwould not happen if the patch was applied, e.g. this post is referring to the patch as a cure [1]. Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/20180814213500.GA74618%4060f81dc409fc.ant.amazon.com [1] https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/#what-can-we-do-about-getting-rid-of-nessie
> On 20 Feb 2022, at 02:38, Thomas Munro <thomas.munro@gmail.com> wrote: > > Back to this patch: assuming we can settle on a good-enough-for-now > replacement algorithm, do we want to add this set of 7 GUCs? Does > anyone else want to weigh in on that? Hi Thomas! It seems we don't have any other input besides reviews and Andres's -0.2. Is there a chance to proceed? Best regards, Andrey Borodin.
Good day, all. I did benchmark of patch on 2 socket Xeon 5220 CPU @ 2.20GHz . I used "benchmark" used to reproduce problems with SLRU on our customers setup. In opposite to Shawn's tests I concentrated on bad case: a lot of contention. slru-funcs.sql - function definitions - functions creates a lot of subtrunsactions to stress subtrans - and select random rows for share to stress multixacts slru-call.sql - function call for benchmark slru-ballast.sql - randomly select 1000 consequent rows "for update skip locked" to stress multixacts patch1 - make SLRU buffers configurable patch2 - make "8-associative banks" Benchmark done by pgbench. Inited with scale 1 to induce contention. pgbench -i -s 1 testdb Benchmark 1: - low number of connections (50), 60% slru-call, 40% slru-ballast pgbench -f slru-call.sql@60 -f slru-ballast.sql@40 -c 50 -j 75 -P 1 -T 30 testdb version | subtrans | multixact | tps | buffers | offs/memb | func+ballast --------+----------+-----------+------ master | 32 | 8/16 | 184+119 patch1 | 32 | 8/16 | 184+119 patch1 | 1024 | 8/16 | 121+77 patch1 | 1024 | 512/1024 | 118+75 patch2 | 32 | 8/16 | 190+122 patch2 | 1024 | 8/16 | 190+125 patch2 | 1024 | 512/1024 | 190+127 As you see, this test case degrades with dumb increase of SLRU buffers. But use of "hash table" in form of "associative buckets" makes performance stable. Benchmark 2: - high connection number (600), 98% slru-call, 2% slru-ballast pgbench -f slru-call.sql@98 -f slru-ballast.sql@2 -c 600 -j 75 -P 1 -T 30 testdb I don't paste "ballast" tps here since 2% make them too small, and they're very noisy. version | subtrans | multixact | tps | buffers | offs/memb | func --------+----------+-----------+------ master | 32 | 8/16 | 13 patch1 | 32 | 8/16 | 13 patch1 | 1024 | 8/16 | 31 patch1 | 1024 | 512/1024 | 53 patch2 | 32 | 8/16 | 12 patch2 | 1024 | 8/16 | 34 patch2 | 1024 | 512/1024 | 67 In this case simple buffer increase does help. But "buckets" increase performance gain. I didn't paste here results third part of patch ("Pack SLRU...") because I didn't see any major performance gain from it, and it consumes large part of patch diff. Rebased versions of first two patch parts are attached. regards, Yura Sokolov
Attachment
> On 21 Jul 2022, at 18:00, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > In this case simple buffer increase does help. But "buckets" > increase performance gain. Yura, thank you for your benchmarks! We already knew that patch can save the day on pathological workloads, now we have a proof of this. Also there's the evidence that user can blindly increase size of SLRU if they want (with the second patch). So there's noneed for hard explanations on how to tune the buffers size. Thomas, do you still have any doubts? Or is it certain that SLRU will be replaced by any better subsystem in 16? Best regards, Andrey Borodin.
On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Thomas, do you still have any doubts? Or is it certain that SLRU will be replaced by any better subsystem in 16? Hi Andrey, Sorry for my lack of replies on this and the other SLRU thread -- I'm thinking and experimenting. More soon.
> Andrey Borodin wrote 2022-07-23 11:39: > > Yura, thank you for your benchmarks! > We already knew that patch can save the day on pathological workloads, > now we have a proof of this. > Also there's the evidence that user can blindly increase size of SLRU > if they want (with the second patch). So there's no need for hard > explanations on how to tune the buffers size. Hi @Andrey.Borodin, With some considerations and performance checks from @Yura.Sokolov we simplified your approach by the following: 1. Preamble. We feel free to increase any SLRU's, since there's no performance degradation on large Buffers count using your SLRU buckets solution. 2. `slru_buffers_size_scale` is only one config param introduced for all SLRUs. It scales SLRUs upper cap by power 2. 3. All SLRU buffers count are capped by both `MBuffers (shared_buffers)` and `slru_buffers_size_scale`. see 4. Magic initial constants `NUM_*_BUFFERS << slru_buffers_size_scale` are applied for every SLRU. 5. All SLRU buffers are always sized as power of 2, their hash bucket size is always 8. There's attached patch for your consideration. It does gather and simplify both `v21-0001-Make-all-SLRU-buffer-sizes-configurable.patch` and `v21-0002-Divide-SLRU-buffers-into-8-associative-banks.patch` to much simpler approach. Thank you, Yours, - Ivan
Attachment
> On 17 Aug 2022, at 00:36, i.lazarev@postgrespro.ru wrote: > >> Andrey Borodin wrote 2022-07-23 11:39: >> Yura, thank you for your benchmarks! >> We already knew that patch can save the day on pathological workloads, >> now we have a proof of this. >> Also there's the evidence that user can blindly increase size of SLRU >> if they want (with the second patch). So there's no need for hard >> explanations on how to tune the buffers size. > > Hi @Andrey.Borodin, With some considerations and performance checks from @Yura.Sokolov we simplified your approach by thefollowing: > > 1. Preamble. We feel free to increase any SLRU's, since there's no performance degradation on large Buffers count usingyour SLRU buckets solution. > 2. `slru_buffers_size_scale` is only one config param introduced for all SLRUs. It scales SLRUs upper cap by power 2. > 3. All SLRU buffers count are capped by both `MBuffers (shared_buffers)` and `slru_buffers_size_scale`. see > 4. Magic initial constants `NUM_*_BUFFERS << slru_buffers_size_scale` are applied for every SLRU. > 5. All SLRU buffers are always sized as power of 2, their hash bucket size is always 8. > > There's attached patch for your consideration. It does gather and simplify both `v21-0001-Make-all-SLRU-buffer-sizes-configurable.patch`and `v21-0002-Divide-SLRU-buffers-into-8-associative-banks.patch`to much simpler approach. I like the idea of one knob instead of one per each SLRU. Maybe we even could deduce sane value from NBuffers? That wouldeffectively lead to 0 knobs :) Your patch have a prefix "v22-0006", does it mean there are 5 previous steps of the patchset? Thank you! Best regards, Andrey Borodin.
Andrey Borodin wrote 2022-08-18 06:35: > > I like the idea of one knob instead of one per each SLRU. Maybe we > even could deduce sane value from NBuffers? That would effectively > lead to 0 knobs :) > > Your patch have a prefix "v22-0006", does it mean there are 5 previous > steps of the patchset? > > Thank you! > > > Best regards, Andrey Borodin. Not sure it's possible to deduce from NBuffers only. slru_buffers_scale_shift looks like relief valve for systems with ultra scaled NBuffers. Regarding v22-0006 I just tried to choose index unique for this thread so now it's fixed to 0001 indexing.
Attachment
On Sat, Jul 23, 2022 at 1:48 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Thomas, do you still have any doubts? Or is it certain that SLRU will be replaced by any better subsystem in 16? > > Hi Andrey, > > Sorry for my lack of replies on this and the other SLRU thread -- I'm > thinking and experimenting. More soon. > Hi Thomas, PostgreSQL 16 feature freeze is approaching again. Let's choose something from possible solutions, even if the chosen one is temporary. Thank you! Best regards, Andrey Borodin.
On Fri, 19 Aug 2022 at 21:18, <i.lazarev@postgrespro.ru> wrote: > > Andrey Borodin wrote 2022-08-18 06:35: > > > > I like the idea of one knob instead of one per each SLRU. Maybe we > > even could deduce sane value from NBuffers? That would effectively > > lead to 0 knobs :) > > > > Your patch have a prefix "v22-0006", does it mean there are 5 previous > > steps of the patchset? > > > > Thank you! > > > > > > Best regards, Andrey Borodin. > > Not sure it's possible to deduce from NBuffers only. > slru_buffers_scale_shift looks like relief valve for systems with ultra > scaled NBuffers. > > Regarding v22-0006 I just tried to choose index unique for this thread > so now it's fixed to 0001 indexing. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID 325bc54eed4ea0836a0bb715bb18342f0c1c668a === === applying patch ./v23-0001-bucketed-SLRUs-simplified.patch patching file src/include/miscadmin.h ... patching file src/backend/utils/misc/guc.c Hunk #1 FAILED at 32. Hunk #2 FAILED at 2375. 2 out of 2 hunks FAILED -- saving rejects to file src/backend/utils/misc/guc.c.rej [1] - http://cfbot.cputube.org/patch_41_2627.log Regards, Vignesh
On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote: > does not apply on top of HEAD as in [1], please post a rebased patch: > Thanks! Here's the rebase. Best regards, Andrey Borodin.
Attachment
On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin <amborodin86@gmail.com> wrote: > > On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote: > > does not apply on top of HEAD as in [1], please post a rebased patch: > > > Thanks! Here's the rebase. I was looking into this patch, it seems like three different optimizations are squeezed in a single patch 1) dividing buffer space in banks to reduce the seq search cost 2) guc parameter for buffer size scale 3) some of the buffer size values are modified compared to what it is on the head. I think these are 3 patches which should be independently committable. While looking into the first idea of dividing the buffer space in banks, I see that it will speed up finding the buffers but OTOH while searching the victim buffer it will actually can hurt the performance the slru pages which are frequently accessed are not evenly distributed across the banks. So imagine the cases where we have some banks with a lot of empty slots and other banks from which we frequently have to evict out the pages in order to get the new pages in. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi Dilip! Thank you for the review! On Tue, Jan 10, 2023 at 9:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin <amborodin86@gmail.com> wrote: > > > > On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote: > > > does not apply on top of HEAD as in [1], please post a rebased patch: > > > > > Thanks! Here's the rebase. > > I was looking into this patch, it seems like three different > optimizations are squeezed in a single patch > 1) dividing buffer space in banks to reduce the seq search cost 2) guc > parameter for buffer size scale 3) some of the buffer size values are > modified compared to what it is on the head. I think these are 3 > patches which should be independently committable. There's no point in dividing SLRU buffers in parts unless the buffer's size is configurable. And it's only possible to enlarge default buffers size if SLRU buffers are divided into banks. So the features can be viewed as independent commits, but make no sense separately. But, probably, it's a good idea to split the patch back anyway, for easier review. > > While looking into the first idea of dividing the buffer space in > banks, I see that it will speed up finding the buffers but OTOH while > searching the victim buffer it will actually can hurt the performance > the slru pages which are frequently accessed are not evenly > distributed across the banks. So imagine the cases where we have some > banks with a lot of empty slots and other banks from which we > frequently have to evict out the pages in order to get the new pages > in. > Yes. Despite the extremely low probability of such a case, this pattern when a user accesses pages assigned to only one bank may happen. This case is equivalent to having just one bank, which means small buffers. Just as we have now. Thank you! Best regards, Andrey Borodin.
On Mon, 9 Jan 2023 at 09:49, Andrey Borodin <amborodin86@gmail.com> wrote: > > On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote: > > does not apply on top of HEAD as in [1], please post a rebased patch: > > > Thanks! Here's the rebase. I'm seeing that there has been no activity in this thread for more than 1 year now, I'm planning to close this in the current commitfest unless someone is planning to take it forward. Regards, Vignesh
> On 20 Jan 2024, at 08:31, vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 9 Jan 2023 at 09:49, Andrey Borodin <amborodin86@gmail.com> wrote: >> >> On Tue, Jan 3, 2023 at 5:02 AM vignesh C <vignesh21@gmail.com> wrote: >>> does not apply on top of HEAD as in [1], please post a rebased patch: >>> >> Thanks! Here's the rebase. > > I'm seeing that there has been no activity in this thread for more > than 1 year now, I'm planning to close this in the current commitfest > unless someone is planning to take it forward. Hi Vignesh, thanks for the ping! Most important parts of this patch set are discussed in [0]. If that patchset will be committed, I'llwithdraw entry for this thread from commitfest. There's a version of Multixact-specific optimizations [1], but I hope they will not be necessary with effective caches developedin [0]. It seems to me that most important part of those optimization is removing sleeps under SLRU lock on standby[2] by Kyotaro Horiguchi. But given that cache optimizations took 4 years to get closer to commit, I'm not sure wewill get this optimization any time soon... Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com [1] https://www.postgresql.org/message-id/flat/2ECE132B-C042-4489-930E-DBC5D0DAB84A%40yandex-team.ru#5f7e7022647be9eeecfc2ae75d765500 [2] https://www.postgresql.org/message-id/flat/20200515.090333.24867479329066911.horikyota.ntt%40gmail.com#855f8bb7205890579a363d2344b4484d
On 2024-Jan-27, Andrey Borodin wrote: > thanks for the ping! Most important parts of this patch set are discussed in [0]. If that patchset will be committed, I'llwithdraw entry for this thread from commitfest. > There's a version of Multixact-specific optimizations [1], but I hope they will not be necessary with effective cachesdeveloped in [0]. It seems to me that most important part of those optimization is removing sleeps under SLRU lockon standby [2] by Kyotaro Horiguchi. But given that cache optimizations took 4 years to get closer to commit, I'm notsure we will get this optimization any time soon... I'd appreciate it if you or Horiguchi-san can update his patch to remove use of usleep in favor of a CV in multixact, and keep this CF entry to cover it. Perhaps a test to make the code reach the usleep(1000) can be written using injection points (49cd2b93d7db)? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarían al hombre" (Ijon Tichy)
> On 28 Jan 2024, at 17:49, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I'd appreciate it if you or Horiguchi-san can update his patch to remove > use of usleep in favor of a CV in multixact, and keep this CF entry to > cover it. Sure! Sounds great! > Perhaps a test to make the code reach the usleep(1000) can be written > using injection points (49cd2b93d7db)? I've tried to prototype something like that. But interesting point between GetNewMultiXactId() and RecordNewMultiXact() isa critical section, and we cannot have injection points in critical sections... Also, to implement such a test we need "wait" type of injection points, see step 2 in attachment. With this type of injectionpoints I can stop a backend amidst entering information about new MultiXact. Best regards, Andrey Borodin.
Attachment
> On 28 Jan 2024, at 23:17, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > >> Perhaps a test to make the code reach the usleep(1000) can be written >> using injection points (49cd2b93d7db)? > > I've tried to prototype something like that. But interesting point between GetNewMultiXactId() and RecordNewMultiXact()is a critical section, and we cannot have injection points in critical sections... > Also, to implement such a test we need "wait" type of injection points, see step 2 in attachment. With this type of injectionpoints I can stop a backend amidst entering information about new MultiXact. Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into "members"SLRU. Cost of having this test: 1. We need a new injection point type "wait" (in addition to "error" and "notice"). It cannot be avoided, because we needto sync at least 3 processed to observe condition we want. 2. We need new way to declare injection point that can happen inside critical section. I've called it "prepared injectionpoint". Complexity of having this test is higher than complexity of CV-sleep patch itself. Do we want it? If so I can produce cleanerversion, currently all multixact tests are int injection_points test module. Best regards, Andrey Borodin.
Attachment
At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in > Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into "members"SLRU. By the way, I raised a question about using multiple CVs simultaneously [1]. That is, I suspect that the current CV implementation doesn't allow us to use multiple condition variables at the same time, because all CVs use the same PCPROC member cvWaitLink to accommodate different waiter sets. If this assumption is correct, we should resolve the issue before spreading more uses of CVs. [1] https://www.postgresql.org/message-id/20240227.150709.1766217736683815840.horikyota.ntt%40gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> On 29 Feb 2024, at 06:59, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in >> Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into "members"SLRU. > > By the way, I raised a question about using multiple CVs > simultaneously [1]. That is, I suspect that the current CV > implementation doesn't allow us to use multiple condition variables at > the same time, because all CVs use the same PCPROC member cvWaitLink > to accommodate different waiter sets. If this assumption is correct, > we should resolve the issue before spreading more uses of CVs. Alvaro, Kyotaro, what's our plan for this? It seems to late to deal with this pg_usleep(1000L) for PG17. I propose following course of action 1. Close this long-standing CF item 2. Start new thread with CV-sleep patch aimed at PG18 3. Create new entry in July CF What do you think? Best regards, Andrey Borodin.
> On 6 Apr 2024, at 14:24, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > What do you think? OK, I'll follow this plan. As long as most parts of this thread were committed, I'll mark CF item as "committed". Thanks to everyone involved! See you in a followup thread about sleeping on CV. Best regards, Andrey Borodin.
On 2024-Feb-03, Andrey M. Borodin wrote: > Here's the test draft. This test reliably reproduces sleep on CV when waiting next multixact to be filled into "members"SLRU. > Cost of having this test: > 1. We need a new injection point type "wait" (in addition to "error" and "notice"). It cannot be avoided, because we needto sync at least 3 processed to observe condition we want. > 2. We need new way to declare injection point that can happen inside critical section. I've called it "prepared injectionpoint". > > Complexity of having this test is higher than complexity of CV-sleep patch itself. Do we want it? If so I can produce cleanerversion, currently all multixact tests are int injection_points test module. Well, it would be nice to have *some* test, but as you say it is way more complex than the thing being tested, and it zooms in on the functioning of the multixact creation in insane quantum-physics ways ... to the point that you can no longer trust that multixact works the same way with the test than without it. So what I did is manually run other tests (pgbench) to verify that the corner case in multixact creation is being handled correctly, and pushed the patch after a few corrections, in particular so that it would follow the CV sleeping protocol a little more closely to what the CV documentation suggests, which is not at all what the patch did. I also fixed a few comments that had been neglected and changed the name and description of the CV in the docs. Now, maybe we can still add the test later, but it needs a rebase. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ “Cuando no hay humildad las personas se degradan” (A. Christie)
> On 7 Apr 2024, at 21:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Well, it would be nice to have *some* test, but as you say it is way > more complex than the thing being tested, and it zooms in on the > functioning of the multixact creation in insane quantum-physics ways ... > to the point that you can no longer trust that multixact works the same > way with the test than without it. So what I did is manually run other > tests (pgbench) to verify that the corner case in multixact creation is > being handled correctly, and pushed the patch after a few corrections, > in particular so that it would follow the CV sleeping protocol a little > more closely to what the CV documentation suggests, which is not at all > what the patch did. I also fixed a few comments that had been neglected > and changed the name and description of the CV in the docs. That's excellent! Thank you! > Now, maybe we can still add the test later, but it needs a rebase. Sure. 'wait' injection points are there already, so I'll produce patch with "prepared" injection points and re-implementtest on top of that. I'll put it on July CF. Best regards, Andrey Borodin.
> On 5 Jul 2024, at 14:16, Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Jun 10, 2024 at 03:10:33PM +0900, Michael Paquier wrote: >> OK, cool. I'll try to get that into the tree once v18 opens up. > > And I've spent more time on this one, and applied it to v18 after some > slight tweaks. Please feel free to re-post your tests with > multixacts, Andrey. Thanks Michael! > On 7 Apr 2024, at 23:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Now, maybe we can still add the test later, but it needs a rebase. Alvaro, please find attached the test. I’ve addressed some Michael’s comments in a nearby thread: removed extra load, made injection point names lowercase, fixedsome grammar issues. Best regards, Andrey Borodin.
Attachment
> On 5 Jul 2024, at 23:18, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > Alvaro, please find attached the test. > I’ve addressed some Michael’s comments in a nearby thread: removed extra load, made injection point names lowercase, fixedsome grammar issues. I’ve made several runs on Github to test stability [0, 1, 2, 4]. CI seems to be stable. Thanks! Best regards, Andrey Borodin. [0] https://github.com/x4m/postgres_g/commit/c9c362679f244 [1] https://github.com/x4m/postgres_g/commit/9d7e43cc1 [2] https://github.com/x4m/postgres_g/commit/18cf186617 [3] https://github.com/x4m/postgres_g/commit/4fbce73997
On 2024-Aug-19, Andrey M. Borodin wrote: > > On 5 Jul 2024, at 23:18, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > > Alvaro, please find attached the test. > > I’ve addressed some Michael’s comments in a nearby thread: removed > > extra load, made injection point names lowercase, fixed some grammar > > issues. > > I’ve made several runs on Github to test stability [0, 1, 2, 4]. CI seems to be stable. OK, I've made some minor adjustments and pushed. CI seemed OK for me, let's see what does the BF have to say. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On 2024-Aug-21, Michael Paquier wrote: > I see that you've gone the way with the SQL function doing a load(). > Would it be worth switching the test to rely on the two macros for > load and caching instead? I've mentioned that previously but never > got down to present a patch for the sake of this test. Hmm, I have no opinion on which way is best. You probably have a better sense of what's better for the injections point interface, so I'm happy to defer to you on this. > + /* reset in case this is a restart within the postmaster */ > + inj_state = NULL; I'm not sure that this assignment actually accomplishes anything ... I don't understand what do the inj_stats_enabled stuff have to do with this patch. I suspect it's a git operation error, ie., you seem to have squashed two different things together. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Industry suffers from the managerial dogma that for the sake of stability and continuity, the company should be independent of the competence of individual employees." (E. Dijkstra)
On 2024-Aug-21, Michael Paquier wrote: > From fd8ab7b6845a2c56aa2c8d9c60f404f6b3407338 Mon Sep 17 00:00:00 2001 > From: Michael Paquier <michael@paquier.xyz> > Date: Wed, 21 Aug 2024 15:16:06 +0900 > Subject: [PATCH 2/3] injection_point: Add injection_points.stats > This GUC controls if statistics should be used or not in the module. > Custom statistics require the module to be loaded with > shared_preload_libraries, hence this GUC is made PGC_POSTMASTER. By > default, stats are disabled. > > This will be used by an upcoming change in a test where stats should not > be used, as the test has a dependency on a critical section. I find it's strange that the information that stats cannot be used with injection points that have dependency on critical sections (?), is only in the commit message and not in the code. Also, maybe it'd make sense for stats to be globally enabled, and that only the tests that require it would disable them? (It's probably easy enough to have a value "injection_points.stats=auto" which means, if the module is loaded in shared_preload_libraries them set stats on, otherwise turn them off.) TBH I don't understand why the issue that stats require shared_preload_libraries only comes up now ... Maybe another approach is to say that if an injection point is loaded via _LOAD() rather than the normal way, then stats are disabled for that one rather than globally? Or give the _LOAD() macro a boolean argument to indicate whether to collect stats for that injection point or not. Lastly, it's not clear to me what does it mean that the test has a "dependency" on a critical section. Do you mean to say that the injection point runs inside a critical section? > From e5329d080b9d8436af8f65aac118745cf1f81ca2 Mon Sep 17 00:00:00 2001 > From: Michael Paquier <michael@paquier.xyz> > Date: Wed, 21 Aug 2024 15:09:06 +0900 > Subject: [PATCH 3/3] Rework new SLRU test with injection points > > Rather than the SQL injection_points_load, this commit makes the test > rely on the two macros to load and run an injection point from the > cache, acting as an example of how to use them. No issues with this, feel free to go ahead. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)
On 2024-Aug-22, Michael Paquier wrote: > On Wed, Aug 21, 2024 at 01:55:06PM -0400, Alvaro Herrera wrote: > > Also, maybe it'd make sense for stats to be globally enabled, and that > > only the tests that require it would disable them? (It's probably easy > > enough to have a value "injection_points.stats=auto" which means, if the > > module is loaded in shared_preload_libraries them set stats on, > > otherwise turn them off.) > > I'm not sure that we need to get down to that until somebody has a > case where they want to rely on stats of injection points for their > stuff. At this stage, I only want the stats to be enabled to provide > automated checks for the custom pgstats APIs, so disabling it by > default and enabling it only in the stats test of the module > injection_points sounds kind of enough to me for now. Oh! I thought the stats were useful by themselves. That not being the case, I agree with simplifying; and the other ways to enhance this point might not be necessary for now. > > Or give the _LOAD() macro a boolean argument to > > indicate whether to collect stats for that injection point or not. > > Sticking some knowledge about the stats in the backend part of > injection points does not sound like a good idea to me. You could flip this around: have the bool be for "this injection point is going to be invoked inside a critical section". Then core code just needs to tell the injection points module what core code does, and it's injection_points that decides what to do with that information. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Crear es tan difícil como ser libre" (Elsa Triolet)
On Thu, Aug 22, 2024 at 10:36:38AM -0400, Alvaro Herrera wrote:
> On 2024-Aug-22, Michael Paquier wrote:
>> I'm not sure that we need to get down to that until somebody has a
>> case where they want to rely on stats of injection points for their
>> stuff. At this stage, I only want the stats to be enabled to provide
>> automated checks for the custom pgstats APIs, so disabling it by
>> default and enabling it only in the stats test of the module
>> injection_points sounds kind of enough to me for now.
>
> Oh! I thought the stats were useful by themselves.
Yep, currently they're not, but I don't want to discard that they'll
never be, either. Perhaps there would be a case where somebody would
like to run a callback N times and trigger a condition? That's
something where the stats could be useful, but I don't have a specific
case for that now. I'm just imagining possibilities.
#1 0x0000558b0956891a in pg_usleep (microsec=microsec@entry=1000) at pgsleep.c:56
#2 0x0000558b0917e01a in GetMultiXactIdMembers (from_pgupgrade=false, onlyLock=<optimized out>, members=0x7ffcd2a9f1e0, multi=109187502) at multixact.c:1392
#3 GetMultiXactIdMembers (multi=109187502, members=members@entry=0x7ffcd2a9f1e0, from_pgupgrade=from_pgupgrade@entry=false, onlyLock=onlyLock@entry=false) at multixact.c:1224
#4 0x0000558b0913de15 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at heapam.c:6924
#5 0x0000558b09146028 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7f440d428308) at heapam.c:6965
#6 0x0000558b0914c02f in HeapTupleSatisfiesMVCC (htup=0x558b0b7cbf20, htup=0x558b0b7cbf20, buffer=8053429, snapshot=0x558b0b63a2d8) at heapam_visibility.c:1089
#7 HeapTupleSatisfiesVisibility (tup=tup@entry=0x7ffcd2a9f2b0, snapshot=snapshot@entry=0x558b0b63a2d8, buffer=buffer@entry=8053429) at heapam_visibility.c:1771
#8 0x0000558b0913e819 in heapgetpage (sscan=sscan@entry=0x558b0b7ccfa0, page=page@entry=115) at heapam.c:468
#9 0x0000558b0913eb7e in heapgettup_pagemode (scan=scan@entry=0x558b0b7ccfa0, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1120
#10 0x0000558b0913fb5e in heap_getnextslot (sscan=0x558b0b7ccfa0, direction=<optimized out>, slot=0x558b0b7cc000) at heapam.c:1352
#11 0x0000558b092c0e7a in table_scan_getnextslot (slot=0x558b0b7cc000, direction=ForwardScanDirection, sscan=<optimized out>) at ../../../src/include/access/tableam.h:1046
#12 SeqNext (node=0x558b0b7cbe10) at nodeSeqscan.c:80
#13 0x0000558b0929b9bf in ExecScan (node=0x558b0b7cbe10, accessMtd=0x558b092c0df0 <SeqNext>, recheckMtd=0x558b092c0dc0 <SeqRecheck>) at execScan.c:198
#14 0x0000558b09292cb2 in ExecProcNode (node=0x558b0b7cbe10) at ../../../src/include/executor/executor.h:262
#15 ExecutePlan (execute_once=<optimized out>, dest=0x558b0bca1350, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x558b0b7cbe10, estate=0x558b0b7cbbe8)
at execMain.c:1636
#16 standard_ExecutorRun (queryDesc=0x558b0b8c9798, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:363
#17 0x00007f44f64d43c5 in pgss_ExecutorRun (queryDesc=0x558b0b8c9798, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:1010
#18 0x0000558b093fda0f in PortalRunSelect (portal=portal@entry=0x558b0b6ba458, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x558b0bca1350) at pquery.c:924
#19 0x0000558b093fedb8 in PortalRun (portal=portal@entry=0x558b0b6ba458, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x558b0bca1350,
altdest=altdest@entry=0x558b0bca1350, qc=0x7ffcd2a9f7a0) at pquery.c:768
#20 0x0000558b093fb243 in exec_simple_query (
query_string=0x558b0b6170c8 "<redacted>")
at postgres.c:1250
#21 0x0000558b093fd412 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4598
#22 0x0000558b0937e170 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4514
#23 BackendStartup (port=<optimized out>) at postmaster.c:4242
#24 ServerLoop () at postmaster.c:1809
#25 0x0000558b0937f147 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x558b0b5d0a30) at postmaster.c:1481
#26 0x0000558b09100a2c in main (argc=5, argv=0x558b0b5d0a30) at main.c:202
On Tue, 29 Oct 2024 at 16:38, Thom Brown <thom@linux.com> wrote: > > On Fri, 23 Aug 2024 at 01:29, Michael Paquier <michael@paquier.xyz> wrote: >> >> On Thu, Aug 22, 2024 at 10:36:38AM -0400, Alvaro Herrera wrote: >> > On 2024-Aug-22, Michael Paquier wrote: >> >> I'm not sure that we need to get down to that until somebody has a >> >> case where they want to rely on stats of injection points for their >> >> stuff. At this stage, I only want the stats to be enabled to provide >> >> automated checks for the custom pgstats APIs, so disabling it by >> >> default and enabling it only in the stats test of the module >> >> injection_points sounds kind of enough to me for now. >> > >> > Oh! I thought the stats were useful by themselves. >> >> Yep, currently they're not, but I don't want to discard that they'll >> never be, either. Perhaps there would be a case where somebody would >> like to run a callback N times and trigger a condition? That's >> something where the stats could be useful, but I don't have a specific >> case for that now. I'm just imagining possibilities. > > > I believe I am seeing the problem being discussed occuring on a production system running 15.6, causing ever-increasingreplay lag on the standby, until I cancel the offending process on the standby and force it to process itsinterrupts. > > Here's the backtrace before I do that: > > #0 0x00007f4503b81876 in select () from /lib64/libc.so.6 > #1 0x0000558b0956891a in pg_usleep (microsec=microsec@entry=1000) at pgsleep.c:56 > #2 0x0000558b0917e01a in GetMultiXactIdMembers (from_pgupgrade=false, onlyLock=<optimized out>, members=0x7ffcd2a9f1e0,multi=109187502) at multixact.c:1392 > #3 GetMultiXactIdMembers (multi=109187502, members=members@entry=0x7ffcd2a9f1e0, from_pgupgrade=from_pgupgrade@entry=false,onlyLock=onlyLock@entry=false) at multixact.c:1224 > #4 0x0000558b0913de15 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at heapam.c:6924 > #5 0x0000558b09146028 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7f440d428308) at heapam.c:6965 > #6 0x0000558b0914c02f in HeapTupleSatisfiesMVCC (htup=0x558b0b7cbf20, htup=0x558b0b7cbf20, buffer=8053429, snapshot=0x558b0b63a2d8)at heapam_visibility.c:1089 > #7 HeapTupleSatisfiesVisibility (tup=tup@entry=0x7ffcd2a9f2b0, snapshot=snapshot@entry=0x558b0b63a2d8, buffer=buffer@entry=8053429)at heapam_visibility.c:1771 > #8 0x0000558b0913e819 in heapgetpage (sscan=sscan@entry=0x558b0b7ccfa0, page=page@entry=115) at heapam.c:468 > #9 0x0000558b0913eb7e in heapgettup_pagemode (scan=scan@entry=0x558b0b7ccfa0, dir=ForwardScanDirection, nkeys=0, key=0x0)at heapam.c:1120 > #10 0x0000558b0913fb5e in heap_getnextslot (sscan=0x558b0b7ccfa0, direction=<optimized out>, slot=0x558b0b7cc000) at heapam.c:1352 > #11 0x0000558b092c0e7a in table_scan_getnextslot (slot=0x558b0b7cc000, direction=ForwardScanDirection, sscan=<optimizedout>) at ../../../src/include/access/tableam.h:1046 > #12 SeqNext (node=0x558b0b7cbe10) at nodeSeqscan.c:80 > #13 0x0000558b0929b9bf in ExecScan (node=0x558b0b7cbe10, accessMtd=0x558b092c0df0 <SeqNext>, recheckMtd=0x558b092c0dc0<SeqRecheck>) at execScan.c:198 > #14 0x0000558b09292cb2 in ExecProcNode (node=0x558b0b7cbe10) at ../../../src/include/executor/executor.h:262 > #15 ExecutePlan (execute_once=<optimized out>, dest=0x558b0bca1350, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x558b0b7cbe10, estate=0x558b0b7cbbe8) > at execMain.c:1636 > #16 standard_ExecutorRun (queryDesc=0x558b0b8c9798, direction=<optimized out>, count=0, execute_once=<optimized out>) atexecMain.c:363 > #17 0x00007f44f64d43c5 in pgss_ExecutorRun (queryDesc=0x558b0b8c9798, direction=ForwardScanDirection, count=0, execute_once=<optimizedout>) at pg_stat_statements.c:1010 > #18 0x0000558b093fda0f in PortalRunSelect (portal=portal@entry=0x558b0b6ba458, forward=forward@entry=true, count=0, count@entry=9223372036854775807,dest=dest@entry=0x558b0bca1350) at pquery.c:924 > #19 0x0000558b093fedb8 in PortalRun (portal=portal@entry=0x558b0b6ba458, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x558b0bca1350, > altdest=altdest@entry=0x558b0bca1350, qc=0x7ffcd2a9f7a0) at pquery.c:768 > #20 0x0000558b093fb243 in exec_simple_query ( > query_string=0x558b0b6170c8 "<redacted>") > at postgres.c:1250 > #21 0x0000558b093fd412 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4598 > #22 0x0000558b0937e170 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4514 > #23 BackendStartup (port=<optimized out>) at postmaster.c:4242 > #24 ServerLoop () at postmaster.c:1809 > #25 0x0000558b0937f147 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x558b0b5d0a30) at postmaster.c:1481 > #26 0x0000558b09100a2c in main (argc=5, argv=0x558b0b5d0a30) at main.c:202 > > This occurred twice, meaning 2 processes needed terminating. Taking a look at what's happening under the hood, it seems to be getting stuck here: if (nextMXOffset == 0) { /* Corner case 2: next multixact is still being filled in */ LWLockRelease(MultiXactOffsetSLRULock); CHECK_FOR_INTERRUPTS(); pg_usleep(1000L); goto retry; } It clearly checks for interrupts, but when I saw this issue happen, it wasn't interruptible. I looked around for similar reports, and I found a fork of Postgres reporting the same issue, but with more info than I thought to get at the time. They determined that the root cause involved a deadlock where the replay thread is stuck due to an out-of-order multixact playback sequence. The standby process waits indefinitely for a page pin to release before proceeding, but the required multixact ID hasn't yet been replayed due to log ordering. What they did to "fix" the issue in their version was to add an error when the reading thread cannot retrieve the next multixact ID, allowing it to exit instead of becoming indefinitely stuck. This took the form of only allowing it to loop 100 times before exiting. Thom
> On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote: > > It clearly checks for interrupts, but when I saw this issue happen, it > wasn't interruptible. Perhaps, that was different multixacts? When I was observing this pathology on Standby, it was a stream of different reads encountering different multis. Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you actuallyneed termination, not cancel? Thanks! Best regards, Andrey Borodin.
On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > > > On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote: > > > > It clearly checks for interrupts, but when I saw this issue happen, it > > wasn't interruptible. > > Perhaps, that was different multixacts? > When I was observing this pathology on Standby, it was a stream of different reads encountering different multis. > > Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you actuallyneed termination, not cancel? Termination didn't work on either of the processes. Thom
> On 31 Oct 2024, at 17:29, Thom Brown <thom@linux.com> wrote: > > On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >> >> >> >>> On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote: >>> >>> It clearly checks for interrupts, but when I saw this issue happen, it >>> wasn't interruptible. >> >> Perhaps, that was different multixacts? >> When I was observing this pathology on Standby, it was a stream of different reads encountering different multis. >> >> Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you actuallyneed termination, not cancel? > > Termination didn't work on either of the processes. How did you force the process to actually terminate? Did you observe repeated read of the same multixact? Was offending process holding any locks while waiting? Best regards, Andrey Borodin.
> On 31 Oct 2024, at 17:29, Thom Brown <thom@linux.com> wrote:
>
> On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>>
>>
>>
>>> On 29 Oct 2024, at 21:45, Thom Brown <thom@linux.com> wrote:
>>>
>>> It clearly checks for interrupts, but when I saw this issue happen, it
>>> wasn't interruptible.
>>
>> Perhaps, that was different multixacts?
>> When I was observing this pathology on Standby, it was a stream of different reads encountering different multis.
>>
>> Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you actually need termination, not cancel?
>
> Termination didn't work on either of the processes.
How did you force the process to actually terminate?
Did you observe repeated read of the same multixact?
Was offending process holding any locks while waiting?
> On 1 Nov 2024, at 00:25, Thom Brown <thom@linux.com> wrote: > > Unfortunately I didn't gather much information when it was occuring, and prioritised getting rid of the process blockingreplay. I just attached gdb to it, got a backtrace and then "print ProcessInterrupts()". > Currently I do not see how this wait can result in a deadlock. But I did observe standby in a pathological sequential scan encountering recent multixact again and again (new each time). I hope this situation will be alleviated by recent cahnges - now there is not a millisecond wait, but hopefully smaller amountof time. In v17 we also added injection point test which reproduces reading very recent multixact. If there is a deadlock I hope buildfarmwill show it to us. Best regards, Andrey Borodin.
On Tue, Oct 29, 2024 at 1:45 PM Thom Brown <thom@linux.com> wrote: > Taking a look at what's happening under the hood, it seems to be > getting stuck here: > > if (nextMXOffset == 0) > { > /* Corner case 2: next multixact is still > being filled in */ > LWLockRelease(MultiXactOffsetSLRULock); > CHECK_FOR_INTERRUPTS(); > pg_usleep(1000L); > goto retry; > } > > It clearly checks for interrupts, but when I saw this issue happen, it > wasn't interruptible. I don't understand the underlying issue here; however, if a process holds an lwlock, it's not interruptible, even if it checks for interrupts. So it could be that at this point in the code we're holding some other LWLock, such as a buffer content lock, and that's why this code fails to achieve its objective. -- Robert Haas EDB: http://www.enterprisedb.com