Re: CSN snapshots in hot standby - Mailing list pgsql-hackers

From 贾明伟
Subject Re: CSN snapshots in hot standby
Date
Msg-id 355C0849-820C-4373-9CC9-9A257C661C3A@163.com
Whole thread Raw
In response to Re: CSN snapshots in hot standby  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: CSN snapshots in hot standby
List pgsql-hackers
Hi,

Thanks for the proposal, it's an interesting approach.

I have a question regarding the xid visibility during standby startup. 
If the checkpoint’s `oldestActiveXid` is smaller than `nextXid`, then there may be already-committed transactions 
in that range which will not be replayed on standby. In that case, I believe clog needs to be used for visibility 
checks within that xid range — is that correct?

On top of your previous discussion, I wrote a test case and attempted to fix the issue.
Patches 0001–0012 are your original commits, unchanged. Patch 0013 contains my own ideas.

Since this is my first time replying to the mailing list, I was worried about breaking the thread, so I’ve included everything as attachments instead.

Looking forward to your thoughts.
Best regards,  
Mingwei Jia


2025年4月1日 05:31,Heikki Linnakangas <hlinnaka@iki.fi> 写道:

Here's a new patchset version. Not much has changed in the actual CSN patches. But I spent a lot of time refactoring the snapshot management code, so that there is a simple place to add the "inprogress XID cache" for the CSN snapshots, in a way that avoids duplicating the cache if a snapshot is copied around.

Patches 0001-0002 are the patches I posted on a separate thread earlier. See https://www.postgresql.org/message-id/ec10d398-c9b3-4542-8095-5fc6408b17d1%40iki.fi.

Patches 0003-0006 contain more snapshot manager changes. The end state is that an MVCC snapshot consists of two structs: a shared "inner" struct that contains xmin, xmax and the XID lists, and an "outer" struct that contains a pointer to the shared struct and the current command ID. As a snapshot is copied around, all the copies share the same shared, reference-counted struct.

The rest of the patches are the same CSN patches I posted before, rebased over the snapshot manager changes.


There's one thing that hasn't been discussed yet: The ProcArrayRecoveryEndTransaction() function, which replaces ExpireTreeKnownAssignedTransactionIds() and is called on replay of every commit/abort record, does this:

/*
* If this was the oldest XID that was still running, advance it. This is
* important for advancing the global xmin, which avoids unnecessary
* recovery conflicts
*
* No locking required because this runs in the startup process.
*
* XXX: the caller actually has a list of XIDs that just committed. We
* could save some clog lookups by taking advantage of that list.
*/
oldest_running_primary_xid = procArray->oldest_running_primary_xid;
while (oldest_running_primary_xid < max_xid)
{
if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
!TransactionIdDidAbort(oldest_running_primary_xid))
{
break;
}
TransactionIdAdvance(oldest_running_primary_xid);
}
if (max_xid == oldest_running_primary_xid)
TransactionIdAdvance(oldest_running_primary_xid);

The point is to maintain an "oldest xmin" value based on the WAL records that are being replayed. Whenever the currently oldest running XID finishes, we scan the CLOG to find the next oldest XID that hasn't completed yet.

That adds approximately one or two CLOG lookup to every commit record replay on average. I haven't tried measuring that, but it seems like it could slow down recovery. There are ways that could be improved. For example, do it in larger batches.


A bunch of other small XXX comments remain, but they're just markers for comments that need to be adjusted, or for further cleanups that are now possible.


There are also several ways the inprogress cache could be made more efficient, which I haven't explored:

- For each XID in the cache radix tree, we store one bit to indicate whether the lookup has been performed, i.e. if the cache is valid for the XID, and another bit to indicate if the XID is visible or not. With 64-bit cache words stored in the radix tree, each cache word can store the status of 32 transactions. It would probably be better to work in bigger chunks. For example, when doing a lookup in the cache, check the status of 64 transactions at once. Assuming they're all stored on the same CSN page, it would not be much more expensive than a single XID lookup. That would make the cache 2x more compact, and save on future lookups of XIDS falling on the same cache word.

- Initializing the radix tree cache is fairly expensive, with several memory allocations. Many of those allocations could be done lazily with some effort in radixtree.h.

- Or start the cache as a small array of XIDs, and switch to the radix tree only after it fills up.

--
Heikki Linnakangas
Neon (https://neon.tech)
<v6-0001-Split-SnapshotData-into-separate-structs-for-each.patch><v6-0002-Simplify-historic-snapshot-refcounting.patch><v6-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patch><v6-0004-Replace-static-snapshot-pointers-with-the-valid-f.patch><v6-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patch><v6-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patch><v6-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patch><v6-0008-XXX-add-perf-test.patch><v6-0009-Use-CSN-snapshots-during-Hot-Standby.patch><v6-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch><v6-0011-Remove-the-now-unused-xids-array-from-xl_running_.patch><v6-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch>

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: pending_since assertion failure on skink
Next
From: itumonohito
Date:
Subject: HELP: SAVEPOINT feature cases