Re: CSN snapshots in hot standby - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: CSN snapshots in hot standby
Date
Msg-id 80f254d3-8ee9-4cde-a7e3-ee99998154da@iki.fi
Whole thread Raw
In response to [MASSMAIL]CSN snapshots in hot standby  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
Here's a new patchset version. Not much has changed in the actual CSN 
patches. But I spent a lot of time refactoring the snapshot management 
code, so that there is a simple place to add the "inprogress XID cache" 
for the CSN snapshots, in a way that avoids duplicating the cache if a 
snapshot is copied around.

Patches 0001-0002 are the patches I posted on a separate thread earlier. 
See 
https://www.postgresql.org/message-id/ec10d398-c9b3-4542-8095-5fc6408b17d1%40iki.fi.

Patches 0003-0006 contain more snapshot manager changes. The end state 
is that an MVCC snapshot consists of two structs: a shared "inner" 
struct that contains xmin, xmax and the XID lists, and an "outer" struct 
that contains a pointer to the shared struct and the current command ID. 
As a snapshot is copied around, all the copies share the same shared, 
reference-counted struct.

The rest of the patches are the same CSN patches I posted before, 
rebased over the snapshot manager changes.


There's one thing that hasn't been discussed yet: The 
ProcArrayRecoveryEndTransaction() function, which replaces 
ExpireTreeKnownAssignedTransactionIds() and is called on replay of every 
commit/abort record, does this:

>     /*
>      * If this was the oldest XID that was still running, advance it. This is
>      * important for advancing the global xmin, which avoids unnecessary
>      * recovery conflicts
>      *
>      * No locking required because this runs in the startup process.
>      *
>      * XXX: the caller actually has a list of XIDs that just committed. We
>      * could save some clog lookups by taking advantage of that list.
>      */
>     oldest_running_primary_xid = procArray->oldest_running_primary_xid;
>     while (oldest_running_primary_xid < max_xid)
>     {
>         if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
>             !TransactionIdDidAbort(oldest_running_primary_xid))
>         {
>             break;
>         }
>         TransactionIdAdvance(oldest_running_primary_xid);
>     }
>     if (max_xid == oldest_running_primary_xid)
>         TransactionIdAdvance(oldest_running_primary_xid);

The point is to maintain an "oldest xmin" value based on the WAL records 
that are being replayed. Whenever the currently oldest running XID 
finishes, we scan the CLOG to find the next oldest XID that hasn't 
completed yet.

That adds approximately one or two CLOG lookup to every commit record 
replay on average. I haven't tried measuring that, but it seems like it 
could slow down recovery. There are ways that could be improved. For 
example, do it in larger batches.


A bunch of other small XXX comments remain, but they're just markers for 
comments that need to be adjusted, or for further cleanups that are now 
possible.


There are also several ways the inprogress cache could be made more 
efficient, which I haven't explored:

- For each XID in the cache radix tree, we store one bit to indicate 
whether the lookup has been performed, i.e. if the cache is valid for 
the XID, and another bit to indicate if the XID is visible or not. With 
64-bit cache words stored in the radix tree, each cache word can store 
the status of 32 transactions. It would probably be better to work in 
bigger chunks. For example, when doing a lookup in the cache, check the 
status of 64 transactions at once. Assuming they're all stored on the 
same CSN page, it would not be much more expensive than a single XID 
lookup. That would make the cache 2x more compact, and save on future 
lookups of XIDS falling on the same cache word.

- Initializing the radix tree cache is fairly expensive, with several 
memory allocations. Many of those allocations could be done lazily with 
some effort in radixtree.h.

- Or start the cache as a small array of XIDs, and switch to the radix 
tree only after it fills up.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

pgsql-hackers by date:

Previous
From: Andrew Jackson
Date:
Subject: Re: [PATCH] PGSERVICEFILE as part of a normal connection string
Next
From: Christoph Berg
Date:
Subject: Re: pgsql: Add support for OAUTHBEARER SASL mechanism