Re: POC: make mxidoff 64 bits - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: POC: make mxidoff 64 bits
Date
Msg-id ca451a6a-5b0f-433c-8ab9-b28b7f480bd7@iki.fi
Whole thread Raw
In response to POC: make mxidoff 64 bits  (Maxim Orlov <orlovmg@gmail.com>)
Responses Re: POC: make mxidoff 64 bits
List pgsql-hackers
On 13/11/2024 17:44, Maxim Orlov wrote:
> On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>     On a different note, I'm surprised you're rewriting member segments
>     from
>     scratch, parsing all the individual member groups and writing them out
>     again. There's no change to the members file format, except for the
>     numbering of the files, so you could just copy the files under the new
>     names without paying attention to the contents. It's not wrong to parse
>     them in detail, but I'd assume that it would be simpler not to.
> 
> Yes, at the beginning I also thought that it would be possible to get by 
> with simple copying. But in case of wraparound, we must "bypass" invalid 
> zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 
> values appears in multixact.c So, they must be handled. Bypass, in fact. 
> When we are switched to the 64-bit offsets, we have two options:
> 1). Bypass every ((uint32) offset == 0) value in multixact.c;
> 2). Convert members and bypass invalid value once.
> 
> The first options seem too weird for me. So, we have to repack members 
> and bypass invalid value.

Hmm, so if I understand correctly, this is related to how we determine 
the length of the members array, by looking at the next multixid's 
offset. This is explained in GetMultiXactIdMembers:

>     /*
>      * Find out the offset at which we need to start reading MultiXactMembers
>      * and the number of members in the multixact.  We determine the latter as
>      * the difference between this multixact's starting offset and the next
>      * one's.  However, there are some corner cases to worry about:
>      *
>      * 1. This multixact may be the latest one created, in which case there is
>      * no next one to look at.  In this case the nextOffset value we just
>      * saved is the correct endpoint.
>      *
>      * 2. The next multixact may still be in process of being filled in: that
>      * is, another process may have done GetNewMultiXactId but not yet written
>      * the offset entry for that ID.  In that scenario, it is guaranteed that
>      * the offset entry for that multixact exists (because GetNewMultiXactId
>      * won't release MultiXactGenLock until it does) but contains zero
>      * (because we are careful to pre-zero offset pages). Because
>      * GetNewMultiXactId will never return zero as the starting offset for a
>      * multixact, when we read zero as the next multixact's offset, we know we
>      * have this case.  We handle this by sleeping on the condition variable
>      * we have just for this; the process in charge will signal the CV as soon
>      * as it has finished writing the multixact offset.
>      *
>      * 3. Because GetNewMultiXactId increments offset zero to offset one to
>      * handle case #2, there is an ambiguity near the point of offset
>      * wraparound.  If we see next multixact's offset is one, is that our
>      * multixact's actual endpoint, or did it end at zero with a subsequent
>      * increment?  We handle this using the knowledge that if the zero'th
>      * member slot wasn't filled, it'll contain zero, and zero isn't a valid
>      * transaction ID so it can't be a multixact member.  Therefore, if we
>      * read a zero from the members array, just ignore it.
>      *
>      * This is all pretty messy, but the mess occurs only in infrequent corner
>      * cases, so it seems better than holding the MultiXactGenLock for a long
>      * time on every multixact creation.
>      */

With 64-bit offsets, can we assume that it never wraps around? We often 
treat 2^64 as "large enough that we'll never run out", e.g. LSNs are 
also assumed to never wrap around. I think that would be a safe 
assumption here too.

If we accept that, we don't need to worry about case 3 anymore. But if 
we upgrade wrapped-around members files by just renaming them, there 
could still be a members array where we had skipped offset 0, and 
reading that after the upgrade might get confused. We could continue to 
ignore a 0 XID in the members array like the comment says; I think that 
would be enough. But yeah, maybe it's better to bite the bullet in 
pg_upgrade and squeeze those out.

Does your upgrade test suite include case 3, where the next multixact's 
offset is 1?


Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other 
comments and checks that talk about binary-upgraded values too that we 
can hopefully clean up now.


If we are to parse the member segments in detail in upgrade anyway, I'd 
be tempted to make some further changes / optimizations:

-  You could leave out all locking XID members in upgrade, because 
they're not relevant after upgrade any more (all the XIDs will be 
committed or aborted and have released the locks; we require prepared 
transactions to be completed before upgrading too). It'd be enough to 
include actual UPDATE/DELETE XIDs.

- The way we determine the length of the members array by looking at the 
next multixid's offset is a bit complicated. We could have one extra 
flag per XID in the members to indicate "this is the last member of this 
multixid". That could either to replace the current mechanism of looking 
at the next offset, or be just an additional cross-check.

- Do we still like the "group" representation, with 4 bytes of flags 
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per 
XID unaligned.

- A more radical idea: There can be only one updating XID in one 
multixid. We could store that directly in the offsets SLRU, and keep 
only the locking XIDs in members. That way, the members SLRU would 
become less critical; it could be safely reset on crash for example 
(except for prepared transactions, which could still be holding locks, 
but it'd still be less serious). Separating correctness-critical data 
from more ephemeral state is generally a good idea.

I'm not insisting on any of these changes, just some things that might 
be worth considering if we're rewriting the SLRUs on upgrade anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Shubham Khanna
Date:
Subject: Re: Improve the error message for logical replication of regular column to generated column.
Next
From: Alvaro Herrera
Date:
Subject: Re: Potential ABI breakage in upcoming minor releases