Thread: POC: make mxidoff 64 bits
Hi!
I've been trying to introduce 64-bit transaction identifications to Postgres for quite a while [0]. All this implies,
of course, an enormous amount of change that will have to take place in various modules. Consider this, the
patch set become too big to be committed “at once”.
The obvious solutions was to try to split the patch set into smaller ones. But here it comes a new challenge,
not every one of these parts, make Postgres better at the moment. Actually, even switching to a
FullTransactionId in PGPROC will have disadvantage in increasing of WAL size [1].
In fact, I believe, we're dealing with the chicken and the egg problem here. Not able to commit full patch set
since it is too big to handle and not able to commit parts of it, since they make sense all together and do not
help improve Postgres at the moment.
But it's not that bad. Since the commit 4ed8f0913bfdb5f, added in [3], we are capable to use 64 bits to
indexing SLRUs.
PROPOSAL
Make multixact offsets 64 bit.
RATIONALE
It is not very difficult to overflow 32-bit mxidoff. Since, it is created for every unique combination of the
It is not very difficult to overflow 32-bit mxidoff. Since, it is created for every unique combination of the
transaction for each tuple, including XIDs and respective flags. And when a transaction is added to a
specific multixact, it is rewritten with a new offset. In other words, it is possible to overflow the offsets of
multixacts without overflowing the multixacts themselves and/or ordinary transactions. I believe, there
was something about in the hackers mail lists, but I could not find links now.
PFA, patch. Here is a WIP version. Upgrade machinery should be added later.
As always, any opinions on a subject a very welcome!
--
Best regards,
Maxim Orlov.
Attachment
On 23/04/2024 11:23, Maxim Orlov wrote: > PROPOSAL > Make multixact offsets 64 bit. +1, this is a good next step and useful regardless of 64-bit XIDs. > @@ -156,7 +148,7 @@ > ((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1)) > > /* page in which a member is to be found */ > -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE) > +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE) > #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT) > > /* Location (byte offset within page) of flag word for a given member */ This is really a bug fix. It didn't matter when TransactionId and MultiXactOffset were both typedefs of uint32, but it was always wrong. The argument name 'xid' is also misleading. I think there are some more like that, MXOffsetToFlagsBitShift for example. -- Heikki Linnakangas Neon (https://neon.tech)
> On 23 Apr 2024, at 11:23, Maxim Orlov <orlovmg@gmail.com> wrote: > > Make multixact offsets 64 bit. - ereport(ERROR, - (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("multixact \"members\" limit exceeded"), Personally, I'd be happy with this! We had some incidents where the only mitigation was vacuum settings tweaking. BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset? Best regards, Andrey Borodin.
Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!
Best regards
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!
Best regards
On Tue, 23 Apr 2024 at 17:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 23/04/2024 11:23, Maxim Orlov wrote:
> PROPOSAL
> Make multixact offsets 64 bit.
+1, this is a good next step and useful regardless of 64-bit XIDs.
> @@ -156,7 +148,7 @@
> ((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
>
> /* page in which a member is to be found */
> -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
> +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
> #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
>
> /* Location (byte offset within page) of flag word for a given member */
This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.
I think there are some more like that, MXOffsetToFlagsBitShift for example.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Tue, 23 Apr 2024 at 12:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.
I think there are some more like that, MXOffsetToFlagsBitShift for example.
Yeah, I always thought so too. I believe, this is just a copy-paste. You mean, it is worth creating a separate CF
entry for these fixes?
On Tue, 23 Apr 2024 at 16:03, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?Actually, first versions of the 64xid patch set have such a cast to types TransactionID, MultiXact and so on. But,
after some discussions, we are switched to unsigned long long cast. Unfortunately, I could not find an exact link
for that discussion. On the other hand, such a casting is already used throughout the code. So, just for the
sake of the consistency, I would like to stay with these casts.
On Tue, 23 Apr 2024 at 16:03, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL size by a few bytes should have very little impact with today's disk performance(Logical replication of this feature wal log is also increased a lot, logical replication is a milestone new feature, and the community has been improving the logical replication of functions),I believe removing troubled postgresql Transaction ID Wraparound was also a milestone new feature adding a few bytes is worth it!
I'm 100% agree. Maybe, I should return to this approach and find some benefits for having FXIDs in WAL.
On Tue, Sep 3, 2024 at 4:30 PM Maxim Orlov <orlovmg@gmail.com> wrote: > Here is rebase. Apparently I'll have to do it often, since the CATALOG_VERSION_NO changed in the patch. I don't think you need to maintain CATALOG_VERSION_NO change in your patch for the exact reason you have mentioned: patch will get conflict each time CATALOG_VERSION_NO is advanced. It's responsibility of committer to advance CATALOG_VERSION_NO when needed. ------ Regards, Alexander Korotkov Supabase
On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.
OK, I got it. My intention here was to help to test the patch. If someone wants to have a
look at the patch, he won't need to make changes in the code. In the next iteration, I'll
remove CATALOG_VERSION_NO version change.
Best regards,
Maxim Orlov.
Hi, Maxim!
Previously we accessed offsets in shared MultiXactState without locks as 32-bit read is always atomic. But I'm not sure it's so when offset become 64-bit.
E.g. GetNewMultiXactId():
nextOffset = MultiXactState->nextOffset;
is outside lock.
There might be other places we do the same as well.
Regards,
Pavel Borisov
Supabase
On Thu, 12 Sept 2024 at 16:09, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Maxim!Previously we accessed offsets in shared MultiXactState without locks as 32-bit read is always atomic. But I'm not sure it's so when offset become 64-bit.E.g. GetNewMultiXactId():nextOffset = MultiXactState->nextOffset;is outside lock.There might be other places we do the same as well.
I think the replacement of plain assignments by pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.
(The same I think is needed for the patchset [1])
Regards,
Pavel Borisov
On 2024-Sep-12, Pavel Borisov wrote: > Hi, Maxim! > > Previously we accessed offsets in shared MultiXactState without locks as > 32-bit read is always atomic. But I'm not sure it's so when offset become > 64-bit. > E.g. GetNewMultiXactId(): > > nextOffset = MultiXactState->nextOffset; > is outside lock. Good though. But fortunately I think it's not a problem. The one you say is with MultiXactGetLock held in shared mode -- and that works OK, as the assignment (in line 1263 at the bottom of the same routine) is done with exclusive lock held. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On 07/09/2024 07:36, Maxim Orlov wrote: > Here is v3. MultiXactMemberFreezeThreshold looks quite bogus now. Now that MaxMultiXactOffset==2^64-1, you cannot get anywhere near the MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I guess it would still be useful to trigger autovacuum if multixacts members grows large though, to release the disk space, even if you can't run out of members as such anymore. What should the logic for that look like? I'd love to see some tests for the pg_upgrade code. Something like a little perl script to generate test clusters with different wraparound scenarios etc. using the old version, and a TAP test to run pg_upgrade on them and verify that queries on the upgraded cluster works correctly. We don't have tests like that in the repository today, and I don't know if we'd want to commit these permanently either, but it would be highly useful now as a one-off thing, to show that the code works. On upgrade, are there really no changes required to pg_multixact/members? I imagined that the segment files would need to be renamed around wraparound, so that if you previously had files like this: pg_multixact/members/FFFE pg_multixact/members/FFFF pg_multixact/members/0000 pg_multixact/members/0001 after upgrade you would need to have: pg_multixact/members/00000000FFFE pg_multixact/members/00000000FFFF pg_multixact/members/000000010000 pg_multixact/members/000000010001 Thanks for working on this! -- Heikki Linnakangas Neon (https://neon.tech)
HI Maxim Orlov
> After a bit of thought, I've realized that to be conservative here is the way to go.
>We can reuse a maximum of existing logic. I mean, we can remove offset wraparound "error logic" and reuse "warning logic". But set the threshold for "warning >logic" to a much higher value. For now, I choose 2^32-1. In other world, legit logic, in my view, here would be to trigger autovacuum if the number of offsets (i.e. >difference nextOffset - oldestOffset) exceeds 2^32-1. PFA patch set.good point ,Couldn't agree with you more. xid64 is the solution to the wraparound problem,The previous error log is no longer meaningful ,But we might want to refine the output waring log a little(For example, checking the underlying reasons why age has been increasing),Though we don't have to worry about xid wraparound
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Can we refine this annotation a bit? for example
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Thanks
Maxim Orlov <orlovmg@gmail.com> 于2024年10月23日周三 23:55写道:
After a bit of thought, I've realized that to be conservative here is the way to go.
We can reuse a maximum of existing logic. I mean, we can remove offset wraparound "error logic" and reuse "warning logic". But set the threshold for "warning logic" to a much higher value. For now, I choose 2^32-1. In other world, legit logic, in my view, here would be to trigger autovacuum if the number of offsets (i.e. difference nextOffset - oldestOffset) exceeds 2^32-1. PFA patch set.
--Best regards,Maxim Orlov.
On 13/11/2024 17:44, Maxim Orlov wrote: > On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > On a different note, I'm surprised you're rewriting member segments > from > scratch, parsing all the individual member groups and writing them out > again. There's no change to the members file format, except for the > numbering of the files, so you could just copy the files under the new > names without paying attention to the contents. It's not wrong to parse > them in detail, but I'd assume that it would be simpler not to. > > Yes, at the beginning I also thought that it would be possible to get by > with simple copying. But in case of wraparound, we must "bypass" invalid > zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 > values appears in multixact.c So, they must be handled. Bypass, in fact. > When we are switched to the 64-bit offsets, we have two options: > 1). Bypass every ((uint32) offset == 0) value in multixact.c; > 2). Convert members and bypass invalid value once. > > The first options seem too weird for me. So, we have to repack members > and bypass invalid value. Hmm, so if I understand correctly, this is related to how we determine the length of the members array, by looking at the next multixid's offset. This is explained in GetMultiXactIdMembers: > /* > * Find out the offset at which we need to start reading MultiXactMembers > * and the number of members in the multixact. We determine the latter as > * the difference between this multixact's starting offset and the next > * one's. However, there are some corner cases to worry about: > * > * 1. This multixact may be the latest one created, in which case there is > * no next one to look at. In this case the nextOffset value we just > * saved is the correct endpoint. > * > * 2. The next multixact may still be in process of being filled in: that > * is, another process may have done GetNewMultiXactId but not yet written > * the offset entry for that ID. In that scenario, it is guaranteed that > * the offset entry for that multixact exists (because GetNewMultiXactId > * won't release MultiXactGenLock until it does) but contains zero > * (because we are careful to pre-zero offset pages). Because > * GetNewMultiXactId will never return zero as the starting offset for a > * multixact, when we read zero as the next multixact's offset, we know we > * have this case. We handle this by sleeping on the condition variable > * we have just for this; the process in charge will signal the CV as soon > * as it has finished writing the multixact offset. > * > * 3. Because GetNewMultiXactId increments offset zero to offset one to > * handle case #2, there is an ambiguity near the point of offset > * wraparound. If we see next multixact's offset is one, is that our > * multixact's actual endpoint, or did it end at zero with a subsequent > * increment? We handle this using the knowledge that if the zero'th > * member slot wasn't filled, it'll contain zero, and zero isn't a valid > * transaction ID so it can't be a multixact member. Therefore, if we > * read a zero from the members array, just ignore it. > * > * This is all pretty messy, but the mess occurs only in infrequent corner > * cases, so it seems better than holding the MultiXactGenLock for a long > * time on every multixact creation. > */ With 64-bit offsets, can we assume that it never wraps around? We often treat 2^64 as "large enough that we'll never run out", e.g. LSNs are also assumed to never wrap around. I think that would be a safe assumption here too. If we accept that, we don't need to worry about case 3 anymore. But if we upgrade wrapped-around members files by just renaming them, there could still be a members array where we had skipped offset 0, and reading that after the upgrade might get confused. We could continue to ignore a 0 XID in the members array like the comment says; I think that would be enough. But yeah, maybe it's better to bite the bullet in pg_upgrade and squeeze those out. Does your upgrade test suite include case 3, where the next multixact's offset is 1? Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other comments and checks that talk about binary-upgraded values too that we can hopefully clean up now. If we are to parse the member segments in detail in upgrade anyway, I'd be tempted to make some further changes / optimizations: - You could leave out all locking XID members in upgrade, because they're not relevant after upgrade any more (all the XIDs will be committed or aborted and have released the locks; we require prepared transactions to be completed before upgrading too). It'd be enough to include actual UPDATE/DELETE XIDs. - The way we determine the length of the members array by looking at the next multixid's offset is a bit complicated. We could have one extra flag per XID in the members to indicate "this is the last member of this multixid". That could either to replace the current mechanism of looking at the next offset, or be just an additional cross-check. - Do we still like the "group" representation, with 4 bytes of flags followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per XID unaligned. - A more radical idea: There can be only one updating XID in one multixid. We could store that directly in the offsets SLRU, and keep only the locking XIDs in members. That way, the members SLRU would become less critical; it could be safely reset on crash for example (except for prepared transactions, which could still be holding locks, but it'd still be less serious). Separating correctness-critical data from more ephemeral state is generally a good idea. I'm not insisting on any of these changes, just some things that might be worth considering if we're rewriting the SLRUs on upgrade anyway. -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, 15 Nov 2024 at 14:06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:
Correct.
If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.
Correct. I couldn't explain this better. I'm more for the squeeze those out. Overwise, we're ending up in adding another hack in multixact, but one of the benefits from switching to 64-bits, it should make XID's logic more straight forward. After all, mxact juggling in pg_upgrade is one time inconvenience.
Does your upgrade test suite include case 3, where the next multixact's
offset is 1?
Not exactly.
simple
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5927049
offset-wrap
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5591183
multi-wrap
Latest checkpoint's NextMultiXactId: 82006
Latest checkpoint's NextMultiOffset: 7408811
offset-multi-wrap
Latest checkpoint's NextMultiXactId: 52146
Latest checkpoint's NextMultiOffset: 5591183
You want test case where NextMultiOffset will be 1?
simple
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5927049
offset-wrap
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5591183
multi-wrap
Latest checkpoint's NextMultiXactId: 82006
Latest checkpoint's NextMultiOffset: 7408811
offset-multi-wrap
Latest checkpoint's NextMultiXactId: 52146
Latest checkpoint's NextMultiOffset: 5591183
You want test case where NextMultiOffset will be 1?
Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.
Yes, technically we can. But this is kinda unrelated to the offsets and will make the patch set significantly complicated, thus more complicated to review and less likely to be committed. Again, I'm not opposing the idea, I'm not sure if it is worth to do it right now.
If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:
- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.
- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.
- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.
Not really. But I would leave it for next iteration - switching multi to 64 bit. I already have some drafts for this. In any case, we'll must do adjustments in pg_upgrade again. My goal is to move towards 64 XIDs, but with the small steps, and I plan changes in "group" representation in combination with switching multi to 64 bit. This seems a bit more appropriate in my view.
As for your optimization suggestions, I like them. I don’t against them, but I’m afraid to disrupt the clarity of thought, especially since the algorithm is not the simplest.
--
Best regards,
Maxim Orlov.
On 27/12/2024 19:09, Maxim Orlov wrote: > > On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > Does the pg_upgrade code work though, if you have that buggy situation > where oldestOffsetKnown == false ? ... > > > > if (!TransactionIdIsValid(*xactptr)) > > { > > /* Corner case 3: we must be looking at > unused slot zero */ > > Assert(offset == 0); > > continue; > > } > > After upgrade, this corner case 3 would *not* happen on offset == 0. So > looks like we're still missing test coverage for this upgrade corner > case. > > Am I understanding correctly that you want to have a test corresponding > to the buggy 9.3 and 9.4 era versions? No, those were two different things. I think there might be two things wrong here: 1. I suspect pg_upgrade might not correctly handle the situation where oldestOffsetKnown==false, and 2. The above assertion in "corner case 3" would not hold. It seems that we don't have a test case for it, or it would've hit the assertion. Now that I think about it, yes, a test case for 1. would be good too. But I was talking about 2. > Do you think we could imitate this scenario on a current master branch > like that: > 1) generate a couple of offsets segments for the first table; > 2) generate more segments for a second table; > 3) drop first table; > 4) stop pg cluster; > 5) remove pg_multixact/offsets/0000 > 6) upgrade? I don't remember off the top of my head. It might be best to just refuse the upgrade if oldestOffsetKnown==false. It's a very ancient corner case. It seems reasonable to require you to upgrade to a newer minor version and run VACUUM before upgrading. IIRC that sets oldestOffsetKnown. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, 2 Jan 2025 at 01:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.
I agree. After all, we do already have a ready-made solution in the form of a vacuum, do we?
If I understand all this multixact_old.c machinery correctly, in case of oldestOffsetKnown==false
we should fail with "could not open file" or offset will be 0 in GetOldMultiXactIdSingleMember.
So, I suppose we can put an analogue of SimpleLruDoesPhysicalPageExist call in the beginning
of GetOldMultiXactIdSingleMember. And if either SimpleLruDoesPhysicalPageExist return false
or a corresponding offset will be 0 we have to bail out with "oldest offset does not exist, consider
running vacuum before pg_upgrdade" or smth. Please, correct me if I'm wrong.
Best regards,
Maxim Orlov.