Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Snapshot related assert failure on skink |
Date | |
Msg-id | 76bcbe27-4408-4f01-8623-4a6eff782312@iki.fi Whole thread Raw |
In response to | Re: Snapshot related assert failure on skink (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Snapshot related assert failure on skink
|
List | pgsql-hackers |
On 24/03/2025 16:56, Tomas Vondra wrote: > > > On 3/23/25 17:43, Heikki Linnakangas wrote: >> On 21/03/2025 17:16, Andres Freund wrote: >>> Am I right in understanding that the only scenario (when in >>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() >>> would >>> "legally" remove a transaction, rather than the commit / abort records >>> doing >>> so, is if the primary crash-restarted while transactions were ongoing? >>> >>> Those transactions won't have a commit/abort records and thus won't >>> trigger >>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have >>> updated >>> ->xactCompletionCount? >> >> Correct. >> >>> When writing the snapshot caching patch, I tried to make sure that all >>> the >>> places that maintain ->latestCompletedXid also update >>> ->xactCompletionCount. Afaict that's still the case. Which, I think, >>> means >>> that we're also missing calls to MaintainLatestCompletedXidRecovery()? >> >> Yep, I was just looking at that too. >> >>> If latestCompletedXid is incorrect visibility determinations end up >>> wrong... >> >> I think it happens to work, because the transactions are effectively >> aborted. latestCompletedXid is used to initialize xmax in >> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set >> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000 >> still correctly considers XID 1001 as "not visible". As soon as a >> transaction commits, latestCompletedXid is fixed. >> >> AFAICS we could skip updating latestCompletedXid on aborts altogether >> and rename it to latestCommittedXid. But it's hardly worth optimizing >> for aborts. >> >> For the same reason, I believe the assertion failure we're discussing >> here is also harmless. Even though the reused snapshot has a smaller >> xmin than expected, all those transactions aborted and are thus not >> visible anyway. >> >> In any case, let's be tidy and fix both latestCompletedXid and >> xactCompletionCount. >> > > Thanks for looking into this and pushing the fix. > > Would it make sense to add a comment documenting this reasoning about > not handling aborts? Otherwise someone will get to rediscover this in > the future ... Well, we're currently _not_ relying on the reasoning about aborts not changing visibility. So it's not like someone will forget the reasoning and have a bug as a result. I guess someone could rediscover that reasoning and think that they can skip advancing those counters on aborts as an optimization, re-introducing the assertion failure. But I believe that was not why we had this bug, it was just a simple omission. So the comment would look like this: "In principle, we would not need to advance xactCompletionCount and latestCompletedXid because aborting transactions don't change visibility. But that could make xmin within a transaction go backwards, if a snapshot with an older xmin but same xactCompletionCount is reused, triggering the assertion in GetSnapshotDataReuse()." If we add that, I guess it should go into ProcArrayEndTransaction() which currently just notes that it's used for commits and aborts interchangeably. Do you think it's worth it? Want to propose a patch? -- Heikki Linnakangas Neon (https://neon.tech)
pgsql-hackers by date: