Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Snapshot related assert failure on skink
Date
Msg-id 76bcbe27-4408-4f01-8623-4a6eff782312@iki.fi
Whole thread Raw
In response to Re: Snapshot related assert failure on skink  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Snapshot related assert failure on skink
List pgsql-hackers
On 24/03/2025 16:56, Tomas Vondra wrote:
> 
> 
> On 3/23/25 17:43, Heikki Linnakangas wrote:
>> On 21/03/2025 17:16, Andres Freund wrote:
>>> Am I right in understanding that the only scenario (when in
>>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
>>> would
>>> "legally" remove a transaction, rather than the commit / abort records
>>> doing
>>> so, is if the primary crash-restarted while transactions were ongoing?
>>>
>>> Those transactions won't have a commit/abort records and thus won't
>>> trigger
>>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
>>> updated
>>> ->xactCompletionCount?
>>
>> Correct.
>>
>>> When writing the snapshot caching patch, I tried to make sure that all
>>> the
>>> places that maintain ->latestCompletedXid also update
>>> ->xactCompletionCount. Afaict that's still the case. Which, I think,
>>> means
>>> that we're also missing calls to MaintainLatestCompletedXidRecovery()?
>>
>> Yep, I was just looking at that too.
>>
>>> If latestCompletedXid is incorrect visibility determinations end up
>>> wrong...
>>
>> I think it happens to work, because the transactions are effectively
>> aborted. latestCompletedXid is used to initialize xmax in
>> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
>> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
>> still correctly considers XID 1001 as "not visible". As soon as a
>> transaction commits, latestCompletedXid is fixed.
>>
>> AFAICS we could skip updating latestCompletedXid on aborts altogether
>> and rename it to latestCommittedXid. But it's hardly worth optimizing
>> for aborts.
>>
>> For the same reason, I believe the assertion failure we're discussing
>> here is also harmless. Even though the reused snapshot has a smaller
>> xmin than expected, all those transactions aborted and are thus not
>> visible anyway.
>>
>> In any case, let's be tidy and fix both latestCompletedXid and
>> xactCompletionCount.
>>
> 
> Thanks for looking into this and pushing the fix.
> 
> Would it make sense to add a comment documenting this reasoning about
> not handling aborts? Otherwise someone will get to rediscover this in
> the future ...

Well, we're currently _not_ relying on the reasoning about aborts not 
changing visibility. So it's not like someone will forget the reasoning 
and have a bug as a result. I guess someone could rediscover that 
reasoning and think that they can skip advancing those counters on 
aborts as an optimization, re-introducing the assertion failure. But I 
believe that was not why we had this bug, it was just a simple omission.

So the comment would look like this:

"In principle, we would not need to advance xactCompletionCount and 
latestCompletedXid because aborting transactions don't change 
visibility. But that could make xmin within a transaction go backwards, 
if a snapshot with an older xmin but same xactCompletionCount is reused, 
triggering the assertion in GetSnapshotDataReuse()."

If we add that, I guess it should go into ProcArrayEndTransaction() 
which currently just notes that it's used for commits and aborts 
interchangeably. Do you think it's worth it? Want to propose a patch?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: vacuum_truncate configuration parameter and isset_offset
Next
From: "David G. Johnston"
Date:
Subject: Re: vacuum_truncate configuration parameter and isset_offset