Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Snapshot related assert failure on skink
Date
Msg-id 33c9a3ed-b850-4b21-ac6c-9d9b3d7fe155@vondra.me
Whole thread Raw
In response to Re: Snapshot related assert failure on skink  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers

On 3/24/25 16:25, Heikki Linnakangas wrote:
> On 24/03/2025 16:56, Tomas Vondra wrote:
>>
>>
>> On 3/23/25 17:43, Heikki Linnakangas wrote:
>>> On 21/03/2025 17:16, Andres Freund wrote:
>>>> Am I right in understanding that the only scenario (when in
>>>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
>>>> would
>>>> "legally" remove a transaction, rather than the commit / abort records
>>>> doing
>>>> so, is if the primary crash-restarted while transactions were ongoing?
>>>>
>>>> Those transactions won't have a commit/abort records and thus won't
>>>> trigger
>>>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
>>>> updated
>>>> ->xactCompletionCount?
>>>
>>> Correct.
>>>
>>>> When writing the snapshot caching patch, I tried to make sure that all
>>>> the
>>>> places that maintain ->latestCompletedXid also update
>>>> ->xactCompletionCount. Afaict that's still the case. Which, I think,
>>>> means
>>>> that we're also missing calls to MaintainLatestCompletedXidRecovery()?
>>>
>>> Yep, I was just looking at that too.
>>>
>>>> If latestCompletedXid is incorrect visibility determinations end up
>>>> wrong...
>>>
>>> I think it happens to work, because the transactions are effectively
>>> aborted. latestCompletedXid is used to initialize xmax in
>>> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
>>> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
>>> still correctly considers XID 1001 as "not visible". As soon as a
>>> transaction commits, latestCompletedXid is fixed.
>>>
>>> AFAICS we could skip updating latestCompletedXid on aborts altogether
>>> and rename it to latestCommittedXid. But it's hardly worth optimizing
>>> for aborts.
>>>
>>> For the same reason, I believe the assertion failure we're discussing
>>> here is also harmless. Even though the reused snapshot has a smaller
>>> xmin than expected, all those transactions aborted and are thus not
>>> visible anyway.
>>>
>>> In any case, let's be tidy and fix both latestCompletedXid and
>>> xactCompletionCount.
>>>
>>
>> Thanks for looking into this and pushing the fix.
>>
>> Would it make sense to add a comment documenting this reasoning about
>> not handling aborts? Otherwise someone will get to rediscover this in
>> the future ...
> 
> Well, we're currently _not_ relying on the reasoning about aborts not
> changing visibility. So it's not like someone will forget the reasoning
> and have a bug as a result. I guess someone could rediscover that
> reasoning and think that they can skip advancing those counters on
> aborts as an optimization, re-introducing the assertion failure. But I
> believe that was not why we had this bug, it was just a simple omission.
> 
> So the comment would look like this:
> 
> "In principle, we would not need to advance xactCompletionCount and
> latestCompletedXid because aborting transactions don't change
> visibility. But that could make xmin within a transaction go backwards,
> if a snapshot with an older xmin but same xactCompletionCount is reused,
> triggering the assertion in GetSnapshotDataReuse()."
> 
> If we add that, I guess it should go into ProcArrayEndTransaction()
> which currently just notes that it's used for commits and aborts
> interchangeably. Do you think it's worth it? Want to propose a patch?
> 

I think the wording you suggested above is reasonable - I certainly
don't have a better one. I don't know what's the likelihood of someone
breaking this, but I think it makes sense to mention this because the
assert is non-trivial to hit.

regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Next
From: David Rowley
Date:
Subject: Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment