Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Snapshot related assert failure on skink
Date
Msg-id 46e1ad87-9547-444c-838d-808834cf185d@vondra.me
Whole thread Raw
In response to Re: Snapshot related assert failure on skink  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Snapshot related assert failure on skink
List pgsql-hackers
On 3/19/25 13:27, Tomas Vondra wrote:
> On 3/19/25 08:17, Heikki Linnakangas wrote:
>> On 19/03/2025 04:22, Tomas Vondra wrote:
>>> I kept stress-testing this, and while the frequency massively increased
>>> on PG18, I managed to reproduce this all the way back to PG14. I see
>>> ~100x more corefiles on PG18.
>>>
>>> That is not a proof the issue was introduced in PG14, maybe it's just
>>> the assert that was added there or something. Or maybe there's another
>>> bug in PG18, making the impact worse.
>>>
>>> But I'd suspect this is a bug in
>>>
>>> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
>>> Author: Andres Freund <andres@anarazel.de>
>>> Date:   Mon Aug 17 21:07:10 2020 -0700
>>>
>>>      snapshot scalability: cache snapshots using a xact completion
>>> counter.
>>>
>>>      Previous commits made it faster/more scalable to compute snapshots.
>>> But not
>>>      building a snapshot is still faster. Now that GetSnapshotData()
>>> does not
>>>      maintain RecentGlobal* anymore, that is actually not too hard:
>>>
>>>      ...
>>
>> Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
>> and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
>> This can happen during hot standby:
>>
>> 1. Backend acquires snapshot A with xmin 1000
>> 2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
>> 3. Backend acquires snapshot B with xmin 1050
>> 4. Backend releases snapshot A, updating TransactionXmin to 1050
>> 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
>> snapshot A's data.
>>
>> Because xactCompletionCount is not updated in step 2, the
>> GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
>> has a lower xmin.
>>
> 
> Could be. As an experiment I added xactCompletionCount advance to the
> two functions you mentioned, and I ran the stress test again. I haven't
> seen any failures so far, after ~1000 runs. Without the patch this
> produced ~200 failures/core files.
> 

I kept stress-testing this (without the fix), trying to figure out why
the frequency of failures got so much higher on PG18. While discussing
this off-list with Andres, he was wondering if maybe there's a second
independent bug on PG18, with the same symptoms.

It took quite a bit of time, but I managed to narrow this down to ~mid
December 2024, commit 952365cded6. The following table shows the number
of assert failures on a single run of the stress test, on 4 different
machines.

                   xeon      ryzen     rpi5-32    rpi5-64
  -------------------------------------------------------
  1585ff7387d       247         52          69         28
  952365cded6       199         30           -          -
  7ec4b9ff80d         1          7           -          -
  7f97b4734f9         3         17           -          -
  578a7fe7b6f         -          -           -          -
  db448ce5ad3         -          -           -          -
  1f81b48a9d5         -          -           -          -
  d5a7bd5670c         0         11           1          0

There's a fair amount of randomness - partially due to the bug being a
race condition, and thus time-sensitive. And then also the test suite is
randomized.

But I think it pretty clearly shows a clear change in these commits. The
rpi5 machines take much longer, so I only have the two results for now.

But it seems it changed in 952365cded6, which is:

    commit 952365cded635e54c4177399c0280cb7a5e34c11
    Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
    Date:   Mon Dec 23 12:42:39 2024 +0200

    Remove unnecessary GetTransactionSnapshot() calls

    In get_database_list() and get_subscription_list(), the
    GetTransactionSnapshot() call is not required because the catalog
    table scans use the catalog snapshot, which is held until the end of
    the scan. See table_beginscan_catalog(), which calls
    RegisterSnapshot(GetCatalogSnapshot(relid)).

    ...

I'm not claiming the commit is buggy - it might be, but I think it's
more likely it just made the preexisting bug easier to hit. I base this
on the observation that incrementing the xactCompletionCount made the
assert failures go away.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: support virtual generated column not null constraint
Next
From: Ni Ku
Date:
Subject: Re: Changing shared_buffers without restart