Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Snapshot related assert failure on skink |
Date | |
Msg-id | 46e1ad87-9547-444c-838d-808834cf185d@vondra.me Whole thread Raw |
In response to | Re: Snapshot related assert failure on skink (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Snapshot related assert failure on skink
|
List | pgsql-hackers |
On 3/19/25 13:27, Tomas Vondra wrote: > On 3/19/25 08:17, Heikki Linnakangas wrote: >> On 19/03/2025 04:22, Tomas Vondra wrote: >>> I kept stress-testing this, and while the frequency massively increased >>> on PG18, I managed to reproduce this all the way back to PG14. I see >>> ~100x more corefiles on PG18. >>> >>> That is not a proof the issue was introduced in PG14, maybe it's just >>> the assert that was added there or something. Or maybe there's another >>> bug in PG18, making the impact worse. >>> >>> But I'd suspect this is a bug in >>> >>> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c >>> Author: Andres Freund <andres@anarazel.de> >>> Date: Mon Aug 17 21:07:10 2020 -0700 >>> >>> snapshot scalability: cache snapshots using a xact completion >>> counter. >>> >>> Previous commits made it faster/more scalable to compute snapshots. >>> But not >>> building a snapshot is still faster. Now that GetSnapshotData() >>> does not >>> maintain RecentGlobal* anymore, that is actually not too hard: >>> >>> ... >> >> Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() >> and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? >> This can happen during hot standby: >> >> 1. Backend acquires snapshot A with xmin 1000 >> 2. Startup process calls ExpireOldKnownAssignedTransactionIds(), >> 3. Backend acquires snapshot B with xmin 1050 >> 4. Backend releases snapshot A, updating TransactionXmin to 1050 >> 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing >> snapshot A's data. >> >> Because xactCompletionCount is not updated in step 2, the >> GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A >> has a lower xmin. >> > > Could be. As an experiment I added xactCompletionCount advance to the > two functions you mentioned, and I ran the stress test again. I haven't > seen any failures so far, after ~1000 runs. Without the patch this > produced ~200 failures/core files. > I kept stress-testing this (without the fix), trying to figure out why the frequency of failures got so much higher on PG18. While discussing this off-list with Andres, he was wondering if maybe there's a second independent bug on PG18, with the same symptoms. It took quite a bit of time, but I managed to narrow this down to ~mid December 2024, commit 952365cded6. The following table shows the number of assert failures on a single run of the stress test, on 4 different machines. xeon ryzen rpi5-32 rpi5-64 ------------------------------------------------------- 1585ff7387d 247 52 69 28 952365cded6 199 30 - - 7ec4b9ff80d 1 7 - - 7f97b4734f9 3 17 - - 578a7fe7b6f - - - - db448ce5ad3 - - - - 1f81b48a9d5 - - - - d5a7bd5670c 0 11 1 0 There's a fair amount of randomness - partially due to the bug being a race condition, and thus time-sensitive. And then also the test suite is randomized. But I think it pretty clearly shows a clear change in these commits. The rpi5 machines take much longer, so I only have the two results for now. But it seems it changed in 952365cded6, which is: commit 952365cded635e54c4177399c0280cb7a5e34c11 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: Mon Dec 23 12:42:39 2024 +0200 Remove unnecessary GetTransactionSnapshot() calls In get_database_list() and get_subscription_list(), the GetTransactionSnapshot() call is not required because the catalog table scans use the catalog snapshot, which is held until the end of the scan. See table_beginscan_catalog(), which calls RegisterSnapshot(GetCatalogSnapshot(relid)). ... I'm not claiming the commit is buggy - it might be, but I think it's more likely it just made the preexisting bug easier to hit. I base this on the observation that incrementing the xactCompletionCount made the assert failures go away. regards -- Tomas Vondra
pgsql-hackers by date: