Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Snapshot related assert failure on skink
Date
Msg-id 27a69a5e-7672-4ee0-ae06-62cd520697a8@iki.fi
Whole thread Raw
In response to Re: Snapshot related assert failure on skink  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On 21/03/2025 12:28, Tomas Vondra wrote:
> But it seems it changed in 952365cded6, which is:
> 
>      commit 952365cded635e54c4177399c0280cb7a5e34c11
>      Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>      Date:   Mon Dec 23 12:42:39 2024 +0200
> 
>      Remove unnecessary GetTransactionSnapshot() calls
> 
>      In get_database_list() and get_subscription_list(), the
>      GetTransactionSnapshot() call is not required because the catalog
>      table scans use the catalog snapshot, which is held until the end of
>      the scan. See table_beginscan_catalog(), which calls
>      RegisterSnapshot(GetCatalogSnapshot(relid)).
> 
>      ...
> 
> I'm not claiming the commit is buggy - it might be, but I think it's
> more likely it just made the preexisting bug easier to hit. I base this
> on the observation that incrementing the xactCompletionCount made the
> assert failures go away.

That commit removed three GetTransactionSnapshot() calls: one in 
autovacuum, one in logical replication launcher, and one in 
InitPostgres(). It must be the one in InitPostgres() that makes the 
difference, because the other two are not executed in a standby.

I was able to reproduce this and to catch it in action with 'rr'. After 
commit 952365cded6, the sequence of events looks like this:

1. At backend startup, GetCatalogSnapshot() sets 
CatalogSnapshotData.xmin=1000, xactCompletionCount=10
2. On first query, GetTransactionSnapshot() sets 
CurrentSnapshotData.xmin=1002, xactCompletionCount=10
3. GetCatalogSnapshot() is called again. It tries to reuse the snapshot 
with xmin=1000 and hits the assertion.

Before commit 952365cded6, which removed the GetTransactionSnapshot() 
call from InitPostgres, this usually happens instead:

0. At backend startup, GetTransactionSnapshot() sets 
CurrentSnapshotData.xmin=1000, xactCompletionCount=10
1. At backend startup, GetCatalogSnapshot() sets 
CatalogSnapshotData.xmin=1000, xactCompletionCount=10
2. On first query, GetTransactionSnapshot() reuses the snapshot with 
CurrentSnapshotData.xmin=1002, xactCompletionCount=10
3. GetCatalogSnapshot() is called again. It successfully reuses the 
snapshot  with xmin=1000

In other words, the GetTransactionSnapshot() call initializes the 
CurrentSnapshotData with the snapshot with an earlier xmin. The first 
GetCatalogSnapshot() call happens almost immediately after that, so 
they're very likely - but not guaranteed - to get the same snapshot. 
When I removed the GetTransactionSnapshot() call, the gap between the 
first GetTransactionSnapshot() and GetCatalogSnapshot() calls in the 
backend became much longer, making this more likely to happen.

In any case, the missing "xactCompletionCount++" in 
ExpireAllKnownAssignedTransactionIds() and 
ExpireOldKnownAssignedTransactionIds() is a clear bug and explains this. 
I will prepare and commit patches to fix that.

Thanks for the testing!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Shubham Khanna
Date:
Subject: Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.
Next
From: Richard Guo
Date:
Subject: Re: Reduce "Var IS [NOT] NULL" quals during constant folding