Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Snapshot related assert failure on skink |
Date | |
Msg-id | f357793e-d5b0-4e56-9676-c61ddb3640b7@vondra.me Whole thread Raw |
In response to | Snapshot related assert failure on skink (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Snapshot related assert failure on skink
|
List | pgsql-hackers |
On 3/4/25 23:25, Andres Freund wrote: > Hi, > > I just saw a BF failure on skink (valgrind) that asserts out. > > Check the 002_compare_backups failure in: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-04%2017%3A35%3A01 > > TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c",Line: 2132, PID: 3115649 > postgres: pitr2: bf postgres [local] SELECT(ExceptionalCondition+0x5f) [0x6e45e0] > postgres: pitr2: bf postgres [local] SELECT(+0x467c3d) [0x56fc3d] > postgres: pitr2: bf postgres [local] SELECT(GetSnapshotData+0x60) [0x570fbd] > postgres: pitr2: bf postgres [local] SELECT(GetNonHistoricCatalogSnapshot+0x53) [0x7285c5] > postgres: pitr2: bf postgres [local] SELECT(GetCatalogSnapshot+0x28) [0x72969b] > postgres: pitr2: bf postgres [local] SELECT(systable_beginscan+0xc9) [0x252602] > postgres: pitr2: bf postgres [local] SELECT(+0x5c1e56) [0x6c9e56] > postgres: pitr2: bf postgres [local] SELECT(+0x5c20e9) [0x6ca0e9] > postgres: pitr2: bf postgres [local] SELECT(SearchCatCache+0x18) [0x6ca71c] > postgres: pitr2: bf postgres [local] SELECT(SearchSysCache+0x21) [0x6deeae] > postgres: pitr2: bf postgres [local] SELECT(GetSysCacheOid+0x23) [0x6df219] > postgres: pitr2: bf postgres [local] SELECT(get_namespace_oid+0x2f) [0x2dbae0] > postgres: pitr2: bf postgres [local] SELECT(+0x1d3c13) [0x2dbc13] > postgres: pitr2: bf postgres [local] SELECT(+0x1d3e1b) [0x2dbe1b] > postgres: pitr2: bf postgres [local] SELECT(+0x1d3e72) [0x2dbe72] > postgres: pitr2: bf postgres [local] SELECT(FuncnameGetCandidates+0xaf) [0x2ddef1] > postgres: pitr2: bf postgres [local] SELECT(func_get_detail+0x96) [0x4bf521] > postgres: pitr2: bf postgres [local] SELECT(ParseFuncOrColumn+0x670) [0x4c13b9] > postgres: pitr2: bf postgres [local] SELECT(+0x3b1123) [0x4b9123] > postgres: pitr2: bf postgres [local] SELECT(+0x3b12ec) [0x4b92ec] > postgres: pitr2: bf postgres [local] SELECT(+0x3b2c60) [0x4bac60] > postgres: pitr2: bf postgres [local] SELECT(+0x3b12dc) [0x4b92dc] > postgres: pitr2: bf postgres [local] SELECT(transformExpr+0x20) [0x4b8faa] > postgres: pitr2: bf postgres [local] SELECT(transformTargetEntry+0x7c) [0x4cf4f9] > postgres: pitr2: bf postgres [local] SELECT(transformTargetList+0x96) [0x4cf5a6] > postgres: pitr2: bf postgres [local] SELECT(+0x39a5b9) [0x4a25b9] > postgres: pitr2: bf postgres [local] SELECT(transformStmt+0x141) [0x4a304b] > postgres: pitr2: bf postgres [local] SELECT(+0x39c390) [0x4a4390] > postgres: pitr2: bf postgres [local] SELECT(transformTopLevelStmt+0x19) [0x4a43cc] > postgres: pitr2: bf postgres [local] SELECT(parse_analyze_fixedparams+0x56) [0x4a4424] > postgres: pitr2: bf postgres [local] SELECT(pg_analyze_and_rewrite_fixedparams+0x40) [0x59b021] > postgres: pitr2: bf postgres [local] SELECT(+0x4937f6) [0x59b7f6] > postgres: pitr2: bf postgres [local] SELECT(PostgresMain+0x940) [0x59db54] > postgres: pitr2: bf postgres [local] SELECT(BackendMain+0x59) [0x596b38] > postgres: pitr2: bf postgres [local] SELECT(postmaster_child_launch+0xeb) [0x4ee282] > postgres: pitr2: bf postgres [local] SELECT(+0x3e9271) [0x4f1271] > postgres: pitr2: bf postgres [local] SELECT(+0x3eadb6) [0x4f2db6] > postgres: pitr2: bf postgres [local] SELECT(PostmasterMain+0x118b) [0x4f43b2] > postgres: pitr2: bf postgres [local] SELECT(main+0x1e2) [0x421da2] > /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x5a27ca8] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x5a27d65] > postgres: pitr2: bf postgres [local] SELECT(_start+0x21) [0x1eb421] > 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] LOG: client backend (PID 3115649) was terminated by signal 6: Aborted > 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] DETAIL: Failed process was running: SELECT NOT pg_is_in_recovery(); > 2025-03-04 19:11:37.943 UTC [3104796][postmaster][:0] LOG: terminating any other active server processes > > > The code directly in question hasn't changed in a few years. I haven't seen > this crash before. The only recent-ish changes in the wider area are: > > commit 952365cded6 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: 2024-12-23 12:42:39 +0200 > > Remove unnecessary GetTransactionSnapshot() calls > > commit 578a7fe7b6f8484f6d7caa2fda288abb3fe87aa0 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: 2024-12-21 23:42:39 +0200 > > Update TransactionXmin when MyProc->xmin is updated > > but neither is immediately yelling "here" to me. > > > I don't really see how pg_combinebackup could lead to this issue either. > > > One potentially interesting bit is that the statement is running *during* a > shutdown checkpoint: > > 2025-03-04 19:11:37.377 UTC [3106505][checkpointer][:0] LOG: checkpoint starting: end-of-recovery immediate wait > 2025-03-04 19:11:37.454 UTC [3115649][client backend][2/2:0] LOG: statement: SELECT NOT pg_is_in_recovery(); > TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c",Line: 2132, PID: 3115649 > ... > 2025-03-04 19:11:37.585 UTC [3106505][checkpointer][:0] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRUbuffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.010 s, total=0.198 s; sync files=2, longest=0.006s, average=0.004 s; distance=16384 kB, estimate=16384 kB; lsn=0/5000060, redo lsn=0/5000060 > > > I wonder if the problem is related to a transaction first using a snapshot > started in recovery and then a "normal" snapshot? > Hi, Alexander Lakhin pointed me to this thread after I posted about hitting the same assert [1]. [1] https://www.postgresql.org/message-id/3aee3535-184d-423e-a5eb-05161e8b5617%40vondra.me It very much seems like the same issue - the backtrace seems to match exactly, except for the query executed ofc. The exact mechanism may be a big different - in my stress test (restarting the primary/standby in different ways, while running pgbench on both instances, etc.) there's no end-of-recovery checkpoint, for example. In my tests, 100% of the failures hit the assert immediately after these messages in the server log (all failures are on the standby) LOG: database system is ready to accept read-only connections LOG: started streaming WAL from primary at 2/86000000 on timeline 1 And by "immediately" I mean it's the very next thing in the log. It seems something happens (to the RecentXmin?) when opening the standby for read-only queries that breaks an existing snapshot ... I'm not sure if the end-of-recovery checkpoint does something similar. I'm still fiddling with the script, trying to increase the probability of the (apparent) race condition. On one machine (old Xeon) I can hit it very easily/reliably, while on a different machine (new Ryzen) it's very rare. I don't know if that's due to difference in speed of the CPU, or fewer cores, ... I guess it changes the timing just enough. I've also tried running the stress test on PG17, and I'm yet to see a single failure there. Not even on the xeon machine, that hits it reliably on 18. So this seems to be a PG18-only issue. If needed, I can try adding more logging, or test a patch. regards -- Tomas Vondra
pgsql-hackers by date: