Thread: Snapshot related assert failure on skink
Hi, I just saw a BF failure on skink (valgrind) that asserts out. Check the 002_compare_backups failure in: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-04%2017%3A35%3A01 TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c",Line: 2132, PID: 3115649 postgres: pitr2: bf postgres [local] SELECT(ExceptionalCondition+0x5f) [0x6e45e0] postgres: pitr2: bf postgres [local] SELECT(+0x467c3d) [0x56fc3d] postgres: pitr2: bf postgres [local] SELECT(GetSnapshotData+0x60) [0x570fbd] postgres: pitr2: bf postgres [local] SELECT(GetNonHistoricCatalogSnapshot+0x53) [0x7285c5] postgres: pitr2: bf postgres [local] SELECT(GetCatalogSnapshot+0x28) [0x72969b] postgres: pitr2: bf postgres [local] SELECT(systable_beginscan+0xc9) [0x252602] postgres: pitr2: bf postgres [local] SELECT(+0x5c1e56) [0x6c9e56] postgres: pitr2: bf postgres [local] SELECT(+0x5c20e9) [0x6ca0e9] postgres: pitr2: bf postgres [local] SELECT(SearchCatCache+0x18) [0x6ca71c] postgres: pitr2: bf postgres [local] SELECT(SearchSysCache+0x21) [0x6deeae] postgres: pitr2: bf postgres [local] SELECT(GetSysCacheOid+0x23) [0x6df219] postgres: pitr2: bf postgres [local] SELECT(get_namespace_oid+0x2f) [0x2dbae0] postgres: pitr2: bf postgres [local] SELECT(+0x1d3c13) [0x2dbc13] postgres: pitr2: bf postgres [local] SELECT(+0x1d3e1b) [0x2dbe1b] postgres: pitr2: bf postgres [local] SELECT(+0x1d3e72) [0x2dbe72] postgres: pitr2: bf postgres [local] SELECT(FuncnameGetCandidates+0xaf) [0x2ddef1] postgres: pitr2: bf postgres [local] SELECT(func_get_detail+0x96) [0x4bf521] postgres: pitr2: bf postgres [local] SELECT(ParseFuncOrColumn+0x670) [0x4c13b9] postgres: pitr2: bf postgres [local] SELECT(+0x3b1123) [0x4b9123] postgres: pitr2: bf postgres [local] SELECT(+0x3b12ec) [0x4b92ec] postgres: pitr2: bf postgres [local] SELECT(+0x3b2c60) [0x4bac60] postgres: pitr2: bf postgres [local] SELECT(+0x3b12dc) [0x4b92dc] postgres: pitr2: bf postgres [local] SELECT(transformExpr+0x20) [0x4b8faa] postgres: pitr2: bf postgres [local] SELECT(transformTargetEntry+0x7c) [0x4cf4f9] postgres: pitr2: bf postgres [local] SELECT(transformTargetList+0x96) [0x4cf5a6] postgres: pitr2: bf postgres [local] SELECT(+0x39a5b9) [0x4a25b9] postgres: pitr2: bf postgres [local] SELECT(transformStmt+0x141) [0x4a304b] postgres: pitr2: bf postgres [local] SELECT(+0x39c390) [0x4a4390] postgres: pitr2: bf postgres [local] SELECT(transformTopLevelStmt+0x19) [0x4a43cc] postgres: pitr2: bf postgres [local] SELECT(parse_analyze_fixedparams+0x56) [0x4a4424] postgres: pitr2: bf postgres [local] SELECT(pg_analyze_and_rewrite_fixedparams+0x40) [0x59b021] postgres: pitr2: bf postgres [local] SELECT(+0x4937f6) [0x59b7f6] postgres: pitr2: bf postgres [local] SELECT(PostgresMain+0x940) [0x59db54] postgres: pitr2: bf postgres [local] SELECT(BackendMain+0x59) [0x596b38] postgres: pitr2: bf postgres [local] SELECT(postmaster_child_launch+0xeb) [0x4ee282] postgres: pitr2: bf postgres [local] SELECT(+0x3e9271) [0x4f1271] postgres: pitr2: bf postgres [local] SELECT(+0x3eadb6) [0x4f2db6] postgres: pitr2: bf postgres [local] SELECT(PostmasterMain+0x118b) [0x4f43b2] postgres: pitr2: bf postgres [local] SELECT(main+0x1e2) [0x421da2] /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x5a27ca8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x5a27d65] postgres: pitr2: bf postgres [local] SELECT(_start+0x21) [0x1eb421] 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] LOG: client backend (PID 3115649) was terminated by signal 6: Aborted 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] DETAIL: Failed process was running: SELECT NOT pg_is_in_recovery(); 2025-03-04 19:11:37.943 UTC [3104796][postmaster][:0] LOG: terminating any other active server processes The code directly in question hasn't changed in a few years. I haven't seen this crash before. The only recent-ish changes in the wider area are: commit 952365cded6 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: 2024-12-23 12:42:39 +0200 Remove unnecessary GetTransactionSnapshot() calls commit 578a7fe7b6f8484f6d7caa2fda288abb3fe87aa0 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: 2024-12-21 23:42:39 +0200 Update TransactionXmin when MyProc->xmin is updated but neither is immediately yelling "here" to me. I don't really see how pg_combinebackup could lead to this issue either. One potentially interesting bit is that the statement is running *during* a shutdown checkpoint: 2025-03-04 19:11:37.377 UTC [3106505][checkpointer][:0] LOG: checkpoint starting: end-of-recovery immediate wait 2025-03-04 19:11:37.454 UTC [3115649][client backend][2/2:0] LOG: statement: SELECT NOT pg_is_in_recovery(); TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c",Line: 2132, PID: 3115649 ... 2025-03-04 19:11:37.585 UTC [3106505][checkpointer][:0] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRU buffers;0 WAL file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.010 s, total=0.198 s; sync files=2, longest=0.006s, average=0.004 s; distance=16384 kB, estimate=16384 kB; lsn=0/5000060, redo lsn=0/5000060 I wonder if the problem is related to a transaction first using a snapshot started in recovery and then a "normal" snapshot? Greetings, Andres Freund
On 3/4/25 23:25, Andres Freund wrote: > Hi, > > I just saw a BF failure on skink (valgrind) that asserts out. > > Check the 002_compare_backups failure in: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-04%2017%3A35%3A01 > > TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c",Line: 2132, PID: 3115649 > postgres: pitr2: bf postgres [local] SELECT(ExceptionalCondition+0x5f) [0x6e45e0] > postgres: pitr2: bf postgres [local] SELECT(+0x467c3d) [0x56fc3d] > postgres: pitr2: bf postgres [local] SELECT(GetSnapshotData+0x60) [0x570fbd] > postgres: pitr2: bf postgres [local] SELECT(GetNonHistoricCatalogSnapshot+0x53) [0x7285c5] > postgres: pitr2: bf postgres [local] SELECT(GetCatalogSnapshot+0x28) [0x72969b] > postgres: pitr2: bf postgres [local] SELECT(systable_beginscan+0xc9) [0x252602] > postgres: pitr2: bf postgres [local] SELECT(+0x5c1e56) [0x6c9e56] > postgres: pitr2: bf postgres [local] SELECT(+0x5c20e9) [0x6ca0e9] > postgres: pitr2: bf postgres [local] SELECT(SearchCatCache+0x18) [0x6ca71c] > postgres: pitr2: bf postgres [local] SELECT(SearchSysCache+0x21) [0x6deeae] > postgres: pitr2: bf postgres [local] SELECT(GetSysCacheOid+0x23) [0x6df219] > postgres: pitr2: bf postgres [local] SELECT(get_namespace_oid+0x2f) [0x2dbae0] > postgres: pitr2: bf postgres [local] SELECT(+0x1d3c13) [0x2dbc13] > postgres: pitr2: bf postgres [local] SELECT(+0x1d3e1b) [0x2dbe1b] > postgres: pitr2: bf postgres [local] SELECT(+0x1d3e72) [0x2dbe72] > postgres: pitr2: bf postgres [local] SELECT(FuncnameGetCandidates+0xaf) [0x2ddef1] > postgres: pitr2: bf postgres [local] SELECT(func_get_detail+0x96) [0x4bf521] > postgres: pitr2: bf postgres [local] SELECT(ParseFuncOrColumn+0x670) [0x4c13b9] > postgres: pitr2: bf postgres [local] SELECT(+0x3b1123) [0x4b9123] > postgres: pitr2: bf postgres [local] SELECT(+0x3b12ec) [0x4b92ec] > postgres: pitr2: bf postgres [local] SELECT(+0x3b2c60) [0x4bac60] > postgres: pitr2: bf postgres [local] SELECT(+0x3b12dc) [0x4b92dc] > postgres: pitr2: bf postgres [local] SELECT(transformExpr+0x20) [0x4b8faa] > postgres: pitr2: bf postgres [local] SELECT(transformTargetEntry+0x7c) [0x4cf4f9] > postgres: pitr2: bf postgres [local] SELECT(transformTargetList+0x96) [0x4cf5a6] > postgres: pitr2: bf postgres [local] SELECT(+0x39a5b9) [0x4a25b9] > postgres: pitr2: bf postgres [local] SELECT(transformStmt+0x141) [0x4a304b] > postgres: pitr2: bf postgres [local] SELECT(+0x39c390) [0x4a4390] > postgres: pitr2: bf postgres [local] SELECT(transformTopLevelStmt+0x19) [0x4a43cc] > postgres: pitr2: bf postgres [local] SELECT(parse_analyze_fixedparams+0x56) [0x4a4424] > postgres: pitr2: bf postgres [local] SELECT(pg_analyze_and_rewrite_fixedparams+0x40) [0x59b021] > postgres: pitr2: bf postgres [local] SELECT(+0x4937f6) [0x59b7f6] > postgres: pitr2: bf postgres [local] SELECT(PostgresMain+0x940) [0x59db54] > postgres: pitr2: bf postgres [local] SELECT(BackendMain+0x59) [0x596b38] > postgres: pitr2: bf postgres [local] SELECT(postmaster_child_launch+0xeb) [0x4ee282] > postgres: pitr2: bf postgres [local] SELECT(+0x3e9271) [0x4f1271] > postgres: pitr2: bf postgres [local] SELECT(+0x3eadb6) [0x4f2db6] > postgres: pitr2: bf postgres [local] SELECT(PostmasterMain+0x118b) [0x4f43b2] > postgres: pitr2: bf postgres [local] SELECT(main+0x1e2) [0x421da2] > /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x5a27ca8] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x5a27d65] > postgres: pitr2: bf postgres [local] SELECT(_start+0x21) [0x1eb421] > 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] LOG: client backend (PID 3115649) was terminated by signal 6: Aborted > 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] DETAIL: Failed process was running: SELECT NOT pg_is_in_recovery(); > 2025-03-04 19:11:37.943 UTC [3104796][postmaster][:0] LOG: terminating any other active server processes > > > The code directly in question hasn't changed in a few years. I haven't seen > this crash before. The only recent-ish changes in the wider area are: > > commit 952365cded6 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: 2024-12-23 12:42:39 +0200 > > Remove unnecessary GetTransactionSnapshot() calls > > commit 578a7fe7b6f8484f6d7caa2fda288abb3fe87aa0 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: 2024-12-21 23:42:39 +0200 > > Update TransactionXmin when MyProc->xmin is updated > > but neither is immediately yelling "here" to me. > > > I don't really see how pg_combinebackup could lead to this issue either. > > > One potentially interesting bit is that the statement is running *during* a > shutdown checkpoint: > > 2025-03-04 19:11:37.377 UTC [3106505][checkpointer][:0] LOG: checkpoint starting: end-of-recovery immediate wait > 2025-03-04 19:11:37.454 UTC [3115649][client backend][2/2:0] LOG: statement: SELECT NOT pg_is_in_recovery(); > TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c",Line: 2132, PID: 3115649 > ... > 2025-03-04 19:11:37.585 UTC [3106505][checkpointer][:0] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRUbuffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.010 s, total=0.198 s; sync files=2, longest=0.006s, average=0.004 s; distance=16384 kB, estimate=16384 kB; lsn=0/5000060, redo lsn=0/5000060 > > > I wonder if the problem is related to a transaction first using a snapshot > started in recovery and then a "normal" snapshot? > Hi, Alexander Lakhin pointed me to this thread after I posted about hitting the same assert [1]. [1] https://www.postgresql.org/message-id/3aee3535-184d-423e-a5eb-05161e8b5617%40vondra.me It very much seems like the same issue - the backtrace seems to match exactly, except for the query executed ofc. The exact mechanism may be a big different - in my stress test (restarting the primary/standby in different ways, while running pgbench on both instances, etc.) there's no end-of-recovery checkpoint, for example. In my tests, 100% of the failures hit the assert immediately after these messages in the server log (all failures are on the standby) LOG: database system is ready to accept read-only connections LOG: started streaming WAL from primary at 2/86000000 on timeline 1 And by "immediately" I mean it's the very next thing in the log. It seems something happens (to the RecentXmin?) when opening the standby for read-only queries that breaks an existing snapshot ... I'm not sure if the end-of-recovery checkpoint does something similar. I'm still fiddling with the script, trying to increase the probability of the (apparent) race condition. On one machine (old Xeon) I can hit it very easily/reliably, while on a different machine (new Ryzen) it's very rare. I don't know if that's due to difference in speed of the CPU, or fewer cores, ... I guess it changes the timing just enough. I've also tried running the stress test on PG17, and I'm yet to see a single failure there. Not even on the xeon machine, that hits it reliably on 18. So this seems to be a PG18-only issue. If needed, I can try adding more logging, or test a patch. regards -- Tomas Vondra
On 3/17/25 12:36, Tomas Vondra wrote: > ... > > I'm still fiddling with the script, trying to increase the probability > of the (apparent) race condition. On one machine (old Xeon) I can hit it > very easily/reliably, while on a different machine (new Ryzen) it's very > rare. I don't know if that's due to difference in speed of the CPU, or > fewer cores, ... I guess it changes the timing just enough. > > I've also tried running the stress test on PG17, and I'm yet to see a > single failure there. Not even on the xeon machine, that hits it > reliably on 18. So this seems to be a PG18-only issue. > And of course, the moment I sent this, I got a failure on 17 too. But it's seems much harder to hit (compared to 18). regards -- Tomas Vondra
On Tue, Mar 18, 2025 at 12:59 AM Tomas Vondra <tomas@vondra.me> wrote: > On 3/17/25 12:36, Tomas Vondra wrote: > > I'm still fiddling with the script, trying to increase the probability > > of the (apparent) race condition. On one machine (old Xeon) I can hit it > > very easily/reliably, while on a different machine (new Ryzen) it's very > > rare. I don't know if that's due to difference in speed of the CPU, or > > fewer cores, ... I guess it changes the timing just enough. > > > > I've also tried running the stress test on PG17, and I'm yet to see a > > single failure there. Not even on the xeon machine, that hits it > > reliably on 18. So this seems to be a PG18-only issue. > > > > And of course, the moment I sent this, I got a failure on 17 too. But > it's seems much harder to hit (compared to 18). Could there be a connection to this commit? commit 119c23eb9819213551cbe7e7665c8b493c59ceee Author: Nathan Bossart <nathan@postgresql.org> Date: Tue Sep 5 13:59:06 2023 -0700 Replace known_assigned_xids_lck with memory barriers.
On 3/17/25 13:18, Thomas Munro wrote: > On Tue, Mar 18, 2025 at 12:59 AM Tomas Vondra <tomas@vondra.me> wrote: >> On 3/17/25 12:36, Tomas Vondra wrote: >>> I'm still fiddling with the script, trying to increase the probability >>> of the (apparent) race condition. On one machine (old Xeon) I can hit it >>> very easily/reliably, while on a different machine (new Ryzen) it's very >>> rare. I don't know if that's due to difference in speed of the CPU, or >>> fewer cores, ... I guess it changes the timing just enough. >>> >>> I've also tried running the stress test on PG17, and I'm yet to see a >>> single failure there. Not even on the xeon machine, that hits it >>> reliably on 18. So this seems to be a PG18-only issue. >>> >> >> And of course, the moment I sent this, I got a failure on 17 too. But >> it's seems much harder to hit (compared to 18). > > Could there be a connection to this commit? > > commit 119c23eb9819213551cbe7e7665c8b493c59ceee > Author: Nathan Bossart <nathan@postgresql.org> > Date: Tue Sep 5 13:59:06 2023 -0700 > > Replace known_assigned_xids_lck with memory barriers. Doesn't seem to be the case. I reverted this (on master), and I still get the assert failures (roughly the same number / loop). regards -- Tomas Vondra
I kept stress-testing this, and while the frequency massively increased on PG18, I managed to reproduce this all the way back to PG14. I see ~100x more corefiles on PG18. That is not a proof the issue was introduced in PG14, maybe it's just the assert that was added there or something. Or maybe there's another bug in PG18, making the impact worse. But I'd suspect this is a bug in commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c Author: Andres Freund <andres@anarazel.de> Date: Mon Aug 17 21:07:10 2020 -0700 snapshot scalability: cache snapshots using a xact completion counter. Previous commits made it faster/more scalable to compute snapshots. But not building a snapshot is still faster. Now that GetSnapshotData() does not maintain RecentGlobal* anymore, that is actually not too hard: ... which is the main snapshot-related change in PG14. I added some logging to all places changing TransactionXmin and RecendXmin in procarray.c, and all the failures look like this: ------------------------ 1742348688.896 2025-03-19 02:44:48.896 CET [2001770] [client backend:[unknown]] [67da2190.1e8b6a:1] [2025-03-19 02:44:48 CET] [13/1/0] LOG: GetSnapshotData TransactionXmin = 209485 1742348688.896 2025-03-19 02:44:48.896 CET [2001770] [client backend:[unknown]] [67da2190.1e8b6a:2] [2025-03-19 02:44:48 CET] [13/1/0] LOG: GetSnapshotData RecentXmin = 209485 1742348688.902 2025-03-19 02:44:48.902 CET [2001770] [client backend:[unknown]] [67da2190.1e8b6a:3] [2025-03-19 02:44:48 CET] [13/1/0] LOG: GetSnapshotDataReuse TransactionXmin = 209485 1742348688.902 2025-03-19 02:44:48.902 CET [2001770] [client backend:[unknown]] [67da2190.1e8b6a:4] [2025-03-19 02:44:48 CET] [13/1/0] LOG: GetSnapshotDataReuse RecentXmin = 209485 1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client backend:pgbench] [67da2190.1e8b6a:5] [2025-03-19 02:44:48 CET] [13/2/0] LOG: GetSnapshotData TransactionXmin = 209508 1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client backend:pgbench] [67da2190.1e8b6a:6] [2025-03-19 02:44:48 CET] [13/2/0] STATEMENT: SELECT abalance FROM pgbench_accounts WHERE aid = 987279; 1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client backend:pgbench] [67da2190.1e8b6a:7] [2025-03-19 02:44:48 CET] [13/2/0] LOG: GetSnapshotData RecentXmin = 209508 1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client backend:pgbench] [67da2190.1e8b6a:8] [2025-03-19 02:44:48 CET] [13/2/0] STATEMENT: SELECT abalance FROM pgbench_accounts WHERE aid = 987279; 1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client backend:pgbench] [67da2190.1e8b6a:9] [2025-03-19 02:44:48 CET] [13/2/0] LOG: GetSnapshotDataReuse RecentXmin = 209485 at character 22 1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client backend:pgbench] [67da2190.1e8b6a:10] [2025-03-19 02:44:48 CET] [13/2/0] STATEMENT: SELECT abalance FROM pgbench_accounts WHERE aid = 987279; TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "procarray.c", Line: 2136, PID: 2001770 ------------------------ These are all logs for pgbench backend PID 2001770. First, there's a call GetSnapshotData and then GetSnapshotDataReuse, which sets both variables to 209485. LOG: GetSnapshotData TransactionXmin = 209485 LOG: GetSnapshotData RecentXmin = 209485 LOG: GetSnapshotDataReuse TransactionXmin = 209485 LOG: GetSnapshotDataReuse RecentXmin = 209485 I believe this is some very early setup, because both log lines have [%b:%a] = [client backend:[unknown]] And then we get another pair of these calls, where GetSnapshotData advances the XIDs to 209508. But the following GetSnapshotDataReuse call updates only RecentXmin to 209485, and that is before TrasactionXmin. Clearly, GetSnapshotDataReuse skips setting the TransactionXmin because it still has valid XID in MyProc->xmin: if (!TransactionIdIsValid(MyProc->xmin)) MyProc->xmin = TransactionXmin = snapshot->xmin; I don't know if that's correct, and it seems a bit weird the RecentXmin moves backwards. But as I mentioned before, it's a bit weird this happens right after the replica gets ready to handle read-only queries: ------------------------ 1742348688.825 2025-03-19 02:44:48.825 CET [1981188] [startup:] [67da2188.1e3b04:9] [2025-03-19 02:44:40 CET] [138/0/0] LOG: consistent recovery state reached at 1/5FA884E8 1742348688.825 2025-03-19 02:44:48.825 CET [1981188] [startup:] [67da2188.1e3b04:10] [2025-03-19 02:44:40 CET] [138/0/0] LOG: record with incorrect prev-link 0/810 at 1/5FA884E8 1742348688.825 2025-03-19 02:44:48.825 CET [1981023] [postmaster:] [67da2188.1e3a5f:5] [2025-03-19 02:44:40 CET] [/0] LOG: database system is ready to accept read-only connections 1742348688.837 2025-03-19 02:44:48.837 CET [2001731] [walreceiver:] [67da2190.1e8b43:1] [2025-03-19 02:44:48 CET] [/0] LOG: started streaming WAL from primary at 1/5F000000 on timeline 1 ------------------------ But there are multiple other concurrent pgbench backends, and those don't have the same issue. None of them seems to advance the XID to 209508. regards -- Tomas Vondra
On 19/03/2025 04:22, Tomas Vondra wrote: > I kept stress-testing this, and while the frequency massively increased > on PG18, I managed to reproduce this all the way back to PG14. I see > ~100x more corefiles on PG18. > > That is not a proof the issue was introduced in PG14, maybe it's just > the assert that was added there or something. Or maybe there's another > bug in PG18, making the impact worse. > > But I'd suspect this is a bug in > > commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c > Author: Andres Freund <andres@anarazel.de> > Date: Mon Aug 17 21:07:10 2020 -0700 > > snapshot scalability: cache snapshots using a xact completion counter. > > Previous commits made it faster/more scalable to compute snapshots. > But not > building a snapshot is still faster. Now that GetSnapshotData() does not > maintain RecentGlobal* anymore, that is actually not too hard: > > ... Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? This can happen during hot standby: 1. Backend acquires snapshot A with xmin 1000 2. Startup process calls ExpireOldKnownAssignedTransactionIds(), 3. Backend acquires snapshot B with xmin 1050 4. Backend releases snapshot A, updating TransactionXmin to 1050 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing snapshot A's data. Because xactCompletionCount is not updated in step 2, the GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A has a lower xmin. -- Heikki Linnakangas Neon (https://neon.tech)
On 3/19/25 08:17, Heikki Linnakangas wrote: > On 19/03/2025 04:22, Tomas Vondra wrote: >> I kept stress-testing this, and while the frequency massively increased >> on PG18, I managed to reproduce this all the way back to PG14. I see >> ~100x more corefiles on PG18. >> >> That is not a proof the issue was introduced in PG14, maybe it's just >> the assert that was added there or something. Or maybe there's another >> bug in PG18, making the impact worse. >> >> But I'd suspect this is a bug in >> >> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c >> Author: Andres Freund <andres@anarazel.de> >> Date: Mon Aug 17 21:07:10 2020 -0700 >> >> snapshot scalability: cache snapshots using a xact completion >> counter. >> >> Previous commits made it faster/more scalable to compute snapshots. >> But not >> building a snapshot is still faster. Now that GetSnapshotData() >> does not >> maintain RecentGlobal* anymore, that is actually not too hard: >> >> ... > > Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() > and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? > This can happen during hot standby: > > 1. Backend acquires snapshot A with xmin 1000 > 2. Startup process calls ExpireOldKnownAssignedTransactionIds(), > 3. Backend acquires snapshot B with xmin 1050 > 4. Backend releases snapshot A, updating TransactionXmin to 1050 > 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing > snapshot A's data. > > Because xactCompletionCount is not updated in step 2, the > GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A > has a lower xmin. > Could be. As an experiment I added xactCompletionCount advance to the two functions you mentioned, and I ran the stress test again. I haven't seen any failures so far, after ~1000 runs. Without the patch this produced ~200 failures/core files. regards -- Tomas Vondra
On 3/19/25 13:27, Tomas Vondra wrote: > On 3/19/25 08:17, Heikki Linnakangas wrote: >> On 19/03/2025 04:22, Tomas Vondra wrote: >>> I kept stress-testing this, and while the frequency massively increased >>> on PG18, I managed to reproduce this all the way back to PG14. I see >>> ~100x more corefiles on PG18. >>> >>> That is not a proof the issue was introduced in PG14, maybe it's just >>> the assert that was added there or something. Or maybe there's another >>> bug in PG18, making the impact worse. >>> >>> But I'd suspect this is a bug in >>> >>> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c >>> Author: Andres Freund <andres@anarazel.de> >>> Date: Mon Aug 17 21:07:10 2020 -0700 >>> >>> snapshot scalability: cache snapshots using a xact completion >>> counter. >>> >>> Previous commits made it faster/more scalable to compute snapshots. >>> But not >>> building a snapshot is still faster. Now that GetSnapshotData() >>> does not >>> maintain RecentGlobal* anymore, that is actually not too hard: >>> >>> ... >> >> Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() >> and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? >> This can happen during hot standby: >> >> 1. Backend acquires snapshot A with xmin 1000 >> 2. Startup process calls ExpireOldKnownAssignedTransactionIds(), >> 3. Backend acquires snapshot B with xmin 1050 >> 4. Backend releases snapshot A, updating TransactionXmin to 1050 >> 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing >> snapshot A's data. >> >> Because xactCompletionCount is not updated in step 2, the >> GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A >> has a lower xmin. >> > > Could be. As an experiment I added xactCompletionCount advance to the > two functions you mentioned, and I ran the stress test again. I haven't > seen any failures so far, after ~1000 runs. Without the patch this > produced ~200 failures/core files. > I kept stress-testing this (without the fix), trying to figure out why the frequency of failures got so much higher on PG18. While discussing this off-list with Andres, he was wondering if maybe there's a second independent bug on PG18, with the same symptoms. It took quite a bit of time, but I managed to narrow this down to ~mid December 2024, commit 952365cded6. The following table shows the number of assert failures on a single run of the stress test, on 4 different machines. xeon ryzen rpi5-32 rpi5-64 ------------------------------------------------------- 1585ff7387d 247 52 69 28 952365cded6 199 30 - - 7ec4b9ff80d 1 7 - - 7f97b4734f9 3 17 - - 578a7fe7b6f - - - - db448ce5ad3 - - - - 1f81b48a9d5 - - - - d5a7bd5670c 0 11 1 0 There's a fair amount of randomness - partially due to the bug being a race condition, and thus time-sensitive. And then also the test suite is randomized. But I think it pretty clearly shows a clear change in these commits. The rpi5 machines take much longer, so I only have the two results for now. But it seems it changed in 952365cded6, which is: commit 952365cded635e54c4177399c0280cb7a5e34c11 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: Mon Dec 23 12:42:39 2024 +0200 Remove unnecessary GetTransactionSnapshot() calls In get_database_list() and get_subscription_list(), the GetTransactionSnapshot() call is not required because the catalog table scans use the catalog snapshot, which is held until the end of the scan. See table_beginscan_catalog(), which calls RegisterSnapshot(GetCatalogSnapshot(relid)). ... I'm not claiming the commit is buggy - it might be, but I think it's more likely it just made the preexisting bug easier to hit. I base this on the observation that incrementing the xactCompletionCount made the assert failures go away. regards -- Tomas Vondra
On 21/03/2025 12:28, Tomas Vondra wrote: > But it seems it changed in 952365cded6, which is: > > commit 952365cded635e54c4177399c0280cb7a5e34c11 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon Dec 23 12:42:39 2024 +0200 > > Remove unnecessary GetTransactionSnapshot() calls > > In get_database_list() and get_subscription_list(), the > GetTransactionSnapshot() call is not required because the catalog > table scans use the catalog snapshot, which is held until the end of > the scan. See table_beginscan_catalog(), which calls > RegisterSnapshot(GetCatalogSnapshot(relid)). > > ... > > I'm not claiming the commit is buggy - it might be, but I think it's > more likely it just made the preexisting bug easier to hit. I base this > on the observation that incrementing the xactCompletionCount made the > assert failures go away. That commit removed three GetTransactionSnapshot() calls: one in autovacuum, one in logical replication launcher, and one in InitPostgres(). It must be the one in InitPostgres() that makes the difference, because the other two are not executed in a standby. I was able to reproduce this and to catch it in action with 'rr'. After commit 952365cded6, the sequence of events looks like this: 1. At backend startup, GetCatalogSnapshot() sets CatalogSnapshotData.xmin=1000, xactCompletionCount=10 2. On first query, GetTransactionSnapshot() sets CurrentSnapshotData.xmin=1002, xactCompletionCount=10 3. GetCatalogSnapshot() is called again. It tries to reuse the snapshot with xmin=1000 and hits the assertion. Before commit 952365cded6, which removed the GetTransactionSnapshot() call from InitPostgres, this usually happens instead: 0. At backend startup, GetTransactionSnapshot() sets CurrentSnapshotData.xmin=1000, xactCompletionCount=10 1. At backend startup, GetCatalogSnapshot() sets CatalogSnapshotData.xmin=1000, xactCompletionCount=10 2. On first query, GetTransactionSnapshot() reuses the snapshot with CurrentSnapshotData.xmin=1002, xactCompletionCount=10 3. GetCatalogSnapshot() is called again. It successfully reuses the snapshot with xmin=1000 In other words, the GetTransactionSnapshot() call initializes the CurrentSnapshotData with the snapshot with an earlier xmin. The first GetCatalogSnapshot() call happens almost immediately after that, so they're very likely - but not guaranteed - to get the same snapshot. When I removed the GetTransactionSnapshot() call, the gap between the first GetTransactionSnapshot() and GetCatalogSnapshot() calls in the backend became much longer, making this more likely to happen. In any case, the missing "xactCompletionCount++" in ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() is a clear bug and explains this. I will prepare and commit patches to fix that. Thanks for the testing! -- Heikki Linnakangas Neon (https://neon.tech)
Hi, On 2025-03-19 09:17:23 +0200, Heikki Linnakangas wrote: > On 19/03/2025 04:22, Tomas Vondra wrote: > > I kept stress-testing this, and while the frequency massively increased > > on PG18, I managed to reproduce this all the way back to PG14. I see > > ~100x more corefiles on PG18. > > > > That is not a proof the issue was introduced in PG14, maybe it's just > > the assert that was added there or something. Or maybe there's another > > bug in PG18, making the impact worse. > > > > But I'd suspect this is a bug in > > > > commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c > > Author: Andres Freund <andres@anarazel.de> > > Date: Mon Aug 17 21:07:10 2020 -0700 > > > > snapshot scalability: cache snapshots using a xact completion counter. > > > > Previous commits made it faster/more scalable to compute snapshots. > > But not > > building a snapshot is still faster. Now that GetSnapshotData() does not > > maintain RecentGlobal* anymore, that is actually not too hard: > > > > ... Thanks for debugging and analyzing this! > Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() and > ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? This can > happen during hot standby: > > 1. Backend acquires snapshot A with xmin 1000 > 2. Startup process calls ExpireOldKnownAssignedTransactionIds(), > 3. Backend acquires snapshot B with xmin 1050 > 4. Backend releases snapshot A, updating TransactionXmin to 1050 > 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing > snapshot A's data. > > Because xactCompletionCount is not updated in step 2, the > GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A has a > lower xmin. I've swapped a lot of the KnownAssigned* code out: Am I right in understanding that the only scenario (when in STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() would "legally" remove a transaction, rather than the commit / abort records doing so, is if the primary crash-restarted while transactions were ongoing? Those transactions won't have a commit/abort records and thus won't trigger ExpireTreeKnownAssignedTransactionIds(), which otherwise would have updated ->xactCompletionCount? When writing the snapshot caching patch, I tried to make sure that all the places that maintain ->latestCompletedXid also update ->xactCompletionCount. Afaict that's still the case. Which, I think, means that we're also missing calls to MaintainLatestCompletedXidRecovery()? If latestCompletedXid is incorrect visibility determinations end up wrong... Greetings, Andres Freund
On 21/03/2025 17:16, Andres Freund wrote: > Am I right in understanding that the only scenario (when in > STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() would > "legally" remove a transaction, rather than the commit / abort records doing > so, is if the primary crash-restarted while transactions were ongoing? > > Those transactions won't have a commit/abort records and thus won't trigger > ExpireTreeKnownAssignedTransactionIds(), which otherwise would have updated > ->xactCompletionCount? Correct. > When writing the snapshot caching patch, I tried to make sure that all the > places that maintain ->latestCompletedXid also update > ->xactCompletionCount. Afaict that's still the case. Which, I think, means > that we're also missing calls to MaintainLatestCompletedXidRecovery()? Yep, I was just looking at that too. > If latestCompletedXid is incorrect visibility determinations end up wrong... I think it happens to work, because the transactions are effectively aborted. latestCompletedXid is used to initialize xmax in GetSnapshotData. If, for example, latestCompletedXid is incorrectly set to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000 still correctly considers XID 1001 as "not visible". As soon as a transaction commits, latestCompletedXid is fixed. AFAICS we could skip updating latestCompletedXid on aborts altogether and rename it to latestCommittedXid. But it's hardly worth optimizing for aborts. For the same reason, I believe the assertion failure we're discussing here is also harmless. Even though the reused snapshot has a smaller xmin than expected, all those transactions aborted and are thus not visible anyway. In any case, let's be tidy and fix both latestCompletedXid and xactCompletionCount. -- Heikki Linnakangas Neon (https://neon.tech)
On 3/23/25 17:43, Heikki Linnakangas wrote: > On 21/03/2025 17:16, Andres Freund wrote: >> Am I right in understanding that the only scenario (when in >> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() >> would >> "legally" remove a transaction, rather than the commit / abort records >> doing >> so, is if the primary crash-restarted while transactions were ongoing? >> >> Those transactions won't have a commit/abort records and thus won't >> trigger >> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have >> updated >> ->xactCompletionCount? > > Correct. > >> When writing the snapshot caching patch, I tried to make sure that all >> the >> places that maintain ->latestCompletedXid also update >> ->xactCompletionCount. Afaict that's still the case. Which, I think, >> means >> that we're also missing calls to MaintainLatestCompletedXidRecovery()? > > Yep, I was just looking at that too. > >> If latestCompletedXid is incorrect visibility determinations end up >> wrong... > > I think it happens to work, because the transactions are effectively > aborted. latestCompletedXid is used to initialize xmax in > GetSnapshotData. If, for example, latestCompletedXid is incorrectly set > to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000 > still correctly considers XID 1001 as "not visible". As soon as a > transaction commits, latestCompletedXid is fixed. > > AFAICS we could skip updating latestCompletedXid on aborts altogether > and rename it to latestCommittedXid. But it's hardly worth optimizing > for aborts. > > For the same reason, I believe the assertion failure we're discussing > here is also harmless. Even though the reused snapshot has a smaller > xmin than expected, all those transactions aborted and are thus not > visible anyway. > > In any case, let's be tidy and fix both latestCompletedXid and > xactCompletionCount. > Thanks for looking into this and pushing the fix. Would it make sense to add a comment documenting this reasoning about not handling aborts? Otherwise someone will get to rediscover this in the future ... regards -- Tomas Vondra
On 24/03/2025 16:56, Tomas Vondra wrote: > > > On 3/23/25 17:43, Heikki Linnakangas wrote: >> On 21/03/2025 17:16, Andres Freund wrote: >>> Am I right in understanding that the only scenario (when in >>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() >>> would >>> "legally" remove a transaction, rather than the commit / abort records >>> doing >>> so, is if the primary crash-restarted while transactions were ongoing? >>> >>> Those transactions won't have a commit/abort records and thus won't >>> trigger >>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have >>> updated >>> ->xactCompletionCount? >> >> Correct. >> >>> When writing the snapshot caching patch, I tried to make sure that all >>> the >>> places that maintain ->latestCompletedXid also update >>> ->xactCompletionCount. Afaict that's still the case. Which, I think, >>> means >>> that we're also missing calls to MaintainLatestCompletedXidRecovery()? >> >> Yep, I was just looking at that too. >> >>> If latestCompletedXid is incorrect visibility determinations end up >>> wrong... >> >> I think it happens to work, because the transactions are effectively >> aborted. latestCompletedXid is used to initialize xmax in >> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set >> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000 >> still correctly considers XID 1001 as "not visible". As soon as a >> transaction commits, latestCompletedXid is fixed. >> >> AFAICS we could skip updating latestCompletedXid on aborts altogether >> and rename it to latestCommittedXid. But it's hardly worth optimizing >> for aborts. >> >> For the same reason, I believe the assertion failure we're discussing >> here is also harmless. Even though the reused snapshot has a smaller >> xmin than expected, all those transactions aborted and are thus not >> visible anyway. >> >> In any case, let's be tidy and fix both latestCompletedXid and >> xactCompletionCount. >> > > Thanks for looking into this and pushing the fix. > > Would it make sense to add a comment documenting this reasoning about > not handling aborts? Otherwise someone will get to rediscover this in > the future ... Well, we're currently _not_ relying on the reasoning about aborts not changing visibility. So it's not like someone will forget the reasoning and have a bug as a result. I guess someone could rediscover that reasoning and think that they can skip advancing those counters on aborts as an optimization, re-introducing the assertion failure. But I believe that was not why we had this bug, it was just a simple omission. So the comment would look like this: "In principle, we would not need to advance xactCompletionCount and latestCompletedXid because aborting transactions don't change visibility. But that could make xmin within a transaction go backwards, if a snapshot with an older xmin but same xactCompletionCount is reused, triggering the assertion in GetSnapshotDataReuse()." If we add that, I guess it should go into ProcArrayEndTransaction() which currently just notes that it's used for commits and aborts interchangeably. Do you think it's worth it? Want to propose a patch? -- Heikki Linnakangas Neon (https://neon.tech)
On 3/24/25 16:25, Heikki Linnakangas wrote: > On 24/03/2025 16:56, Tomas Vondra wrote: >> >> >> On 3/23/25 17:43, Heikki Linnakangas wrote: >>> On 21/03/2025 17:16, Andres Freund wrote: >>>> Am I right in understanding that the only scenario (when in >>>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() >>>> would >>>> "legally" remove a transaction, rather than the commit / abort records >>>> doing >>>> so, is if the primary crash-restarted while transactions were ongoing? >>>> >>>> Those transactions won't have a commit/abort records and thus won't >>>> trigger >>>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have >>>> updated >>>> ->xactCompletionCount? >>> >>> Correct. >>> >>>> When writing the snapshot caching patch, I tried to make sure that all >>>> the >>>> places that maintain ->latestCompletedXid also update >>>> ->xactCompletionCount. Afaict that's still the case. Which, I think, >>>> means >>>> that we're also missing calls to MaintainLatestCompletedXidRecovery()? >>> >>> Yep, I was just looking at that too. >>> >>>> If latestCompletedXid is incorrect visibility determinations end up >>>> wrong... >>> >>> I think it happens to work, because the transactions are effectively >>> aborted. latestCompletedXid is used to initialize xmax in >>> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set >>> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000 >>> still correctly considers XID 1001 as "not visible". As soon as a >>> transaction commits, latestCompletedXid is fixed. >>> >>> AFAICS we could skip updating latestCompletedXid on aborts altogether >>> and rename it to latestCommittedXid. But it's hardly worth optimizing >>> for aborts. >>> >>> For the same reason, I believe the assertion failure we're discussing >>> here is also harmless. Even though the reused snapshot has a smaller >>> xmin than expected, all those transactions aborted and are thus not >>> visible anyway. >>> >>> In any case, let's be tidy and fix both latestCompletedXid and >>> xactCompletionCount. >>> >> >> Thanks for looking into this and pushing the fix. >> >> Would it make sense to add a comment documenting this reasoning about >> not handling aborts? Otherwise someone will get to rediscover this in >> the future ... > > Well, we're currently _not_ relying on the reasoning about aborts not > changing visibility. So it's not like someone will forget the reasoning > and have a bug as a result. I guess someone could rediscover that > reasoning and think that they can skip advancing those counters on > aborts as an optimization, re-introducing the assertion failure. But I > believe that was not why we had this bug, it was just a simple omission. > > So the comment would look like this: > > "In principle, we would not need to advance xactCompletionCount and > latestCompletedXid because aborting transactions don't change > visibility. But that could make xmin within a transaction go backwards, > if a snapshot with an older xmin but same xactCompletionCount is reused, > triggering the assertion in GetSnapshotDataReuse()." > > If we add that, I guess it should go into ProcArrayEndTransaction() > which currently just notes that it's used for commits and aborts > interchangeably. Do you think it's worth it? Want to propose a patch? > I think the wording you suggested above is reasonable - I certainly don't have a better one. I don't know what's the likelihood of someone breaking this, but I think it makes sense to mention this because the assert is non-trivial to hit. regards -- Tomas Vondra