Re: BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column - Mailing list pgsql-bugs
From | Alexander Lakhin |
---|---|
Subject | Re: BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column |
Date | |
Msg-id | be69a11d-b7b5-9ee8-221a-fdf34898bb6d@gmail.com Whole thread Raw |
In response to | BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column (PG Bug reporting form <noreply@postgresql.org>) |
List | pgsql-bugs |
14.06.2024 21:00, PG Bug reporting form wrote: > The following script: > ... > > produces errors like: > 2024-06-14 17:10:58.349 UTC|law|regression|666c79a2.3fe3a4|ERROR: could not > map filenumber "base/16384/16453" to relation OID I've simplified that script to the following: psql -c "CREATE TABLE t(t1 text, i2 int);" for ((i=1;i<=100;i++)); do echo "iteration $i"; cat << EOF | psql >/dev/null & BEGIN; SELECT txid_current(); SELECT pg_sleep(0.01); EOF cat << EOF | psql >/dev/null & ALTER TABLE t RENAME COLUMN t1 TO i1; ALTER TABLE t ALTER COLUMN i1 TYPE int4 USING (i1::int4); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); EOF cat << EOF | psql >/dev/null & SELECT 'init' FROM pg_create_logical_replication_slot( 'regression_slot_$i', 'test_decoding'); SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_$i', NULL, NULL); EOF wait grep 'ERROR' server.log && { break; } cat << EOF | psql >/dev/null SELECT pg_drop_replication_slot('regression_slot_$i'); ALTER TABLE t ALTER COLUMN i1 TYPE text USING (i1::text); ALTER TABLE t RENAME COLUMN i1 TO t1; EOF done It fails deterministically (though not on each run) for me as below: ... iteration 93 ERROR: could not map filenumber "base/16384/17310" to relation OID an excerpt from server.log (generated with log_min_messages = DEBUG3): 2024-07-07 09:15:36.167 UTC [324743:33] psql LOG: starting logical decoding for slot "regression_slot_93" 2024-07-07 09:15:36.167 UTC [324743:34] psql DETAIL: Streaming transactions committing after 0/2006458, reading WAL from 0/1FFB820. 2024-07-07 09:15:36.167 UTC [324743:35] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_93', NULL, NULL); ... 2024-07-07 09:15:36.169 UTC [324743:47] psql DEBUG: snapshot of 0+0 running transaction ids (lsn 0/20065A8 oldest xid 1666 latest complete 1665 next xid 1666) 2024-07-07 09:15:36.169 UTC [324743:48] psql DEBUG: purged committed transactions from 0 to 0, xmin: 1662, xmax: 1662 2024-07-07 09:15:36.169 UTC [324743:49] psql DEBUG: xmin: 1662, xmax: 1662, oldest running: 1662, oldest xmin: 1662 2024-07-07 09:15:36.169 UTC [324743:50] psql DEBUG: updated xmin: 1 restart: 0 2024-07-07 09:15:36.169 UTC [324743:51] psql DEBUG: xmin required by slots: data 0, catalog 1662 2024-07-07 09:15:36.169 UTC [324743:52] psql LOG: logical decoding found consistent point at 0/2006420 2024-07-07 09:15:36.169 UTC [324743:53] psql DETAIL: There are no running transactions. 2024-07-07 09:15:36.169 UTC [324743:54] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_93', NULL, NULL); 2024-07-07 09:15:36.169 UTC [324743:55] psql DEBUG: aborting old transaction 1662 2024-07-07 09:15:36.170 UTC [324743:56] psql ERROR: could not map filenumber "base/16384/17310" to relation OID 2024-07-07 09:15:36.170 UTC [324743:57] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_93', NULL, NULL); The corresponding fragment of WAL: rmgr: Transaction len (rec/tot): 353/ 353, tx: 1662, lsn: 0/02005B38, prev 0/02005B00, desc: INVALIDATION ; inval msgs: catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 relcache 17310 snapshot 2608 rmgr: Transaction len (rec/tot): 1605/ 1605, tx: 1662, lsn: 0/02005CA0, prev 0/02005B38, desc: COMMIT 2024-07-07 09:15:36.150647 UTC; rels: base/16384/17305 base/16384/17308 base/16384/17309; dropped stats: 2/16384/17309 2/16384/17308 2/16384/17310; inval msgs: catcache 7 ... relcache 17310 ... relcache 17310 snapshot 2608 rmgr: Transaction len (rec/tot): 34/ 34, tx: 1661, lsn: 0/02006300, prev 0/02005CA0, desc: ABORT 2024-07-07 09:15:36.157644 UTC rmgr: Standby len (rec/tot): 78/ 78, tx: 0, lsn: 0/02006328, prev 0/02006300, desc: LOCK xid 1662 db 16384 rel 17308 xid 1662 db 16384 rel 17309 xid 1662 db 16384 rel 16385 xid 1662 db 16384 rel 17310 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/02006378, prev 0/02006328, desc: RUNNING_XACTS nextXid 1663 latestCompletedXid 1661 oldestRunningXid 1662; 1 xacts: 1662 rmgr: Heap len (rec/tot): 59/ 59, tx: 1663, lsn: 0/020063B0, prev 0/02006378, desc: INSERT+INIT off: 1, flags: 0x08, blkref #0: rel 1663/16384/17310 blk 3 rmgr: Transaction len (rec/tot): 46/ 46, tx: 1663, lsn: 0/020063F0, prev 0/020063B0, desc: COMMIT 2024-07-07 09:15:36.162938 UTC (The transaction 1662 performs table rewrite.) What I'm observing with some debug logging added is: ReorderBufferProcessTXN(), called during processing the commit record 0/02006328, gets reloid = 0 from RelidByRelfilenumber() because systable_beginscan() chooses (via GetCatalogSnapshot()) HistoricSnapshot with xmin = 1662, xmax = 1662. That snapshot was set inside SnapBuildProcessChange(), via ReorderBufferSetBaseSnapshot(), from builder->snapshot, which is built inside SnapBuildProcessChange(), via SnapBuildBuildSnapshot(), from builder->xmin (builder->xmax), which is set by SnapBuildProcessRunningXacts from running->oldestRunningXid = 1662. Really, WAL contains RUNNING_XACTS oldestRunningXid 1662 before the record being decoded. So I wonder if the RUNNING_XACT record in WAL is correct or it should be decoded differently? Best regards, Alexander
pgsql-bugs by date: