Re: BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp) - Mailing list pgsql-bugs
From | Jeremy Schneider |
---|---|
Subject | Re: BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp) |
Date | |
Msg-id | d96ae60a-7d40-c9b4-719d-fbbb61622606@amazon.com Whole thread Raw |
In response to | Re: BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp) (Jeremy Schneider <schnjere@amazon.com>) |
List | pgsql-bugs |
On 5/23/18 14:20, Jeremy Schneider wrote: > On 9/20/17 02:26, matioli.matheus@gmail.com wrote: >> I have checked and the file `pg_commit_ts/E7FB` is on the data directory >> with 229376, so looks like it is trying to read the file past its end, hence >> the above error. > > I recently saw a problem that looked very similar to this, seems worth > reporting on the same thread. > > PostgreSQL version: 10.3 > Operating system: Linux > > On startup, while attempting recovery, PostgreSQL failed to startup with > the following error: > > FATAL: could not access status of transaction 68544569 > DETAIL: Could not read from file "pg_commit_ts/0A37" at offset 106496: > Success. > CONTEXT: WAL redo at 70E/E5FE2030 for Transaction/COMMIT: 2018-xx-xx > xx:xx:xx.xxxxxx-xx; subxacts: 68544570 > LOG: startup process (PID 100292) exited with exit code 1 > LOG: aborting startup due to startup process failure > LOG: database system is shut down > > This is on a primary, not a standby. track_commit_timestamp is on, > wal_level is replica and there are no shared_preload_libraries. > > The pg_commit_ts directory was actually empty when recovery started and > in the process of recovery PostgreSQL recreated the 0A37 file. However > it consistently created a 106496 byte file (entirely filled with > 0/nulls) and then threw the error above when trying to read beyond the > end. If we removed the 0A37 file and restarted then it would again > recreate the 106496 byte file and error out in the same way. > > Here's the WAL it was trying to apply at failure, with preceding and > following entries: > > rmgr: Heap len (rec/tot): 71/ 71, tx: 68544569, lsn: > 70E/E5FE1FD0, prev 70E/E5FE1F98, desc: HOT_UPDATE off 2 xmax 68544569 ; > new off 10 xmax 0, blkref #0: rel 1663/3850018/16485 blk 295474 > rmgr: Transaction len (rec/tot): 46/ 46, tx: 68544569, lsn: > 70E/E5FE2030, prev 70E/E5FE1FD0, desc: COMMIT 2018-xx-xx xx:xx:xx.xxxxxx > UTC; subxacts: 68544570 > rmgr: Heap len (rec/tot): 54/ 54, tx: 68539960, lsn: > 70E/E5FE2060, prev 70E/E5FE2030, desc: LOCK off 12: xid 68539960: flags > 0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/11038913/16866 blk 6135 > > Unfortunately the WAL files from the time of the issue are no longer > available, so I'm unable to go back and do any additional analysis of > the original crash. (Like looking more at those transactions.) But I > wanted to at least get a record over here of what happened. I was able to get a little more information about this. We can see that: * a vacuum was running on TABLE1 at the time of the problem * there were many transactions containing sub-transactions (inserting into TABLE1) at the time of the problem *** vacuum activity was seen on TABLE1 concurrently for quite some time before the problem * nearly all DB activity including a stream of inserts into TABLE2 stopped abruptly at TIME1 * at TIME1 there were 9 xacts in flight. 3 of these aborted around this time. * after TIME1, one new transaction started, inserting into TABLE3 (with subxact in TABLE1). that xact committed at 06:04:04.015169 UTC. * the vacuum on TABLE1 continue just past the new transaction commit. it was the last observed activity in the DB before crash. * underlying I/O system looks healthy during this entire period. Not enough information to get a root cause, but I wanted to record it in case someone hits something similar in the future. -Jeremy -- Jeremy Schneider Database Engineer Amazon Web Services
pgsql-bugs by date: