Re: BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp) - Mailing list pgsql-bugs

From Jeremy Schneider
Subject Re: BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp)
Date
Msg-id d96ae60a-7d40-c9b4-719d-fbbb61622606@amazon.com
Whole thread Raw
In response to Re: BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp)  (Jeremy Schneider <schnjere@amazon.com>)
List pgsql-bugs
On 5/23/18 14:20, Jeremy Schneider wrote:
> On 9/20/17 02:26, matioli.matheus@gmail.com wrote:
>> I have checked and the file `pg_commit_ts/E7FB` is on the data directory
>> with 229376, so looks like it is trying to read the file past its end, hence
>> the above error.
> 
> I recently saw a problem that looked very similar to this, seems worth
> reporting on the same thread.
> 
> PostgreSQL version: 10.3
> Operating system: Linux
> 
> On startup, while attempting recovery, PostgreSQL failed to startup with
> the following error:
> 
> FATAL: could not access status of transaction 68544569
> DETAIL: Could not read from file "pg_commit_ts/0A37" at offset 106496:
> Success.
> CONTEXT: WAL redo at 70E/E5FE2030 for Transaction/COMMIT: 2018-xx-xx
> xx:xx:xx.xxxxxx-xx; subxacts: 68544570
> LOG: startup process (PID 100292) exited with exit code 1
> LOG: aborting startup due to startup process failure
> LOG: database system is shut down
> 
> This is on a primary, not a standby.  track_commit_timestamp is on,
> wal_level is replica and there are no shared_preload_libraries.
> 
> The pg_commit_ts directory was actually empty when recovery started and
> in the process of recovery PostgreSQL recreated the 0A37 file.  However
> it consistently created a 106496 byte file (entirely filled with
> 0/nulls) and then threw the error above when trying to read beyond the
> end.  If we removed the 0A37 file and restarted then it would again
> recreate the 106496 byte file and error out in the same way.
> 
> Here's the WAL it was trying to apply at failure, with preceding and
> following entries:
> 
> rmgr: Heap        len (rec/tot):     71/    71, tx:   68544569, lsn:
> 70E/E5FE1FD0, prev 70E/E5FE1F98, desc: HOT_UPDATE off 2 xmax 68544569 ;
> new off 10 xmax 0, blkref #0: rel 1663/3850018/16485 blk 295474
> rmgr: Transaction len (rec/tot):     46/    46, tx:   68544569, lsn:
> 70E/E5FE2030, prev 70E/E5FE1FD0, desc: COMMIT 2018-xx-xx xx:xx:xx.xxxxxx
> UTC; subxacts: 68544570
> rmgr: Heap        len (rec/tot):     54/    54, tx:   68539960, lsn:
> 70E/E5FE2060, prev 70E/E5FE2030, desc: LOCK off 12: xid 68539960: flags
> 0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/11038913/16866 blk 6135
> 
> Unfortunately the WAL files from the time of the issue are no longer
> available, so I'm unable to go back and do any additional analysis of
> the original crash.  (Like looking more at those transactions.)  But I
> wanted to at least get a record over here of what happened.

I was able to get a little more information about this.

We can see that:
* a vacuum was running on TABLE1 at the time of the problem
* there were many transactions containing sub-transactions (inserting
into TABLE1) at the time of the problem
*** vacuum activity was seen on TABLE1 concurrently for quite some time
before the problem
* nearly all DB activity including a stream of inserts into TABLE2
stopped abruptly at TIME1
* at TIME1 there were 9 xacts in flight. 3 of these aborted around this
time.
* after TIME1, one new transaction started, inserting into TABLE3 (with
subxact in TABLE1). that xact committed at 06:04:04.015169 UTC.
* the vacuum on TABLE1 continue just past the new transaction commit. it
was the last observed activity in the DB before crash.
* underlying I/O system looks healthy during this entire period.

Not enough information to get a root cause, but I wanted to record it in
case someone hits something similar in the future.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services


pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: BUG #15221: Could not open file "pg_subtrans/0013"
Next
From: Thomas Munro
Date:
Subject: Re: BUG #15225: [XX000] ERROR: invalid DSA memory alloc request size1073741824 / Where: parallel worker