Re: ERROR could not access transaction/Could not open file pg_commit_ts - Mailing list pgsql-general

From Jeremy Finzel
Subject Re: ERROR could not access transaction/Could not open file pg_commit_ts
Date
Msg-id CAMa1XUiZZ7vHM8qupUELa0QnazJ5aUjLm55AF9a1+ZjkGupznA@mail.gmail.com
Whole thread Raw
In response to ERROR could not access transaction/Could not open file pg_commit_ts  (Jeremy Finzel <finzelj@gmail.com>)
Responses Re: ERROR could not access transaction/Could not open filepg_commit_ts  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-general

On Fri, Mar 9, 2018 at 10:43 AM, Jeremy Finzel <finzelj@gmail.com> wrote:
Hello -

Here is our cluster setup:

cluster_a 9.5.11 Ubuntu 16.04.4 LTS
 --> cluster_b (streamer) 9.5.11 Ubuntu 16.04.4 LTS
 --> cluster_c (streamer) 9.5.11 Ubuntu 16.04.4 LTS

Very recently, we started seeing these errors when running a query on a specific table on the streamer:

2018-03-09 08:28:16.280 CST,"uname","foo",18692,"0.0.0.0:0",5aa29292.4904,4,"SELECT",2018-03-09 07:56:34 CST,18/15992,0,ERROR,58P01,"could not access status of transaction 1035047007","Could not open file ""pg_commit_ts/9A45"": No such file or directory."

A little history on the cluster:
  • The most recent change we made was a point release upgrade from 9.5.5 to 9.5.11 on the master, and 9.5.9 to 9.5.11 for the 2 streamers
  • It is a very high WAL traffic reporting system.
  • We actually have synchronous_commit set to off.  It's possible this could have bitten us and we are just now seeing issues, however there have been no crashes since the table in question was created.
  • We have run pg_repack on many tables on this cluster, but that also has not happened since over a month
  • We had a similar error of missing pg_commit_ts file over a year ago after an actual crash.  We had serious issues getting the cluster to start, and had to resort to recreating the missing pg_commit_ts with null bytes (IIRC, we had a snapshot of the system which still showed the file), which worked but left us questioning what really caused the issue.

The table that is causing the error has been in production and used fine since 2/15/2018 when it was created. It is fed by pglogical replication (v. 2.1.1 on subscriber) from a system running 9.6.1 and pglogical v. 1.2.1.  The point release upgrade from earlier 9.5 did take place after this.

However, we only just started seeing errors in the past 12 hours.  The table was autovacuumed on master at 2018-03-08 18:18:15.532137-06, which was about 3 hours before the first user query errored, however, I saw that 2 hours after the autovac, there was another user query that worked successfully on the table.  Not sure if related?

Any insight/ideas would be much appreciated!

Thanks,
Jeremy

UPDATE: what is actually failing is a call to pg_xact_commit_timestamp(xmin) on a given table under the view.  We still think we must have some corruption though with pg_commit_ts.

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Feature request: min/max for macaddr type
Next
From: Alvaro Herrera
Date:
Subject: Re: ERROR could not access transaction/Could not open filepg_commit_ts