Thread: ERROR could not access transaction/Could not open file pg_commit_ts
Hello -
Here is our cluster setup:
cluster_a 9.5.11 Ubuntu 16.04.4 LTS
--> cluster_b (streamer) 9.5.11 Ubuntu 16.04.4 LTS
--> cluster_c (streamer) 9.5.11 Ubuntu 16.04.4 LTS
Very recently, we started seeing these errors when running a query on a specific table on the streamer:
2018-03-09 08:28:16.280 CST,"uname","foo",18692,"0.0.0.0:0",5aa29292.4904,4,"SELECT",2018-03-09 07:56:34 CST,18/15992,0,ERROR,58P01,"could not access status of transaction 1035047007","Could not open file ""pg_commit_ts/9A45"": No such file or directory."
A little history on the cluster:
- The most recent change we made was a point release upgrade from 9.5.5 to 9.5.11 on the master, and 9.5.9 to 9.5.11 for the 2 streamers
- It is a very high WAL traffic reporting system.
- We actually have synchronous_commit set to off. It's possible this could have bitten us and we are just now seeing issues, however there have been no crashes since the table in question was created.
- We have run pg_repack on many tables on this cluster, but that also has not happened since over a month
- We had a similar error of missing pg_commit_ts file over a year ago after an actual crash. We had serious issues getting the cluster to start, and had to resort to recreating the missing pg_commit_ts with null bytes (IIRC, we had a snapshot of the system which still showed the file), which worked but left us questioning what really caused the issue.
The table that is causing the error has been in production and used fine since 2/15/2018 when it was created. It is fed by pglogical replication (v. 2.1.1 on subscriber) from a system running 9.6.1 and pglogical v. 1.2.1. The point release upgrade from earlier 9.5 did take place after this.
However, we only just started seeing errors in the past 12 hours. The table was autovacuumed on master at 2018-03-08 18:18:15.532137-06, which was about 3 hours before the first user query errored, however, I saw that 2 hours after the autovac, there was another user query that worked successfully on the table. Not sure if related?
Any insight/ideas would be much appreciated!
Thanks,
Jeremy
On Fri, Mar 9, 2018 at 10:43 AM, Jeremy Finzel <finzelj@gmail.com> wrote:
Hello -Here is our cluster setup:cluster_a 9.5.11 Ubuntu 16.04.4 LTS--> cluster_b (streamer) 9.5.11 Ubuntu 16.04.4 LTS--> cluster_c (streamer) 9.5.11 Ubuntu 16.04.4 LTSVery recently, we started seeing these errors when running a query on a specific table on the streamer:2018-03-09 08:28:16.280 CST,"uname","foo",18692,"0.0.0.0:0",5aa29292.4904,4," SELECT",2018-03-09 07:56:34 CST,18/15992,0,ERROR,58P01,"co uld not access status of transaction 1035047007","Could not open file ""pg_commit_ts/9A45"": No such file or directory." A little history on the cluster:
- The most recent change we made was a point release upgrade from 9.5.5 to 9.5.11 on the master, and 9.5.9 to 9.5.11 for the 2 streamers
- It is a very high WAL traffic reporting system.
- We actually have synchronous_commit set to off. It's possible this could have bitten us and we are just now seeing issues, however there have been no crashes since the table in question was created.
- We have run pg_repack on many tables on this cluster, but that also has not happened since over a month
- We had a similar error of missing pg_commit_ts file over a year ago after an actual crash. We had serious issues getting the cluster to start, and had to resort to recreating the missing pg_commit_ts with null bytes (IIRC, we had a snapshot of the system which still showed the file), which worked but left us questioning what really caused the issue.
The table that is causing the error has been in production and used fine since 2/15/2018 when it was created. It is fed by pglogical replication (v. 2.1.1 on subscriber) from a system running 9.6.1 and pglogical v. 1.2.1. The point release upgrade from earlier 9.5 did take place after this.However, we only just started seeing errors in the past 12 hours. The table was autovacuumed on master at 2018-03-08 18:18:15.532137-06, which was about 3 hours before the first user query errored, however, I saw that 2 hours after the autovac, there was another user query that worked successfully on the table. Not sure if related?Any insight/ideas would be much appreciated!Thanks,Jeremy
UPDATE: what is actually failing is a call to pg_xact_commit_timestamp(xmin) on a given table under the view. We still think we must have some corruption though with pg_commit_ts.
Jeremy Finzel wrote: > UPDATE: what is actually failing is a call to > pg_xact_commit_timestamp(xmin) on a given table under the view. We still > think we must have some corruption though with pg_commit_ts. This is not a valid query, because the xmin may belong to a frozen tuple. pg_commit_ts does not keep data forever; old files are removed. Anything older than datfrozenxid could return that error. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Jeremy Finzel wrote: > >> UPDATE: what is actually failing is a call to >> pg_xact_commit_timestamp(xmin) on a given table under the view. We still >> think we must have some corruption though with pg_commit_ts. > > This is not a valid query, because the xmin may belong to a frozen > tuple. pg_commit_ts does not keep data forever; old files are removed. > Anything older than datfrozenxid could return that error. Ok but what does it mean if the error is raised only on both streaming standbys but not on master? We took a quick look to verify, just in terms of pg_commit_ts file names, between master and standbys to find no differences. Thanks -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800