Thread: locate DB corruption
Hello, I'm running into the following error running a large query on a database restored from WAL replay:
could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or directory
Searches of this mailing list seem to indicate this means a database row has been corrupted. How would I go about locating the point of corruption in order to fix?
thanks,
dave
On 08/31/2018 08:02 AM, Dave Peticolas wrote: > Hello, I'm running into the following error running a large query on a > database restored from WAL replay: > > could not access status of transaction 330569126 > DETAIL: Could not open file "pg_clog/0C68": No such file or directory Postgres version? Where is the replay coming from? > > Searches of this mailing list seem to indicate this means a database row > has been corrupted. How would I go about locating the point of > corruption in order to fix? To me it looks like what it says, the transaction file could not be found. From Postgres 9.6 --> 10 pg_clog became pg_xact. Are you sure you are not working across versions? If not do pg_clog/ and 0C68 actually exist? > > thanks, > dave > -- Adrian Klaver adrian.klaver@aklaver.com
On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 08/31/2018 08:02 AM, Dave Peticolas wrote:
> Hello, I'm running into the following error running a large query on a
> database restored from WAL replay:
>
> could not access status of transaction 330569126
> DETAIL: Could not open file "pg_clog/0C68": No such file or directory
Postgres version?
Right! Sorry, that original email didn't have a lot of info. This is 9.6.9 restoring a backup from 9.6.8.
Where is the replay coming from?
From a snapshot and WAL files stored in Amazon S3.
>
> Searches of this mailing list seem to indicate this means a database row
> has been corrupted. How would I go about locating the point of
> corruption in order to fix?
To me it looks like what it says, the transaction file could not be
found. From Postgres 9.6 --> 10 pg_clog became pg_xact.
Are you sure you are not working across versions?
I am sure, they are all 9.6.
If not do pg_clog/ and 0C68 actually exist?
pg_clog definitely exists, but 0C68 does not. I think I have subsequently found the precise row in the specific table that seems to be the problem. Specifically I can select * from TABLE where id = BADID - 1 or id = BADID + 1 and the query returns. I get the error if I select the row with the bad ID.
Now what I'm not sure of is how to fix.
On 08/31/2018 08:51 AM, Dave Peticolas wrote: > On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> wrote: > > On 08/31/2018 08:02 AM, Dave Peticolas wrote: > > Hello, I'm running into the following error running a large query > on a > > database restored from WAL replay: > > > > could not access status of transaction 330569126 > > DETAIL: Could not open file "pg_clog/0C68": No such file or directory > > > Postgres version? > > > Right! Sorry, that original email didn't have a lot of info. This is > 9.6.9 restoring a backup from 9.6.8. > > Where is the replay coming from? > > > From a snapshot and WAL files stored in Amazon S3. Seems the process is not creating a consistent backup. How are they being generated? > Are you sure you are not working across versions? > > > I am sure, they are all 9.6. > > If not do pg_clog/ and 0C68 actually exist? > > > pg_clog definitely exists, but 0C68 does not. I think I have > subsequently found the precise row in the specific table that seems to > be the problem. Specifically I can select * from TABLE where id = BADID > - 1 or id = BADID + 1 and the query returns. I get the error if I select > the row with the bad ID. > > Now what I'm not sure of is how to fix. One thing I can think of is to rebuild from a later version of your S3 data and see if it has all the necessary files. There is also pg_resetxlog: https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html I have not used it, so I can not offer much in the way of tips. Just from reading the docs I would suggest stopping the server and then creating a backup of $PG_DATA(if possible) before using pg_resetxlog. -- Adrian Klaver adrian.klaver@aklaver.com
On Fri, Aug 31, 2018 at 5:19 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 08/31/2018 08:51 AM, Dave Peticolas wrote:
> On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
> On 08/31/2018 08:02 AM, Dave Peticolas wrote:
> > Hello, I'm running into the following error running a large query
> on a
> > database restored from WAL replay:
> >
> > could not access status of transaction 330569126
> > DETAIL: Could not open file "pg_clog/0C68": No such file or directory
>
>
> Postgres version?
>
>
> Right! Sorry, that original email didn't have a lot of info. This is
> 9.6.9 restoring a backup from 9.6.8.
>
> Where is the replay coming from?
>
>
> From a snapshot and WAL files stored in Amazon S3.
Seems the process is not creating a consistent backup.
This time, yes. This setup has been working for almost two years with probably hundreds of restores in that time. But nothing's perfect I guess :)
How are they being generated?
The snapshots are sent to S3 via a tar process after calling the start backup function. I am following the postgres docs here. The WAL files are just copied to S3.
> Are you sure you are not working across versions?
>
>
> I am sure, they are all 9.6.
>
> If not do pg_clog/ and 0C68 actually exist?
>
>
> pg_clog definitely exists, but 0C68 does not. I think I have
> subsequently found the precise row in the specific table that seems to
> be the problem. Specifically I can select * from TABLE where id = BADID
> - 1 or id = BADID + 1 and the query returns. I get the error if I select
> the row with the bad ID.
>
> Now what I'm not sure of is how to fix.
One thing I can think of is to rebuild from a later version of your S3
data and see if it has all the necessary files.
Yes, I think that's a good idea, I'm trying that.
There is also pg_resetxlog:
https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html
I have not used it, so I can not offer much in the way of tips. Just
from reading the docs I would suggest stopping the server and then
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.
Thanks, I didn't know about that. The primary DB seems OK so hopefully it won't be needed.
On Fri, Aug 31, 2018 at 8:48 PM Dave Peticolas <dave@krondo.com> wrote:
On Fri, Aug 31, 2018 at 5:19 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 08/31/2018 08:51 AM, Dave Peticolas wrote:
> On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
> On 08/31/2018 08:02 AM, Dave Peticolas wrote:
> > Hello, I'm running into the following error running a large query
> on a
> > database restored from WAL replay:
> >
> > could not access status of transaction 330569126
> > DETAIL: Could not open file "pg_clog/0C68": No such file or directory
>
>
> Postgres version?
>
>
> Right! Sorry, that original email didn't have a lot of info. This is
> 9.6.9 restoring a backup from 9.6.8.
>
> Where is the replay coming from?
>
>
> From a snapshot and WAL files stored in Amazon S3.
Seems the process is not creating a consistent backup.This time, yes. This setup has been working for almost two years with probably hundreds of restores in that time. But nothing's perfect I guess :)How are they being generated?The snapshots are sent to S3 via a tar process after calling the start backup function. I am following the postgres docs here. The WAL files are just copied to S3.
> Are you sure you are not working across versions?
>
>
> I am sure, they are all 9.6.
>
> If not do pg_clog/ and 0C68 actually exist?
>
>
> pg_clog definitely exists, but 0C68 does not. I think I have
> subsequently found the precise row in the specific table that seems to
> be the problem. Specifically I can select * from TABLE where id = BADID
> - 1 or id = BADID + 1 and the query returns. I get the error if I select
> the row with the bad ID.
>
> Now what I'm not sure of is how to fix.
One thing I can think of is to rebuild from a later version of your S3
data and see if it has all the necessary files.Yes, I think that's a good idea, I'm trying that.There is also pg_resetxlog:
https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html
I have not used it, so I can not offer much in the way of tips. Just
from reading the docs I would suggest stopping the server and then
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.Thanks, I didn't know about that. The primary DB seems OK so hopefully it won't be needed.
Well restoring from a backup of the primary does seem to have fixed the issue with the corrupt table.
On 09/01/2018 04:45 PM, Dave Peticolas wrote: > Well restoring from a backup of the primary does seem to have fixed the > issue with the corrupt table. Pretty sure it was not that the table was corrupt but that transaction information was missing from pg_clog. In a previous post you mentioned you ran tar to do the snapshot of $PG_DATA. Was there any error when tar ran the backup that caused you problems? -- Adrian Klaver adrian.klaver@aklaver.com
On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/01/2018 04:45 PM, Dave Peticolas wrote:
> Well restoring from a backup of the primary does seem to have fixed the
> issue with the corrupt table.
Pretty sure it was not that the table was corrupt but that transaction
information was missing from pg_clog.
In a previous post you mentioned you ran tar to do the snapshot of
$PG_DATA.
Was there any error when tar ran the backup that caused you problems?
Well the interesting thing about that is that although the bad table was originally discovered in a DB restored from a snapshot, I subsequently discovered it in the real-time clone of the primary from which the backups are made. So somehow the clone's table became corrupted. The same table was not corrupt on the primary, but I have discovered an error on the primary -- it's in the thread I posted today. These events seem correlated in time, I'll have to mine the logs some more.
Greetings, * Dave Peticolas (dave@krondo.com) wrote: > On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com> > wrote: > > > On 09/01/2018 04:45 PM, Dave Peticolas wrote: > > > > > Well restoring from a backup of the primary does seem to have fixed the > > > issue with the corrupt table. > > > > Pretty sure it was not that the table was corrupt but that transaction > > information was missing from pg_clog. > > > > In a previous post you mentioned you ran tar to do the snapshot of > > $PG_DATA. > > > > Was there any error when tar ran the backup that caused you problems? > > Well the interesting thing about that is that although the bad table was > originally discovered in a DB restored from a snapshot, I subsequently > discovered it in the real-time clone of the primary from which the backups > are made. So somehow the clone's table became corrupted. The same table was > not corrupt on the primary, but I have discovered an error on the primary > -- it's in the thread I posted today. These events seem correlated in time, > I'll have to mine the logs some more. Has this primary been the primary since inception, or was it promoted to be one at some point after first being built as a replica..? Thanks! Stephen
Attachment
On Sun, Sep 2, 2018 at 4:51 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Dave Peticolas (dave@krondo.com) wrote:
> On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com>
> wrote:
>
> > On 09/01/2018 04:45 PM, Dave Peticolas wrote:
> >
> > > Well restoring from a backup of the primary does seem to have fixed the
> > > issue with the corrupt table.
> >
> > Pretty sure it was not that the table was corrupt but that transaction
> > information was missing from pg_clog.
> >
> > In a previous post you mentioned you ran tar to do the snapshot of
> > $PG_DATA.
> >
> > Was there any error when tar ran the backup that caused you problems?
>
> Well the interesting thing about that is that although the bad table was
> originally discovered in a DB restored from a snapshot, I subsequently
> discovered it in the real-time clone of the primary from which the backups
> are made. So somehow the clone's table became corrupted. The same table was
> not corrupt on the primary, but I have discovered an error on the primary
> -- it's in the thread I posted today. These events seem correlated in time,
> I'll have to mine the logs some more.
Has this primary been the primary since inception, or was it promoted to
be one at some point after first being built as a replica..?
It was the primary since inception. All the problems now appear to have stemmed from the primary due to a bug in 9.6.8 (see other thread). I've since upgraded to 9.6.10.