Thread: locate DB corruption

locate DB corruption

From
Dave Peticolas
Date:
Hello, I'm running into the following error running a large query on a database restored from WAL replay:

could not access status of transaction 330569126
DETAIL: Could not open file "pg_clog/0C68": No such file or directory

Searches of this mailing list seem to indicate this means a database row has been corrupted. How would I go about locating the point of corruption in order to fix?

thanks,
dave

Re: locate DB corruption

From
Adrian Klaver
Date:
On 08/31/2018 08:02 AM, Dave Peticolas wrote:
> Hello, I'm running into the following error running a large query on a 
> database restored from WAL replay:
> 
> could not access status of transaction 330569126
> DETAIL: Could not open file "pg_clog/0C68": No such file or directory


Postgres version?

Where is the replay coming from?

> 
> Searches of this mailing list seem to indicate this means a database row 
> has been corrupted. How would I go about locating the point of 
> corruption in order to fix?

To me it looks like what it says, the transaction file could not be 
found. From Postgres 9.6 --> 10 pg_clog became pg_xact.

Are you sure you are not working across versions?

If not do pg_clog/ and 0C68 actually exist?

> 
> thanks,
> dave
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: locate DB corruption

From
Dave Peticolas
Date:
On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 08/31/2018 08:02 AM, Dave Peticolas wrote:
> Hello, I'm running into the following error running a large query on a
> database restored from WAL replay:
>
> could not access status of transaction 330569126
> DETAIL: Could not open file "pg_clog/0C68": No such file or directory


Postgres version?

Right! Sorry, that original email didn't have a lot of info. This is 9.6.9 restoring a backup from 9.6.8.
 
Where is the replay coming from?

From a snapshot and WAL files stored in Amazon S3.

 
>
> Searches of this mailing list seem to indicate this means a database row
> has been corrupted. How would I go about locating the point of
> corruption in order to fix?

To me it looks like what it says, the transaction file could not be
found. From Postgres 9.6 --> 10 pg_clog became pg_xact.

Are you sure you are not working across versions?

I am sure, they are all 9.6.
 
If not do pg_clog/ and 0C68 actually exist?

pg_clog definitely exists, but 0C68 does not. I think I have subsequently found the precise row in the specific table that seems to be the problem. Specifically I can select * from TABLE where id = BADID - 1 or id = BADID + 1 and the query returns. I get the error if I select the row with the bad ID.

Now what I'm not sure of is how to fix.

Re: locate DB corruption

From
Adrian Klaver
Date:
On 08/31/2018 08:51 AM, Dave Peticolas wrote:
> On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 08/31/2018 08:02 AM, Dave Peticolas wrote:
>      > Hello, I'm running into the following error running a large query
>     on a
>      > database restored from WAL replay:
>      >
>      > could not access status of transaction 330569126
>      > DETAIL: Could not open file "pg_clog/0C68": No such file or directory
> 
> 
>     Postgres version?
> 
> 
> Right! Sorry, that original email didn't have a lot of info. This is 
> 9.6.9 restoring a backup from 9.6.8.
> 
>     Where is the replay coming from?
> 
> 
>  From a snapshot and WAL files stored in Amazon S3.

Seems the process is not creating a consistent backup.

How are they being generated?


>     Are you sure you are not working across versions?
> 
> 
> I am sure, they are all 9.6.
> 
>     If not do pg_clog/ and 0C68 actually exist?
> 
> 
> pg_clog definitely exists, but 0C68 does not. I think I have 
> subsequently found the precise row in the specific table that seems to 
> be the problem. Specifically I can select * from TABLE where id = BADID 
> - 1 or id = BADID + 1 and the query returns. I get the error if I select 
> the row with the bad ID.
> 
> Now what I'm not sure of is how to fix.

One thing I can think of is to rebuild from a later version of your S3 
data and see if it has all the necessary files.

There is also pg_resetxlog:

https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html

I have not used it, so I can not offer much in the way of tips. Just 
from reading the docs I would suggest stopping the server and then 
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: locate DB corruption

From
Dave Peticolas
Date:
On Fri, Aug 31, 2018 at 5:19 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 08/31/2018 08:51 AM, Dave Peticolas wrote:
> On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     On 08/31/2018 08:02 AM, Dave Peticolas wrote:
>      > Hello, I'm running into the following error running a large query
>     on a
>      > database restored from WAL replay:
>      >
>      > could not access status of transaction 330569126
>      > DETAIL: Could not open file "pg_clog/0C68": No such file or directory
>
>
>     Postgres version?
>
>
> Right! Sorry, that original email didn't have a lot of info. This is
> 9.6.9 restoring a backup from 9.6.8.
>
>     Where is the replay coming from?
>
>
>  From a snapshot and WAL files stored in Amazon S3.

Seems the process is not creating a consistent backup.

This time, yes. This setup has been working for almost two years with probably hundreds of restores in that time. But nothing's perfect I guess :)
 
How are they being generated?

The snapshots are sent to S3 via a tar process after calling the start backup function. I am following the postgres docs here. The WAL files are just copied to S3.
 

>     Are you sure you are not working across versions?
>
>
> I am sure, they are all 9.6.
>
>     If not do pg_clog/ and 0C68 actually exist?
>
>
> pg_clog definitely exists, but 0C68 does not. I think I have
> subsequently found the precise row in the specific table that seems to
> be the problem. Specifically I can select * from TABLE where id = BADID
> - 1 or id = BADID + 1 and the query returns. I get the error if I select
> the row with the bad ID.
>
> Now what I'm not sure of is how to fix.

One thing I can think of is to rebuild from a later version of your S3
data and see if it has all the necessary files.

Yes, I think that's a good idea, I'm trying that.
 
There is also pg_resetxlog:

https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html

I have not used it, so I can not offer much in the way of tips. Just
from reading the docs I would suggest stopping the server and then
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.

Thanks, I didn't know about that. The primary DB seems OK so hopefully it won't be needed.
 

Re: locate DB corruption

From
Dave Peticolas
Date:
On Fri, Aug 31, 2018 at 8:48 PM Dave Peticolas <dave@krondo.com> wrote:
On Fri, Aug 31, 2018 at 5:19 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 08/31/2018 08:51 AM, Dave Peticolas wrote:
> On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     On 08/31/2018 08:02 AM, Dave Peticolas wrote:
>      > Hello, I'm running into the following error running a large query
>     on a
>      > database restored from WAL replay:
>      >
>      > could not access status of transaction 330569126
>      > DETAIL: Could not open file "pg_clog/0C68": No such file or directory
>
>
>     Postgres version?
>
>
> Right! Sorry, that original email didn't have a lot of info. This is
> 9.6.9 restoring a backup from 9.6.8.
>
>     Where is the replay coming from?
>
>
>  From a snapshot and WAL files stored in Amazon S3.

Seems the process is not creating a consistent backup.

This time, yes. This setup has been working for almost two years with probably hundreds of restores in that time. But nothing's perfect I guess :)
 
How are they being generated?

The snapshots are sent to S3 via a tar process after calling the start backup function. I am following the postgres docs here. The WAL files are just copied to S3.
 

>     Are you sure you are not working across versions?
>
>
> I am sure, they are all 9.6.
>
>     If not do pg_clog/ and 0C68 actually exist?
>
>
> pg_clog definitely exists, but 0C68 does not. I think I have
> subsequently found the precise row in the specific table that seems to
> be the problem. Specifically I can select * from TABLE where id = BADID
> - 1 or id = BADID + 1 and the query returns. I get the error if I select
> the row with the bad ID.
>
> Now what I'm not sure of is how to fix.

One thing I can think of is to rebuild from a later version of your S3
data and see if it has all the necessary files.

Yes, I think that's a good idea, I'm trying that.
 
There is also pg_resetxlog:

https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html

I have not used it, so I can not offer much in the way of tips. Just
from reading the docs I would suggest stopping the server and then
creating a backup of $PG_DATA(if possible) before using pg_resetxlog.

Thanks, I didn't know about that. The primary DB seems OK so hopefully it won't be needed.

Well restoring from a backup of the primary does seem to have fixed the issue with the corrupt table. 

Re: locate DB corruption

From
Adrian Klaver
Date:
On 09/01/2018 04:45 PM, Dave Peticolas wrote:

> Well restoring from a backup of the primary does seem to have fixed the 
> issue with the corrupt table.

Pretty sure it was not that the table was corrupt but that transaction 
information was missing from pg_clog.

In a previous post you mentioned you ran tar to do the snapshot of 
$PG_DATA.

Was there any error when tar ran the backup that caused you problems?


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: locate DB corruption

From
Dave Peticolas
Date:
On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/01/2018 04:45 PM, Dave Peticolas wrote:

> Well restoring from a backup of the primary does seem to have fixed the
> issue with the corrupt table.

Pretty sure it was not that the table was corrupt but that transaction
information was missing from pg_clog.

In a previous post you mentioned you ran tar to do the snapshot of
$PG_DATA.

Was there any error when tar ran the backup that caused you problems?

Well the interesting thing about that is that although the bad table was originally discovered in a DB restored from a snapshot, I subsequently discovered it in the real-time clone of the primary from which the backups are made. So somehow the clone's table became corrupted. The same table was not corrupt on the primary, but I have discovered an error on the primary -- it's in the thread I posted today. These events seem correlated in time, I'll have to mine the logs some more. 

Re: locate DB corruption

From
Stephen Frost
Date:
Greetings,

* Dave Peticolas (dave@krondo.com) wrote:
> On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com>
> wrote:
>
> > On 09/01/2018 04:45 PM, Dave Peticolas wrote:
> >
> > > Well restoring from a backup of the primary does seem to have fixed the
> > > issue with the corrupt table.
> >
> > Pretty sure it was not that the table was corrupt but that transaction
> > information was missing from pg_clog.
> >
> > In a previous post you mentioned you ran tar to do the snapshot of
> > $PG_DATA.
> >
> > Was there any error when tar ran the backup that caused you problems?
>
> Well the interesting thing about that is that although the bad table was
> originally discovered in a DB restored from a snapshot, I subsequently
> discovered it in the real-time clone of the primary from which the backups
> are made. So somehow the clone's table became corrupted. The same table was
> not corrupt on the primary, but I have discovered an error on the primary
> -- it's in the thread I posted today. These events seem correlated in time,
> I'll have to mine the logs some more.

Has this primary been the primary since inception, or was it promoted to
be one at some point after first being built as a replica..?

Thanks!

Stephen

Attachment

Re: locate DB corruption

From
Dave Peticolas
Date:
On Sun, Sep 2, 2018 at 4:51 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Dave Peticolas (dave@krondo.com) wrote:
> On Sat, Sep 1, 2018 at 5:09 PM Adrian Klaver <adrian.klaver@aklaver.com>
> wrote:
>
> > On 09/01/2018 04:45 PM, Dave Peticolas wrote:
> >
> > > Well restoring from a backup of the primary does seem to have fixed the
> > > issue with the corrupt table.
> >
> > Pretty sure it was not that the table was corrupt but that transaction
> > information was missing from pg_clog.
> >
> > In a previous post you mentioned you ran tar to do the snapshot of
> > $PG_DATA.
> >
> > Was there any error when tar ran the backup that caused you problems?
>
> Well the interesting thing about that is that although the bad table was
> originally discovered in a DB restored from a snapshot, I subsequently
> discovered it in the real-time clone of the primary from which the backups
> are made. So somehow the clone's table became corrupted. The same table was
> not corrupt on the primary, but I have discovered an error on the primary
> -- it's in the thread I posted today. These events seem correlated in time,
> I'll have to mine the logs some more.

Has this primary been the primary since inception, or was it promoted to
be one at some point after first being built as a replica..?
 
It was the primary since inception. All the problems now appear to have stemmed from the primary due to a bug in 9.6.8 (see other thread). I've since upgraded to 9.6.10.