Thread: Broken primary key after backup restore.

Broken primary key after backup restore.

From
Michael Chau
Date:
Hi,

In Production, I have a DB2 which is replicated partially using Londiste from DB1. I make file-system backups nightly on both DBs.

Last Monday, when I restored the backup made from DB2 to a test server, Postgres(9.3.5) started up fine. But, I found out that the primary key of one of the tables is broken

select * from <mytable> order by id desc;
ERROR:  could not find left sibling of block 17392 in index "mytable_pkey"

I am able to select without using the id index. On Prod DB1 , DB2 and on another test server restored from backup made from DB1, there is no problem, as I am able to select the table with and without index.

The table has 5 million rows. And I run Vacuum Analyze once a week.

1) To fix the above error, I tried to run vacuum full on the table and run 'reindex table <mytable>;. But it didn't help as the reindexing has taken very very long time and not sure if it has finished or just timed out.

There is also a suggestion to recreate the primary key constraint concurrently which I will look into later.

2) However, my main concern right now is whether there is any corruption in the Prods table as it does look fine. Is there any way to check? And also should we trust a file-system backup in this case?

Thanks






Re: Broken primary key after backup restore.

From
Adrian Klaver
Date:
On 09/17/2015 04:31 PM, Michael Chau wrote:
> Hi,
>
> In Production, I have a DB2 which is replicated partially using Londiste
> from DB1.

Well I think the above needs more explanation to help understand how the
DB2 backup got into this state and possibly prevent it in the future.


I make file-system backups nightly on both DBs.

How is that done exactly?

>
> Last Monday, when I restored the backup made from DB2 to a test server,
> Postgres(9.3.5) started up fine. But, I found out that the primary key
> of one of the tables is broken
>
> # select * from <mytable> order by id desc;
> ERROR:  could not find left sibling of block 17392 in index "mytable_pkey"
>
> I am able to select without using the id index. On Prod DB1 , DB2 and on
> another test server restored from backup made from DB1, there is no
> problem, as I am able to select the table with and without index.

Did you restore to the DB2 derived test server in the same way as you
did the other servers?

>
> The table has 5 million rows. And I run Vacuum Analyze once a week.
>
> 1) To fix the above error, I tried to run vacuum full on the table and
> run 'reindex table <mytable>;. But it didn't help as the reindexing has
> taken very very long time and not sure if it has finished or just timed out.
>
> There is also a suggestion to recreate the primary key constraint
> concurrently which I will look into later.
>
> 2) However, my main concern right now is whether there is any corruption
> in the Prods table as it does look fine. Is there any way to check? And
> also should we trust a file-system backup in this case?
>
> Thanks
>
>
>
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Broken primary key after backup restore.

From
Michael Chau
Date:
1) 
In Production, I have a DB2 which is replicated partially using Londiste
from DB1.

Well I think the above needs more explanation to help understand how the DB2 backup got into this state and possibly prevent it in the future.

A: So, the DB1 has several schemas in the database. We use Londiste to replicate just one of the schemas to DB2.  The table in question is in that schema. Backup is done on both DB1 and DB2.

2) 
I make file-system backups nightly on both DBs.

How is that done exactly?

A: To backup:

1) pg_start_backup()
2) tar up the files under the data directory
3) pg_stop_backup()

To restore on test server:

1) Just untar the tar ball, then start up Postgres. Of course the data directory is empty beforehand.

This has been working for almost 2 years without any problem until last Monday. I remember that I just ran vacuum analyze that table on both DB1 and DB2  that morning. But, I don't think that it harms anything.

Thanks


On Thu, Sep 17, 2015 at 4:53 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/17/2015 04:31 PM, Michael Chau wrote:
Hi,

In Production, I have a DB2 which is replicated partially using Londiste
from DB1.

Well I think the above needs more explanation to help understand how the DB2 backup got into this state and possibly prevent it in the future.


I make file-system backups nightly on both DBs.

How is that done exactly?


Last Monday, when I restored the backup made from DB2 to a test server,
Postgres(9.3.5) started up fine. But, I found out that the primary key
of one of the tables is broken

# select * from <mytable> order by id desc;
ERROR:  could not find left sibling of block 17392 in index "mytable_pkey"

I am able to select without using the id index. On Prod DB1 , DB2 and on
another test server restored from backup made from DB1, there is no
problem, as I am able to select the table with and without index.

Did you restore to the DB2 derived test server in the same way as you did the other servers?



The table has 5 million rows. And I run Vacuum Analyze once a week.

1) To fix the above error, I tried to run vacuum full on the table and
run 'reindex table <mytable>;. But it didn't help as the reindexing has
taken very very long time and not sure if it has finished or just timed out.

There is also a suggestion to recreate the primary key constraint
concurrently which I will look into later.

2) However, my main concern right now is whether there is any corruption
in the Prods table as it does look fine. Is there any way to check? And
also should we trust a file-system backup in this case?

Thanks








--
Adrian Klaver
adrian.klaver@aklaver.com



--
Michael Chau
Database Administrator
GAME GOLF
77 Geary St, 5th floor
San Francisco, CA 94108
e) michael.chau@gameyourgame.com
t) @GAMEGOLF

Re: Broken primary key after backup restore.

From
Adrian Klaver
Date:
On 09/17/2015 05:37 PM, Michael Chau wrote:
> 1)
>
>     In Production, I have a DB2 which is replicated partially using Londiste
>     from DB1.
>
>
> Well I think the above needs more explanation to help understand how the
> DB2 backup got into this state and possibly prevent it in the future.
>
> A: So, the DB1 has several schemas in the database. We use Londiste to
> replicate just one of the schemas to DB2.  The table in question is in
> that schema. Backup is done on both DB1 and DB2.
>
> 2)
> I make file-system backups nightly on both DBs.
>
> How is that done exactly?
>
> A: To backup:
>
> 1) pg_start_backup()
> 2) tar up the files under the data directory
> 3) pg_stop_backup()
>
> To restore on test server:
>
> 1) Just untar the tar ball, then start up Postgres. Of course the data
> directory is empty beforehand.
>
> This has been working for almost 2 years without any problem until last
> Monday. I remember that I just ran vacuum analyze that table on both DB1
> and DB2  that morning. But, I don't think that it harms anything.

Well it looks fairly straight forward, to me at least.


Anything in the Postgres logs when you started up the test server based
off the DB2 file system backup?

Anything in the original DB2 logs around the time you where taking the
backup?

>
> Thanks
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Broken primary key after backup restore.

From
Guillaume Lelarge
Date:

Le 18 sept. 2015 5:23 AM, "Adrian Klaver" <adrian.klaver@aklaver.com> a écrit :
>
> On 09/17/2015 05:37 PM, Michael Chau wrote:
>>
>> 1)
>>
>>     In Production, I have a DB2 which is replicated partially using Londiste
>>     from DB1.
>>
>>
>> Well I think the above needs more explanation to help understand how the
>> DB2 backup got into this state and possibly prevent it in the future.
>>
>> A: So, the DB1 has several schemas in the database. We use Londiste to
>> replicate just one of the schemas to DB2.  The table in question is in
>> that schema. Backup is done on both DB1 and DB2.
>>
>> 2)
>> I make file-system backups nightly on both DBs.
>>
>> How is that done exactly?
>>
>> A: To backup:
>>
>> 1) pg_start_backup()
>> 2) tar up the files under the data directory
>> 3) pg_stop_backup()
>>
>> To restore on test server:
>>
>> 1) Just untar the tar ball, then start up Postgres. Of course the data
>> directory is empty beforehand.
>>
>> This has been working for almost 2 years without any problem until last
>> Monday. I remember that I just ran vacuum analyze that table on both DB1
>> and DB2  that morning. But, I don't think that it harms anything.
>
>
> Well it looks fairly straight forward, to me at least.
>

Do I miss something obvious? Because this is to me the wrong way to do the restore. You need to apply WAL files archived between pg_start_backup and pg_stop_backup to get consistent data files.

Re: Broken primary key after backup restore.

From
Adrian Klaver
Date:
On 09/17/2015 11:15 PM, Guillaume Lelarge wrote:
> Le 18 sept. 2015 5:23 AM, "Adrian Klaver" <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> a écrit :
>  >
>  > On 09/17/2015 05:37 PM, Michael Chau wrote:
>  >>
>  >> 1)
>  >>
>  >>     In Production, I have a DB2 which is replicated partially using
> Londiste
>  >>     from DB1.
>  >>
>  >>
>  >> Well I think the above needs more explanation to help understand how the
>  >> DB2 backup got into this state and possibly prevent it in the future.
>  >>
>  >> A: So, the DB1 has several schemas in the database. We use Londiste to
>  >> replicate just one of the schemas to DB2.  The table in question is in
>  >> that schema. Backup is done on both DB1 and DB2.
>  >>
>  >> 2)
>  >> I make file-system backups nightly on both DBs.
>  >>
>  >> How is that done exactly?
>  >>
>  >> A: To backup:
>  >>
>  >> 1) pg_start_backup()
>  >> 2) tar up the files under the data directory
>  >> 3) pg_stop_backup()
>  >>
>  >> To restore on test server:
>  >>
>  >> 1) Just untar the tar ball, then start up Postgres. Of course the data
>  >> directory is empty beforehand.
>  >>
>  >> This has been working for almost 2 years without any problem until last
>  >> Monday. I remember that I just ran vacuum analyze that table on both DB1
>  >> and DB2  that morning. But, I don't think that it harms anything.
>  >
>  >
>  > Well it looks fairly straight forward, to me at least.
>  >
>
> Do I miss something obvious? Because this is to me the wrong way to do
> the restore. You need to apply WAL files archived between
> pg_start_backup and pg_stop_backup to get consistent data files.

Would that not be taken care of by the tar data directory/ untar data
directory?

I would think if it was a WAL issue the OP could never get the server to
start and get to the point the query failed on a single table and
column. All that being said, I think the OP would be better served by
pg_basebackup:

http://www.postgresql.org/docs/9.4/static/app-pgbasebackup.html


>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Broken primary key after backup restore.

From
Jeff Janes
Date:
On Fri, Sep 18, 2015 at 6:16 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/17/2015 11:15 PM, Guillaume Lelarge wrote:
Le 18 sept. 2015 5:23 AM, "Adrian Klaver" <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> a écrit :
 >
 > On 09/17/2015 05:37 PM, Michael Chau wrote:
 >>
 >> To restore on test server:
 >>
 >> 1) Just untar the tar ball, then start up Postgres. Of course the data
 >> directory is empty beforehand.
 >>
 >> This has been working for almost 2 years without any problem until last
 >> Monday. I remember that I just ran vacuum analyze that table on both DB1
 >> and DB2  that morning. But, I don't think that it harms anything.
 >
 >
 > Well it looks fairly straight forward, to me at least.
 >

Do I miss something obvious? Because this is to me the wrong way to do
the restore. You need to apply WAL files archived between
pg_start_backup and pg_stop_backup to get consistent data files.

Would that not be taken care of by the tar data directory/ untar data directory?

Only if you are very lucky.  If your tar command tars up the pg_xlog directory as the last thing it does, then you are probably going to be OK.  Otherwise, it is a crap shoot.

 
I would think if it was a WAL issue the OP could never get the server to start and get to the point the query failed on a single table and column.

With pg_basebackup, that is probably the case, as it either doesn't copy xlog at all, or if it does it makes sure it is complete.  But with tar, you have no such protection.

 
All that being said, I think the OP would be better served by pg_basebackup:

http://www.postgresql.org/docs/9.4/static/app-pgbasebackup.html


Yes, indeed.

Cheers,

Jeff

Re: Broken primary key after backup restore.

From
Michael Chau
Date:
Hi Jeff,

>Only if you are very lucky.  If your tar command tars up the pg_xlog directory as the last thing it does, then you are probably going to be OK.  Otherwise, it is a crap shoot.

May be that's it. I have another similar set up, but the pg_xlog is a soft link to another directory, and I use 'tar -chvzf'. It tar up the pg_xlog at the very last. And the restore is fine.

For this one, DB1 and DB2, the pg_xlog is the directory itself, and I use 'tar -cvzf'. And it tar up pg_xlog at the beginning. I always have doubt about it. But I though pg_stop_backup() and pg_start_backup() like freezing would prevent the inconsistency.

Indeed, I will look inot pgbasebackup.

Thanks,



On Fri, Sep 18, 2015 at 11:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Sep 18, 2015 at 6:16 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/17/2015 11:15 PM, Guillaume Lelarge wrote:
Le 18 sept. 2015 5:23 AM, "Adrian Klaver" <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> a écrit :
 >
 > On 09/17/2015 05:37 PM, Michael Chau wrote:
 >>
 >> To restore on test server:
 >>
 >> 1) Just untar the tar ball, then start up Postgres. Of course the data
 >> directory is empty beforehand.
 >>
 >> This has been working for almost 2 years without any problem until last
 >> Monday. I remember that I just ran vacuum analyze that table on both DB1
 >> and DB2  that morning. But, I don't think that it harms anything.
 >
 >
 > Well it looks fairly straight forward, to me at least.
 >

Do I miss something obvious? Because this is to me the wrong way to do
the restore. You need to apply WAL files archived between
pg_start_backup and pg_stop_backup to get consistent data files.

Would that not be taken care of by the tar data directory/ untar data directory?

Only if you are very lucky.  If your tar command tars up the pg_xlog directory as the last thing it does, then you are probably going to be OK.  Otherwise, it is a crap shoot.

 
I would think if it was a WAL issue the OP could never get the server to start and get to the point the query failed on a single table and column.

With pg_basebackup, that is probably the case, as it either doesn't copy xlog at all, or if it does it makes sure it is complete.  But with tar, you have no such protection.

 
All that being said, I think the OP would be better served by pg_basebackup:

http://www.postgresql.org/docs/9.4/static/app-pgbasebackup.html


Yes, indeed.

Cheers,

Jeff




--
Michael Chau
Database Administrator
GAME GOLF
77 Geary St, 5th floor
San Francisco, CA 94108
c) 510-366-3800
e) michael.chau@gameyourgame.com
t) @GAMEGOLF

Re: Broken primary key after backup restore.

From
David Steele
Date:
On 9/18/15 3:44 PM, Michael Chau wrote:
> Hi Jeff,
>
>>Only if you are very lucky.  If your tar command tars up the pg_xlog directory as the last thing it does, then you
areprobably going to be OK.  Otherwise, it is a crap shoot. 
>
> May be that's it. I have another similar set up, but the pg_xlog is a
> soft link to another directory, and I use 'tar -chvzf'. It tar up the
> pg_xlog at the very last. And the restore is fine.

This is still not always safe.  It depends on your wal_keep_segments
settings and some luck.  WAL segments can be recycled during the backup.

> For this one, DB1 and DB2, the pg_xlog is the directory itself, and I
> use 'tar -cvzf'. And it tar up pg_xlog at the beginning. I always have
> doubt about it. But I though pg_stop_backup() and pg_start_backup() like
> freezing would prevent the inconsistency.

This is definitely not a good idea.

> Indeed, I will look inot pgbasebackup.

pg_basebackup is good for creating replicas but for real backup you
might want to consider purpose-built backup software like pgBackRest or
barman.

--
-David
david@pgmasters.net


Re: Broken primary key after backup restore.

From
Michael Chau
Date:
Hi,

Just want to let everybody knows

So, I found out this morning what went wrong.

For some reason, there were some bad wal log files in pg_xlog. I believe that they got generated during the backup last Monday, but I don't know why. I speculate that may be the wal_keep_segments was not set high enough as I have changed it recently.

Luckily, I have archived the wal log files. And by comparing between the two directories, I did see those bad wal log files in pg_xlog directory only.

Yes, we were looking at Barman before. But at that time Barman didn't support our version of PostgreSQL. I think that I will pursue this.

Thanks again.

On Fri, Sep 18, 2015 at 2:40 PM, David Steele <david@pgmasters.net> wrote:
On 9/18/15 3:44 PM, Michael Chau wrote:
Hi Jeff,

Only if you are very lucky.  If your tar command tars up the pg_xlog directory as the last thing it does, then you are probably going to be OK.  Otherwise, it is a crap shoot.

May be that's it. I have another similar set up, but the pg_xlog is a
soft link to another directory, and I use 'tar -chvzf'. It tar up the
pg_xlog at the very last. And the restore is fine.

This is still not always safe.  It depends on your wal_keep_segments settings and some luck.  WAL segments can be recycled during the backup.

For this one, DB1 and DB2, the pg_xlog is the directory itself, and I
use 'tar -cvzf'. And it tar up pg_xlog at the beginning. I always have
doubt about it. But I though pg_stop_backup() and pg_start_backup() like
freezing would prevent the inconsistency.

This is definitely not a good idea.

Indeed, I will look inot pgbasebackup.

pg_basebackup is good for creating replicas but for real backup you might want to consider purpose-built backup software like pgBackRest or barman.

--
-David
david@pgmasters.net



--
Michael Chau
Database Administrator
GAME GOLF
77 Geary St, 5th floor
San Francisco, CA 94108
c) 510-366-3800
e) michael.chau@gameyourgame.com
t) @GAMEGOLF

Re: Broken primary key after backup restore.

From
Kevin Grittner
Date:
Michael Chau <michael.chau@gameyourgame.com> wrote:

> For some reason, there were some bad wal log files in pg_xlog. I
> believe that they got generated during the backup last Monday,
> but I don't know why. I speculate that may be the
> wal_keep_segments was not set high enough as I have changed it
> recently.
>
> Luckily, I have archived the wal log files. And by comparing
> between the two directories, I did see those bad wal log files in
> pg_xlog directory only.

The instructions for making a backup explicitly say to exclude or
delete the files in the pg_xlog directory and use those from the
archive.  That's because tar (or whatever you are using to copy the
files) may copy a WAL file before some change made during the
backup is written to it.  What you were doing is not supported and
likely to appear to work sometimes (possibly even without hidden
corruption), and fail to create a backup that will even start on
other attempts.

You might find this blog post helpful:

http://tbeitr.blogspot.com/2015/07/deleting-backuplabel-on-restore-will.html

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company