Thread: : PostgreSQL Online Backup

: PostgreSQL Online Backup

From
Venkat Balaji
Date:
Hello Everyone,

We have had situations where-in "rsync" was executed without executing "pg_start_backup()" on the production data directory and on the next runs, "pg_start_backup()" has been executed with "rsync". This was to avoid high IO load on production. We ended up getting unmatched files (especially in pg_clog) and not sure about "base" directory.

Postgres is asking for WAL Archive files dated sometime around 15 days ago. We are absolutely not sure whats going on.

Is this dangerous for production (like corruption) ? or just the backup will be invalid ? Please help us know if we have to perform any precautionary checks on the production cluster.

Apart from firing a checkpoint, does "pg_start_backup()" updates any dictionary tables or views ? or it updates anything in "pg_xlog"

Looking forward for your help !

Thanks
VB


Re: : PostgreSQL Online Backup

From
Venkat Balaji
Date:
I tried restoring the backup, after taking the full backup.

Below is what i see in the "archive destination".

Postgres was asking for "00000001000001930000006F" and i tried to find the same and below is what i find...

-rw------- 1 postgres postgres 3.3M Sep 26 02:06 00000001000001930000006F.gz
-rw------- 1 postgres postgres  219 Sep 26 02:53 00000001000001930000006F.00328508.backup.gz

Why is PG (9.0) putting an extension for the WAL Archive file as "backup.gz" ??

Please help !

Thanks
VB

On Mon, Sep 26, 2011 at 5:11 PM, Venkat Balaji <venkat.balaji@verse.in> wrote:
Hello Everyone,

We have had situations where-in "rsync" was executed without executing "pg_start_backup()" on the production data directory and on the next runs, "pg_start_backup()" has been executed with "rsync". This was to avoid high IO load on production. We ended up getting unmatched files (especially in pg_clog) and not sure about "base" directory.

Postgres is asking for WAL Archive files dated sometime around 15 days ago. We are absolutely not sure whats going on.

Is this dangerous for production (like corruption) ? or just the backup will be invalid ? Please help us know if we have to perform any precautionary checks on the production cluster.

Apart from firing a checkpoint, does "pg_start_backup()" updates any dictionary tables or views ? or it updates anything in "pg_xlog"

Looking forward for your help !

Thanks
VB



Re: : PostgreSQL Online Backup

From
"Albe Laurenz"
Date:
Venkat Balaji wrote:
> We have had situations where-in "rsync" was executed without executing
"pg_start_backup()" on the
> production data directory and on the next runs, "pg_start_backup()"
has been executed with "rsync".
> This was to avoid high IO load on production. We ended up getting
unmatched files (especially in
> pg_clog) and not sure about "base" directory.
>
> Postgres is asking for WAL Archive files dated sometime around 15 days
ago. We are absolutely not sure
> whats going on.
>
> Is this dangerous for production (like corruption) ? or just the
backup will be invalid ? Please help
> us know if we have to perform any precautionary checks on the
production cluster.
>
> Apart from firing a checkpoint, does "pg_start_backup()" updates any
dictionary tables or views ? or
> it updates anything in "pg_xlog"
>
> Looking forward for your help !

I am not sure what the problem is.

Do you have problems starting the original PostgreSQL cluster,
or do you have problems restoring a backup?

Running pg_start_backup() will not harm the cluster.
End online backup mode by running pg_stop_backup() or removing
the backup_label file in the cluster directory.

Yours,
Laurenz Albe

Re: : PostgreSQL Online Backup

From
Alan Hodgson
Date:
On September 26, 2011 05:49:50 AM Venkat Balaji wrote:
> I tried restoring the backup, after taking the full backup.
>
> Below is what i see in the "archive destination".
>
> Postgres was asking for "00000001000001930000006F" and i tried to find the
> same and below is what i find...
>
> -rw------- 1 postgres postgres 3.3M Sep 26 02:06
> 00000001000001930000006F.gz -rw------- 1 postgres postgres  219 Sep 26
> 02:53
> 00000001000001930000006F.00328508.backup.gz
>
> Why is PG (9.0) putting an extension for the WAL Archive file as
> > "backup.gz" ??
>

The archive files are created by your archive_command, as specified in
postgresql.conf. My guess would be that your archive command runs the files
through gzip as part of archiving (which is fine).

However, the restore_command you specify in recovery.conf  must undo this
compression. So instead of (for example) 'cp -f "%f" "%p"', it might instead
need to look like 'zcat "%f" > "%p"'.

Hope this helps.



Re: : PostgreSQL Online Backup

From
Venkat Balaji
Date:
Thanks for all your inputs !

Our problem is -

We had mistakenly executed  "rsync" on the running PostgreSQL data directory (production) and we did not run "pg_start_backup()".

Will this harm production ? can this lead to corruption ?

Thanks -

On Mon, Sep 26, 2011 at 10:29 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
On September 26, 2011 05:49:50 AM Venkat Balaji wrote:
> I tried restoring the backup, after taking the full backup.
>
> Below is what i see in the "archive destination".
>
> Postgres was asking for "00000001000001930000006F" and i tried to find the
> same and below is what i find...
>
> -rw------- 1 postgres postgres 3.3M Sep 26 02:06
> 00000001000001930000006F.gz -rw------- 1 postgres postgres  219 Sep 26
> 02:53
> 00000001000001930000006F.00328508.backup.gz
>
> Why is PG (9.0) putting an extension for the WAL Archive file as
> > "backup.gz" ??
>

The archive files are created by your archive_command, as specified in
postgresql.conf. My guess would be that your archive command runs the files
through gzip as part of archiving (which is fine).

However, the restore_command you specify in recovery.conf  must undo this
compression. So instead of (for example) 'cp -f "%f" "%p"', it might instead
need to look like 'zcat "%f" > "%p"'.

Hope this helps.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: : PostgreSQL Online Backup

From
"Albe Laurenz"
Date:
Venkat Balaji wrote:
> Our problem is -
>
> We had mistakenly executed  "rsync" on the running PostgreSQL data
directory (production) and we did
> not run "pg_start_backup()".
>
> Will this harm production ? can this lead to corruption ?

I assume that you used rsync to copy *from* the data directory.

This cannot lead to data corruption.
Only performance might suffer temporarily due to the additional I/O.

The backup made with rsync will be unusable without pg_start_backup().

Yours,
Laurenz Albe

Re: : PostgreSQL Online Backup

From
Venkat Balaji
Date:
Another problem in recovery (probably because of "rsync") -

As said earlier, we are taking a production backup everyday incrementally using "rsync".

But, Postgres some how misses to sync few files in between and keeps on asking the back dated archive files (more than 1 week ago). 

I restored October 2nd backup and PG is asking for September 26th archive file with the last known time as 26th Sep, 2011. 

2011-10-03 07:17:12 CDT [12705]: [1-1] LOG:  database system was interrupted; last known up at 2011-09-26 09:01:36 CDT
2011-10-03 07:17:12 CDT [12705]: [2-1] LOG:  starting archive recovery
cp: cannot stat `/usr/local/pgsql9.0.1/obtdata/data/pg_xlog/000000010000053900000076': No such file or directory
2011-10-03 07:17:12 CDT [12705]: [3-1] LOG:  could not open file "pg_xlog/000000010000053900000076" (log file 1337, segment 118): No such file or directory
2011-10-03 07:17:12 CDT [12705]: [4-1] LOG:  invalid checkpoint record
2011-10-03 07:17:12 CDT [12705]: [5-1] PANIC:  could not locate required checkpoint record
2011-10-03 07:17:12 CDT [12705]: [6-1] HINT:  If you are not restoring from a backup, try removing the file "/usr/local/pgsql9.0.1/obtdata/data/backup_label".
2011-10-03 07:17:12 CDT [12702]: [1-1] LOG:  startup process (PID 12705) was terminated by signal 6: Aborted
2011-10-03 07:17:12 CDT [12702]: [2-1] LOG:  aborting startup due to startup process failure


I always see pg_clog files and some base files not getting synced.

Below is what we are doing -

pg_start_backup()
rsync the data directory
pg_stop_backup()

The first time "rsync" is fine, but, the subsequent runs are generating in-consistency.

We do the same every day to backup the data directory incrementally.

What i observed is PG records the TXN id when ever backup starts and stops + backup label. The next day when PG records the start backup time and TXN id, i think some of the TXN ids and pg_clog files generated between last stop time and the next start time are missed.

Did anyone observe this behavior ?? Please help !

This is critical for us. I want to recommend not to use "rsync" (use cp or scp instead) for production backup.

Thanks
VB

On Tue, Sep 27, 2011 at 2:36 PM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Venkat Balaji wrote:
> Our problem is -
>
> We had mistakenly executed  "rsync" on the running PostgreSQL data
directory (production) and we did
> not run "pg_start_backup()".
>
> Will this harm production ? can this lead to corruption ?

I assume that you used rsync to copy *from* the data directory.

This cannot lead to data corruption.
Only performance might suffer temporarily due to the additional I/O.

The backup made with rsync will be unusable without pg_start_backup().

Yours,
Laurenz Albe

Re: : PostgreSQL Online Backup

From
Alan Hodgson
Date:
On October 3, 2011 05:33:35 AM Venkat Balaji wrote:
> Did anyone observe this behavior ?? Please help !
>
> This is critical for us. I want to recommend not to use "rsync" (use cp or
> scp instead) for production backup.
>

rsync works fine. Why exactly can't the recovery find the backed up copy of
000000010000053900000076? Please post your archive_command settings, the
contents of any script(s) called by that, and the recovery.conf file you're
using that's having problems, as well as the complete process you followed to
initiate recovery. I strongly suspect you're missing part of the process of
actually saving the WAL files needed for recovery.

Re: : PostgreSQL Online Backup

From
Venkat Balaji
Date:

The recovery is unable to find the WAL archive because, it was generated on 26th September.

Whereas the backup is as taken on Oct 2nd, 2011. We deleted all the files.

I do not have that WAL archive copy.

The problem area -

I found that a pg_clog file dated 26th Sep, 2011 is not synced (its not 256K).

Thanks
VB

2011/10/3 Alan Hodgson <ahodgson@simkin.ca>
On October 3, 2011 05:33:35 AM Venkat Balaji wrote:
> Did anyone observe this behavior ?? Please help !
>
> This is critical for us. I want to recommend not to use "rsync" (use cp or
> scp instead) for production backup.
>

rsync works fine. Why exactly can't the recovery find the backed up copy of
000000010000053900000076? Please post your archive_command settings, the
contents of any script(s) called by that, and the recovery.conf file you're
using that's having problems, as well as the complete process you followed to
initiate recovery. I strongly suspect you're missing part of the process of
actually saving the WAL files needed for recovery.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: : PostgreSQL Online Backup

From
Alan Hodgson
Date:
> > rsync works fine. Why exactly can't the recovery find the backed up copy
> > of 000000010000053900000076? Please post your archive_command settings,
> > the contents of any script(s) called by that, and the recovery.conf file
> > you're using that's having problems, as well as the complete process you
> > followed to
> > initiate recovery. I strongly suspect you're missing part of the process
> > of actually saving the WAL files needed for recovery.

> The recovery is unable to find the WAL archive because, it was generated on
> 26th September.
>
> Whereas the backup is as taken on Oct 2nd, 2011. We deleted all the files.
>
> I do not have that WAL archive copy.
>
> The problem area -
>
> I found that a pg_clog file dated 26th Sep, 2011 is not synced (its not
> 256K).
>

I'm going to need the rest of what I asked for to offer any further suggestions
- especially the full and exact steps you took to initiate recovery and the
contents of recovery.conf. Also, please don't top-post.

Re: : PostgreSQL Online Backup

From
Venkat Balaji
Date:
Sorry for not responding to this email for so long.

Alan, 

We had mentioned the following line in recovery.conf file (we had given pg_xlog location since we did not have WAL archives) -

restore_command = 'cp <data dir>/pg_xlog/%f %p'

We found where the problem was -

Here is what i did -

1. We had taken a full backup using pg_start_backup() and pg_stop_backup() on Day 1
2. Rest of the days (from Day 2 - Day 15), we had incrementally backed-up ( this is also using pg_start_backup() and pg_stop_backup())
3. On Day-16th, when i started the recovery, PG was asking Day 1's WAL archive file, which we did not have.

A fresh complete backup with change in our backup strategy resolved the issue.

Thanks a lot for all your inputs and help on this !!

Regards,
VB

2011/10/4 Alan Hodgson <ahodgson@simkin.ca>
> > rsync works fine. Why exactly can't the recovery find the backed up copy
> > of 000000010000053900000076? Please post your archive_command settings,
> > the contents of any script(s) called by that, and the recovery.conf file
> > you're using that's having problems, as well as the complete process you
> > followed to
> > initiate recovery. I strongly suspect you're missing part of the process
> > of actually saving the WAL files needed for recovery.

> The recovery is unable to find the WAL archive because, it was generated on
> 26th September.
>
> Whereas the backup is as taken on Oct 2nd, 2011. We deleted all the files.
>
> I do not have that WAL archive copy.
>
> The problem area -
>
> I found that a pg_clog file dated 26th Sep, 2011 is not synced (its not
> 256K).
>

I'm going to need the rest of what I asked for to offer any further suggestions
- especially the full and exact steps you took to initiate recovery and the
contents of recovery.conf. Also, please don't top-post.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general