Thread: Issue with rsync based incremental backup : postgres: startup process waiting for 0000000100000001000000D2

Hi,

I'm trying to make a periodic (daily) incremental backup using rsync but noticing a message that postgres startup process is waiting on so and so XLOG file indefinitely.

I intend to schedule a cron job to periodically execute this rsync backup process by : a) stop postgres server on backup server b) rsync the data directory of original server to backup server c) start postgres on secondary d) ensure that postgres starts up with out any issue e) execute some test cases so as to ensure that there no data integrity issues f) execute more tests to ensure that data is valid from business logic perspective.

One may ask why am i using rsync backup instead of the log shipping bases streaming replication; well, my answer would be that our system would be having a lot of updates and fewer additions and since the transactions would be piling up for each update, the WAL files accumulate so much that we would be needing to archive much more data than the actual changes. Further, transferring such huge amount of logs is going to consume lot of time and bandwidth. Instead, I suppose that with rsync based backup we just need to sync the actual data files and need to worry about archiving the older WAL files, thus saving on the time of transfer, bandwidth costs, and WAL archive management.

I would like to know if syncing the $PGDATA directory with rsync would be good enough for incremental backups or that would create data integrity issues.

Everytime i run the rsync command, I could successfully start the backup server and am able to see the latest data from the original server, but am a bit worried about this message which may be hinting that the backup is not complete and is missing some essential WAL files.

Is there any additional steps or rsync configuration options that i need to specify in order to make this sync complete?

Please let me know what can be done to make the rsync process complete so that the backup server does not complain about waiting for missing WAL files.

Here are the steps i executed on master and slave:

On Backup Server
root@backup ~]# 
root@backup ~]# service postgresql stop
Stopping postgresql service:                               [  OK  ]
root@backup ~]# 

On Original Server
root@original ~]# 
root@original ~]# su -l postgres -s /bin/bash -c "psql -d test"
test=# 
test=# SELECT pg_start_backup('rsync_backup');
test=#\q
root@original ~]# 
root@original ~]# rsync --archive --checksum --compress --progress  --exclude postgresql.conf --exclude pg_hba.conf --exclude server.crt --exclude server.key /var/lib/pgsql/data/* root@backup.server:/var/lib/pgsql/data/
root@original ~]# su -l postgres -s /bin/bash -c "psql -d test"
test=# 
test=# SELECT pg_stop_backup();
test=#\q
root@original~]# 

On Backup Server
root@backup ~]# 
root@backup ~]# service postgresql start
Starting postgresql service:                               [  OK  ]
root@backup ~]# 
root@backup  ~]#  ps aux  | grep postgres
postgres 18210  5.3  1.5 1811408 88540 ?       S    20:43   0:00 /usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 18220  0.0  0.0 115496  1144 ?        Ss   20:43   0:00 postgres: logger process                             
postgres 18221  0.0  0.0 1812252 2328 ?        Ss   20:43   0:00 postgres: startup process   waiting for 0000000100000001000000D2
postgres 18222  0.0  0.0 1812180 2280 ?        Ss   20:43   0:00 postgres: writer process                             
postgres 18223  0.0  0.0 117744  1388 ?        Ss   20:43   0:00 postgres: stats collector process                    
postgres 18337  0.6  0.1 1814308 8092 ?        Ss   20:43   0:00 postgres: avaya_system_app_user avmgmt 127.0.0.1(34207) idle
postgres 18406  0.4  0.1 1815792 9272 ?        Ss   20:43   0:00 postgres: avaya_system_app_user avmgmt 127.0.0.1(34217) idle
postgres 18477  0.0  0.1 1813684 6680 ?        Ss   20:43   0:00 postgres: avaya_sseuser avmgmt 127.0.0.1(34231) idle 
root     18479  0.0  0.0  61160   728 pts/1    R+   20:43   0:00 grep postgres
root@backup ~]# 

Please advise me if what i'm doing is makes sense and is an accepted mechanism for taking backups or if there is any other procedure that i can emplpoy to avoid unnecessarily archiving gigabytes of WAL logs which may be growing many times the size of the actual data directory.

Thanks and Regards,
Samba 


On Thu, May 3, 2012 at 11:49 AM, Samba <saasira@gmail.com> wrote:
Hi,

Please advise me if what i'm doing is makes sense and is an accepted mechanism for taking backups or if there is any other procedure that i can emplpoy to avoid unnecessarily archiving gigabytes of WAL logs which may be growing many times the size of the actual data directory.

Thanks and Regards,
Samba


The problem is that rsync isn't copying all the xlog files created during the time the rsync is taking place, which is why it is complaining that there are files missing.

There may be other logical flaws with your process as well.

Something similar to the steps given in "Starting Replication with only a Quick Master Restart" as laid out in the wiki tutorial on binary replication might give you a way to make this work.  (You probably won't need the restart of the master, since you're not actually setting up replication, so you won't be changing the postgresql.conf file on your master.)

This uses a two-step process.  First you copy all the files EXCEPT the ones on pg_xlog, then you copy those files, so you have a complete set.

See http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial
--
Mike Nolan
Thanks Mike,

It worked! 

I had even setup streaming replication after doing incremental replication, without needing to stop postgres on the primary server.

Here is the script i came up with :

#!/bin/bash
if [ $# -ne 1 ]; then
echo "you must specify the hostname of the backup server";
exit 0;
fi;
BACKUP_SERVER=$1
PGDATA=/var/lib/pgsql/data
PGXLOG=$PGDATA/pg_xlog
PGEXEC="sudo su -l postgres -s /bin/bash -c"
RSYNC="rsync"
OPTIONS="--archive --checksum --compress --progress"
EXCLUDES="--exclude postmaster.pid --exclude postgresql.conf --exclude pg_hba.conf --exclude server.crt --exclude server.key"
ROLLOVER=32
SSH="ssh -q -o StrictHostKeyChecking=no -o BatchMode=yes $BACKUP_SERVER"
REPLICATION_CHECK="$SSH ps aux | grep postgres | grep wal | grep receiver"
#On BACKUP_SERVER
if [ -n "$(service postgresql status | grep "pid[:blank:]*[0-9]*")" ]; then 
$SSH "service postgresql stop"
fi;
#On PRIMARY
echo "Running VACUUM"
$PGEXEC "psql -c \"VACUUM FULL;\""
echo "VACUUM completed"

for f in $(ls -tr $PGXLOG | head -n ${ROLLOVER}); do 
$RSYNC $OPTIONS $PGXLOG/$f $BACKUP_SERVER:$PGXLOG/
done;
$PGEXEC "psql -c \"SELECT pg_start_backup('incremental_backup',true);\""
$RSYNC $OPTIONS $EXCLUDES --exclude pg_xlog $PGDATA $BACKUP_SERVER:$PGDATA
$PGEXEC "psql -c \"SELECT pg_stop_backup();\""
$RSYNC $OPTIONS $PGXLOG $BACKUP_SERVER:$PGXLOG
$RSYNC $OPTIONS $PGXLOG $BACKUP_SERVER:$PGXLOG
#On BACKUP_SERVER
$SSH "service postgresql start"
if [ -z "$(service postgresql status | grep "pid[:blank:]*[0-9]*")" ]; then 
echo "Failed to start database on backup server"
        echo "Look into the postgres logs for more details"
        echo "exiting..."
        exit 1;
fi;
#need to improve this delay-check to wait until the backup server has finished recovery and started into streaming mode
sleep 30
if [ -n "$(${REPLICATION_CHECK})" ] ; then
echo "SUCCESS in synching BACKUP_SERVER with the latest data from Primary"; 
#On BACKUP_SERVER
$SSH "service postgresql stop"
echo "Stopped the backup server in good state; it will get updated in the next scheduled incremental backup"
else 
echo "FAILED to sync backup server with Primary";
        echo "Leaving the backup server running in the failed state for further debugging"
        exit 1; 
fi;
exit 0;

I hope this would help others in need...

Thanks and Regards,
Samba

----------------------------------------------------------------------------------------------------------------------------------------------------------------
On Thu, May 3, 2012 at 11:55 PM, Michael Nolan <htfoot@gmail.com> wrote:


On Thu, May 3, 2012 at 11:49 AM, Samba <saasira@gmail.com> wrote:
Hi,

Please advise me if what i'm doing is makes sense and is an accepted mechanism for taking backups or if there is any other procedure that i can emplpoy to avoid unnecessarily archiving gigabytes of WAL logs which may be growing many times the size of the actual data directory.

Thanks and Regards,
Samba


The problem is that rsync isn't copying all the xlog files created during the time the rsync is taking place, which is why it is complaining that there are files missing.

There may be other logical flaws with your process as well.

Something similar to the steps given in "Starting Replication with only a Quick Master Restart" as laid out in the wiki tutorial on binary replication might give you a way to make this work.  (You probably won't need the restart of the master, since you're not actually setting up replication, so you won't be changing the postgresql.conf file on your master.)

This uses a two-step process.  First you copy all the files EXCEPT the ones on pg_xlog, then you copy those files, so you have a complete set.

See http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial
--
Mike Nolan