8.0beta1 hot spare how to - Mailing list pgsql-hackers

From Gaetano Mendola
Subject 8.0beta1 hot spare how to
Date
Msg-id 411DEF94.6020706@bigfoot.com
Whole thread Raw
List pgsql-hackers
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,
I seen that Eric Kerin did the work suggested by Tom about
how to use the PITR in order to have an hot spare postgres,
writing a C program.

I did the same writing 2 shell scripts, one of them perform
the restore the other one deliver the partial filled wal and
check if the postmaster is alive ( check if the pid process
still exist ).

With these two scripts I'm able to have an hot spare installation,
and the spare one go alive when the first postmaster dies.

How test it:

1) Master node:
    modify postgresql.conf using:

~        archive_command = 'cp %p /mnt/server/archivedir/%f'

~        launch postgres and perform a backup as doc

~        http://developer.postgresql.org/docs/postgres/backup-online.html

    suggest to do

    launch the script:

    partial_wal_deliver.sh <PID> /mnt/server/partialdir <pg_xlog path>

~        this script will delivery each 10 seconds the "current" wal file,
~        and touch the "alive" file in order to notify the spare node that
~        the master node is up and running


2) Spare node:
    create a recovery.conf with the line:

~        restore_command = 'restore.sh /mnt/server/archivedir/%f %p /mnt/server/partialdir'

~        replace the content of data directory with the backup performed at point 1,
~        remove any file present in the pg_xlog directory ( leaving there the archive_status
~        directory ) and remove the postmaster.pid file ( this is necessary if you are running
~        the spare postgres on the same hw ).

~        launch the postmaster, the restore will continue till the "alive" file present in the
~        /mnt/server/partialdir directory is not updated for 60 seconds ( you can modify this
~        values inside the restore.sh script ).

Be sure that restore.sh and all directories involved are accessible

Let me know.


This is a first step, of course, as Eric Kerin did, is better port these script
in C and make it more robust.

Postgres can help this process, as suggested by Tom creating a pg_current_wal()
or even better having two new GUC parameters: archive_current_wal_command and
archive_current_wal_delay.

I problem I discover during the tests is that if you shut down the spare node
and the restore_command is still waiting for a file then the postmaster will never
exit :-(





Regards
Gaetano Mendola










-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBHe+S7UpzwH2SGd4RAss1AJ0ZfnhSZWWvbRjvbgAlpLu0dtxKIwCg22B+
6bFDyutTVow6+mcTNfD3Cuc=
=Ot+H
-----END PGP SIGNATURE-----
#!/bin/bash

PID=$1
PARTIAL=$2
PGXLOG=$3

function copy_last_wal
{
   FILE=$( ls -t1p $PGXLOG | grep -v / | head -1 )

   echo "Last Wal> " $FILE

   cp ${PGXLOG}/${FILE} ${PARTIAL}/${FILE}.tmp
   mv ${PARTIAL}/${FILE}.tmp ${PARTIAL}/${FILE}.partial
   find ${PARTIAL} -name *.partial | grep -v ${FILE} | xargs -i rm -fr {}
}


while [ 1 ]
do
   ps --pid $PID > /dev/null 2>&1
   ALIVE=$?

   if [ "${ALIVE}" == "1" ]
   then
       #The process is dead
       echo "Process dead"
       copy_last_wal
       exit 1
   fi

   #The process still exist
   touch ${PARTIAL}/alive
   copy_last_wal

   sleep 10
done
#!/bin/bash


SOURCE=$1
TARGET=$2
PARTIAL=$3

SIZE_EXPECTED=16777216        #bytes     16 MB
DIED_TIME=60                    #seconds

function test_existence
{
    if [ -f ${SOURCE}   ]
    then
       COUNTER=0

       #I have to check if the file is begin copied
       #I assume that it will reach the right
       #size in a few seconds

       while [ $(stat -c '%s' ${SOURCE} ) -lt $SIZE_EXPECTED ]
       do
          sleep 1
          let COUNTER+=1
          if [ 20 -lt $COUNTER ]
          then
             exit 1    # BAILING OUT
          fi
       done

       cp $SOURCE $TARGET
       exit 0
    fi
    echo ${SOURCE}"> not found"

    #if is looking for a history file and not exist
    #I have suddenly exit
    echo $SOURCE | grep history > /dev/null 2>&1 && exit 1
}


while [ 1 ]
do

   test_existence

   #CHECK IF THE MASTER IS ALIVE
   DELTA_TIME=$(( $( date +'%s' ) - $( stat -c '%Z' ${PARTIAL}/alive ) ))
   if [ $DIED_TIME -lt $DELTA_TIME ]
   then
       echo "Master is dead..."
       # Master is dead
       CURRENT_WAL=$( basename $SOURCE )
       echo "Partial: " ${PARTIAL}
       echo "Current wal: " ${CURRENT_WAL}
       echo "Target: " ${TARGET}
       cp ${PARTIAL}/${CURRENT_WAL}.partial ${TARGET}  > /dev/null 2>&1 && exit 0
       exit 1
   fi

   sleep 1

done

pgsql-hackers by date:

Previous
From: "Simon@2ndquadrant.com"
Date:
Subject: Re: NOT LOGGED options (was Point in Time Recovery )
Next
From: Peter Eisentraut
Date:
Subject: Re: [PERFORM] Reiser4