Thread: will PITR in 8.0 be usable for "hot spare"/"log shipping" type of replication

will PITR in 8.0 be usable for "hot spare"/"log shipping" type of replication

From
Hannu Krosing
Date:
will PITR in 8.0 be usable for "hot spare"/"log shipping" type of
replication or is it just for Point In Time RECOVERY ?

-----------
Hannu



Hannu Krosing <hannu@tm.ee> writes:
> will PITR in 8.0 be usable for "hot spare"/"log shipping" type of
> replication or is it just for Point In Time RECOVERY ?

It should work; dunno if anyone has tried it yet.
        regards, tom lane


Re: will PITR in 8.0 be usable for "hot spare"/"log shipping" type

From
Gaetano Mendola
Date:
Tom Lane wrote:

> Hannu Krosing <hannu@tm.ee> writes:
> 
>>will PITR in 8.0 be usable for "hot spare"/"log shipping" type of
>>replication or is it just for Point In Time RECOVERY ?
> 
> 
> It should work; dunno if anyone has tried it yet.

I was thinking about it but I soon realized that actually is
impossible to do, postgres replay the log only if during the
start the file recover.conf is present in $DATA directory :-(

Am I missing the point ?


Regards
Gaetano Mendola






Gaetano Mendola <mendola@bigfoot.com> writes:
> Tom Lane wrote:
>> It should work; dunno if anyone has tried it yet.

> I was thinking about it but I soon realized that actually is
> impossible to do, postgres replay the log only if during the
> start the file recover.conf is present in $DATA directory :-(

So you put one in ... what's the problem?

The way I'd envision this working is that

1. You set up WAL archiving on the master, and arrange to ship copies of
completed segment files to the slave.

2. You take an on-line backup (ie, tar dump) on the master, and restore
it on the slave.

3. You set up a recover.conf file with the restore_command being some
kind of shell script that knows where to look for the shipped-over
segment files, and also has a provision for being signaled to stop
tracking the shipped-over segments and come alive.

4. You start the postmaster on the slave.  It will try to recover.  Each
time it asks the restore_command script for another segment file, the
script will sleep until that segment file is available, then return it.

5. When the master dies, you signal the restore_command script that it's
time to come alive.  It now returns "no such file" to the patiently
waiting postmaster, and within seconds you have a live database on the
slave.

Now, as sketched you only get recovery up to the last WAL segment
boundary, which might not be good enough on a low-traffic system.
But you could combine this with some hack to ship over the latest
partial WAL segment periodically (once a minute maybe).  The
restore_command script shouldn't use a partial segment --- until
the alarm comes, and then it should.

Somebody should hack this together and try it during beta.  I don't
have time myself.
        regards, tom lane


Re: will PITR in 8.0 be usable for "hot spare"/"log shipping" type

From
Gaetano Mendola
Date:
Tom Lane wrote:


> Somebody should hack this together and try it during beta.  I don't
> have time myself.

Will see, if I have spare time I will try.


Regards
Gaetano Mendola




Re: will PITR in 8.0 be usable for "hot spare"/"log shipping" type

From
Brian Hirt
Date:
I wonder if there will be assumptions in the startup code concerning 
time.   What if the startup takes 18 months, would there be some sort 
of problem with this approach you think?

On Aug 11, 2004, at 6:14 PM, Gaetano Mendola wrote:

> Tom Lane wrote:
>
>
>> Somebody should hack this together and try it during beta.  I don't
>> have time myself.
>
> Will see, if I have spare time I will try.
>
>
> Regards
> Gaetano Mendola
>
>
>
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)



Re: will PITR in 8.0 be usable for "hot spare"/"log shipping" type

From
Tom Lane
Date:
Brian Hirt <bhirt@mobygames.com> writes:
> I wonder if there will be assumptions in the startup code concerning 
> time.   What if the startup takes 18 months, would there be some sort 
> of problem with this approach you think?

I don't know of any such assumptions, but this sort of question is why
someone should prototype it while we're still in beta ...

One point that occurred to me is that you aren't really going to want to
just leave the slave sitting there updating an original backup forever.
Anytime the slave machine itself crashes (power outage, say) it will
have to replay the log again from the time of the original backup ---
so you have to keep all the copied log segments in place.  I would guess
that you'll want to refresh the slave's starting backup about as often
as you take a new base backup for the master, and for about the same
reason: you want to limit how many archived WAL segments you have to
keep.

So in theory it should work, but there are a lot of procedural details
to be resolved to translate that handwavy sketch into a reliable
process --- and maybe we'll find that we need to adjust some details
of how the basic recovery process works in order to make the idea
really practical.  So like I said, I'd love for somebody to prototype
it while we can still rejigger details.
        regards, tom lane


Re: will PITR in 8.0 be usable for "hot spare"/"log

From
Eric Kerin
Date:
On Wed, 2004-08-11 at 16:43, Tom Lane wrote:
> Gaetano Mendola <mendola@bigfoot.com> writes:
> > Tom Lane wrote:
> >> It should work; dunno if anyone has tried it yet.
> 
> > I was thinking about it but I soon realized that actually is
> > impossible to do, postgres replay the log only if during the
> > start the file recover.conf is present in $DATA directory :-(
>
> <SNIP>
>
> Somebody should hack this together and try it during beta.  I don't
> have time myself.
> 
>             regards, tom lane


I've wrote up a very quick, and insanely dirty hack to do log shipping. 
Actually, it's so poorly written I kinda feel ashamed to post the code.

But so far the process looks very promising, with a few caveats. 

The issues I've seen are:
1. Knowing when the master has finished the file transfer transfer to
the backup.
2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
they don't exist)
3. Keeping the backup from coming online before the replay has fully
finished in the event of a failure to copy a file, or other strange
errors (out of memory, etc).

I've got a solution for 1.  I use a control file that contains the name
of the last file that was successfully copied over.  After the program
copies the file, it updates the control file with the new file's name.
The restore program looks in that file for what is the last safe file to
replay, and sleeps if the one it's been told to look for isn't safe yet.

Two is pretty easy, just special case out files ending in .history or
.backup.

Three is a problem I see happening once in a while, and will cause you
to have to recreate the backup database from a backup of master, it
could spell trouble, or at the very least a mad DBA.  A possible fix is
to check the error code returned from restore_command to see if it's
ENOENT before bringing the db online, instead of bringing the database
online at any error.  This might be better as an option though.



Still lots of bugs in my implementation, and my next step is to re-write
it from scratch.  I'm going to keep playing with this and see if I can
get something a little more solid working.

Here's a url to the code as it is right now, works on linux, no promises
with anything else.  http://www.bootseg.com/log_ship.c

For the archive command use:
/path_to_binary/log_ship -a /archive_directory/ %p %f

For the restore_command use:
/path_to_binary/log_ship -r /archive_directory/ %f %p

Any comments are Very appreciated

Thanks, 
Eric Kerin



Eric Kerin <eric@bootseg.com> writes:
> The issues I've seen are:
> 1. Knowing when the master has finished the file transfer transfer to
> the backup.

The "standard" solution to this is you write to a temporary file name
(generated off your process PID, or some other convenient reasonably-
unique random name) and rename() into place only after you've finished
the transfer.  If you are paranoid you can try to fsync the file before
renaming, too.  File rename is a reasonably atomic process on all modern
OSes.

> 2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
> they don't exist)

Yeah, this is an area that needs more thought.  At the moment I believe
both of these will only be asked for during the initial microseconds of
slave-postmaster start.  If they are not there I don't think you need to
wait for them.  It's only plain ol' WAL segments that you want to wait
for.  (Anyone see a hole in that analysis?)

> 3. Keeping the backup from coming online before the replay has fully
> finished in the event of a failure to copy a file, or other strange
> errors (out of memory, etc).

Right, also an area that needs thought.  Some other people opined that
they want the switchover to occur only on manual command.  I'd go with
that too if you have anything close to 24x7 availability of admins.
If you *must* have automatic switchover, what's the safest criterion?
Dunno, but let's think ...
        regards, tom lane


Re: will PITR in 8.0 be usable for "hot spare"/"log

From
Gaetano Mendola
Date:
Eric Kerin wrote:

> On Wed, 2004-08-11 at 16:43, Tom Lane wrote:
> 
>>Gaetano Mendola <mendola@bigfoot.com> writes:
>>
>>>Tom Lane wrote:
>>>
>>>>It should work; dunno if anyone has tried it yet.
>>
>>>I was thinking about it but I soon realized that actually is
>>>impossible to do, postgres replay the log only if during the
>>>start the file recover.conf is present in $DATA directory :-(
>>
>><SNIP>
>>
>>Somebody should hack this together and try it during beta.  I don't
>>have time myself.
>>
>>            regards, tom lane
> 
> 
> 
> I've wrote up a very quick, and insanely dirty hack to do log shipping. 
> Actually, it's so poorly written I kinda feel ashamed to post the code.
> 
> But so far the process looks very promising, with a few caveats. 
> 
> The issues I've seen are:
> 1. Knowing when the master has finished the file transfer transfer to
> the backup.
> 2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
> they don't exist)
> 3. Keeping the backup from coming online before the replay has fully
> finished in the event of a failure to copy a file, or other strange
> errors (out of memory, etc).

I did the same work and I had the same problems solved in the exact way
you did, however my version was shell based ( wasted time )   :-(

I guess is better mantain the C version, I will take a look at it and
I will modify it if something doesn't work.

Good work.


Regards
Gaetano Mendola






Re: will PITR in 8.0 be usable for "hot spare"/"log shipping" type of replication

From
"Simon@2ndquadrant.com"
Date:
> Tom Lane
> Eric Kerin <eric@bootseg.com> writes:
> > The issues I've seen are:
> > 1. Knowing when the master has finished the file transfer transfer to
> > the backup.
>
> The "standard" solution to this is you write to a temporary file name
> (generated off your process PID, or some other convenient reasonably-
> unique random name) and rename() into place only after you've finished
> the transfer.  If you are paranoid you can try to fsync the file before
> renaming, too.  File rename is a reasonably atomic process on all modern
> OSes.
>
> > 2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
> > they don't exist)
>
> Yeah, this is an area that needs more thought.  At the moment I believe
> both of these will only be asked for during the initial microseconds of
> slave-postmaster start.  If they are not there I don't think you need to
> wait for them.  It's only plain ol' WAL segments that you want to wait
> for.  (Anyone see a hole in that analysis?)

Agreed.

> > 3. Keeping the backup from coming online before the replay has fully
> > finished in the event of a failure to copy a file, or other strange
> > errors (out of memory, etc).
>
> Right, also an area that needs thought.  Some other people opined that
> they want the switchover to occur only on manual command.  I'd go with
> that too if you have anything close to 24x7 availability of admins.
> If you *must* have automatic switchover, what's the safest criterion?
> Dunno, but let's think ...
>

That's fairly straightforward.

You use a recovery_command that sleeps when it discovers a full log file
isn't available - i.e. it has requested the "last" or master-current WAL
file. The program wakes when the decision/operator command to switchover is
taken.

That way, when switchover occurs, you're straight up. No code changes...

This is important because it will allow us to test recovery for many systems
by creating a continuously rolling copy. Implementing this will be the best
way to stress-test the recovery code.

I'm not hugely in favour of copying partially filled log files, but if
that's what people want...as long as we don't change the basic code to
implement it, because then we'll have just created another code path that
will leave PITR untested for most people.

[I discussed all of this before as Automatic Standby Database functionality]

Best Regards, Simon Riggs



"Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:
>> Tom Lane wrote:
>> Right, also an area that needs thought.  Some other people opined that
>> they want the switchover to occur only on manual command.  I'd go with
>> that too if you have anything close to 24x7 availability of admins.
>> If you *must* have automatic switchover, what's the safest criterion?
>> Dunno, but let's think ...

> That's fairly straightforward.

> You use a recovery_command that sleeps when it discovers a full log file
> isn't available - i.e. it has requested the "last" or master-current WAL
> file. The program wakes when the decision/operator command to switchover is
> taken.

But you're glossing over the hard part, which is how to take that
decision (assuming that for some reason you can't afford to wait for
a human to make it).
        regards, tom lane


Re: will PITR in 8.0 be usable for "hot spare"/"log

From
Eric Kerin
Date:
On Sat, 2004-08-14 at 01:11, Tom Lane wrote:
> Eric Kerin <eric@bootseg.com> writes:
> > The issues I've seen are:
> > 1. Knowing when the master has finished the file transfer transfer to
> > the backup.
> 
> The "standard" solution to this is you write to a temporary file name
> (generated off your process PID, or some other convenient reasonably-
> unique random name) and rename() into place only after you've finished
> the transfer.  
Yup, much easier this way.  Done.

> > 2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
> > they don't exist)
> 
> Yeah, this is an area that needs more thought.  At the moment I believe
> both of these will only be asked for during the initial microseconds of
> slave-postmaster start.  If they are not there I don't think you need to
> wait for them.  It's only plain ol' WAL segments that you want to wait
> for.  (Anyone see a hole in that analysis?)
> 
Seems to be working fine this way, I'm now just returning ENOENT if they
don't exist.  

> > 3. Keeping the backup from coming online before the replay has fully
> > finished in the event of a failure to copy a file, or other strange
> > errors (out of memory, etc).
> 
> Right, also an area that needs thought.  Some other people opined that
> they want the switchover to occur only on manual command.  I'd go with
> that too if you have anything close to 24x7 availability of admins.
> If you *must* have automatic switchover, what's the safest criterion?
> Dunno, but let's think ...

I'm not even really talking about automatic startup on fail over.  Right
now, if the recovery_command returns anything but 0, the database will
finish recovery, and come online.  This would cause you to have to
re-build your backup system from a copy of master unnecessarily.  Sounds
kinda messy to me, especially if it's a false trigger (temporary io
error, out of memory)


What I think might be a better long term approach (but probably more of
an 8.1 thing).  Have the database go in to a read-only/replay mode,
accept only read-only commands from users.  A replay program opens a
connection to the backup system's postmaster, and tells it to replay a
given file when it becomes available. Once you want the system to come
online, the DBA will call a different function that will instruct the
system to come fully online, and start accepting updates from users.

This could be quite complex, but provides two things: proper log
shipping with status, (without the false fail->db online possibility)
and a read-only replicated backup system(s), which would also be good
for a reporting database.

Thoughts?


Anyway, here's a re-written program for my implementation of log
shipping:  http://www.bootseg.com/log_ship.c It operates mostly the
same, but most of the stupid bugs are fixed.  The old one was renamed to
http://www.bootseg.com/log_ship.c.ver1 if you really want it.

Thanks, 
Eric




Re: will PITR in 8.0 be usable for "hot spare"/"log

From
Gaetano Mendola
Date:
Eric Kerin wrote:
> On Sat, 2004-08-14 at 01:11, Tom Lane wrote:
> 
>>Eric Kerin <eric@bootseg.com> writes:
>>
>>>The issues I've seen are:
>>>1. Knowing when the master has finished the file transfer transfer to
>>>the backup.
>>
>>The "standard" solution to this is you write to a temporary file name
>>(generated off your process PID, or some other convenient reasonably-
>>unique random name) and rename() into place only after you've finished
>>the transfer.  
> 
> Yup, much easier this way.  Done.
> 
> 
>>>2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
>>>they don't exist)
>>
>>Yeah, this is an area that needs more thought.  At the moment I believe
>>both of these will only be asked for during the initial microseconds of
>>slave-postmaster start.  If they are not there I don't think you need to
>>wait for them.  It's only plain ol' WAL segments that you want to wait
>>for.  (Anyone see a hole in that analysis?)
>>
> 
> Seems to be working fine this way, I'm now just returning ENOENT if they
> don't exist.  
> 
> 
>>>3. Keeping the backup from coming online before the replay has fully
>>>finished in the event of a failure to copy a file, or other strange
>>>errors (out of memory, etc).
>>
>>Right, also an area that needs thought.  Some other people opined that
>>they want the switchover to occur only on manual command.  I'd go with
>>that too if you have anything close to 24x7 availability of admins.
>>If you *must* have automatic switchover, what's the safest criterion?
>>Dunno, but let's think ...
> 
> 
> I'm not even really talking about automatic startup on fail over.  Right
> now, if the recovery_command returns anything but 0, the database will
> finish recovery, and come online.  This would cause you to have to
> re-build your backup system from a copy of master unnecessarily.  Sounds
> kinda messy to me, especially if it's a false trigger (temporary io
> error, out of memory)

Well, this is the way most of HA cluster solution are working, in my experience
the RH cluster solution rely on a common partition between the two nodes
and on a serial connection between them.
For sure for a 24x7 service is a compulsory requirement have an automatic procedure
that handle the failures without uman intervention.


Regards
Gaetano Mendola























Re: will PITR in 8.0 be usable for "hot spare"/"log

From
Eric Kerin
Date:
On Sun, 2004-08-15 at 16:22, Gaetano Mendola wrote:
> Eric Kerin wrote:
> > On Sat, 2004-08-14 at 01:11, Tom Lane wrote:
> > 
> >>Eric Kerin <eric@bootseg.com> writes:
> >>
> >>>The issues I've seen are:
> >>>1. Knowing when the master has finished the file transfer transfer to
> >>>the backup.
> >>
> >>The "standard" solution to this is you write to a temporary file name
> >>(generated off your process PID, or some other convenient reasonably-
> >>unique random name) and rename() into place only after you've finished
> >>the transfer.  
> > 
> > Yup, much easier this way.  Done.
> > 
> > 
> >>>2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
> >>>they don't exist)
> >>
> >>Yeah, this is an area that needs more thought.  At the moment I believe
> >>both of these will only be asked for during the initial microseconds of
> >>slave-postmaster start.  If they are not there I don't think you need to
> >>wait for them.  It's only plain ol' WAL segments that you want to wait
> >>for.  (Anyone see a hole in that analysis?)
> >>
> > 
> > Seems to be working fine this way, I'm now just returning ENOENT if they
> > don't exist.  
> > 
> > 
> >>>3. Keeping the backup from coming online before the replay has fully
> >>>finished in the event of a failure to copy a file, or other strange
> >>>errors (out of memory, etc).
> >>
> >>Right, also an area that needs thought.  Some other people opined that
> >>they want the switchover to occur only on manual command.  I'd go with
> >>that too if you have anything close to 24x7 availability of admins.
> >>If you *must* have automatic switchover, what's the safest criterion?
> >>Dunno, but let's think ...
> > 
> > 
> > I'm not even really talking about automatic startup on fail over.  Right
> > now, if the recovery_command returns anything but 0, the database will
> > finish recovery, and come online.  This would cause you to have to
> > re-build your backup system from a copy of master unnecessarily.  Sounds
> > kinda messy to me, especially if it's a false trigger (temporary io
> > error, out of memory)
> 
> Well, this is the way most of HA cluster solution are working, in my experience
> the RH cluster solution rely on a common partition between the two nodes
> and on a serial connection between them.
> For sure for a 24x7 service is a compulsory requirement have an automatic procedure
> that handle the failures without uman intervention.
> 
> 
> Regards
> Gaetano Mendola
> 

Already sent this to Gaetano, didn't realize the mail was on list too:

Redhat's HA stuff is a fail over cluster, not a log shipping cluster.

For a fail over cluster, log shipping isn't involved. Just the normal
WAL replay, same as if the database came back online in the same node. 
It also has many methods of communication to check if the master is
online (Serial, Network, Hard disk quorum device).  Once the Backup
detects a failure of the master, it powers the master off, and takes
over all devices, and network names/IP addresses.

In log shipping, you can't even be sure that both nodes will be close
enough together to have multiple communication methods.  At work, we
have an Oracle log shipping setup where the backup cluster is a
thousand or so miles away from the master cluster, separated by a T3
link.

For a 24x7 zero-downtime type of system, you would have 2 Fail over
clusters, separated by a few miles(or a few thousand). Then setup log
shipping from the master to the backup.  That keeps the system online
incase of a single node hardware failure, without having to transfer to
the backup log shipping system.  The backup is there incase the master
is completely destroyed (by fire, hardware corruption, etc) Hence the
reason for the remote location.

Thanks, 
Eric





Re: will PITR in 8.0 be usable for "hot spare"/"log

From
Gaetano Mendola
Date:
Eric Kerin wrote:

> On Sun, 2004-08-15 at 16:22, Gaetano Mendola wrote:
> 
>>Eric Kerin wrote:
>>
>>>On Sat, 2004-08-14 at 01:11, Tom Lane wrote:
>>>
>>>
>>>>Eric Kerin <eric@bootseg.com> writes:
>>>>
>>>>
>>>>>The issues I've seen are:
>>>>>1. Knowing when the master has finished the file transfer transfer to
>>>>>the backup.
>>>>
>>>>The "standard" solution to this is you write to a temporary file name
>>>>(generated off your process PID, or some other convenient reasonably-
>>>>unique random name) and rename() into place only after you've finished
>>>>the transfer.  
>>>
>>>Yup, much easier this way.  Done.
>>>
>>>
>>>
>>>>>2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
>>>>>they don't exist)
>>>>
>>>>Yeah, this is an area that needs more thought.  At the moment I believe
>>>>both of these will only be asked for during the initial microseconds of
>>>>slave-postmaster start.  If they are not there I don't think you need to
>>>>wait for them.  It's only plain ol' WAL segments that you want to wait
>>>>for.  (Anyone see a hole in that analysis?)
>>>>
>>>
>>>Seems to be working fine this way, I'm now just returning ENOENT if they
>>>don't exist.  
>>>
>>>
>>>
>>>>>3. Keeping the backup from coming online before the replay has fully
>>>>>finished in the event of a failure to copy a file, or other strange
>>>>>errors (out of memory, etc).
>>>>
>>>>Right, also an area that needs thought.  Some other people opined that
>>>>they want the switchover to occur only on manual command.  I'd go with
>>>>that too if you have anything close to 24x7 availability of admins.
>>>>If you *must* have automatic switchover, what's the safest criterion?
>>>>Dunno, but let's think ...
>>>
>>>
>>>I'm not even really talking about automatic startup on fail over.  Right
>>>now, if the recovery_command returns anything but 0, the database will
>>>finish recovery, and come online.  This would cause you to have to
>>>re-build your backup system from a copy of master unnecessarily.  Sounds
>>>kinda messy to me, especially if it's a false trigger (temporary io
>>>error, out of memory)
>>
>>Well, this is the way most of HA cluster solution are working, in my experience
>>the RH cluster solution rely on a common partition between the two nodes
>>and on a serial connection between them.
>>For sure for a 24x7 service is a compulsory requirement have an automatic procedure
>>that handle the failures without uman intervention.
>>
>>
>>Regards
>>Gaetano Mendola
>>
> 
> 
> Already sent this to Gaetano, didn't realize the mail was on list too:
> 
> Redhat's HA stuff is a fail over cluster, not a log shipping cluster.
> Once the Backup detects a failure of the master, it powers the master off, > and takes over all devices, and network
names/IPaddresses.
 

We are using RH HA stuff since long time and is not necessary have the master
powered off ( our setup don't ).


> In log shipping, you can't even be sure that both nodes will be close
> enough together to have multiple communication methods.  At work, we
> have an Oracle log shipping setup where the backup cluster is a
> thousand or so miles away from the master cluster, separated by a T3
> link.
>
> For a 24x7 zero-downtime type of system, you would have 2 Fail over
> clusters, separated by a few miles(or a few thousand). Then setup log
> shipping from the master to the backup.  That keeps the system online
> incase of a single node hardware failure, without having to transfer to
> the backup log shipping system.  The backup is there incase the master
> is completely destroyed (by fire, hardware corruption, etc) Hence the
> reason for the remote location.

I totally agree with you but not all people can set up a RH HA cluster or
equivalent solutions ( is needed very expensive SAN with double port ) and
this software version could help in a low cost setup. The scripts that I posted
do the failover between master and slave in automatic way delivering also
the partial WAL ( I could increase the robusteness checking also a serial
connection ) without need expensive HW.

For sure this way to proceed ( the log shipping activity ) will increase
the availability in case of total disaster ( actualy I transfer to another
location a plain dump each 3 hours :-( ).


Regards
Gaetano Mendola