Re: will PITR in 8.0 be usable for "hot spare"/"log - Mailing list pgsql-hackers

From Eric Kerin
Subject Re: will PITR in 8.0 be usable for "hot spare"/"log
Date
Msg-id 1092607126.11196.15.camel@auh5-0478
Whole thread Raw
In response to Re: will PITR in 8.0 be usable for "hot spare"/"log  (Gaetano Mendola <mendola@bigfoot.com>)
Responses Re: will PITR in 8.0 be usable for "hot spare"/"log
List pgsql-hackers
On Sun, 2004-08-15 at 16:22, Gaetano Mendola wrote:
> Eric Kerin wrote:
> > On Sat, 2004-08-14 at 01:11, Tom Lane wrote:
> > 
> >>Eric Kerin <eric@bootseg.com> writes:
> >>
> >>>The issues I've seen are:
> >>>1. Knowing when the master has finished the file transfer transfer to
> >>>the backup.
> >>
> >>The "standard" solution to this is you write to a temporary file name
> >>(generated off your process PID, or some other convenient reasonably-
> >>unique random name) and rename() into place only after you've finished
> >>the transfer.  
> > 
> > Yup, much easier this way.  Done.
> > 
> > 
> >>>2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
> >>>they don't exist)
> >>
> >>Yeah, this is an area that needs more thought.  At the moment I believe
> >>both of these will only be asked for during the initial microseconds of
> >>slave-postmaster start.  If they are not there I don't think you need to
> >>wait for them.  It's only plain ol' WAL segments that you want to wait
> >>for.  (Anyone see a hole in that analysis?)
> >>
> > 
> > Seems to be working fine this way, I'm now just returning ENOENT if they
> > don't exist.  
> > 
> > 
> >>>3. Keeping the backup from coming online before the replay has fully
> >>>finished in the event of a failure to copy a file, or other strange
> >>>errors (out of memory, etc).
> >>
> >>Right, also an area that needs thought.  Some other people opined that
> >>they want the switchover to occur only on manual command.  I'd go with
> >>that too if you have anything close to 24x7 availability of admins.
> >>If you *must* have automatic switchover, what's the safest criterion?
> >>Dunno, but let's think ...
> > 
> > 
> > I'm not even really talking about automatic startup on fail over.  Right
> > now, if the recovery_command returns anything but 0, the database will
> > finish recovery, and come online.  This would cause you to have to
> > re-build your backup system from a copy of master unnecessarily.  Sounds
> > kinda messy to me, especially if it's a false trigger (temporary io
> > error, out of memory)
> 
> Well, this is the way most of HA cluster solution are working, in my experience
> the RH cluster solution rely on a common partition between the two nodes
> and on a serial connection between them.
> For sure for a 24x7 service is a compulsory requirement have an automatic procedure
> that handle the failures without uman intervention.
> 
> 
> Regards
> Gaetano Mendola
> 

Already sent this to Gaetano, didn't realize the mail was on list too:

Redhat's HA stuff is a fail over cluster, not a log shipping cluster.

For a fail over cluster, log shipping isn't involved. Just the normal
WAL replay, same as if the database came back online in the same node. 
It also has many methods of communication to check if the master is
online (Serial, Network, Hard disk quorum device).  Once the Backup
detects a failure of the master, it powers the master off, and takes
over all devices, and network names/IP addresses.

In log shipping, you can't even be sure that both nodes will be close
enough together to have multiple communication methods.  At work, we
have an Oracle log shipping setup where the backup cluster is a
thousand or so miles away from the master cluster, separated by a T3
link.

For a 24x7 zero-downtime type of system, you would have 2 Fail over
clusters, separated by a few miles(or a few thousand). Then setup log
shipping from the master to the backup.  That keeps the system online
incase of a single node hardware failure, without having to transfer to
the backup log shipping system.  The backup is there incase the master
is completely destroyed (by fire, hardware corruption, etc) Hence the
reason for the remote location.

Thanks, 
Eric





pgsql-hackers by date:

Previous
From: Gavin Sherry
Date:
Subject: Re: Savepoint weirdness
Next
From: Gaetano Mendola
Date:
Subject: Re: will PITR in 8.0 be usable for "hot spare"/"log