Eric Kerin wrote:
> On Sat, 2004-08-14 at 01:11, Tom Lane wrote:
>
>>Eric Kerin <eric@bootseg.com> writes:
>>
>>>The issues I've seen are:
>>>1. Knowing when the master has finished the file transfer transfer to
>>>the backup.
>>
>>The "standard" solution to this is you write to a temporary file name
>>(generated off your process PID, or some other convenient reasonably-
>>unique random name) and rename() into place only after you've finished
>>the transfer.
>
> Yup, much easier this way. Done.
>
>
>>>2. Handling the meta-files, (.history, .backup) (eg: not sleeping if
>>>they don't exist)
>>
>>Yeah, this is an area that needs more thought. At the moment I believe
>>both of these will only be asked for during the initial microseconds of
>>slave-postmaster start. If they are not there I don't think you need to
>>wait for them. It's only plain ol' WAL segments that you want to wait
>>for. (Anyone see a hole in that analysis?)
>>
>
> Seems to be working fine this way, I'm now just returning ENOENT if they
> don't exist.
>
>
>>>3. Keeping the backup from coming online before the replay has fully
>>>finished in the event of a failure to copy a file, or other strange
>>>errors (out of memory, etc).
>>
>>Right, also an area that needs thought. Some other people opined that
>>they want the switchover to occur only on manual command. I'd go with
>>that too if you have anything close to 24x7 availability of admins.
>>If you *must* have automatic switchover, what's the safest criterion?
>>Dunno, but let's think ...
>
>
> I'm not even really talking about automatic startup on fail over. Right
> now, if the recovery_command returns anything but 0, the database will
> finish recovery, and come online. This would cause you to have to
> re-build your backup system from a copy of master unnecessarily. Sounds
> kinda messy to me, especially if it's a false trigger (temporary io
> error, out of memory)
Well, this is the way most of HA cluster solution are working, in my experience
the RH cluster solution rely on a common partition between the two nodes
and on a serial connection between them.
For sure for a 24x7 service is a compulsory requirement have an automatic procedure
that handle the failures without uman intervention.
Regards
Gaetano Mendola