Thread: Re: [HACKERS] PITR Dead horse?

Re: [HACKERS] PITR Dead horse?

From
"Simon Riggs"
Date:
>Bruce Momjian
> Marc G. Fournier wrote:
> >
> > pgsql-hackers-pitr@postgresql.org
> >
> > I set myself as owner, since I didn't figure it was something you
really
> > needed added to your plate? :)  Just means you don't have to go
through
> > and do the Approvals for postings when they need it, I'll just do it
as
> my
> > normal stuff ...
>
> OK, I have added the mailing list to the web page:
>
>     http://momjian.postgresql.org/main/writings/pgsql/project
>
> and have subscribed myself.

Sorry, I've joined also - I thought Marc's post was a suggestion rather
than a reality. I checked http://www.postgresql.org/lists.html but it
wasn't listed, so I thought it didn't exist yet.

Is there a digest of all previous postings? Or another way to access
them?

Regards, Simon


Re: [HACKERS] PITR Dead horse?

From
Bruce Momjian
Date:
Simon Riggs wrote:
> >Bruce Momjian
> > Marc G. Fournier wrote:
> > >
> > > pgsql-hackers-pitr@postgresql.org
> > >
> > > I set myself as owner, since I didn't figure it was something you
> really
> > > needed added to your plate? :)  Just means you don't have to go
> through
> > > and do the Approvals for postings when they need it, I'll just do it
> as
> > my
> > > normal stuff ...
> >
> > OK, I have added the mailing list to the web page:
> >
> >     http://momjian.postgresql.org/main/writings/pgsql/project
> >
> > and have subscribed myself.
>
> Sorry, I've joined also - I thought Marc's post was a suggestion rather
> than a reality. I checked http://www.postgresql.org/lists.html but it
> wasn't listed, so I thought it didn't exist yet.
>
> Is there a digest of all previous postings? Or another way to access
> them?

No one has said anything on the list yet.  :-)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [HACKERS] PITR Dead right

From
"Simon Riggs"
Date:
>Bruce Momjian
> No one has said anything on the list yet.  :-)

Great. Well, I'm planning to be on-line tomorrow night to begin thinking
this through and making some proposals.

If anybody's out there and intending to help, please shout about it on
this list now - Marc's set it up for us, so we should use it, or shut it
down!

Looking forward to working with y'all.

Very best regards, Simon Riggs


Re: [HACKERS] PITR Dead right

From
Fred Moyer
Date:
On Wed, 2004-02-11 at 21:10, Simon Riggs wrote:
> >Bruce Momjian
> > No one has said anything on the list yet.  :-)
>
> Great. Well, I'm planning to be on-line tomorrow night to begin thinking
> this through and making some proposals.
>
> If anybody's out there and intending to help, please shout about it on
> this list now - Marc's set it up for us, so we should use it, or shut it
> down!
I haven't been involved on the development side of PostgreSQL yet but
have managed to make my way to this list with an interest in working
on PITR.  Let me know what I can do to help, I don't mind doing grunt
work if it helps push this along.

> Looking forward to working with y'all.

I am also looking forward to working on this.

Hope I can contribute to this great project,

Regards,
Fred


Re: [HACKERS] PITR Dead horse?

From
Mark Kirkwood
Date:
Tom recently posted the PITR patch to PATCHES - maybe we should sent it
to this list too. That might kick start the discussion...

regards

Mark

Bruce Momjian wrote:

>
>No one has said anything on the list yet.  :-)
>
>
>


Re: [HACKERS] PITR Dead horse?

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Tom recently posted the PITR patch to PATCHES - maybe we should sent it
> to this list too. That might kick start the discussion...

Actually, I tried to send it here first.  This list bounced it because
it was over 40K (which I think is the standard limit for all the lists
except PATCHES).  So, if you want to look at it, go look in the PATCHES
archives:
http://archives.postgresql.org/pgsql-patches/2004-02/msg00134.php

(BTW, Marc, I still don't see an entry for the -pitr list at
http://archives.postgresql.org/ ...)

            regards, tom lane

Re: [HACKERS] PITR Dead horse?

From
"Marc G. Fournier"
Date:
On Thu, 12 Feb 2004, Tom Lane wrote:

> (BTW, Marc, I still don't see an entry for the -pitr list at
> http://archives.postgresql.org/ ...)

fixed ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Proposals for PITR

From
"Simon Riggs"
Date:
PITR for PostgreSQL 7.5+
===========================

I'm posting a few thoughts and plans on how to progress PITR. Initially,
I'd suggest chewing through some ideas to get some direction and then
split out some tasks to pursue individually or in teams.

I'm not claiming leadership, infallibility or anything really - I'm more
than happy to work together on these or other ideas for PITR. Any and
all comments or completely alternate views are welcomed.

It seems very likely that this design/plan comes much later than other
designs/thoughts from members of the PGDG community or elsewhere. All
ideas, corrections or offerings of partial code will be gratefully
accepted. There's no competition here - if you do it first and it works,
great. I'll buy you a beer.

I'll update this as information arrives and the picture changes:

OVERVIEW
PITR means Point-in-Time-Recovery and is an extension to the current
functionality of PostgreSQL. Achieving PITR will be a mixture of OS
features and integrated dbms features. My suggested first priority is to
sort out the latter and not reinvent-the-wheel too much on OS-provided
features.

OVERALL PITR REQUIREMENTS
- A full backup of the database
- A backup of set of time-sequenceable changes that can be used to
"rollforward" from the point that the full backup was taken to the
desired point in time.
- Best practice, possibly offered by backup management software, to
implement both types of backup using a planned & structured approach

SUGGESTED IMPLEMENTATION GOALS
1. Exactly accurate PITR for databases
2. Robustness of code
3. Close to zero performance impact on normal (backed up) system
4. Good overall performance of total recovery procedures to minimise
outage
5. Other normal PostgreSQL code givens: Berkeley licence, portable ANSI
C

GENERAL PITR RECOVERY SCENARIO
The general PITR recovery scenario requires:
A - Take a full database backup
B - Take regular log file backups
C - Restore a full backup of a database.
If you cannot do this, hide your face in shame.
If it fails, retry it until it works. If it never does, you're out of
luck. If it succeeds, but portions of the database are corrupt, it may
still be possible to continue if non-critical sections of the dbms are
still intact, such as the system tables.
D - Restore a full set of log records.
If you cannot do this, you will have to accept that you have lost data.
If it fails, retry it until it works. If it never does, at least you
have the full backup point, even if that is out of date (slightly).
E - Locate the point in the set of log files that matches the state at
which the full backup was taken.
If this cannot occur or errors, it may be because you have a set of log
files that don't match your full backup. In that case, you must accept
that the full backup you have is the best you've got.
F - Decide somehow what the point in time is you would like to recover
to
G - Issue commands to define the Point-in-Time to which you would like
to recover.
H - Rollforward from the point of the backup to the defined PIT.
If this fails because of corrupted log files, then you may choose to
either: try to carry on restoring past any corruption point, or give up
and be happy with however far you've got.

SUGGESTED FEATURES TO BE IMPLEMENTED
My suggestion is that PostgreSQL PITR team should try to provide
facilities to allow B,E,F,G and H.
Partial facilities already exist to achieve:
A - either using pg_dump or via a full OS file copy on a shutdown
database
G - rollforwad on WAL logs already possible, but need to implement "stop
at point in time logic".

It shouldn't be our role to implement file handling: backup and restore
at a later time can be done many ways, so let the user pick their
preferred method, according to their preferences, data volume, available
funds etc.

We should assume that the system onto which the restore takes place is
completely different from the system on which the backup was taken. We
may need to alter config files prior to starting the restored database.
Any other pre-requisites?

COMPARISON WITH EXISTING TODO ENTRIES
- Add entries to WAL files so it contains enough information for
recovery after a full backup
This is assumed to be complete. [I'm not sure where the role of pg_clog
is in all of this, so this may be a broken assumption.] Comments?

- Write application to archive WAL files to tape, disk, or network
Probably need to do first part, but I'm arguing not to do the copy to
tape..

- Determine how the archive writer will interact with the checkpoint
process and WAL file recycling
Required for above

- Write an application that will recover from a pg_dump or tar backup
and then recover WAL files to a specific time
Seems possible, but is possible without this initially, so do other
stuff first

SUGGESTED PHASING OF IMPLEMENTATION
Here's a suggested phasing of work, to move towards full PITR. Clearly
other ways are also possible...

Phase 0: Planning & Discussion, including writing detailed test plan

Thread 1
Phase 1.1: Implement backup & full restore from OS image (B, E)
Phase 1.2: Implement PITR (add G and modify for H)
Phase 1.3: Implement WAL log inspection facilities (to improve step F)

Thread 2
Phase 2.1: Implement step A/C using pg_dump

Overall, doing this is going to require lots of investigation and
discussion. Most importantly, its going to involve a fair amount of
detailed and fairly intrusive testing. i.e. complete database wipe and
restore.
Please don't anyone reading this think I'm going to cut acres of
code...we'll need lots of help all sorts of people, especially from
veteran PostgreSQL developers.

No time plans. We have time to get it right.

FEATURE OVERVIEWS & INVESTIGATIONS/IMPLEMENTATIONS REQUIRED
B - Backing up WAL log files
-Ordinarily, when old log segment files are no longer needed, they are
recycled (renamed to become the next segments in the numbered sequence).
This means that the data within them must be copied from there to
another location
    AFTER postgres has closed that file
    BEFORE it is renamed and recycled
Spotting that window of opportunity fairly accurately is important. We
will need some mechanism to recognise availability of file for copying,
then lock the file and copy it away.
Think about
-what will happen if postgres tries to reuse file while we are still
copying. Does postgres hang, waiting for copy to complete (which may be
a long time if the copy is hanging because of a disk full condition).
This may already be catered for in the code since recycle logic can
handle not-ready-to-recycle conditions (not sure).
-Is it possible to allow read-only queries to continue in this case?
-what will happen if copy fails?
-who will do the copying? Should we use another sub-process of
postmaster to do the copying, or should we try to use a separate program
entirely?
-what will happen if that process slows down? What will we do if it
fails?
-how will we notice that a WAL log is now ready for copying? How will we
communicate that to the archiver process?
-Manual indicates that current WAL format is bulky and would require
some compressed format to be implemented. Initially, I suggest ignoring
this and simply relying of OS or hardware/tape compression methods.

E - Tee up log files to backup state
Should be fairly straightforward when using a full OS file backup of a
correctly shutdown database.
-Investigate use of WAL file names, to see if any problems exist there.

G - Issue command for Recovery point in time
With full OS file backup, if the database is shutdown correctly, then we
will need a way to tell the database "you think you're up to date, but
you're not - I've added some more WAL files into the directories, so
roll forward on those now please".
-Decide when and how to issue command.
-Decide syntax for command?
RECOVER DATABASE TO TIME <TIME> if an SQL statement
Could also implement a new switch on postmaster, to tell it to come up
in recovery mode even though it thinks it doesn't need to?

H - Modify WAL rollfoward
Rollforward until point-in-time should be fairly straightforward.
Current implementation is once-started it will run until it runs out of
files to apply, so simply change the termination test. Where to store
the point-in-time variable? Global? Pass as parameter when WAL
rollforward logic invoked as a command?

A/C - using pg_dump
Following restore from pg_dump, pg_control will not be set at the same
point it was at when the backup was taken. As a result, step E would
incorrectly identify the starting position for the rollforward, which
would either result in a "cannot find correct log" message, or
attempting to rollforward with completely inappropriate WAL data. Some
means of getting round this issue needs to be investigated. Idea: modify
pg_dump to record the transactionid at the start of the pg_dump and have
it generate as the LAST SQL statement in the dump a statement to modify
the system tables (gasp!) to reflect the equivalent state at backup.
Maybe; investigation required. This route may well be complicated by
National Language issues etc, whereas the full backup method will at
least ignore all of that.

What else do we need to do?

What complications are there?

= = = = = = =


Re: Proposals for PITR

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> The general PITR recovery scenario requires:
> A - Take a full database backup
> B - Take regular log file backups
> C - Restore a full backup of a database.

Note that pg_dump has absolutely zero to do with this.  The base for a
PITR recovery scenario has to be a physical dump, not a logical dump;
else WAL replay won't work.

The conclusion that we came to in previous discussions was that the
physical dump can be taken simply by doing "tar" (or your favorite
alternative tool) on the $PGDATA directory tree --- you could optionally
exclude the pg_xlog directory but must include all else.  It is okay to
do this while the database is live, so long as the series of log files
you will use to recover starts from a checkpoint that occurred before
you began the tar run.  When you do this the contents of the tar archive
will be an inconsistent snapshot of the database and therefore not
directly useful.  However, if you load the tar archive and then replay
WAL from a prior checkpoint to any time after completion of the tar run,
you will have brought the database back to consistency.

Part A of the problem therefore only requires tar plus enough management
software to keep track of the start and stop times of the tar run and
correlate the start time to the WAL log series.  It is worth pointing
out though that you need to start up your Part B solution (archiving off
WAL files as they come free) *before* you start taking your base dump.

(When we get around to implementing tablespaces, things might get more
complicated, since you'd need to remember to archive tablespace trees
that might not live under $PGDATA.  But we can ignore that refinement
for now, I think.)

> We should assume that the system onto which the restore takes place is
> completely different from the system on which the backup was taken.

It can't be too different, since you will need a binary-compatible
server; you won't be able to move across machine architectures this way.

> - Add entries to WAL files so it contains enough information for
> recovery after a full backup
> This is assumed to be complete. [I'm not sure where the role of pg_clog
> is in all of this, so this may be a broken assumption.] Comments?

I believe that the major problems (lack of recording of file creation/
deletion) were solved by the J.R. Nield patches I applied last week.
There may be some minor issues left to fix but I can't think of any
showstoppers.

> - Write application to archive WAL files to tape, disk, or network
> Probably need to do first part, but I'm arguing not to do the copy to
> tape..

I'd like to somehow see this handled by a user-supplied program or
script.  What we mainly need is to define a good API that lets the
archiver program understand which WAL segment files to archive when.

> B - Backing up WAL log files
> -Ordinarily, when old log segment files are no longer needed, they are
> recycled (renamed to become the next segments in the numbered sequence).
> This means that the data within them must be copied from there to
> another location
>     AFTER postgres has closed that file
>     BEFORE it is renamed and recycled

My inclination would be to change the backend code so that as soon as a
WAL segment is completed, it is flagged as being ready to dump to tape
(or wherever).  Possibly the easiest way to do this is to rename the
segment file somehow, perhaps "nnn" becomes "nnn.full".  Then, after the
archiver process has properly dumped the file, reflag it as being dumped
(perhaps rename to "nnn.done").  Obviously there are any number of ways
we could do this flagging, and depending on an OS rename facility might
not be the best.

A segment then can be recycled when it is both (a) older than the latest
checkpoint and (b) flagged as dumped.  Note that this approach allows
dumping of a file to start before the first time at which it could be
recycled.  In the event of a crash and restart, WAL replay has to be
able to find the flagged segments, so the flagging mechanism can't be
one that would make this impossible.

> Think about
> -what will happen if postgres tries to reuse file while we are still
> copying.

This is a non-problem; segments that haven't been recycled can't become
reuse targets.

> -what will happen if copy fails?

This is the archiver's problem to deal with.  It only gets to be a
serious problem when you run out of disk space for WAL segment files,
so in most scenarios there is time for manual intervention to fix any
such problem and restart the archiver.

> -Manual indicates that current WAL format is bulky and would require
> some compressed format to be implemented. Initially, I suggest ignoring
> this and simply relying of OS or hardware/tape compression methods.

This is definitely something we could leave for later.

> With full OS file backup, if the database is shutdown correctly, then we
> will need a way to tell the database "you think you're up to date, but
> you're not - I've added some more WAL files into the directories, so
> roll forward on those now please".

I do not think this is an issue either, because my vision of this does
not include tar backups of shutdown databases.  What will be backed up
is a live database, therefore the postmaster will definitely know that
it needs to perform WAL replay.  What we will need is hooks to make sure
that the full set of required log files is available.  It's entirely
possible that that set of log files exceeds available disk space, so it
needs to be possible to run WAL replay incrementally, loading and then
replaying additional log segments after deleting old ones.

Possibly we could do this with some postmaster command-line switches.
J. R. Nield's patch embodied an "interactive recovery" backend mode,
which I didn't like in detail but the general idea is not necessarily
wrong.

            regards, tom lane

Re: Proposals for PITR

From
"Simon Riggs"
Date:
Tom,

Thanks for your many comments and practical suggestions - most of which
I think I should be able to bash something out once I've got my new dev
env sorted.

I'll update the proposal into a design document with some of my earlier
blah taken out and all of your clarifications put in.

There's a few comments on stuff below:

Best Regards, Simon Riggs

>Tom Lane [mailto:tgl@sss.pgh.pa.us] writes
> > - Write application to archive WAL files to tape, disk, or network
> > Probably need to do first part, but I'm arguing not to do the copy
to
> > tape..
>
> I'd like to somehow see this handled by a user-supplied program or
> script.  What we mainly need is to define a good API that lets the
> archiver program understand which WAL segment files to archive when.
>
> > B - Backing up WAL log files
> > -Ordinarily, when old log segment files are no longer needed, they
are
> > recycled (renamed to become the next segments in the numbered
sequence).
> > This means that the data within them must be copied from there to
> > another location
> >     AFTER postgres has closed that file
> >     BEFORE it is renamed and recycled
>
> My inclination would be to change the backend code so that as soon as
a
> WAL segment is completed, it is flagged as being ready to dump to tape
> (or wherever).  Possibly the easiest way to do this is to rename the
> segment file somehow, perhaps "nnn" becomes "nnn.full".  Then, after
the
> archiver process has properly dumped the file, reflag it as being
dumped
> (perhaps rename to "nnn.done").  Obviously there are any number of
ways
> we could do this flagging, and depending on an OS rename facility
might
> not be the best.
>
> A segment then can be recycled when it is both (a) older than the
latest
> checkpoint and (b) flagged as dumped.  Note that this approach allows
> dumping of a file to start before the first time at which it could be
> recycled.  In the event of a crash and restart, WAL replay has to be
> able to find the flagged segments, so the flagging mechanism can't be
> one that would make this impossible.

That sort of API doesn't do much for my sense of truth-and-beauty, but
it will work and allow us to get to the testing stage beyond where we
will, I'm sure, discover many things. When that knowledge is gained *we*
can refactor.

Spawning new post to think through the API in more detail.

> > With full OS file backup, if the database is shutdown correctly,
then we
> > will need a way to tell the database "you think you're up to date,
but
> > you're not - I've added some more WAL files into the directories, so
> > roll forward on those now please".
>
> I do not think this is an issue either, because my vision of this does
> not include tar backups of shutdown databases.  What will be backed up
> is a live database, therefore the postmaster will definitely know that
> it needs to perform WAL replay.  What we will need is hooks to make
sure
> that the full set of required log files is available.

OK, again lets go for it on that assumption.

Longer term, I would feel more comfortable with a specific "backup
state". Relying on a side-effect of crash recovery for disaster recovery
doesn't give me a warm feeling. BUT, that feeling is for later, not now.

> It's entirely
> possible that that set of log files exceeds available disk space, so
it
> needs to be possible to run WAL replay incrementally, loading and then
> replaying additional log segments after deleting old ones.
> Possibly we could do this with some postmaster command-line switches.
> J. R. Nield's patch embodied an "interactive recovery" backend mode,
> which I didn't like in detail but the general idea is not necessarily
> wrong.

Again, yes, though I will for now aim at the assumption that recovery
can be completed within available disk space, with this as an immediate
add-on when we have something that works.

That is also the basis for a "warm standby" solution: Copy the tar to a
new system (similar as you say), then repeatedly move new WAL logs
across to it, then startup in recover-only mode.

"Recover-only" mode would be initiated by a command line switch, as you
say. This would recover all of the WAL logs, then immediately shutdown
again.

The extension to that is what Oli Sennhauser has suggested, which is to
allow the second system to come up in read-only mode.

Best Regards, Simon Riggs


From
"Simon Riggs"
Date:
>Tom Lane [mailto:tgl@sss.pgh.pa.us]
>Re: [pgsql-hackers-pitr] Proposals for PITR
>>"Simon Riggs" <simon@2ndquadrant.com> writes:
>> The general PITR recovery scenario requires:
>> A - Take a full database backup
>> B - Take regular log file backups
>> C - Restore a full backup of a database.
> Note that pg_dump has absolutely zero to do with this.  The base for a
> PITR recovery scenario has to be a physical dump, not a logical dump;
> else WAL replay won't work.

Yes, I agree, I only included it because it was on the TODO. I'll cut it
out of the further thinking on PITR, in case it confuses the issue in
design/usage.

Regards, Simon


Re:

From
Bruce Momjian
Date:
Simon Riggs wrote:
> >Tom Lane [mailto:tgl@sss.pgh.pa.us]
> >Re: [pgsql-hackers-pitr] Proposals for PITR
> >>"Simon Riggs" <simon@2ndquadrant.com> writes:
> >> The general PITR recovery scenario requires:
> >> A - Take a full database backup
> >> B - Take regular log file backups
> >> C - Restore a full backup of a database.
> > Note that pg_dump has absolutely zero to do with this.  The base for a
> > PITR recovery scenario has to be a physical dump, not a logical dump;
> > else WAL replay won't work.
>
> Yes, I agree, I only included it because it was on the TODO. I'll cut it
> out of the further thinking on PITR, in case it confuses the issue in
> design/usage.

I removed the pg_dump mention on the web page.  Thanks.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Archival API

From
"Simon Riggs"
Date:
>Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > Simon Riggs wrote
> > - Write application to archive WAL files to tape, disk, or network
> > Probably need to do first part, but I'm arguing not to do the copy
to
> > tape..
>
> I'd like to somehow see this handled by a user-supplied program or
> script.  What we mainly need is to define a good API that lets the
> archiver program understand which WAL segment files to archive when.
>
> > B - Backing up WAL log files
> > -Ordinarily, when old log segment files are no longer needed, they
are
> > recycled (renamed to become the next segments in the numbered
sequence).
> > This means that the data within them must be copied from there to
> > another location
> >     AFTER postgres has closed that file
> >     BEFORE it is renamed and recycled
>
> My inclination would be to change the backend code so that as soon as
a
> WAL segment is completed, it is flagged as being ready to dump to tape
> (or wherever).  Possibly the easiest way to do this is to rename the
> segment file somehow, perhaps "nnn" becomes "nnn.full".  Then, after
the
> archiver process has properly dumped the file, reflag it as being
dumped
> (perhaps rename to "nnn.done").  Obviously there are any number of
ways
> we could do this flagging, and depending on an OS rename facility
might
> not be the best.

Yes, that would be the correct time to begin archive.

The way the code is currently written there is a slot in MoveOfflineLogs
which looks to see if XLOG_archive_dir is set before entering a section
which is empty apart from a message. That routine doesn't get called
until we're about to recycle the files, which means we've lost our
window of opportunity to archive them. Making the number of files larger
doesn't effect that being called last.... I'm going to ignore that
"hint" and any patch will include deletion of that code to avoid later
confusion.

The log switch and close occurs during XLogWrite, when it is established
that there is no more room in the current log file for the current
segment.

The file-flagging mechanism only allows a single archiver program to
operate, so I'll structure it as a new function XLogArchiveNotify() so
we can add in extra stuff later to improve/change things. That way we
have a home for the API.

> A segment then can be recycled when it is both (a) older than the
latest
> checkpoint and (b) flagged as dumped.  Note that this approach allows
> dumping of a file to start before the first time at which it could be
> recycled.  In the event of a crash and restart, WAL replay has to be
> able to find the flagged segments, so the flagging mechanism can't be
> one that would make this impossible.

The number of WAL logs is effectively tunable anyway because it depends
on the number of checkpoint segments, so we can increase that if there
are issues with archival speed v txn rate.

The rename is always safe because the log file names never wrap.

However, I'm loathe to touch the files, in case something crashes
somewhere and we are left with recovery failing because of an
unlocatable file. (To paraphrase one of the existing code comments, only
the truly paranoid survive). A similar way is to have a "buddy" file,
which indicates whether it is full and ready for archival. i.e. when we
close file "nnn" we also write an nearly/empty file called "nnn.full".
That file can then be deleted later BY THE archiver once archival has
finished, allowing it to be recycled by InstallXLogFileSegment(). (Would
require at least 6 more file descriptors, but I'm not sure if that's an
issue).

InstallXLogFileSegment() can check for XLogArchiveBusy() to see whether
it is allowed to reuse or allocate a new one. In initial implementation
this would just test to see whether "nnn.full" still exists. This will
allow a range of behaviour to be catered for, such as long waits while
manual tape mounts are requested by the archiver etc..

So in summary, the API is:

Archiver initialises and waits on notify
Postgresql initialises
...then
Postgresql fills log, switches and close it, then calls
XLogArchiveNotify()
Archiver moves log somewhere safe, then sets state such that...
...sometime later
Postgresql checks XLogArchiveBusy() to see if its safe to recycle file
and discovers the state set by

API is completely unintrusive on current tried and tested operation, and
leaves the archiver(s) free to act as they choose, outside of the
address space of PostgreSQL. That way we don't have to update regession
tests with some destructive non-manual crash tests to show that works.

Clearly, we wouldn't want WAL logs to hang around too long, so we need
an initiation method for the archival process. Otherwise, we'll be
writing "nnn.full" notifications yet without anybody ever deleting them.
Either this could be set at startup with an archive_log_mode parameter
(OK, the names been used before, but if the cap fits, wear it) or
setting a maximum limit to number of archive logs and a few other ideas,
none of which I like.

Hmmmm...any listeners got any ideas here? How do we want this to work?

Anybody want to write a more complex archiver process to act as more
than just a test harness?

Best regards,

Simon Riggs