Re: Minor changes to Recovery related code - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Minor changes to Recovery related code
Date
Msg-id 200709270523.l8R5N7621573@momjian.us
Whole thread Raw
In response to Re: Minor changes to Recovery related code  ("Simon Riggs" <simon@2ndquadrant.com>)
List pgsql-hackers
This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Sat, 2007-03-31 at 00:51 +0200, Florian G. Pflug wrote:
> > Simon Riggs wrote:
> > > On Fri, 2007-03-30 at 16:34 -0400, Tom Lane wrote:
> > >> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > >>> 2. pg_stop_backup() should wait until all archive files are safely
> > >>> archived before returning
> > >> Not sure I agree with that one.  If it fails, you can't tell whether the
> > >> action is done and it failed while waiting for the archiver, or if you
> > >> need to redo it.
> > > 
> > > There's a slight delay between pg_stop_backup() completing and the
> > > archiver doing its stuff. Currently if somebody does a -m fast straight
> > > after the pg_stop_backup() the backup may be unusable.
> > > 
> > > We need a way to plug that small hole.
> > > 
> > > I suggest that pg_stop_backup() polls once per second until
> > > pg_xlog/archive_status/LOG.ready disappears, in which case it ends
> > > successfully. If it does this for more than 60 seconds it ends
> > > successfully but produces a WARNING.
> > 
> > I fear that ending sucessfully despite having not archived all wals
> > will make this feature less worthwile. If a dba knows what he is
> > doing, he can code a perfectly safe backup script using 8.2 too.
> > He'll just have to check the current wal position after pg_stop_backup(),
> > (There is a function for that, right?), and wait until the corresponding
> > wal was archived.
> > 
> > In realitly, however, I feare that most people will just create a script
> > that does 'echo "select pg_stop_backup | psql"' or something similar.
> > If they're a bit more carefull, they will enable ON_ERROR_STOP, and check
> > the return value of pgsql. I believe that those are the people who would
> > really benefit from a pg_stop_backup() that waits for archiving to complete.
> > But they probably won't check for WARNINGs.
> > 
> > Maybe doing it the other way round would be an option?
> > pg_stop_backup() could wait for the archiver to complete forever, but
> > spit out a warning every 60 seconds or so "WARNING: Still waiting
> > for wal archiving of wal ??? to complete". If someone really wants
> > a 60-second timeout, he can just use statement_timeout.
> 
> I've just come up against this problem again, so I think it is a must
> fix for this release. Other problems exist also, mentioned on separate
> threads.
> 
> We have a number of problems surrounding pg_stop_backup/shutdown:
> 
> 1. pg_stop_backup() currently returns before the WAL file containing the
> last change is correctly archived. That is a small hole, but one that is
> exposed when people write test scripts that immediately shutdown the
> database after issuing pg_stop_backup(). It doesn't make much sense to
> shutdown immediately after a hot backup, but it should still work
> sensibly.
> 
> 2. We've also had problems caused by making the archiver wait until all
> WAL files are archived. If there is a backlog for some reason and the
> DBA issues a restart (i.e. stop and immediate restart) then making the
> archiver loop while it tries (and possibly fails) to archive all files
> would cause an outage. Avoiding this is why we do the current
> get-out-fast approach.
> There are some sub scenarios:
> a) there is a backlog of WAL files, but no error has occurred on the
> *last* file (we might have just fixed a problem).
> b) there is a backlog of WAL files, but an error is causing a retry of
> the last file.
> 
> My proposal is for us to record somewhere other than the logs that a
> failure to archive has occurred and is being retried. Failure to archive
> will be recorded in the archive_status directory as an additional file
> called archive_error, which will be deleted in the case of archive
> success and created in the case of archive error. This maintains
> archiver's lack of attachment to shared memory and general simplicity of
> design.
> 
> - pg_stop_backup() will wait until the WAL file that ends the backup is
> safely archived, even if a failure to archive occurs. This is a change
> to current behaviour, but since it implements the originally *expected*
> behaviour IMHO it should be the default.
> 
> - new function: pg_stop_backup_nowait() return immediately without
> waiting for archive, the same as the current pg_stop_backup()
> 
> - new function: pg_stop_backup_wait(int seconds) wait until either an
> archival fails or the ending WAL file is archived, with a max wait as
> specified. wait=0 means wait until archive errors are resolved.
> 
> Alternatives?
> 
> -- 
>   Simon Riggs             
>   EnterpriseDB   http://www.enterprisedb.com
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: "Mark Wong"
Date:
Subject: Re: top for postgresql (ptop?)
Next
From: Bruce Momjian
Date:
Subject: Re: [COMMITTERS] pgsql: Temporarily modify tsearch regression tests to suppress notice