Re: Minor changes to Recovery related code - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Minor changes to Recovery related code |
Date | |
Msg-id | 200709270523.l8R5N7621573@momjian.us Whole thread Raw |
In response to | Re: Minor changes to Recovery related code ("Simon Riggs" <simon@2ndquadrant.com>) |
List | pgsql-hackers |
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Simon Riggs wrote: > On Sat, 2007-03-31 at 00:51 +0200, Florian G. Pflug wrote: > > Simon Riggs wrote: > > > On Fri, 2007-03-30 at 16:34 -0400, Tom Lane wrote: > > >> "Simon Riggs" <simon@2ndquadrant.com> writes: > > >>> 2. pg_stop_backup() should wait until all archive files are safely > > >>> archived before returning > > >> Not sure I agree with that one. If it fails, you can't tell whether the > > >> action is done and it failed while waiting for the archiver, or if you > > >> need to redo it. > > > > > > There's a slight delay between pg_stop_backup() completing and the > > > archiver doing its stuff. Currently if somebody does a -m fast straight > > > after the pg_stop_backup() the backup may be unusable. > > > > > > We need a way to plug that small hole. > > > > > > I suggest that pg_stop_backup() polls once per second until > > > pg_xlog/archive_status/LOG.ready disappears, in which case it ends > > > successfully. If it does this for more than 60 seconds it ends > > > successfully but produces a WARNING. > > > > I fear that ending sucessfully despite having not archived all wals > > will make this feature less worthwile. If a dba knows what he is > > doing, he can code a perfectly safe backup script using 8.2 too. > > He'll just have to check the current wal position after pg_stop_backup(), > > (There is a function for that, right?), and wait until the corresponding > > wal was archived. > > > > In realitly, however, I feare that most people will just create a script > > that does 'echo "select pg_stop_backup | psql"' or something similar. > > If they're a bit more carefull, they will enable ON_ERROR_STOP, and check > > the return value of pgsql. I believe that those are the people who would > > really benefit from a pg_stop_backup() that waits for archiving to complete. > > But they probably won't check for WARNINGs. > > > > Maybe doing it the other way round would be an option? > > pg_stop_backup() could wait for the archiver to complete forever, but > > spit out a warning every 60 seconds or so "WARNING: Still waiting > > for wal archiving of wal ??? to complete". If someone really wants > > a 60-second timeout, he can just use statement_timeout. > > I've just come up against this problem again, so I think it is a must > fix for this release. Other problems exist also, mentioned on separate > threads. > > We have a number of problems surrounding pg_stop_backup/shutdown: > > 1. pg_stop_backup() currently returns before the WAL file containing the > last change is correctly archived. That is a small hole, but one that is > exposed when people write test scripts that immediately shutdown the > database after issuing pg_stop_backup(). It doesn't make much sense to > shutdown immediately after a hot backup, but it should still work > sensibly. > > 2. We've also had problems caused by making the archiver wait until all > WAL files are archived. If there is a backlog for some reason and the > DBA issues a restart (i.e. stop and immediate restart) then making the > archiver loop while it tries (and possibly fails) to archive all files > would cause an outage. Avoiding this is why we do the current > get-out-fast approach. > There are some sub scenarios: > a) there is a backlog of WAL files, but no error has occurred on the > *last* file (we might have just fixed a problem). > b) there is a backlog of WAL files, but an error is causing a retry of > the last file. > > My proposal is for us to record somewhere other than the logs that a > failure to archive has occurred and is being retried. Failure to archive > will be recorded in the archive_status directory as an additional file > called archive_error, which will be deleted in the case of archive > success and created in the case of archive error. This maintains > archiver's lack of attachment to shared memory and general simplicity of > design. > > - pg_stop_backup() will wait until the WAL file that ends the backup is > safely archived, even if a failure to archive occurs. This is a change > to current behaviour, but since it implements the originally *expected* > behaviour IMHO it should be the default. > > - new function: pg_stop_backup_nowait() return immediately without > waiting for archive, the same as the current pg_stop_backup() > > - new function: pg_stop_backup_wait(int seconds) wait until either an > archival fails or the ending WAL file is archived, with a max wait as > specified. wait=0 means wait until archive errors are resolved. > > Alternatives? > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
pgsql-hackers by date: