Re: Online base backup from the hot-standby - Mailing list pgsql-hackers
From | Steve Singer |
---|---|
Subject | Re: Online base backup from the hot-standby |
Date | |
Msg-id | BLU0-SMTP598D7314D4D1468AA0C4238EF30@phx.gbl Whole thread Raw |
In response to | Re: Online base backup from the hot-standby (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Online base backup from the hot-standby
|
List | pgsql-hackers |
On 11-09-22 09:24 AM, Fujii Masao wrote: <blockquote cite="mid:CAHGQGwEQpF2nY1CTZkioXu=ifZtVTjF0dq_RiqJUzu7MGOACjw@mail.gmail.com"type="cite"><pre wrap="">On Wed, Sep 21, 2011at 11:50 AM, Fujii Masao <a class="moz-txt-link-rfc2396E" href="mailto:masao.fujii@gmail.com"><masao.fujii@gmail.com></a>wrote: </pre><blockquote type="cite"><pre wrap="">2011/9/13 Jun Ishiduka <a class="moz-txt-link-rfc2396E" href="mailto:ishizuka.jun@po.ntts.co.jp"><ishizuka.jun@po.ntts.co.jp></a>: </pre><blockquote type="cite"><pre wrap=""> Update patch. Changes: * set 'on' full_page_writes by user (in document) * read "FROM: XX" in backup_label (in xlog.c) * check status when pg_stop_backup is executed (in xlog.c) </pre></blockquote><pre wrap=""> Thanks for updating the patch. Before reviewing the patch, to encourage people to comment and review the patch, I explain what this patch provides: </pre></blockquote><pre wrap=""> Attached is the updated version of the patch. I refactored the code, fixed some bugs, added lots of source code comments, improved the document, but didn't change the basic design. Please check this patch, and let's use this patch as the base if you agree with that. </pre></blockquote><br /> I have looked at both Jun's patch from Sept 13 and Fujii's updates to the patch. I agree thatFujii's updated version should be used as the basis for changes going forward. My comments below refer to that version(unless otherwise noted).<br /><br /><br /> In backup.sgml the new section titled "Making a Base Backup during Recovery" I would prefer to see some mention in the title that this procedure is for standby servers ie "Making a Base Backupfrom a Standby Database". Users who have setup a hot-standby database should be familiar with the 'standby' terminology.I agree that the "during recovery" description is technically correct but I'm not sure someone who is lookingthrough the manual for instructions on making a base backup from here standby will realize this is the section theyshould read.<br /><br /> Around line 969 where you give an example of copying the control file I would be a bit clearerthat this is an example command. Ie (Copy the pg_control file from the cluster directory to the global sub-directoryof the backup. For example "cp $PGDATA/global/pg_control /mnt/server/backupdir/global")<br /><br /><br /> TestingNotes<br /> -----------------------------<br /><br /> I created a standby server from a base backup of another standbyserver. On this new standby server I then<br /><br /> 1. Ran pg_start_backup('3'); and left the psql connection open<br/> 2. touch /tmp/3 -- my trigger_file<br /><br /> ssinger@ssinger-laptop:/usr/local/pgsql92git/bin$ LOG: triggerfile found: /tmp/3<br /> FATAL: terminating walreceiver process due to administrator command<br /> LOG: restoredlog file "000000010000000000000006" from archive<br /> LOG: record with zero length at 0/60002F0<br /> LOG: restoredlog file "000000010000000000000006" from archive<br /> LOG: redo done at 0/6000298<br /> LOG: restored log file"000000010000000000000006" from archive<br /> PANIC: record with zero length at 0/6000298<br /> LOG: startup process(PID 19011) was terminated by signal 6: Aborted<br /> LOG: terminating any other active server processes<br /> WARNING: terminating connection because of crash of another server process<br /> DETAIL: The postmaster has commanded thisserver process to roll back the current transaction and exit, because another server process exited abnormally and possiblycorrupted shared memory.<br /> HINT: In a moment you should be able to reconnect to the database and repeat yourcommand.<br /><br /> The new postmaster (the one trying to be promoted) dies. This is somewhat repeatable.<br /><br/> ----<br /><br /> If a base backup is in progress on a recovery database and that recovery database is promoted tomaster, following the promotion (if you don't restart the postmaster). I see<br /> select pg_stop_backup();<br /> ERROR: database system status mismatches between pg_start_backup() and pg_stop_backup()<br /><br /> If you restart the postmasterthis goes away. When the postmaster leaves recovery mode I think it should abort an existing base backup so pg_stop_backup()will say no backup in progress, or give an error message on pg_stop_backup() saying that the base backupwon't be usable. The above error doesn't really tell the user why there is a mismatch.<br /><br /> ---------<br /><br/> In my testing a few times I got into a situation where a standby server coming from a recovery target took a whileto finish recovery (this is on a database with no activity). Then when i tried promoting that server to master I got<br/><br /> LOG: trigger file found: /tmp/3<br /> FATAL: terminating walreceiver process due to administrator command<br/> LOG: restored log file "000000010000000000000009" from archive<br /> LOG: restored log file "000000010000000000000009"from archive<br /> LOG: redo done at 0/90000E8<br /> LOG: restored log file "000000010000000000000009"from archive<br /> PANIC: unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0<br/> LOG: startup process (PID 1804) was terminated by signal 6: Aborted<br /> LOG: terminating any other active serverprocesses<br /><br /><br /> It is *possible* I mixed up the order of a step somewhere since my testing isn't scriptbased. A standby server that 'looks' okay but can't actually be promoted is dangerous.<br /><br /> This version ofthe patch (I was testing the Sept 22nd version) seems less stable than how I remember the version from the July CF. MaybeI'm just testing it harder or maybe something has been broken.<br /><br /><br /><br /><blockquote cite="mid:CAHGQGwEQpF2nY1CTZkioXu=ifZtVTjF0dq_RiqJUzu7MGOACjw@mail.gmail.com"type="cite"><pre wrap="">In the current patch,there is no safeguard for preventing users from taking backup during recovery when FPW is disabled. This is unsafe. Are you planning to implement such a safeguard? </pre></blockquote><br /> I agree with Fujii that we need a way (on the recovery machine) to detect if the master doesn'thave FPW on. The ideas up-thread on how to do this sound good.<br /><br /><br /><blockquote cite="mid:CAHGQGwEQpF2nY1CTZkioXu=ifZtVTjF0dq_RiqJUzu7MGOACjw@mail.gmail.com"type="cite"><pre wrap="">Regards, </pre> <pre wrap=""> <fieldset class="mimeAttachmentHeader"></fieldset> </pre></blockquote><br />
pgsql-hackers by date: