Thread: Unarchived WALs deleted after crash
Hi, I am facing an unexpected behavior on a 9.2.2 cluster that I can reproduce on current HEAD. On a cluster with archive enabled but failing, after a crash of postmaster, the checkpoint occurring before leaving the recovery mode deletes any additional WALs, even those waiting to be archived. Because of this, after recovering from the crash, previous PITR backup can not be used to restore the instance to a time where archiving was failing. Any slaves fed by WAL or lagging in SR need to be recreated. AFAICT, this is not documented and I would expect the WALs to be archived by the archiver process when the cluster exits the recovery step. Here is a simple scenario to reproduce this. Configuration: wal_level = archive archive_mode = on archive_command = '/bin/false' log_checkpoints = on Scenario: createdb test psql -c 'create table test as select i, md5(i::text) from generate_series(1,3000000) as i;' test kill -9 $(head -1 $PGDATA/postmaster.pid) pg_ctl start Using this scenario, log files shows: LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false WARNING: transactionlog file "000000010000000000000001" could not be archived: too many failures LOG: database system was interrupted; last known up at 2013-02-14 16:12:58 CET LOG: database system was not properly shut down; automatic recovery in progress LOG: crash recovery starts in timeline 1 and has target timeline 1 LOG: redo starts at 0/11400078 LOG: recordwith zero length at 0/13397190 LOG: redo done at 0/13397160 LOG: last completed transaction was at log time 2013-02-14 16:12:58.49303+01 LOG: checkpoint starting: end-of-recovery immediate LOG: checkpoint complete: wrote 2869 buffers (17.5%);0 transaction log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s, total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s LOG: autovacuum launcher started LOG: database system isready to accept connections LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/falseLOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false LOG: archivecommand failed with exit code 1 DETAIL: The failed archive command was: /bin/false WARNING: transaction log file"000000010000000000000011" could not be archived: too many failures Before the kill, "000000010000000000000001" was the WAL to archive. After the kill, the checkpoint deleted 9 files before exiting recovery mode and "000000010000000000000011" become the first WAL to archive. "000000010000000000000001" through "000000010000000000000010" were removed or recycled. Is it expected ? -- Jehan-Guillaume de Rorthais http://www.dalibo.com
On Thu, Feb 14, 2013 at 7:45 AM, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Hi, > > I am facing an unexpected behavior on a 9.2.2 cluster that I can > reproduce on current HEAD. > > On a cluster with archive enabled but failing, after a crash of > postmaster, the checkpoint occurring before leaving the recovery mode > deletes any additional WALs, even those waiting to be archived. I believe I have encountered this recently, but didn't get enough chance to work with it to correspond. For me, the cause was out-of-disk on the file system that exclusively contained WAL, backlogged because archiving fell behind writing. This causes the cluster to crash -- par for the course -- but also an archive gap was created. At the time I thought there was some kind of bug in dealing with out of space issues in the archiver (the .ready bookkeeping), but the symptoms I saw seem like they might be explained by your report, too. -- fdr
On 14.02.2013 17:45, Jehan-Guillaume de Rorthais wrote: > I am facing an unexpected behavior on a 9.2.2 cluster that I can > reproduce on current HEAD. > > On a cluster with archive enabled but failing, after a crash of > postmaster, the checkpoint occurring before leaving the recovery mode > deletes any additional WALs, even those waiting to be archived. > ... > Is it expected ? No, it's a bug. Ouch. It was introduced in 9.2, by commit 5286105800c7d5902f98f32e11b209c471c0c69c: > - /* > - * Normally we don't delete old XLOG files during recovery to > - * avoid accidentally deleting a file that looks stale due to a > - * bug or hardware issue, but in fact contains important data. > - * During streaming recovery, however, we will eventually fill the > - * disk if we never clean up, so we have to. That's not an issue > - * with file-based archive recovery because in that case we > - * restore one XLOG file at a time, on-demand, and with a > - * different filename that can't be confused with regular XLOG > - * files. > - */ > - if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name)) > + if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name)) > [ delete the file ] With that commit, we started to keep WAL segments restored from the archive in pg_xlog, so we needed to start deleting old segments during archive recovery, even when streaming replication was not active. But the above change was to broad; we started to delete old segments also during crash recovery. The above should check InArchiveRecovery, ie. only delete old files when in archive recovery, not when in crash recovery. But there's one little complication: InArchiveRecovery is currently only valid in the startup process, so we'll need to also share it in shared memory, so that the checkpointer process can access it. I propose the attached patch to fix it. - Heikki
Attachment
On 15 February 2013 14:31, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 14.02.2013 17:45, Jehan-Guillaume de Rorthais wrote: >> >> I am facing an unexpected behavior on a 9.2.2 cluster that I can >> reproduce on current HEAD. >> >> On a cluster with archive enabled but failing, after a crash of >> postmaster, the checkpoint occurring before leaving the recovery mode >> deletes any additional WALs, even those waiting to be archived. > >> ... >> Is it expected ? > > No, it's a bug. Ouch. It was introduced in 9.2, by commit > 5286105800c7d5902f98f32e11b209c471c0c69c: Thanks for tracking that down. >> - /* >> - * Normally we don't delete old XLOG files during recovery to >> - * avoid accidentally deleting a file that looks stale due to a >> - * bug or hardware issue, but in fact contains important data. >> - * During streaming recovery, however, we will eventually fill the >> - * disk if we never clean up, so we have to. That's not an issue >> - * with file-based archive recovery because in that case we >> - * restore one XLOG file at a time, on-demand, and with a >> - * different filename that can't be confused with regular XLOG >> - * files. >> - */ >> - if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name)) >> + if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name)) >> [ delete the file ] > > > With that commit, we started to keep WAL segments restored from the archive > in pg_xlog, so we needed to start deleting old segments during archive > recovery, even when streaming replication was not active. But the above > change was to broad; we started to delete old segments also during crash > recovery. > > The above should check InArchiveRecovery, ie. only delete old files when in > archive recovery, not when in crash recovery. But there's one little > complication: InArchiveRecovery is currently only valid in the startup > process, so we'll need to also share it in shared memory, so that the > checkpointer process can access it. > > I propose the attached patch to fix it. Agree with your diagnosis and fix. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 15.02.2013 17:12, Simon Riggs wrote: > On 15 February 2013 14:31, Heikki Linnakangas<hlinnakangas@vmware.com> wrote: >>> - /* >>> - * Normally we don't delete old XLOG files during recovery to >>> - * avoid accidentally deleting a file that looks stale due to a >>> - * bug or hardware issue, but in fact contains important data. >>> - * During streaming recovery, however, we will eventually fill the >>> - * disk if we never clean up, so we have to. That's not an issue >>> - * with file-based archive recovery because in that case we >>> - * restore one XLOG file at a time, on-demand, and with a >>> - * different filename that can't be confused with regular XLOG >>> - * files. >>> - */ >>> - if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name)) >>> + if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name)) >>> [ delete the file ] >> >> With that commit, we started to keep WAL segments restored from the archive >> in pg_xlog, so we needed to start deleting old segments during archive >> recovery, even when streaming replication was not active. But the above >> change was to broad; we started to delete old segments also during crash >> recovery. >> >> The above should check InArchiveRecovery, ie. only delete old files when in >> archive recovery, not when in crash recovery. But there's one little >> complication: InArchiveRecovery is currently only valid in the startup >> process, so we'll need to also share it in shared memory, so that the >> checkpointer process can access it. >> >> I propose the attached patch to fix it. > > Agree with your diagnosis and fix. Ok, committed. For the sake of the archives, attached is a script based on Jehan-Guillaume's description that I used for testing (incidentally based on Kyotaro's script to reproduce an unrelated problem in another thread). Thanks for the report! - Heikki
Attachment
On Fri, Feb 15, 2013 at 11:31 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 14.02.2013 17:45, Jehan-Guillaume de Rorthais wrote: >> >> I am facing an unexpected behavior on a 9.2.2 cluster that I can >> reproduce on current HEAD. >> >> On a cluster with archive enabled but failing, after a crash of >> postmaster, the checkpoint occurring before leaving the recovery mode >> deletes any additional WALs, even those waiting to be archived. > >> ... >> Is it expected ? > > No, it's a bug. Ouch. It was introduced in 9.2, by commit > 5286105800c7d5902f98f32e11b209c471c0c69c: Oh, sorry for my mistake. > >> - /* >> - * Normally we don't delete old XLOG files during recovery to >> - * avoid accidentally deleting a file that looks stale due to a >> - * bug or hardware issue, but in fact contains important data. >> - * During streaming recovery, however, we will eventually fill the >> - * disk if we never clean up, so we have to. That's not an issue >> - * with file-based archive recovery because in that case we >> - * restore one XLOG file at a time, on-demand, and with a >> - * different filename that can't be confused with regular XLOG >> - * files. >> - */ >> - if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name)) >> + if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name)) >> [ delete the file ] > > > With that commit, we started to keep WAL segments restored from the archive > in pg_xlog, so we needed to start deleting old segments during archive > recovery, even when streaming replication was not active. But the above > change was to broad; we started to delete old segments also during crash > recovery. > > The above should check InArchiveRecovery, ie. only delete old files when in > archive recovery, not when in crash recovery. But there's one little > complication: InArchiveRecovery is currently only valid in the startup > process, so we'll need to also share it in shared memory, so that the > checkpointer process can access it. > > I propose the attached patch to fix it. At least in 9.2, when the archived file is restored into pg_xlog, its xxx.done archive status file is created. So we don't need to check InArchiveRecovery when deleting old WAL files. Checking whether xxx.done exists is enough. Unfortunately in HEAD, xxx.done file is not created when restoring archived file because of absence of the patch. We need to implement that first. Regards, -- Fujii Masao
On 15.02.2013 18:10, Fujii Masao wrote: > On Fri, Feb 15, 2013 at 11:31 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> - /* >>> - * Normally we don't delete old XLOG files during recovery to >>> - * avoid accidentally deleting a file that looks stale due to a >>> - * bug or hardware issue, but in fact contains important data. >>> - * During streaming recovery, however, we will eventually fill the >>> - * disk if we never clean up, so we have to. That's not an issue >>> - * with file-based archive recovery because in that case we >>> - * restore one XLOG file at a time, on-demand, and with a >>> - * different filename that can't be confused with regular XLOG >>> - * files. >>> - */ >>> - if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name)) >>> + if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name)) >>> [ delete the file ] >> >> With that commit, we started to keep WAL segments restored from the archive >> in pg_xlog, so we needed to start deleting old segments during archive >> recovery, even when streaming replication was not active. But the above >> change was to broad; we started to delete old segments also during crash >> recovery. >> >> The above should check InArchiveRecovery, ie. only delete old files when in >> archive recovery, not when in crash recovery. But there's one little >> complication: InArchiveRecovery is currently only valid in the startup >> process, so we'll need to also share it in shared memory, so that the >> checkpointer process can access it. >> >> I propose the attached patch to fix it. > > At least in 9.2, when the archived file is restored into pg_xlog, its xxx.done > archive status file is created. So we don't need to check InArchiveRecovery > when deleting old WAL files. Checking whether xxx.done exists is enough. Hmm, what about streamed WAL files? I guess we could go back to the pre-9.2 coding, and check WalRcvInProgress(). But I didn't actually like that too much, it seems rather random that old streamed files are recycled when wal receiver is running at the time of restartpoint, and otherwise not. Because whether wal receiver is running at the time the restartpoint happens has little to do with which files were created by streaming replication. With the right pattern of streaming files from the master, but always being teporarily disconnected when the restartpoint runs, you could still accumulate WAL files infinitely. > Unfortunately in HEAD, xxx.done file is not created when restoring archived > file because of absence of the patch. We need to implement that first. Ah yeah, that thing again.. (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going to forward-port that patch now, before it's forgotten again. It's not clear to me what the holdup was on this, but whatever the bigger patch we've been waiting for is, it can just as well be done on top of the forward-port. - Heikki
On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 15.02.2013 18:10, Fujii Masao wrote: >> >> On Fri, Feb 15, 2013 at 11:31 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>>> >>>> - /* >>>> >>>> - * Normally we don't delete old XLOG files during recovery to >>>> - * avoid accidentally deleting a file that looks stale due to a >>>> - * bug or hardware issue, but in fact contains important data. >>>> - * During streaming recovery, however, we will eventually fill the >>>> - * disk if we never clean up, so we have to. That's not an issue >>>> - * with file-based archive recovery because in that case we >>>> - * restore one XLOG file at a time, on-demand, and with a >>>> - * different filename that can't be confused with regular XLOG >>>> - * files. >>>> - */ >>>> - if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name)) >>>> + if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name)) >>>> [ delete the file ] >>> >>> >>> With that commit, we started to keep WAL segments restored from the >>> archive >>> in pg_xlog, so we needed to start deleting old segments during archive >>> recovery, even when streaming replication was not active. But the above >>> change was to broad; we started to delete old segments also during crash >>> recovery. >>> >>> The above should check InArchiveRecovery, ie. only delete old files when >>> in >>> archive recovery, not when in crash recovery. But there's one little >>> complication: InArchiveRecovery is currently only valid in the startup >>> process, so we'll need to also share it in shared memory, so that the >>> checkpointer process can access it. >>> >>> I propose the attached patch to fix it. >> >> >> At least in 9.2, when the archived file is restored into pg_xlog, its >> xxx.done >> archive status file is created. So we don't need to check >> InArchiveRecovery >> when deleting old WAL files. Checking whether xxx.done exists is enough. > > > Hmm, what about streamed WAL files? I guess we could go back to the pre-9.2 > coding, and check WalRcvInProgress(). But I didn't actually like that too > much, it seems rather random that old streamed files are recycled when wal > receiver is running at the time of restartpoint, and otherwise not. Because > whether wal receiver is running at the time the restartpoint happens has > little to do with which files were created by streaming replication. With > the right pattern of streaming files from the master, but always being > teporarily disconnected when the restartpoint runs, you could still > accumulate WAL files infinitely. Walreceiver always creates .done file when it closes the already-flushed WAL file and switches WAL file to next. So we also don't need to check WalRcvInProgress(). >> Unfortunately in HEAD, xxx.done file is not created when restoring >> archived >> file because of absence of the patch. We need to implement that first. > > > Ah yeah, that thing again.. > (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going > to forward-port that patch now, before it's forgotten again. It's not clear > to me what the holdup was on this, but whatever the bigger patch we've been > waiting for is, it can just as well be done on top of the forward-port. I posted the patch to that thread. Regards, -- Fujii Masao
On 15 February 2013 17:07, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> Unfortunately in HEAD, xxx.done file is not created when restoring >> archived >> file because of absence of the patch. We need to implement that first. > > > Ah yeah, that thing again.. > (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going > to forward-port that patch now, before it's forgotten again. It's not clear > to me what the holdup was on this, but whatever the bigger patch we've been > waiting for is, it can just as well be done on top of the forward-port. Agreed. I wouldn't wait for a better version now. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 15.02.2013 19:16, Fujii Masao wrote: > On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> On 15.02.2013 18:10, Fujii Masao wrote: >>> >>> At least in 9.2, when the archived file is restored into pg_xlog, its >>> xxx.done >>> archive status file is created. So we don't need to check >>> InArchiveRecovery >>> when deleting old WAL files. Checking whether xxx.done exists is enough. >> >> Hmm, what about streamed WAL files? I guess we could go back to the pre-9.2 >> coding, and check WalRcvInProgress(). But I didn't actually like that too >> much, it seems rather random that old streamed files are recycled when wal >> receiver is running at the time of restartpoint, and otherwise not. Because >> whether wal receiver is running at the time the restartpoint happens has >> little to do with which files were created by streaming replication. With >> the right pattern of streaming files from the master, but always being >> teporarily disconnected when the restartpoint runs, you could still >> accumulate WAL files infinitely. > > Walreceiver always creates .done file when it closes the > already-flushed WAL file > and switches WAL file to next. So we also don't need to check > WalRcvInProgress(). Ah, I missed that part of the patch. Okay, agreed, that's a better fix. I committed your forward-port of the 9.2 patch to master, reverted my earlier fix for this bug, and simply removed the InArchiveRecovery/ArchiveRecoveryInProgress()/RecoveryInProgress() condition from RemoveOldXlogFiles(). - Heikki
On 15 February 2013 16:10, Fujii Masao <masao.fujii@gmail.com> wrote: >> I propose the attached patch to fix it. > > At least in 9.2, when the archived file is restored into pg_xlog, its xxx.done > archive status file is created. So we don't need to check InArchiveRecovery > when deleting old WAL files. Checking whether xxx.done exists is enough. I don't agree. The extra test Heikki put in was useful and helps avoid issues when we get the .done creation wrong, or when people delete them. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Just a quick top-post to thank you all for this fix guys ! Cheers, On 15/02/2013 18:43, Heikki Linnakangas wrote: > On 15.02.2013 19:16, Fujii Masao wrote: >> On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> On 15.02.2013 18:10, Fujii Masao wrote: >>>> >>>> At least in 9.2, when the archived file is restored into >>>> pg_xlog, its xxx.done archive status file is created. So we >>>> don't need to check InArchiveRecovery when deleting old WAL >>>> files. Checking whether xxx.done exists is enough. >>> >>> Hmm, what about streamed WAL files? I guess we could go back to >>> the pre-9.2 coding, and check WalRcvInProgress(). But I didn't >>> actually like that too much, it seems rather random that old >>> streamed files are recycled when wal receiver is running at the >>> time of restartpoint, and otherwise not. Because whether wal >>> receiver is running at the time the restartpoint happens has >>> little to do with which files were created by streaming >>> replication. With the right pattern of streaming files from the >>> master, but always being teporarily disconnected when the >>> restartpoint runs, you could still accumulate WAL files >>> infinitely. >> >> Walreceiver always creates .done file when it closes the >> already-flushed WAL file and switches WAL file to next. So we >> also don't need to check WalRcvInProgress(). > > Ah, I missed that part of the patch. > > Okay, agreed, that's a better fix. I committed your forward-port of > the 9.2 patch to master, reverted my earlier fix for this bug, and > simply removed the > InArchiveRecovery/ArchiveRecoveryInProgress()/RecoveryInProgress() > condition from RemoveOldXlogFiles(). > > - Heikki - -- Jehan-Guillaume de Rorthais http://www.dalibo.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlEk5hwACgkQXu9L1HbaT6JZ3wCg4h7QT+wRMT8KZAA/PjOjZcCV CS4AnRFeGdXIgklo1/RD2hi+e98pNBEe =voW3 -----END PGP SIGNATURE-----
On Fri, Feb 15, 2013 at 9:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 15 February 2013 17:07, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > >>> Unfortunately in HEAD, xxx.done file is not created when restoring >>> archived >>> file because of absence of the patch. We need to implement that first. >> >> >> Ah yeah, that thing again.. >> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going >> to forward-port that patch now, before it's forgotten again. It's not clear >> to me what the holdup was on this, but whatever the bigger patch we've been >> waiting for is, it can just as well be done on top of the forward-port. > > Agreed. I wouldn't wait for a better version now. Related to this, how is this going to affect point releases, and are there any lingering doubts about the mechanism of the fix? This is quite serious given my reliance on archiving, so unless the thinking for point releases is 'real soon' I must backpatch and release it on my own accord until then. Thanks for the attention paid to the bug report, as always. -- fdr
On 21.02.2013 02:59, Daniel Farina wrote: > On Fri, Feb 15, 2013 at 9:29 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >> On 15 February 2013 17:07, Heikki Linnakangas<hlinnakangas@vmware.com> wrote: >> >>>> Unfortunately in HEAD, xxx.done file is not created when restoring >>>> archived >>>> file because of absence of the patch. We need to implement that first. >>> >>> >>> Ah yeah, that thing again.. >>> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going >>> to forward-port that patch now, before it's forgotten again. It's not clear >>> to me what the holdup was on this, but whatever the bigger patch we've been >>> waiting for is, it can just as well be done on top of the forward-port. >> >> Agreed. I wouldn't wait for a better version now. > > Related to this, how is this going to affect point releases, and are > there any lingering doubts about the mechanism of the fix? Are you talking about the patch to avoid restored WAL segments from being re-archived (commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), or the bug that that unarchived WALs were deleted after crash (commit b5ec56f664fa20d80fe752de494ec96362eff520)? The former was included in 9.2.0 already, and the latter will be included in the next point release. I have no lingering doubts about this. There was some plans to do bigger changes for the re-archiving issue (6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), which is why it was initially left out from master. But that didn't happen, and I believe everyone is happy with the current state of things. > This is > quite serious given my reliance on archiving, so unless the thinking > for point releases is 'real soon' I must backpatch and release it on > my own accord until then. I don't know what the release schedule is. I take that to be a request to put out a new minor release ASAP. - Heikki
On Thu, Feb 21, 2013 at 12:39 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 21.02.2013 02:59, Daniel Farina wrote: >> >> On Fri, Feb 15, 2013 at 9:29 AM, Simon Riggs<simon@2ndquadrant.com> >> wrote: >>> >>> On 15 February 2013 17:07, Heikki Linnakangas<hlinnakangas@vmware.com> >>> wrote: >>> >>>>> Unfortunately in HEAD, xxx.done file is not created when restoring >>>>> archived >>>>> file because of absence of the patch. We need to implement that first. >>>> >>>> >>>> >>>> Ah yeah, that thing again.. >>>> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm >>>> going >>>> to forward-port that patch now, before it's forgotten again. It's not >>>> clear >>>> to me what the holdup was on this, but whatever the bigger patch we've >>>> been >>>> waiting for is, it can just as well be done on top of the forward-port. >>> >>> >>> Agreed. I wouldn't wait for a better version now. >> >> >> Related to this, how is this going to affect point releases, and are >> there any lingering doubts about the mechanism of the fix? > > > Are you talking about the patch to avoid restored WAL segments from being > re-archived (commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), or the bug > that that unarchived WALs were deleted after crash (commit > b5ec56f664fa20d80fe752de494ec96362eff520)? The former was included in 9.2.0 > already, and the latter will be included in the next point release. Unarchived WALs being deleted after a crash is the one that worries me. I actually presume re-archivals will happen anyway because I may lose connection to archive storage after the WAL has already been committed, hence b5ec56f664fa20d80fe752de494ec96362eff520. >> This is >> quite serious given my reliance on archiving, so unless the thinking >> for point releases is 'real soon' I must backpatch and release it on >> my own accord until then. > > > I don't know what the release schedule is. I take that to be a request to > put out a new minor release ASAP. Perhaps, but it's more of a concrete evaluation of how important archiving is to me and my affiliated operation. An acceptable answer might be "yeah, backpatch if you feel it's that much of a rush." Clearly, my opinion is that a gap in the archives is pretty cringe-inducing. I hit it from an out of disk case, and you'd be surprised (or perhaps not?) how many people like to kill -9 processes on a whim. I already maintain other backpatches (not related to fixes), and this one is only temporary, so it's not too much trouble for me. -- fdr
On Thu, Feb 21, 2013 at 12:39 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > > Are you talking about the patch to avoid restored WAL segments from being > re-archived (commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), or the bug > that that unarchived WALs were deleted after crash (commit > b5ec56f664fa20d80fe752de494ec96362eff520)? The former was included in 9.2.0 > already, and the latter will be included in the next point release. > ... > > I don't know what the release schedule is. I take that to be a request to > put out a new minor release ASAP. +1 from me. I'm rather uncomfortable running a system with this unarchived deletion bug. Cheers, Jeff