Thread: Unarchived WALs deleted after crash

Unarchived WALs deleted after crash

From
Jehan-Guillaume de Rorthais
Date:
Hi,

I am facing an unexpected behavior on a 9.2.2 cluster that I can
reproduce on current HEAD.

On a cluster with archive enabled but failing, after a crash of
postmaster, the checkpoint occurring before leaving the recovery mode
deletes any additional WALs, even those waiting to be archived.

Because of this, after recovering from the crash, previous PITR backup
can not be used to restore the instance to a time where archiving was
failing. Any slaves fed by WAL or lagging in SR need to be recreated.

AFAICT, this is not documented and I would expect the WALs to be
archived by the archiver process when the cluster exits the recovery step.

Here is a simple scenario to reproduce this.

Configuration:
 wal_level = archive archive_mode = on archive_command = '/bin/false' log_checkpoints = on

Scenario: createdb test psql -c 'create table test as select i, md5(i::text) from
generate_series(1,3000000) as i;' test kill -9 $(head -1 $PGDATA/postmaster.pid) pg_ctl start

Using this scenario, log files shows:
 LOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was: /bin/false WARNING:
transactionlog file "000000010000000000000001" could not
 
be archived: too many failures LOG:  database system was interrupted; last known up at 2013-02-14
16:12:58 CET LOG:  database system was not properly shut down; automatic recovery
in progress LOG:  crash recovery starts in timeline 1 and has target timeline 1 LOG:  redo starts at 0/11400078 LOG:
recordwith zero length at 0/13397190 LOG:  redo done at 0/13397160 LOG:  last completed transaction was at log time
2013-02-14
16:12:58.49303+01 LOG:  checkpoint starting: end-of-recovery immediate LOG:  checkpoint complete: wrote 2869 buffers
(17.5%);0 transaction
 
log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s,
total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s LOG:  autovacuum launcher started LOG:  database system
isready to accept connections LOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was:
/bin/falseLOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was: /bin/false LOG:
archivecommand failed with exit code 1 DETAIL:  The failed archive command was: /bin/false WARNING:  transaction log
file"000000010000000000000011" could not
 
be archived: too many failures

Before the kill, "000000010000000000000001" was the WAL to archive.
After the kill, the checkpoint deleted 9 files before exiting recovery
mode and "000000010000000000000011" become the first WAL to archive.
"000000010000000000000001" through "000000010000000000000010" were
removed or recycled.

Is it expected ?
-- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com



Re: Unarchived WALs deleted after crash

From
Daniel Farina
Date:
On Thu, Feb 14, 2013 at 7:45 AM, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> Hi,
>
> I am facing an unexpected behavior on a 9.2.2 cluster that I can
> reproduce on current HEAD.
>
> On a cluster with archive enabled but failing, after a crash of
> postmaster, the checkpoint occurring before leaving the recovery mode
> deletes any additional WALs, even those waiting to be archived.

I believe I have encountered this recently, but didn't get enough
chance to work with it to correspond.  For me, the cause was
out-of-disk on the file system that exclusively contained WAL,
backlogged because archiving fell behind writing.  This causes the
cluster to crash -- par for the course -- but also an archive gap was
created.  At the time I thought there was some kind of bug in dealing
with out of space issues in the archiver (the .ready bookkeeping), but
the symptoms I saw seem like they might be explained by your report,
too.

--
fdr



Re: Unarchived WALs deleted after crash

From
Heikki Linnakangas
Date:
On 14.02.2013 17:45, Jehan-Guillaume de Rorthais wrote:
> I am facing an unexpected behavior on a 9.2.2 cluster that I can
> reproduce on current HEAD.
>
> On a cluster with archive enabled but failing, after a crash of
> postmaster, the checkpoint occurring before leaving the recovery mode
> deletes any additional WALs, even those waiting to be archived.
 > ...
 > Is it expected ?

No, it's a bug. Ouch. It was introduced in 9.2, by commit
5286105800c7d5902f98f32e11b209c471c0c69c:

> -  /*
> -   * Normally we don't delete old XLOG files during recovery to
> -   * avoid accidentally deleting a file that looks stale due to a
> -   * bug or hardware issue, but in fact contains important data.
> -   * During streaming recovery, however, we will eventually fill the
> -   * disk if we never clean up, so we have to. That's not an issue
> -   * with file-based archive recovery because in that case we
> -   * restore one XLOG file at a time, on-demand, and with a
> -   * different filename that can't be confused with regular XLOG
> -   * files.
> -   */
> -   if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name))
> +   if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
>          [ delete the file ]

With that commit, we started to keep WAL segments restored from the
archive in pg_xlog, so we needed to start deleting old segments during
archive recovery, even when streaming replication was not active. But
the above change was to broad; we started to delete old segments also
during crash recovery.

The above should check InArchiveRecovery, ie. only delete old files when
in archive recovery, not when in crash recovery. But there's one little
complication: InArchiveRecovery is currently only valid in the startup
process, so we'll need to also share it in shared memory, so that the
checkpointer process can access it.

I propose the attached patch to fix it.

- Heikki

Attachment

Re: Unarchived WALs deleted after crash

From
Simon Riggs
Date:
On 15 February 2013 14:31, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 14.02.2013 17:45, Jehan-Guillaume de Rorthais wrote:
>>
>> I am facing an unexpected behavior on a 9.2.2 cluster that I can
>> reproduce on current HEAD.
>>
>> On a cluster with archive enabled but failing, after a crash of
>> postmaster, the checkpoint occurring before leaving the recovery mode
>> deletes any additional WALs, even those waiting to be archived.
>
>> ...
>> Is it expected ?
>
> No, it's a bug. Ouch. It was introduced in 9.2, by commit
> 5286105800c7d5902f98f32e11b209c471c0c69c:

Thanks for tracking that down.

>> -  /*
>> -   * Normally we don't delete old XLOG files during recovery to
>> -   * avoid accidentally deleting a file that looks stale due to a
>> -   * bug or hardware issue, but in fact contains important data.
>> -   * During streaming recovery, however, we will eventually fill the
>> -   * disk if we never clean up, so we have to. That's not an issue
>> -   * with file-based archive recovery because in that case we
>> -   * restore one XLOG file at a time, on-demand, and with a
>> -   * different filename that can't be confused with regular XLOG
>> -   * files.
>> -   */
>> -   if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name))
>> +   if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>          [ delete the file ]
>
>
> With that commit, we started to keep WAL segments restored from the archive
> in pg_xlog, so we needed to start deleting old segments during archive
> recovery, even when streaming replication was not active. But the above
> change was to broad; we started to delete old segments also during crash
> recovery.
>
> The above should check InArchiveRecovery, ie. only delete old files when in
> archive recovery, not when in crash recovery. But there's one little
> complication: InArchiveRecovery is currently only valid in the startup
> process, so we'll need to also share it in shared memory, so that the
> checkpointer process can access it.
>
> I propose the attached patch to fix it.

Agree with your diagnosis and fix.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Unarchived WALs deleted after crash

From
Heikki Linnakangas
Date:
On 15.02.2013 17:12, Simon Riggs wrote:
> On 15 February 2013 14:31, Heikki Linnakangas<hlinnakangas@vmware.com>  wrote:
>>> -  /*
>>> -   * Normally we don't delete old XLOG files during recovery to
>>> -   * avoid accidentally deleting a file that looks stale due to a
>>> -   * bug or hardware issue, but in fact contains important data.
>>> -   * During streaming recovery, however, we will eventually fill the
>>> -   * disk if we never clean up, so we have to. That's not an issue
>>> -   * with file-based archive recovery because in that case we
>>> -   * restore one XLOG file at a time, on-demand, and with a
>>> -   * different filename that can't be confused with regular XLOG
>>> -   * files.
>>> -   */
>>> -   if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>> +   if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>>           [ delete the file ]
>>
>> With that commit, we started to keep WAL segments restored from the archive
>> in pg_xlog, so we needed to start deleting old segments during archive
>> recovery, even when streaming replication was not active. But the above
>> change was to broad; we started to delete old segments also during crash
>> recovery.
>>
>> The above should check InArchiveRecovery, ie. only delete old files when in
>> archive recovery, not when in crash recovery. But there's one little
>> complication: InArchiveRecovery is currently only valid in the startup
>> process, so we'll need to also share it in shared memory, so that the
>> checkpointer process can access it.
>>
>> I propose the attached patch to fix it.
>
> Agree with your diagnosis and fix.

Ok, committed. For the sake of the archives, attached is a script based
on Jehan-Guillaume's description that I used for testing (incidentally
based on Kyotaro's script to reproduce an unrelated problem in another
thread).

Thanks for the report!

- Heikki

Attachment

Re: Unarchived WALs deleted after crash

From
Fujii Masao
Date:
On Fri, Feb 15, 2013 at 11:31 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 14.02.2013 17:45, Jehan-Guillaume de Rorthais wrote:
>>
>> I am facing an unexpected behavior on a 9.2.2 cluster that I can
>> reproduce on current HEAD.
>>
>> On a cluster with archive enabled but failing, after a crash of
>> postmaster, the checkpoint occurring before leaving the recovery mode
>> deletes any additional WALs, even those waiting to be archived.
>
>> ...
>> Is it expected ?
>
> No, it's a bug. Ouch. It was introduced in 9.2, by commit
> 5286105800c7d5902f98f32e11b209c471c0c69c:

Oh, sorry for my mistake.

>
>> -  /*
>> -   * Normally we don't delete old XLOG files during recovery to
>> -   * avoid accidentally deleting a file that looks stale due to a
>> -   * bug or hardware issue, but in fact contains important data.
>> -   * During streaming recovery, however, we will eventually fill the
>> -   * disk if we never clean up, so we have to. That's not an issue
>> -   * with file-based archive recovery because in that case we
>> -   * restore one XLOG file at a time, on-demand, and with a
>> -   * different filename that can't be confused with regular XLOG
>> -   * files.
>> -   */
>> -   if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name))
>> +   if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>          [ delete the file ]
>
>
> With that commit, we started to keep WAL segments restored from the archive
> in pg_xlog, so we needed to start deleting old segments during archive
> recovery, even when streaming replication was not active. But the above
> change was to broad; we started to delete old segments also during crash
> recovery.
>
> The above should check InArchiveRecovery, ie. only delete old files when in
> archive recovery, not when in crash recovery. But there's one little
> complication: InArchiveRecovery is currently only valid in the startup
> process, so we'll need to also share it in shared memory, so that the
> checkpointer process can access it.
>
> I propose the attached patch to fix it.

At least in 9.2, when the archived file is restored into pg_xlog, its xxx.done
archive status file is created. So we don't need to check InArchiveRecovery
when deleting old WAL files. Checking whether xxx.done exists is enough.

Unfortunately in HEAD, xxx.done file is not created when restoring archived
file because of absence of the patch. We need to implement that first.

Regards,

-- 
Fujii Masao



Re: Unarchived WALs deleted after crash

From
Heikki Linnakangas
Date:
On 15.02.2013 18:10, Fujii Masao wrote:
> On Fri, Feb 15, 2013 at 11:31 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>>> -  /*
>>> -   * Normally we don't delete old XLOG files during recovery to
>>> -   * avoid accidentally deleting a file that looks stale due to a
>>> -   * bug or hardware issue, but in fact contains important data.
>>> -   * During streaming recovery, however, we will eventually fill the
>>> -   * disk if we never clean up, so we have to. That's not an issue
>>> -   * with file-based archive recovery because in that case we
>>> -   * restore one XLOG file at a time, on-demand, and with a
>>> -   * different filename that can't be confused with regular XLOG
>>> -   * files.
>>> -   */
>>> -   if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>> +   if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>>           [ delete the file ]
>>
>> With that commit, we started to keep WAL segments restored from the archive
>> in pg_xlog, so we needed to start deleting old segments during archive
>> recovery, even when streaming replication was not active. But the above
>> change was to broad; we started to delete old segments also during crash
>> recovery.
>>
>> The above should check InArchiveRecovery, ie. only delete old files when in
>> archive recovery, not when in crash recovery. But there's one little
>> complication: InArchiveRecovery is currently only valid in the startup
>> process, so we'll need to also share it in shared memory, so that the
>> checkpointer process can access it.
>>
>> I propose the attached patch to fix it.
>
> At least in 9.2, when the archived file is restored into pg_xlog, its xxx.done
> archive status file is created. So we don't need to check InArchiveRecovery
> when deleting old WAL files. Checking whether xxx.done exists is enough.

Hmm, what about streamed WAL files? I guess we could go back to the 
pre-9.2 coding, and check WalRcvInProgress(). But I didn't actually like 
that too much, it seems rather random that old streamed files are 
recycled when wal receiver is running at the time of restartpoint, and 
otherwise not. Because whether wal receiver is running at the time the 
restartpoint happens has little to do with which files were created by 
streaming replication. With the right pattern of streaming files from 
the master, but always being teporarily disconnected when the 
restartpoint runs, you could still accumulate WAL files infinitely.

> Unfortunately in HEAD, xxx.done file is not created when restoring archived
> file because of absence of the patch. We need to implement that first.

Ah yeah, that thing again.. 
(http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm 
going to forward-port that patch now, before it's forgotten again. It's 
not clear to me what the holdup was on this, but whatever the bigger 
patch we've been waiting for is, it can just as well be done on top of 
the forward-port.

- Heikki



Re: Unarchived WALs deleted after crash

From
Fujii Masao
Date:
On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 15.02.2013 18:10, Fujii Masao wrote:
>>
>> On Fri, Feb 15, 2013 at 11:31 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com>  wrote:
>>>>
>>>> -  /*
>>>>
>>>> -   * Normally we don't delete old XLOG files during recovery to
>>>> -   * avoid accidentally deleting a file that looks stale due to a
>>>> -   * bug or hardware issue, but in fact contains important data.
>>>> -   * During streaming recovery, however, we will eventually fill the
>>>> -   * disk if we never clean up, so we have to. That's not an issue
>>>> -   * with file-based archive recovery because in that case we
>>>> -   * restore one XLOG file at a time, on-demand, and with a
>>>> -   * different filename that can't be confused with regular XLOG
>>>> -   * files.
>>>> -   */
>>>> -   if (WalRcvInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>>> +   if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
>>>>           [ delete the file ]
>>>
>>>
>>> With that commit, we started to keep WAL segments restored from the
>>> archive
>>> in pg_xlog, so we needed to start deleting old segments during archive
>>> recovery, even when streaming replication was not active. But the above
>>> change was to broad; we started to delete old segments also during crash
>>> recovery.
>>>
>>> The above should check InArchiveRecovery, ie. only delete old files when
>>> in
>>> archive recovery, not when in crash recovery. But there's one little
>>> complication: InArchiveRecovery is currently only valid in the startup
>>> process, so we'll need to also share it in shared memory, so that the
>>> checkpointer process can access it.
>>>
>>> I propose the attached patch to fix it.
>>
>>
>> At least in 9.2, when the archived file is restored into pg_xlog, its
>> xxx.done
>> archive status file is created. So we don't need to check
>> InArchiveRecovery
>> when deleting old WAL files. Checking whether xxx.done exists is enough.
>
>
> Hmm, what about streamed WAL files? I guess we could go back to the pre-9.2
> coding, and check WalRcvInProgress(). But I didn't actually like that too
> much, it seems rather random that old streamed files are recycled when wal
> receiver is running at the time of restartpoint, and otherwise not. Because
> whether wal receiver is running at the time the restartpoint happens has
> little to do with which files were created by streaming replication. With
> the right pattern of streaming files from the master, but always being
> teporarily disconnected when the restartpoint runs, you could still
> accumulate WAL files infinitely.

Walreceiver always creates .done file when it closes the
already-flushed WAL file
and switches WAL file to next. So we also don't  need to check
WalRcvInProgress().

>> Unfortunately in HEAD, xxx.done file is not created when restoring
>> archived
>> file because of absence of the patch. We need to implement that first.
>
>
> Ah yeah, that thing again..
> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going
> to forward-port that patch now, before it's forgotten again. It's not clear
> to me what the holdup was on this, but whatever the bigger patch we've been
> waiting for is, it can just as well be done on top of the forward-port.

I posted the patch to that thread.

Regards,

-- 
Fujii Masao



Re: Unarchived WALs deleted after crash

From
Simon Riggs
Date:
On 15 February 2013 17:07, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

>> Unfortunately in HEAD, xxx.done file is not created when restoring
>> archived
>> file because of absence of the patch. We need to implement that first.
>
>
> Ah yeah, that thing again..
> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going
> to forward-port that patch now, before it's forgotten again. It's not clear
> to me what the holdup was on this, but whatever the bigger patch we've been
> waiting for is, it can just as well be done on top of the forward-port.

Agreed. I wouldn't wait for a better version now.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Unarchived WALs deleted after crash

From
Heikki Linnakangas
Date:
On 15.02.2013 19:16, Fujii Masao wrote:
> On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> On 15.02.2013 18:10, Fujii Masao wrote:
>>>
>>> At least in 9.2, when the archived file is restored into pg_xlog, its
>>> xxx.done
>>> archive status file is created. So we don't need to check
>>> InArchiveRecovery
>>> when deleting old WAL files. Checking whether xxx.done exists is enough.
>>
>> Hmm, what about streamed WAL files? I guess we could go back to the pre-9.2
>> coding, and check WalRcvInProgress(). But I didn't actually like that too
>> much, it seems rather random that old streamed files are recycled when wal
>> receiver is running at the time of restartpoint, and otherwise not. Because
>> whether wal receiver is running at the time the restartpoint happens has
>> little to do with which files were created by streaming replication. With
>> the right pattern of streaming files from the master, but always being
>> teporarily disconnected when the restartpoint runs, you could still
>> accumulate WAL files infinitely.
>
> Walreceiver always creates .done file when it closes the
> already-flushed WAL file
> and switches WAL file to next. So we also don't  need to check
> WalRcvInProgress().

Ah, I missed that part of the patch.

Okay, agreed, that's a better fix. I committed your forward-port of the 
9.2 patch to master, reverted my earlier fix for this bug, and simply 
removed the 
InArchiveRecovery/ArchiveRecoveryInProgress()/RecoveryInProgress() 
condition from RemoveOldXlogFiles().

- Heikki



Re: Unarchived WALs deleted after crash

From
Simon Riggs
Date:
On 15 February 2013 16:10, Fujii Masao <masao.fujii@gmail.com> wrote:

>> I propose the attached patch to fix it.
>
> At least in 9.2, when the archived file is restored into pg_xlog, its xxx.done
> archive status file is created. So we don't need to check InArchiveRecovery
> when deleting old WAL files. Checking whether xxx.done exists is enough.

I don't agree. The extra test Heikki put in was useful and helps avoid
issues when we get the .done creation wrong, or when people delete
them.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Unarchived WALs deleted after crash

From
Jehan-Guillaume de Rorthais
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Just a quick top-post to thank you all for this fix guys !

Cheers,

On 15/02/2013 18:43, Heikki Linnakangas wrote:
> On 15.02.2013 19:16, Fujii Masao wrote:
>> On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas 
>> <hlinnakangas@vmware.com>  wrote:
>>> On 15.02.2013 18:10, Fujii Masao wrote:
>>>> 
>>>> At least in 9.2, when the archived file is restored into
>>>> pg_xlog, its xxx.done archive status file is created. So we
>>>> don't need to check InArchiveRecovery when deleting old WAL
>>>> files. Checking whether xxx.done exists is enough.
>>> 
>>> Hmm, what about streamed WAL files? I guess we could go back to
>>> the pre-9.2 coding, and check WalRcvInProgress(). But I didn't
>>> actually like that too much, it seems rather random that old
>>> streamed files are recycled when wal receiver is running at the
>>> time of restartpoint, and otherwise not. Because whether wal
>>> receiver is running at the time the restartpoint happens has 
>>> little to do with which files were created by streaming
>>> replication. With the right pattern of streaming files from the
>>> master, but always being teporarily disconnected when the
>>> restartpoint runs, you could still accumulate WAL files
>>> infinitely.
>> 
>> Walreceiver always creates .done file when it closes the 
>> already-flushed WAL file and switches WAL file to next. So we
>> also don't  need to check WalRcvInProgress().
> 
> Ah, I missed that part of the patch.
> 
> Okay, agreed, that's a better fix. I committed your forward-port of
> the 9.2 patch to master, reverted my earlier fix for this bug, and
> simply removed the 
> InArchiveRecovery/ArchiveRecoveryInProgress()/RecoveryInProgress() 
> condition from RemoveOldXlogFiles().
> 
> - Heikki

- -- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlEk5hwACgkQXu9L1HbaT6JZ3wCg4h7QT+wRMT8KZAA/PjOjZcCV
CS4AnRFeGdXIgklo1/RD2hi+e98pNBEe
=voW3
-----END PGP SIGNATURE-----



Re: Unarchived WALs deleted after crash

From
Daniel Farina
Date:
On Fri, Feb 15, 2013 at 9:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 15 February 2013 17:07, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
>>> Unfortunately in HEAD, xxx.done file is not created when restoring
>>> archived
>>> file because of absence of the patch. We need to implement that first.
>>
>>
>> Ah yeah, that thing again..
>> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going
>> to forward-port that patch now, before it's forgotten again. It's not clear
>> to me what the holdup was on this, but whatever the bigger patch we've been
>> waiting for is, it can just as well be done on top of the forward-port.
>
> Agreed. I wouldn't wait for a better version now.

Related to this, how is this going to affect point releases, and are
there any lingering doubts about the mechanism of the fix?  This is
quite serious given my reliance on archiving, so unless the thinking
for point releases is 'real soon' I must backpatch and release it on
my own accord until then.

Thanks for the attention paid to the bug report, as always.

--
fdr



Re: Unarchived WALs deleted after crash

From
Heikki Linnakangas
Date:
On 21.02.2013 02:59, Daniel Farina wrote:
> On Fri, Feb 15, 2013 at 9:29 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> On 15 February 2013 17:07, Heikki Linnakangas<hlinnakangas@vmware.com>  wrote:
>>
>>>> Unfortunately in HEAD, xxx.done file is not created when restoring
>>>> archived
>>>> file because of absence of the patch. We need to implement that first.
>>>
>>>
>>> Ah yeah, that thing again..
>>> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm going
>>> to forward-port that patch now, before it's forgotten again. It's not clear
>>> to me what the holdup was on this, but whatever the bigger patch we've been
>>> waiting for is, it can just as well be done on top of the forward-port.
>>
>> Agreed. I wouldn't wait for a better version now.
>
> Related to this, how is this going to affect point releases, and are
> there any lingering doubts about the mechanism of the fix?

Are you talking about the patch to avoid restored WAL segments from 
being re-archived (commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), or 
the bug that that unarchived WALs were deleted after crash (commit 
b5ec56f664fa20d80fe752de494ec96362eff520)? The former was included in 
9.2.0 already, and the latter will be included in the next point release.

I have no lingering doubts about this. There was some plans to do bigger 
changes for the re-archiving issue 
(6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), which is why it was 
initially left out from master. But that didn't happen, and I believe 
everyone is happy with the current state of things.

> This is
> quite serious given my reliance on archiving, so unless the thinking
> for point releases is 'real soon' I must backpatch and release it on
> my own accord until then.

I don't know what the release schedule is. I take that to be a request 
to put out a new minor release ASAP.

- Heikki



Re: Unarchived WALs deleted after crash

From
Daniel Farina
Date:
On Thu, Feb 21, 2013 at 12:39 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 21.02.2013 02:59, Daniel Farina wrote:
>>
>> On Fri, Feb 15, 2013 at 9:29 AM, Simon Riggs<simon@2ndquadrant.com>
>> wrote:
>>>
>>> On 15 February 2013 17:07, Heikki Linnakangas<hlinnakangas@vmware.com>
>>> wrote:
>>>
>>>>> Unfortunately in HEAD, xxx.done file is not created when restoring
>>>>> archived
>>>>> file because of absence of the patch. We need to implement that first.
>>>>
>>>>
>>>>
>>>> Ah yeah, that thing again..
>>>> (http://www.postgresql.org/message-id/50DF5BA7.6070200@vmware.com) I'm
>>>> going
>>>> to forward-port that patch now, before it's forgotten again. It's not
>>>> clear
>>>> to me what the holdup was on this, but whatever the bigger patch we've
>>>> been
>>>> waiting for is, it can just as well be done on top of the forward-port.
>>>
>>>
>>> Agreed. I wouldn't wait for a better version now.
>>
>>
>> Related to this, how is this going to affect point releases, and are
>> there any lingering doubts about the mechanism of the fix?
>
>
> Are you talking about the patch to avoid restored WAL segments from being
> re-archived (commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), or the bug
> that that unarchived WALs were deleted after crash (commit
> b5ec56f664fa20d80fe752de494ec96362eff520)? The former was included in 9.2.0
> already, and the latter will be included in the next point release.

Unarchived WALs being deleted after a crash is the one that worries
me.  I actually presume re-archivals will happen anyway because I may
lose connection to archive storage after the WAL has already been
committed, hence b5ec56f664fa20d80fe752de494ec96362eff520.

>> This is
>> quite serious given my reliance on archiving, so unless the thinking
>> for point releases is 'real soon' I must backpatch and release it on
>> my own accord until then.
>
>
> I don't know what the release schedule is. I take that to be a request to
> put out a new minor release ASAP.

Perhaps, but it's more of a concrete evaluation of how important
archiving is to me and my affiliated operation.  An acceptable answer
might be "yeah, backpatch if you feel it's that much of a rush."
Clearly, my opinion is that a gap in the archives is pretty
cringe-inducing.  I hit it from an out of disk case, and you'd be
surprised (or perhaps not?) how many people like to kill -9 processes
on a whim.

I already maintain other backpatches (not related to fixes), and this
one is only temporary, so it's not too much trouble for me.

--
fdr



Re: Unarchived WALs deleted after crash

From
Jeff Janes
Date:
On Thu, Feb 21, 2013 at 12:39 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>
> Are you talking about the patch to avoid restored WAL segments from being
> re-archived (commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d), or the bug
> that that unarchived WALs were deleted after crash (commit
> b5ec56f664fa20d80fe752de494ec96362eff520)? The former was included in 9.2.0
> already, and the latter will be included in the next point release.
>
...
>
> I don't know what the release schedule is. I take that to be a request to
> put out a new minor release ASAP.

+1 from me.  I'm rather uncomfortable running a system with this
unarchived deletion bug.

Cheers,

Jeff