Re: 9.2.3 crashes during archive recovery - Mailing list pgsql-hackers

From KONDO Mitsumasa
Subject Re: 9.2.3 crashes during archive recovery
Date
Msg-id 51398B8E.5060803@lab.ntt.co.jp
Whole thread Raw
In response to Re: 9.2.3 crashes during archive recovery  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: 9.2.3 crashes during archive recovery  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
(2013/03/07 19:41), Heikki Linnakangas wrote:
> On 07.03.2013 10:05, KONDO Mitsumasa wrote:
>> (2013/03/06 16:50), Heikki Linnakangas wrote:>
>>> Yeah. That fix isn't right, though; XLogPageRead() is supposed to
>>> return true on success, and false on error, and the patch makes it
>>> return 'true' on error, if archive recovery was requested but we're
>>> still in crash recovery. The real issue here is that I missed the two
>>> "return NULL;"s in ReadRecord(), so the code that I put in the
>>> next_record_is_invalid codepath isn't run if XLogPageRead() doesn't
>>> find the file at all. Attached patch is the proper fix for this.
>>>
>> Thanks for createing patch! I test your patch in 9.2_STABLE, but it does
>> not use promote command...
>> When XLogPageRead() was returned false ,it means the end of stanby loop,
>> crash recovery loop, and archive recovery loop.
>> Your patch is not good for promoting Standby to Master. It does not come
>> off standby loop.
>>
>> So I make new patch which is based Heikki's and Horiguchi's patch.
>
> Ah, I see. I committed a slightly modified version of this.
I feel that your modification is legible. Thanks for your modification and committing patch! 
>>>> I also found a bug in latest 9.2_stable. It does not get latest timeline
>>>> and
>>>> recovery history file in archive recovery when master and standby
>>>> timeline is different.
>>>
>>> Works for me.. Can you create a test script for that? Remember to set
>>> "recovery_target_timeline='latest'".
>> ...
>> It can be reproduced in my test script, too.
>
> I see the problem now, with that script. So what happens is that the startup process first scans the timeline history
filesto choose the recovery target timeline. For that scan, I temporarily set InArchiveRecovery=true, in
readRecoveryCommandFile.However, after readRecoveryCommandFile returns, we then try to read the timeline history file
correspondingthe chosen recovery target timeline, but InArchiveRecovery is no longer set, so we don't fetch the file
fromarchive, and return a "dummy" history, with just the target timeline in it. That doesn't contain the older
timeline,so you get an error at recovery.
 
> Fixed per your patch to check for ArchiveRecoveryRequested instead of InArchiveRecovery, when reading timeline
historyfiles. This also makes it unnecessary to temporarily set InArchiveRecovery=true, so removed that.
 
> Committed both fixes. Please confirm this this fixed the problem in your test environment. Many thanks for the
testingand the patches!
 
I understand this problem. Thank you for adding modification and detail explanation! I test latest REL9_2_STABLE in my
system.I confirm that it run good without problem. If I found an another problem, I will report and send you patch and
testscript!
 


Best regards,
-- 
Mitsumasa KONDO
NTT OSS Center



pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: Enabling Checksums
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Re: proposal: a width specification for s specifier (format function), fix behave when positional and ordered placeholders are used