Re: [BUG] Archive recovery failure on 9.3+. - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [BUG] Archive recovery failure on 9.3+.
Date
Msg-id 52FCBDB3.30206@vmware.com
Whole thread Raw
In response to Re: [BUG] Archive recovery failure on 9.3+.  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [BUG] Archive recovery failure on 9.3+.  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 02/13/2014 01:37 PM, Kyotaro HORIGUCHI wrote:
> At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in
<52FC7E2A.9060703@vmware.com>
>> On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:
>>>>>>> Wouldn't it be better to not archive the old segment, and instead
>>>>>>> switch
>>>>>>> to a new segment after writing the end-of-recovery checkpoint, so that
>>>>>>> the segment on the new timeline is archived sooner?
>>>>>>
>>>>>> It would be better to zero-fill and switch segments, yes. We should
>>>>>> NEVER be in a position of archiving two different versions of the same
>>>>>> segment.
>>>>>
>>>>> Ok, I think we're in agreement that that's the way to go for master.
>
> Does this mean that no mechanical solution for this situation
> will be given and an operator should remove the older segment for
> the same segment id before starting recovoery process?

I don't understand. With the proposed change to never archive a partial 
segment, there is no segment with the same filename in the archive. 
There is nothing to remove. Or did you mean that you have to do that in 
9.0-9.2 back-branches, if we only back-patch this change to 9.3?

>> If it's already in the archive, it's not going to be removed from the
>> archive.
>
> I have understood the conclusion so far is not archiving the
> older segment when promotion but it seems a bit odd story as you
> suggested. If the conclusion here is no aid as my new
> understanding, would you let me hear the reason why recovery have
> changed to prefer archive to pg_xlog?
>
> The commit abf5c5c9a4 seems to change the behavior but I don't
> find the reason for the change.

Well, it didn't work too well before abf5c5c9a4 either. With your test 
script, and commit abf5c5c9a4 reverted [1], you got this:

### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ###
waiting for server to start....LOG:  database system was interrupted; 
last known up at 2014-02-13 14:17:53 EET
LOG:  starting archive recovery
LOG:  restored log file "00000002.history" from archive
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such 
file or directory
LOG:  restored log file "000000010000000000000002" from archive
LOG:  record with zero length at 0/2000090
LOG:  invalid primary checkpoint record
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such 
file or directory
LOG:  restored log file "000000010000000000000002" from archive
LOG:  using previous checkpoint record at 0/2000028
LOG:  consistent recovery state reached at 0/2000028
LOG:  record with zero length at 0/2000090
LOG:  redo is not required
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such 
file or directory
LOG:  restored log file "000000010000000000000002" from archive
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/00000003.history’: No such file or 
directory
LOG:  selected new timeline ID: 3
LOG:  restored log file "00000002.history" from archive
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Note the "invalid primary checkpoint record" message. It restored the 
partial 000000010000000000000002 file from the archive, and tried to 
recover using that instead of the 000000020000000000000002 file from 
pg_xlog. The primary checkpoint record is not present in 
000000010000000000000002, so it fell back to recover from the previous 
checkpoint.

It was the exact same issue, even though it appeared to work if you 
don't look carefully. The second recovery never replayed the WAL 
generated on timeline 2, it branched off directly from timeline 1, at 
the same point where timeline 2 did. You can see that by looking at the 
history files. Also, if you create a table and insert a row to it after 
the first recovery, just before killing the server, you'll see that the 
row is not there after the second recovery. That's wrong; the data files 
might already contain some changes from timeline 2, and you end up with 
a corrupt database. Thanks to commit abf5c5c9a4, the situation is at 
least detected, and the system refuses to start up.

The behavior where we prefer a segment from archive with lower TLI over 
a file with higher TLI in pg_xlog actually changed in commit 
a068c391ab0. Arguably changing it wasn't a good idea, but the problem 
your test script demonstrates can be fixed by not archiving the partial 
segment, with no change to the preference of archive/pg_xlog. As 
discussed, archiving a partial segment seems like a bad idea anyway, so 
let's just stop doing that.


[1] just applying commit abf5c5c9a4 in reverse doesn't apply cleanly due 
to further changes to the code, but I did this to simulate the old behavior:

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 6a4a504..352d3f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4366,6 +4366,7 @@ readRecoveryCommandFile(void)
     /* Enable fetching from archive recovery area */     ArchiveRecoveryRequested = true;
+    InArchiveRecovery = true;
     /*      * If user specified recovery_target_timeline, validate it or compute the

- Heikki



pgsql-hackers by date:

Previous
From: Kohei KaiGai
Date:
Subject: Re: New hook after raw parsing, before analyze
Next
From: Christoph Berg
Date:
Subject: Re: [BUG] Archive recovery failure on 9.3+.