exitArchiveRecovery woes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject exitArchiveRecovery woes
Date
Msg-id 549187D0.8050609@vmware.com
Whole thread Raw
Responses Re: exitArchiveRecovery woes  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
At the end of archive recovery, we copy the last segment from the old
timeline, to initialize the first segment on the new timeline. For
example, if the timeline switch happens in the middle of WAL segment
000000010000000000000005, the whole 000000010000000000000005 segment is
copied to become 000000020000000000000005. The copying is necessary, so
that the new segment contains valid data up to the switch point.

However, we wouldn't really need to copy the whole segment, copying up
to the switch point would be enough. In fact, copying the whole segment
is a bad idea, because the copied WAL looks valid on the new timeline
too. When we read the WAL at crash recovery, we rely on a number of
things to determine if the next WAL record is valid. Most importantly,
the checksum, and the prev-pointer. The checksum protects any random
data from appearing valid, and the prev-pointer makes sure that a WAL
record copied from another location in the WAL is not mistaken as valid.
The prev-pointer is particularly important when we recycle old WAL
segments as new, because the old segment contains valid WAL records with
checksums and all. When we copy a WAL segment with the same segment
number, the prev pointer doesn't protect us, as there can be WAL records
at the exact same locations in both segments. There is a timeline ID on
the page header, but we could still be mistaken within the page. Also,
we are lenient with the TLI at start of WAL recovery, when we read the
first WAL record after the checkpoint. There are further safeguards,
like the fact that when writing WAL, we always write full blocks. But
the write could still be torn at the OS or disk level, if you crash
after writing the WAL, but before fsyncing it.

This is largely academic, but I was able to craft a test case where WAL
recovery mistakenly starts to replay the WAL copied from the old
timeline, as if it was on the new timeline. Attached is a shell script I
used. It's very sensitive to the lengths of the WAL records, so probably
only works on a similar platform as mine (x86_64 Linux). Running
pitr-test.sh ends with this:

S LOG:  database system was interrupted; last known up at 2014-12-17
15:15:42 EET
S LOG:  database system was not properly shut down; automatic recovery
in progress
S LOG:  redo starts at 0/50A2018
S PANIC:  heap_insert_redo: invalid max offset number
S CONTEXT:  xlog redo Heap/INSERT: off 28
S LOG:  startup process (PID 10640) was terminated by signal 6: Aborted
S LOG:  aborting startup due to startup process failure

That PANIC happens because it tries to apply WAL from different
timeline, and it doesn't work because it missed an earlier change to the
same page it modifies. (If you were unlucky, you could get silent
corruption instead, if the WAL record happens to apply without an error)

A simple way to avoid this is to copy the old WAL segment only up to the
point of the timeline switch, and zero the rest.


Another thing I noticed is that we copy the last old WAL segment on the
new timeline, even if the timeline switch happens at a segment boundary.
In that case, the copied WAL segment is 100% identical to the old
segment; it contains no records belonging to the new timeline. I guess
that's not wrong per se, but it seems pointless and confusing.

Attached is a patch that addresses both of those issues. This doesn't
seem worth the risk to back-patch, but let's fix these in master.


PS. The "if (endTLI != ThisTimeLineID)" test in exitArchiveRecovery was
always true, because we always switch to a new timeline after archive
recovery. I turned that into an Assert.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: WALWriter active during recovery
Next
From: Simon Riggs
Date:
Subject: Re: Combining Aggregates