Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Date
Msg-id CAB7nPqTGnonaydRDx2KQoLAt+AM_nMFqeR6inYZZAo8EeHKwfw@mail.gmail.com
Whole thread Raw
In response to Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers


On Tue, Jan 22, 2013 at 9:06 AM, Michael Paquier <michael.paquier@gmail.com> wrote:


On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Hmm, so it's the same issue I thought I fixed yesterday. My patch only fixed it for the case that the timeline switch is in the first page of the segment. When it's not, you still get two calls for a WAL record, first one for the first page in the segment, to verify that, and then the page that actually contains the record. The first call leads XLogPageRead to think it needs to read from the old timeline.

We didn't have this problem before the xlogreader refactoring because XLogPageRead() was always called with the RecPtr of the record, even when we actually read the segment header from the file first. We'll have to somehow get that same information, the RecPtr of the record we're actually interested in, to XLogPageRead(). We could add a new argument to the callback for that, or we could keep xlogreader.c as it is and pass it through from ReadRecord to XLogPageRead() in the private struct.

An explicit argument to the callback is probably best. That's straightforward, and it might be useful for the callback to know the actual WAL position that xlogreader.c is interested in anyway. See attached.
Just to let you know that I am still getting the error even after commit 2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1

LOG:  selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: TIMELINE_HISTORY 2
DEBUG:  received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is sync with timeline 2.
Log on slave 2 at this moment:
LOG:  restarted WAL streaming at 0/5000000 on timeline 1

LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/5014200
DEBUG:  walreceiver ended streaming and awaits new instructions

The timeline history file is the same for both nodes:
$ cat 00000002.history
1    0/5014200    no recovery target specified

I might be wrong, but shouldn't there be also an entry for timeline 2 in this file?

Am I missing something?
Sorry, there are no problems...
I simply forgot to set up recovery_target_timeline to 'latest' in recovery.conf...
--
Michael Paquier
http://michael.otacoo.com

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Next
From: Robert Haas
Date:
Subject: Re: Request for vote to move forward with recovery.conf overhaul