Re: Standby receiving part of missing WAL segment - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Standby receiving part of missing WAL segment
Date
Msg-id CA+Tgmob6+q_zwOeokf_FAAOAr3uvGME56ApF6EQZ5ncTcBAG-Q@mail.gmail.com
Whole thread Raw
In response to Re: Standby receiving part of missing WAL segment  (Thom Brown <thom@linux.com>)
List pgsql-hackers
On Thu, Feb 12, 2015 at 9:56 AM, Thom Brown <thom@linux.com> wrote:
> On 12 February 2015 at 13:56, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote:
>> > Today I witnessed a situation which appears to have gone down like this:
>> >
>> > - The primary server starting streaming WAL data from segment 00A8 to
>> > the
>> > standby
>> > - The standby server started receiving that data
>> > - Before 00A8 is finished, the wal sender process dies on the primary,
>> > but
>> > the archiver process continues, and 00A8 ends up being archived as usual
>> > - The primary continues to generate WAL and cleans up old WAL files from
>> > pg_xlog until 00A8 is gone.
>> > - The primary is restarted and the wal sender process is back up and
>> > running
>> > - The standby says "waiting for 00A8", which it can no longer get from
>> > the
>> > primary
>> > - 00A8 is in the standby's archive directory, but the standby is waiting
>> > for
>> > the rest of the segment from the primary via streaming replication, so
>> > doesn't check the archive
>> > - The standby is restarted
>> > - The standby goes back into recovery and eventually replays 00A8 and
>> > continues as normal.
>> >
>> > Should the standby be able to get feedback from the primary that the
>> > requested segment is no longer available, and therefore know to check
>> > its
>> > archive?
>>
>> Last time I played around with this, if the standby requested a
>> segment from the master that was no longer present there, the standby
>> would immediately get an ERROR, which it seems like would get you out
>> of trouble.  I wonder why that didn't happen in your case.
>
>
> Yeah, I've tried recreating this like so:
>
> - Primary streams to standby like usual
> - Kill -9 primary then change its port and bring it back up
> - Create traffic on primary until it no longer has the WAL file the standby
> wants, but has archived it
> - Change the port of the primary back to what the standby is trying to talk
> to
>
> But before it gets to that 4th point, the standby has gone to the archive
> for the rest of it:
>
> cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or
> directory
> 2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL:  could not
> connect to the primary server: could not connect to server: Connection
> refused
>                 Is the server running on host "127.0.0.1" and accepting
>                 TCP/IP connections on port 5488?
>
> cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or
> directory
> 2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL:  could not
> connect to the primary server: could not connect to server: Connection
> refused
>                 Is the server running on host "127.0.0.1" and accepting
>                 TCP/IP connections on port 5488?
>
> 2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG:  restored log
> file "0000000100000006000000C8" from archive
>
> I don't suppose this is something that was buggy in 9.3.1?

I don't know.  I don't remember a bug of that type, but I might have
missed it.  If you see the scenario again, it might help to pull a
stack trace from the master and the standby.  That might tell us
whether the master or the standby is to blame, and where the offender
is wedged.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Manipulating complex types as non-contiguous structures in-memory
Next
From: Thom Brown
Date:
Subject: Re: Index-only scans for GiST.