Re: Standby receiving part of missing WAL segment - Mailing list pgsql-hackers

From Thom Brown
Subject Re: Standby receiving part of missing WAL segment
Date
Msg-id CAA-aLv4LcGXy1KSV2JtN94-SPefzD4Dq_=J2zMOGNs3cmZRwfA@mail.gmail.com
Whole thread Raw
In response to Re: Standby receiving part of missing WAL segment  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Standby receiving part of missing WAL segment  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 12 February 2015 at 13:56, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote:
> Today I witnessed a situation which appears to have gone down like this:
>
> - The primary server starting streaming WAL data from segment 00A8 to the
> standby
> - The standby server started receiving that data
> - Before 00A8 is finished, the wal sender process dies on the primary, but
> the archiver process continues, and 00A8 ends up being archived as usual
> - The primary continues to generate WAL and cleans up old WAL files from
> pg_xlog until 00A8 is gone.
> - The primary is restarted and the wal sender process is back up and running
> - The standby says "waiting for 00A8", which it can no longer get from the
> primary
> - 00A8 is in the standby's archive directory, but the standby is waiting for
> the rest of the segment from the primary via streaming replication, so
> doesn't check the archive
> - The standby is restarted
> - The standby goes back into recovery and eventually replays 00A8 and
> continues as normal.
>
> Should the standby be able to get feedback from the primary that the
> requested segment is no longer available, and therefore know to check its
> archive?

Last time I played around with this, if the standby requested a
segment from the master that was no longer present there, the standby
would immediately get an ERROR, which it seems like would get you out
of trouble.  I wonder why that didn't happen in your case.

Yeah, I've tried recreating this like so:

- Primary streams to standby like usual
- Kill -9 primary then change its port and bring it back up
- Create traffic on primary until it no longer has the WAL file the standby wants, but has archived it
- Change the port of the primary back to what the standby is trying to talk to

But before it gets to that 4th point, the standby has gone to the archive for the rest of it:

cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "127.0.0.1" and accepting
                TCP/IP connections on port 5488?
       
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "127.0.0.1" and accepting
                TCP/IP connections on port 5488?
       
2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG:  restored log file "0000000100000006000000C8" from archive

I don't suppose this is something that was buggy in 9.3.1?

Thom

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: GSoC 2015 - mentors, students and admins.
Next
From: Robert Haas
Date:
Subject: Re: Manipulating complex types as non-contiguous structures in-memory