Thread: Standby receiving part of missing WAL segment

Standby receiving part of missing WAL segment

From
Thom Brown
Date:
Hi,

Today I witnessed a situation which appears to have gone down like this:

- The primary server starting streaming WAL data from segment 00A8 to the standby
- The standby server started receiving that data
- Before 00A8 is finished, the wal sender process dies on the primary, but the archiver process continues, and 00A8 ends up being archived as usual
- The primary continues to generate WAL and cleans up old WAL files from pg_xlog until 00A8 is gone.
- The primary is restarted and the wal sender process is back up and running
- The standby says "waiting for 00A8", which it can no longer get from the primary
- 00A8 is in the standby's archive directory, but the standby is waiting for the rest of the segment from the primary via streaming replication, so doesn't check the archive
- The standby is restarted
- The standby goes back into recovery and eventually replays 00A8 and continues as normal.

Should the standby be able to get feedback from the primary that the requested segment is no longer available, and therefore know to check its archive?

Or should it check the archive anyway if it hasn't received any further WAL data via the streaming replication connection after a certain amount of time?

At the moment, the standby gets stuck forever in this situation, even though it has access to the WAL it needs.

Thom

Re: Standby receiving part of missing WAL segment

From
Robert Haas
Date:
On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote:
> Today I witnessed a situation which appears to have gone down like this:
>
> - The primary server starting streaming WAL data from segment 00A8 to the
> standby
> - The standby server started receiving that data
> - Before 00A8 is finished, the wal sender process dies on the primary, but
> the archiver process continues, and 00A8 ends up being archived as usual
> - The primary continues to generate WAL and cleans up old WAL files from
> pg_xlog until 00A8 is gone.
> - The primary is restarted and the wal sender process is back up and running
> - The standby says "waiting for 00A8", which it can no longer get from the
> primary
> - 00A8 is in the standby's archive directory, but the standby is waiting for
> the rest of the segment from the primary via streaming replication, so
> doesn't check the archive
> - The standby is restarted
> - The standby goes back into recovery and eventually replays 00A8 and
> continues as normal.
>
> Should the standby be able to get feedback from the primary that the
> requested segment is no longer available, and therefore know to check its
> archive?

Last time I played around with this, if the standby requested a
segment from the master that was no longer present there, the standby
would immediately get an ERROR, which it seems like would get you out
of trouble.  I wonder why that didn't happen in your case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Standby receiving part of missing WAL segment

From
Thom Brown
Date:
On 12 February 2015 at 13:56, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote:
> Today I witnessed a situation which appears to have gone down like this:
>
> - The primary server starting streaming WAL data from segment 00A8 to the
> standby
> - The standby server started receiving that data
> - Before 00A8 is finished, the wal sender process dies on the primary, but
> the archiver process continues, and 00A8 ends up being archived as usual
> - The primary continues to generate WAL and cleans up old WAL files from
> pg_xlog until 00A8 is gone.
> - The primary is restarted and the wal sender process is back up and running
> - The standby says "waiting for 00A8", which it can no longer get from the
> primary
> - 00A8 is in the standby's archive directory, but the standby is waiting for
> the rest of the segment from the primary via streaming replication, so
> doesn't check the archive
> - The standby is restarted
> - The standby goes back into recovery and eventually replays 00A8 and
> continues as normal.
>
> Should the standby be able to get feedback from the primary that the
> requested segment is no longer available, and therefore know to check its
> archive?

Last time I played around with this, if the standby requested a
segment from the master that was no longer present there, the standby
would immediately get an ERROR, which it seems like would get you out
of trouble.  I wonder why that didn't happen in your case.

Yeah, I've tried recreating this like so:

- Primary streams to standby like usual
- Kill -9 primary then change its port and bring it back up
- Create traffic on primary until it no longer has the WAL file the standby wants, but has archived it
- Change the port of the primary back to what the standby is trying to talk to

But before it gets to that 4th point, the standby has gone to the archive for the rest of it:

cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "127.0.0.1" and accepting
                TCP/IP connections on port 5488?
       
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "127.0.0.1" and accepting
                TCP/IP connections on port 5488?
       
2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG:  restored log file "0000000100000006000000C8" from archive

I don't suppose this is something that was buggy in 9.3.1?

Thom

Re: Standby receiving part of missing WAL segment

From
Robert Haas
Date:
On Thu, Feb 12, 2015 at 9:56 AM, Thom Brown <thom@linux.com> wrote:
> On 12 February 2015 at 13:56, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote:
>> > Today I witnessed a situation which appears to have gone down like this:
>> >
>> > - The primary server starting streaming WAL data from segment 00A8 to
>> > the
>> > standby
>> > - The standby server started receiving that data
>> > - Before 00A8 is finished, the wal sender process dies on the primary,
>> > but
>> > the archiver process continues, and 00A8 ends up being archived as usual
>> > - The primary continues to generate WAL and cleans up old WAL files from
>> > pg_xlog until 00A8 is gone.
>> > - The primary is restarted and the wal sender process is back up and
>> > running
>> > - The standby says "waiting for 00A8", which it can no longer get from
>> > the
>> > primary
>> > - 00A8 is in the standby's archive directory, but the standby is waiting
>> > for
>> > the rest of the segment from the primary via streaming replication, so
>> > doesn't check the archive
>> > - The standby is restarted
>> > - The standby goes back into recovery and eventually replays 00A8 and
>> > continues as normal.
>> >
>> > Should the standby be able to get feedback from the primary that the
>> > requested segment is no longer available, and therefore know to check
>> > its
>> > archive?
>>
>> Last time I played around with this, if the standby requested a
>> segment from the master that was no longer present there, the standby
>> would immediately get an ERROR, which it seems like would get you out
>> of trouble.  I wonder why that didn't happen in your case.
>
>
> Yeah, I've tried recreating this like so:
>
> - Primary streams to standby like usual
> - Kill -9 primary then change its port and bring it back up
> - Create traffic on primary until it no longer has the WAL file the standby
> wants, but has archived it
> - Change the port of the primary back to what the standby is trying to talk
> to
>
> But before it gets to that 4th point, the standby has gone to the archive
> for the rest of it:
>
> cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or
> directory
> 2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL:  could not
> connect to the primary server: could not connect to server: Connection
> refused
>                 Is the server running on host "127.0.0.1" and accepting
>                 TCP/IP connections on port 5488?
>
> cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or
> directory
> 2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL:  could not
> connect to the primary server: could not connect to server: Connection
> refused
>                 Is the server running on host "127.0.0.1" and accepting
>                 TCP/IP connections on port 5488?
>
> 2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG:  restored log
> file "0000000100000006000000C8" from archive
>
> I don't suppose this is something that was buggy in 9.3.1?

I don't know.  I don't remember a bug of that type, but I might have
missed it.  If you see the scenario again, it might help to pull a
stack trace from the master and the standby.  That might tell us
whether the master or the standby is to blame, and where the offender
is wedged.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company