Thread: Standby receiving part of missing WAL segment
Hi,
Today I witnessed a situation which appears to have gone down like this:
- The primary server starting streaming WAL data from segment 00A8 to the standbyToday I witnessed a situation which appears to have gone down like this:
At the moment, the standby gets stuck forever in this situation, even though it has access to the WAL it needs.
Thom
On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote: > Today I witnessed a situation which appears to have gone down like this: > > - The primary server starting streaming WAL data from segment 00A8 to the > standby > - The standby server started receiving that data > - Before 00A8 is finished, the wal sender process dies on the primary, but > the archiver process continues, and 00A8 ends up being archived as usual > - The primary continues to generate WAL and cleans up old WAL files from > pg_xlog until 00A8 is gone. > - The primary is restarted and the wal sender process is back up and running > - The standby says "waiting for 00A8", which it can no longer get from the > primary > - 00A8 is in the standby's archive directory, but the standby is waiting for > the rest of the segment from the primary via streaming replication, so > doesn't check the archive > - The standby is restarted > - The standby goes back into recovery and eventually replays 00A8 and > continues as normal. > > Should the standby be able to get feedback from the primary that the > requested segment is no longer available, and therefore know to check its > archive? Last time I played around with this, if the standby requested a segment from the master that was no longer present there, the standby would immediately get an ERROR, which it seems like would get you out of trouble. I wonder why that didn't happen in your case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12 February 2015 at 13:56, Robert Haas <robertmhaas@gmail.com> wrote:
I don't suppose this is something that was buggy in 9.3.1?On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote:
> Today I witnessed a situation which appears to have gone down like this:
>
> - The primary server starting streaming WAL data from segment 00A8 to the
> standby
> - The standby server started receiving that data
> - Before 00A8 is finished, the wal sender process dies on the primary, but
> the archiver process continues, and 00A8 ends up being archived as usual
> - The primary continues to generate WAL and cleans up old WAL files from
> pg_xlog until 00A8 is gone.
> - The primary is restarted and the wal sender process is back up and running
> - The standby says "waiting for 00A8", which it can no longer get from the
> primary
> - 00A8 is in the standby's archive directory, but the standby is waiting for
> the rest of the segment from the primary via streaming replication, so
> doesn't check the archive
> - The standby is restarted
> - The standby goes back into recovery and eventually replays 00A8 and
> continues as normal.
>
> Should the standby be able to get feedback from the primary that the
> requested segment is no longer available, and therefore know to check its
> archive?
Last time I played around with this, if the standby requested a
segment from the master that was no longer present there, the standby
would immediately get an ERROR, which it seems like would get you out
of trouble. I wonder why that didn't happen in your case.
Yeah, I've tried recreating this like so:
- Primary streams to standby like usual
- Kill -9 primary then change its port and bring it back up
- Create traffic on primary until it no longer has the WAL file the standby wants, but has archived it
- Change the port of the primary back to what the standby is trying to talk to
But before it gets to that 4th point, the standby has gone to the archive for the rest of it:
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5488?
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5488?
2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG: restored log file "0000000100000006000000C8" from archive
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5488?
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or directory
2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5488?
2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG: restored log file "0000000100000006000000C8" from archive
Thom
On Thu, Feb 12, 2015 at 9:56 AM, Thom Brown <thom@linux.com> wrote: > On 12 February 2015 at 13:56, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom@linux.com> wrote: >> > Today I witnessed a situation which appears to have gone down like this: >> > >> > - The primary server starting streaming WAL data from segment 00A8 to >> > the >> > standby >> > - The standby server started receiving that data >> > - Before 00A8 is finished, the wal sender process dies on the primary, >> > but >> > the archiver process continues, and 00A8 ends up being archived as usual >> > - The primary continues to generate WAL and cleans up old WAL files from >> > pg_xlog until 00A8 is gone. >> > - The primary is restarted and the wal sender process is back up and >> > running >> > - The standby says "waiting for 00A8", which it can no longer get from >> > the >> > primary >> > - 00A8 is in the standby's archive directory, but the standby is waiting >> > for >> > the rest of the segment from the primary via streaming replication, so >> > doesn't check the archive >> > - The standby is restarted >> > - The standby goes back into recovery and eventually replays 00A8 and >> > continues as normal. >> > >> > Should the standby be able to get feedback from the primary that the >> > requested segment is no longer available, and therefore know to check >> > its >> > archive? >> >> Last time I played around with this, if the standby requested a >> segment from the master that was no longer present there, the standby >> would immediately get an ERROR, which it seems like would get you out >> of trouble. I wonder why that didn't happen in your case. > > > Yeah, I've tried recreating this like so: > > - Primary streams to standby like usual > - Kill -9 primary then change its port and bring it back up > - Create traffic on primary until it no longer has the WAL file the standby > wants, but has archived it > - Change the port of the primary back to what the standby is trying to talk > to > > But before it gets to that 4th point, the standby has gone to the archive > for the rest of it: > > cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or > directory > 2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL: could not > connect to the primary server: could not connect to server: Connection > refused > Is the server running on host "127.0.0.1" and accepting > TCP/IP connections on port 5488? > > cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or > directory > 2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL: could not > connect to the primary server: could not connect to server: Connection > refused > Is the server running on host "127.0.0.1" and accepting > TCP/IP connections on port 5488? > > 2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG: restored log > file "0000000100000006000000C8" from archive > > I don't suppose this is something that was buggy in 9.3.1? I don't know. I don't remember a bug of that type, but I might have missed it. If you see the scenario again, it might help to pull a stack trace from the master and the standby. That might tell us whether the master or the standby is to blame, and where the offender is wedged. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company