Thread: postgres wal sender replication timeout during pg_basebackup

postgres wal sender replication timeout during pg_basebackup

From

Peter Brunnengräber

Date:

07 April 2016, 18:14:14

Hello all,
   I had posted this to dba.stackexchange but haven't gotten any responses, so I thought the list here may be more
focusedand have a better shot to post this. 

   I'll start by noting that I am still somewhat green with Postgres...  One of our applications requires it, so I have
beenlearning as I go... 

   Right now I am working on a postgres 9.2 Active/Standby cluster on Debian wheezy to make the application more fault
tolerent,based off of the ClusterLabs pgsql cluster documentation
[http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster].

   In the lab, I am able to get this setup and working without a problem; But on the pre-production cluster, I keep
runninginto a wal sync error. 

   I brought the database files over from the current single production postgres server. By this I mean I shutdown
postgresand tar-ed up the data directory and copied it over the the cluster's Master node. I put the files in place,
setthe permissions, and was able to start-up postgres on the Master via corosync just fine. 

   In preparing the slave, I used the pg_basebackup tool to bring the database over from the Master and this is where I
keephaving issues. As it is transferring, at about 57% I see the error: 

>  $ pg_basebackup -h db-master -U u_repl -D /db/data/postgresql/9.2/main/ -X stream -P
>  pg_basebackup: could not receive data from WAL stream: SSL connection has been closed unexpectedly
>  176472/176472 kB (100%), 1/1 tablespace
>  pg_basebackup: child process exited with error 1`

And on the server, I see:

>  2016-04-06 21:05:31 UTC LOG:  terminating walsender process due to replication timeout

  But the transfer doesn't stop and keeps going to completion.

  I found this [http://dba.stackexchange.com/questions/59916/streaming-replication-log-is-puzzling-me] question on
stackexchangeabout setting "ssl_renegotiation_limit" to 0, but this didn't make much difference. 

  Anyone have any ideas? I didn't find any reference to this problem in the mailing list archives. I am completely
baffledas to why this would error, but keep on going. Maybe this isn't a problem at all?  It is the same procedure I
usedin the lab setup... the only difference is that the production database is much bigger in size. 

Any thoughts??


-With kind regards, Peter.

Re: postgres wal sender replication timeout during pg_basebackup

From

Albe Laurenz

Date:

08 April 2016, 09:03:38

Peter Brunnengräber wrote:
>    I brought the database files over from the current single production postgres server. By this I
> mean I shutdown postgres and tar-ed up the data directory and copied it over the the cluster's Master
> node. I put the files in place, set the permissions, and was able to start-up postgres on the Master
> via corosync just fine.
> 
>    In preparing the slave, I used the pg_basebackup tool to bring the database over from the Master
> and this is where I keep having issues. As it is transferring, at about 57% I see the error:
> 
> >  $ pg_basebackup -h db-master -U u_repl -D /db/data/postgresql/9.2/main/ -X stream -P
> >  pg_basebackup: could not receive data from WAL stream: SSL connection has been closed unexpectedly
> >  176472/176472 kB (100%), 1/1 tablespace
> >  pg_basebackup: child process exited with error 1`
> 
> And on the server, I see:
> 
> >  2016-04-06 21:05:31 UTC LOG:  terminating walsender process due to replication timeout
> 
>   But the transfer doesn't stop and keeps going to completion.
> 
>   I found this [http://dba.stackexchange.com/questions/59916/streaming-replication-log-is-puzzling-me]
> question on stackexchange about setting "ssl_renegotiation_limit" to 0, but this didn't make much
> difference.
> 
>   Anyone have any ideas? I didn't find any reference to this problem in the mailing list archives. I
> am completely baffled as to why this would error, but keep on going. Maybe this isn't a problem at
> all?  It is the same procedure I used in the lab setup... the only difference is that the production
> database is much bigger in size.

ssl_renegotiation_limit would also have been my first guess.
What PostgreSQL version are you running?

The server error message means that the client did not send a status update
within "wal_sender_timeout" milliseconds, see
http://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT

Does pg_basebackup succeed if you set "wal_sender_timeout" to zero?

Is there a firewall between client and server that could swallow such messages?

Could you try without SSL (e.g. set the environment variable PGSSLMODE to "disable")
an see if that makes the problem go away?
Avoiding SSL will also greatly speed up pg_basebackup.

Yours,
Laurenz Albe

Re: postgres wal sender replication timeout during pg_basebackup

From

Peter Brunnengräber

Date:

11 April 2016, 21:18:14

Hello Mr. Albe,

> What PostgreSQL version are you running?
  9.2

> The server error message means that the client did not send a status
> update within "wal_sender_timeout" milliseconds

  So if I understand this correctly, the wal sender must receive a message back from the receiver in this preset time
orelse think that the transmission failed... 

  9.2 doesn't seem to have the "wal_sender_timeout" parameter, and it appears than "replication_timeout" may be the
nameof the parameter prior to v9.3 so this is what I am tweaking.  I originally had "replication_timeout = 5s", and I
verifiedthat "wal_receiver_status_interval = 2s" per the documentation. 

  You were correct that setting this value to 0 did allow the pg_basebackup to complete without an error.  I plan to
alsotry setting this value to 15s to see if the pg_basebackup completes in that time frame. 

> Is there a firewall between client and server that could swallow such messages?
  None that I am aware of, but I will check with the Xen Hypervisor admin to make sure there isn't something setup here
whichcould also cause trouble down the road. 

> Avoiding SSL will also greatly speed up pg_basebackup.
  Ok.  I will give this a try as well.

Thank you ever so much for your reply and solution, it was greatly appreciated!

With kind regards. -Peter

----- Original Message -----
From: "Albe Laurenz" <laurenz.albe@wien.gv.at>
To: "Peter Brunnengräber" <pbrunnen@bccglobal.com>, pgsql-admin@postgresql.org
Sent: Friday, April 8, 2016 5:03:29 AM
Subject: Re: [ADMIN] postgres wal sender replication timeout during pg_basebackup

Peter Brunnengräber wrote:
>    I brought the database files over from the current single production postgres server. By this I
> mean I shutdown postgres and tar-ed up the data directory and copied it over the the cluster's Master
> node. I put the files in place, set the permissions, and was able to start-up postgres on the Master
> via corosync just fine.
>
>    In preparing the slave, I used the pg_basebackup tool to bring the database over from the Master
> and this is where I keep having issues. As it is transferring, at about 57% I see the error:
>
> >  $ pg_basebackup -h db-master -U u_repl -D /db/data/postgresql/9.2/main/ -X stream -P
> >  pg_basebackup: could not receive data from WAL stream: SSL connection has been closed unexpectedly
> >  176472/176472 kB (100%), 1/1 tablespace
> >  pg_basebackup: child process exited with error 1`
>
> And on the server, I see:
>
> >  2016-04-06 21:05:31 UTC LOG:  terminating walsender process due to replication timeout
>
>   But the transfer doesn't stop and keeps going to completion.
>
>   I found this [http://dba.stackexchange.com/questions/59916/streaming-replication-log-is-puzzling-me]
> question on stackexchange about setting "ssl_renegotiation_limit" to 0, but this didn't make much
> difference.
>
>   Anyone have any ideas? I didn't find any reference to this problem in the mailing list archives. I
> am completely baffled as to why this would error, but keep on going. Maybe this isn't a problem at
> all?  It is the same procedure I used in the lab setup... the only difference is that the production
> database is much bigger in size.

ssl_renegotiation_limit would also have been my first guess.
What PostgreSQL version are you running?

The server error message means that the client did not send a status update
within "wal_sender_timeout" milliseconds, see
http://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT

Does pg_basebackup succeed if you set "wal_sender_timeout" to zero?

Is there a firewall between client and server that could swallow such messages?

Could you try without SSL (e.g. set the environment variable PGSSLMODE to "disable")
an see if that makes the problem go away?
Avoiding SSL will also greatly speed up pg_basebackup.

Yours,
Laurenz Albe

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin