Thread: streaming replication timeout error

streaming replication timeout error

From
高健
Date:
Hello:

My customer encountered some connection timeout, while using one primary-one standby streaming replication.

The original log is japanese, because there are no error-code like oracle's ora-xxx,
I tried to translate the japanese information into English, But that might be not correct English for PG.


The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory

I was asked about:
In what occasion will the above fatal error occur?

I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.

I found: 
Senario I:

If the wal file wanted is removed manually:

Both in primary and standby, log will be like this:
FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment 000000010000000000000011 has already been removed


Senario II:

If replication_timeout=60, wal_receiver_status_interval=0,
log will be like this:

On primary side:
LOG:  terminating walsender process due to replication timeout

On standby side:
FATAL: could not receive data from WAL stream: 
LOG:  streaming replication successfully connected to primary


But they are not the same to my customer's case.
And I also noticed that my customer's recovery.conf:

In it it has primary_conninfo as this:
host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1 

That is to say:
      keepalives_idle=5 
      keepalives_interval=1
      keepalives_count=3

With that setting, I think that For every 5 seconds,
In order to see whether the connection is alive, 3 keepalives package will be sent,
And if there is no repsonse in 1 seconds, then he connection will be invalid.

If my unserstanding is right, 
when master is busy at network,
the three keepalives_  parameter will also cause the connection time out error.

But I haven't found a  good explanation fitting the logs' FATAL error.
Can anybody give me some info?

2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95024][0][00000]LOG:  Recovery replay at point 1F1/45EF8E20
2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95025][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:40:42.673062+09
2013-09-22 09:44:00 JST[27987][412d33c5.2d23-95026][0][00000]LOG:  restartpoint starting: time
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95027][0][00000]LOG:  restartpoint complete: wrote 1051 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=104.773 s, sync=0.003 s, total=104.781 s; sync files=90, longest=0.001 s, average=0.000 s
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95028][0][00000]LOG:  Recovery replay at point 1F1/45EF8E20
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95029][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:45:44.77683+09
2013-09-22 09:49:00 JST[27987][412d33c5.2d23-95030][0][00000]LOG:  restartpoint starting: time
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95031][0][00000]LOG:  restartpoint complete: wrote 963 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=96.096 s, sync=0.006 s, total=96.106 s; sync files=98, longest=0.002 s, average=0.000 s
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95032][0][00000]LOG:  Recovery replay at point 1F1/46CE8EA0
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95033][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:50:36.564966+09
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
2013-09-22 09:52:50 JST[27806][51d1fbc5.6c9e-12][0][00000]LOG:  Log file 497、Segment 74、Offset 12263424 Page address 1F0/8DBB2000 is not expected
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AD.history: No such file or directory
2013-09-22 09:52:51 JST[2878][523e3f63.b3e-1][0][00000]LOG:  Streaming replication connected to primary server successfully
2013-09-22 09:54:00 JST[27987][412d33c5.2d23-95034][0][00000]LOG:  restartpoint starting: time
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95035][0][00000]LOG:  restartpoint complete: wrote 9929 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 10 recycled; write=149.436 s, sync=0.032 s, total=149.472 s; sync files=90, longest=0.012 s, average=0.000 s
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95036][0][00000]LOG:  Recovery replay at point 1F1/48E9D238
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95037][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:56:29.015514+09

Thanks in advance

jian gao

Re: streaming replication timeout error

From
Adrian Klaver
Date:
On 10/08/2013 07:58 PM, 高健 wrote:
> Hello:
>
> My customer encountered some connection timeout, while using one
> primary-one standby streaming replication.
>
> The original log is japanese, because there are no error-code like
> oracle's ora-xxx,
> I tried to translate the japanese information into English, But that
> might be not correct English for PG.
>
>
> The most important part is:
>
> 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could
> not receive data from WAL stream: could not receive data from server:
> connection timeout
> scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
> file or directory
>
> I was asked about:
> In what occasion will the above fatal error occur?
>
> I looked into the postgresql.conf file for the primary and standby server.
> And made some experiences.
>
> I found:
> Senario I:
>
> If the wal file wanted is removed manually:
>
> Both in primary and standby, log will be like this:
> FATAL:  could not receive data from WAL stream: FATAL:  requested WAL
> segment 000000010000000000000011 has already been removed
>
>

>
> But I haven't found a  good explanation fitting the logs' FATAL error.
> Can anybody give me some info?

Would seem to me the interesting part is:

  scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No
such  file or directory

Are using scp to move WAL files to an archive directory?

If so, it seems scp is having issues, either network interruption or the
file is disappearing under it.

>
> Thanks in advance
>
> jian gao
>


--
Adrian Klaver
adrian.klaver@gmail.com


Re: streaming replication timeout error

From
高健
Date:
Hello:

Thanks for replying.

The recovery.conf file on standby(DB2) is like that:

standby_mode             = 'on'
primary_conninfo         = 'host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1'
recovery_target_timeline = 'latest'
restore_command          = 'scp -o "ConnectTimeout 5" -i /opt/PostgresPlus/9.2AS/.ssh/id_edb DB1:/opt/PostgresPlus/9.2AS/data/arch/%f %p'


I  am not familiar with the scp command,  I think that here scp is used to copy archive wal log files from primary  to standby...

Maybe the ConnectionTimeout is too small, And sometimes when network is not very well, 
the restore_command will fail and return FATAL error?

In fact I am a little confused about restore_command, we are using streaming replication, but why restore_command is still needed to copy archive wal log, isn't it  the old warm standby (file shipping)?

Best Regards
jian gao 



2013/10/9 Adrian Klaver <adrian.klaver@gmail.com>
On 10/08/2013 07:58 PM, 高健 wrote:
Hello:

My customer encountered some connection timeout, while using one
primary-one standby streaming replication.

The original log is japanese, because there are no error-code like
oracle's ora-xxx,
I tried to translate the japanese information into English, But that
might be not correct English for PG.


The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could
not receive data from WAL stream: could not receive data from server:
connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory

I was asked about:
In what occasion will the above fatal error occur?

I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.

I found:
Senario I:

If the wal file wanted is removed manually:

Both in primary and standby, log will be like this:
FATAL:  could not receive data from WAL stream: FATAL:  requested WAL
segment 000000010000000000000011 has already been removed




But I haven't found a  good explanation fitting the logs' FATAL error.
Can anybody give me some info?

Would seem to me the interesting part is:


 scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such  file or directory

Are using scp to move WAL files to an archive directory?

If so, it seems scp is having issues, either network interruption or the file is disappearing under it.



Thanks in advance

jian gao



--
Adrian Klaver
adrian.klaver@gmail.com

Re: streaming replication timeout error

From
Adrian Klaver
Date:
On 10/09/2013 05:51 PM, 高健 wrote:
> Hello:
>
> Thanks for replying.
>
> The recovery.conf file on standby(DB2) is like that:
>
> standby_mode             = 'on'
> primary_conninfo         = 'host=DB1 port=5432 application_name=testpg
> user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1'
> recovery_target_timeline = 'latest'
> restore_command          = 'scp -o "ConnectTimeout 5" -i
> /opt/PostgresPlus/9.2AS/.ssh/id_edb
> DB1:/opt/PostgresPlus/9.2AS/data/arch/%f %p'
>
>
> I  am not familiar with the scp command,  I think that here scp is used
> to copy archive wal log files from primary  to standby...
>
> Maybe the ConnectionTimeout is too small, And sometimes when network is
> not very well,
> the restore_command will fail and return FATAL error?
>
> In fact I am a little confused about restore_command, we are using
> streaming replication, but why restore_command is still needed to copy
> archive wal log, isn't it  the old warm standby (file shipping)?

Best explanation is in the docs:

http://www.postgresql.org/docs/9.3/static/warm-standby.html
"
At startup, the standby begins by restoring all WAL available in the
archive location, calling restore_command. Once it reaches the end of
WAL available there and restore_command fails, it tries to restore any
WAL available in the pg_xlog directory. If that fails, and streaming
replication has been configured, the standby tries to connect to the
primary server and start streaming WAL from the last valid record found
in archive or pg_xlog. If that fails or streaming replication is not
configured, or if the connection is later disconnected, the standby goes
back to step 1 and tries to restore the file from the archive again.
This loop of retries from the archive, pg_xlog, and via streaming
replication goes on until the server is stopped or failover is triggered
by a trigger file.
"

Basically by having a restore_command and primary_conninfo you are
telling the standby to do both, following the sequence described above.

FYI ConnectTimeout is a SSH option passed to scp.

man ssh_config will get you more information.

Would seem both your streaming and archiving are using the same network,
is that correct?

If so you have a single point of failure, the network.


>
> Best Regards
> jian gao
>
>

--
Adrian Klaver
adrian.klaver@gmail.com


Re: streaming replication timeout error

From
Stuart Bishop
Date:
On Wed, Oct 9, 2013 at 9:58 AM, 高健 <luckyjackgao@gmail.com> wrote:

> The most important part is:
>
> 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could not
> receive data from WAL stream: could not receive data from server: connection
> timeout
> scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
> file or directory
>
> I was asked about:
> In what occasion will the above fatal error occur?

It is a network error. The TCP/IP socket died somehow, and the FATAL
error logged. PostgreSQL then fell back to using the restore_command
defined in your recovery.conf file, but the WAL file had not yet been
shipped and the output logged. I imagine streaming replication happily
reconnected soon after. This is all quite normal.



--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/


Re: streaming replication timeout error

From
高健
Date:
Thank you all a lot!

I have got it.

Best regards


2013/10/10 Stuart Bishop <stuart@stuartbishop.net>
On Wed, Oct 9, 2013 at 9:58 AM, 高健 <luckyjackgao@gmail.com> wrote:

> The most important part is:
>
> 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could not
> receive data from WAL stream: could not receive data from server: connection
> timeout
> scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
> file or directory
>
> I was asked about:
> In what occasion will the above fatal error occur?

It is a network error. The TCP/IP socket died somehow, and the FATAL
error logged. PostgreSQL then fell back to using the restore_command
defined in your recovery.conf file, but the WAL file had not yet been
shipped and the output logged. I imagine streaming replication happily
reconnected soon after. This is all quite normal.



--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/