Thread: streaming replication timeout error
Hello:
My customer encountered some connection timeout, while using one primary-one standby streaming replication.
The original log is japanese, because there are no error-code like oracle's ora-xxx,
I tried to translate the japanese information into English, But that might be not correct English for PG.
The most important part is:
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
I was asked about:
In what occasion will the above fatal error occur?
I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.
I found:
Senario I:
If the wal file wanted is removed manually:
Both in primary and standby, log will be like this:
FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000011 has already been removed
Senario II:
If replication_timeout=60, wal_receiver_status_interval=0,
log will be like this:
On primary side:
LOG: terminating walsender process due to replication timeout
On standby side:
FATAL: could not receive data from WAL stream:
LOG: streaming replication successfully connected to primary
But they are not the same to my customer's case.
And I also noticed that my customer's recovery.conf:
In it it has primary_conninfo as this:
host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1
That is to say:
keepalives_idle=5
keepalives_interval=1
keepalives_count=3
With that setting, I think that For every 5 seconds,
In order to see whether the connection is alive, 3 keepalives package will be sent,
And if there is no repsonse in 1 seconds, then he connection will be invalid.
If my unserstanding is right,
when master is busy at network,
the three keepalives_ parameter will also cause the connection time out error.
But I haven't found a good explanation fitting the logs' FATAL error.
Can anybody give me some info?
2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95024][0][00000]LOG: Recovery replay at point 1F1/45EF8E20
2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95025][0][00000]DETAIL: Last transaction log time 2013-09-22 09:40:42.673062+09
2013-09-22 09:44:00 JST[27987][412d33c5.2d23-95026][0][00000]LOG: restartpoint starting: time
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95027][0][00000]LOG: restartpoint complete: wrote 1051 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=104.773 s, sync=0.003 s, total=104.781 s; sync files=90, longest=0.001 s, average=0.000 s
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95028][0][00000]LOG: Recovery replay at point 1F1/45EF8E20
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95029][0][00000]DETAIL: Last transaction log time 2013-09-22 09:45:44.77683+09
2013-09-22 09:49:00 JST[27987][412d33c5.2d23-95030][0][00000]LOG: restartpoint starting: time
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95031][0][00000]LOG: restartpoint complete: wrote 963 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=96.096 s, sync=0.006 s, total=96.106 s; sync files=98, longest=0.002 s, average=0.000 s
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95032][0][00000]LOG: Recovery replay at point 1F1/46CE8EA0
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95033][0][00000]DETAIL: Last transaction log time 2013-09-22 09:50:36.564966+09
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
2013-09-22 09:52:50 JST[27806][51d1fbc5.6c9e-12][0][00000]LOG: Log file 497、Segment 74、Offset 12263424 Page address 1F0/8DBB2000 is not expected
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AD.history: No such file or directory
2013-09-22 09:52:51 JST[2878][523e3f63.b3e-1][0][00000]LOG: Streaming replication connected to primary server successfully
2013-09-22 09:54:00 JST[27987][412d33c5.2d23-95034][0][00000]LOG: restartpoint starting: time
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95035][0][00000]LOG: restartpoint complete: wrote 9929 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 10 recycled; write=149.436 s, sync=0.032 s, total=149.472 s; sync files=90, longest=0.012 s, average=0.000 s
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95036][0][00000]LOG: Recovery replay at point 1F1/48E9D238
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95037][0][00000]DETAIL: Last transaction log time 2013-09-22 09:56:29.015514+09
Thanks in advance
jian gao
On 10/08/2013 07:58 PM, 高健 wrote: > Hello: > > My customer encountered some connection timeout, while using one > primary-one standby streaming replication. > > The original log is japanese, because there are no error-code like > oracle's ora-xxx, > I tried to translate the japanese information into English, But that > might be not correct English for PG. > > > The most important part is: > > 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could > not receive data from WAL stream: could not receive data from server: > connection timeout > scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such > file or directory > > I was asked about: > In what occasion will the above fatal error occur? > > I looked into the postgresql.conf file for the primary and standby server. > And made some experiences. > > I found: > Senario I: > > If the wal file wanted is removed manually: > > Both in primary and standby, log will be like this: > FATAL: could not receive data from WAL stream: FATAL: requested WAL > segment 000000010000000000000011 has already been removed > > > > But I haven't found a good explanation fitting the logs' FATAL error. > Can anybody give me some info? Would seem to me the interesting part is: scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory Are using scp to move WAL files to an archive directory? If so, it seems scp is having issues, either network interruption or the file is disappearing under it. > > Thanks in advance > > jian gao > -- Adrian Klaver adrian.klaver@gmail.com
Hello:
Thanks for replying.
The recovery.conf file on standby(DB2) is like that:
standby_mode = 'on'
primary_conninfo = 'host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1'
recovery_target_timeline = 'latest'
restore_command = 'scp -o "ConnectTimeout 5" -i /opt/PostgresPlus/9.2AS/.ssh/id_edb DB1:/opt/PostgresPlus/9.2AS/data/arch/%f %p'
I am not familiar with the scp command, I think that here scp is used to copy archive wal log files from primary to standby...
Maybe the ConnectionTimeout is too small, And sometimes when network is not very well,
the restore_command will fail and return FATAL error?
In fact I am a little confused about restore_command, we are using streaming replication, but why restore_command is still needed to copy archive wal log, isn't it the old warm standby (file shipping)?
Best Regards
jian gao
2013/10/9 Adrian Klaver <adrian.klaver@gmail.com>
On 10/08/2013 07:58 PM, 高健 wrote:Hello:
My customer encountered some connection timeout, while using one
primary-one standby streaming replication.
The original log is japanese, because there are no error-code like
oracle's ora-xxx,
I tried to translate the japanese information into English, But that
might be not correct English for PG.
The most important part is:
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could
not receive data from WAL stream: could not receive data from server:
connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
file or directory
I was asked about:
In what occasion will the above fatal error occur?
I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.
I found:
Senario I:
If the wal file wanted is removed manually:
Both in primary and standby, log will be like this:
FATAL: could not receive data from WAL stream: FATAL: requested WAL
segment 000000010000000000000011 has already been removedBut I haven't found a good explanation fitting the logs' FATAL error.
Can anybody give me some info?
Would seem to me the interesting part is:Are using scp to move WAL files to an archive directory?
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
If so, it seems scp is having issues, either network interruption or the file is disappearing under it.--
Thanks in advance
jian gao
Adrian Klaver
adrian.klaver@gmail.com
On 10/09/2013 05:51 PM, 高健 wrote: > Hello: > > Thanks for replying. > > The recovery.conf file on standby(DB2) is like that: > > standby_mode = 'on' > primary_conninfo = 'host=DB1 port=5432 application_name=testpg > user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1' > recovery_target_timeline = 'latest' > restore_command = 'scp -o "ConnectTimeout 5" -i > /opt/PostgresPlus/9.2AS/.ssh/id_edb > DB1:/opt/PostgresPlus/9.2AS/data/arch/%f %p' > > > I am not familiar with the scp command, I think that here scp is used > to copy archive wal log files from primary to standby... > > Maybe the ConnectionTimeout is too small, And sometimes when network is > not very well, > the restore_command will fail and return FATAL error? > > In fact I am a little confused about restore_command, we are using > streaming replication, but why restore_command is still needed to copy > archive wal log, isn't it the old warm standby (file shipping)? Best explanation is in the docs: http://www.postgresql.org/docs/9.3/static/warm-standby.html " At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file. " Basically by having a restore_command and primary_conninfo you are telling the standby to do both, following the sequence described above. FYI ConnectTimeout is a SSH option passed to scp. man ssh_config will get you more information. Would seem both your streaming and archiving are using the same network, is that correct? If so you have a single point of failure, the network. > > Best Regards > jian gao > > -- Adrian Klaver adrian.klaver@gmail.com
On Wed, Oct 9, 2013 at 9:58 AM, 高健 <luckyjackgao@gmail.com> wrote: > The most important part is: > > 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not > receive data from WAL stream: could not receive data from server: connection > timeout > scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such > file or directory > > I was asked about: > In what occasion will the above fatal error occur? It is a network error. The TCP/IP socket died somehow, and the FATAL error logged. PostgreSQL then fell back to using the restore_command defined in your recovery.conf file, but the WAL file had not yet been shipped and the output logged. I imagine streaming replication happily reconnected soon after. This is all quite normal. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
Thank you all a lot!
I have got it.
Best regards
2013/10/10 Stuart Bishop <stuart@stuartbishop.net>
On Wed, Oct 9, 2013 at 9:58 AM, 高健 <luckyjackgao@gmail.com> wrote:It is a network error. The TCP/IP socket died somehow, and the FATAL
> The most important part is:
>
> 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not
> receive data from WAL stream: could not receive data from server: connection
> timeout
> scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such
> file or directory
>
> I was asked about:
> In what occasion will the above fatal error occur?
error logged. PostgreSQL then fell back to using the restore_command
defined in your recovery.conf file, but the WAL file had not yet been
shipped and the output logged. I imagine streaming replication happily
reconnected soon after. This is all quite normal.
--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/