slave wal is ahead of master - Mailing list pgsql-performance

From Evgeniy Shishkin
Subject slave wal is ahead of master
Date
Msg-id 561E3FF0-C1B4-448A-AB33-176540E72D16@gmail.com
Whole thread Raw
Responses Re: slave wal is ahead of master  (Sethu Prasad <sethuprasad.in@gmail.com>)
List pgsql-performance
Hello,

we have 3 servers with postgresql 9.3.3. One is master and two slaves.
We run synchronous_replication and fsync, synchronous_commit and full_page_writes are on.

Suddenly master hang up with hardware failure, it is a strange bug in iLo which we investigate with HP.

Before master was rebooted, i ran ps aux on slave
postgres: wal receiver process   streaming 12/F1031DF8

Last messages in slaves logs was
2014-03-19 02:41:29.005 GMT,,,7389,,53108c69.1cdd,16029,,2014-02-28 13:17:29 GMT,,0,LOG,00000,"recovery restart point
at12/DFFBB3E8","last completed transaction was at log time 2014-03-19 02:41:28.886869+00",,,,,,,,"" 

and then there was silence, because master hang.

Then master was rebooted and slave wrote in log
2014-03-19 15:36:39.176 GMT,,,7392,,53108c69.1ce0,2,,2014-02-28 13:17:29 GMT,,0,FATAL,XX000,"terminating walreceiver
dueto timeout",,,,,,,,,"" 
2014-03-19 15:36:39.177 GMT,,,7388,,53108c69.1cdc,6,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"record with zero length
at12/F1031DF8",,,,,,,,,"" 
2014-03-19 15:36:57.181 GMT,,,12100,,5329b996.2f44,1,,2014-03-19 15:36:54 GMT,,0,FATAL,XX000,"could not connect to the
primaryserver: could not connect to server: No route to host 
        Is the server running on host ""10.162.2.50"" and accepting
        TCP/IP connections on port 5432?
",,,,,,,,,""

Then master finally came back, slave wrote
2014-03-19 15:40:09.389 GMT,,,13121,,5329ba59.3341,1,,2014-03-19 15:40:09 GMT,,0,FATAL,XX000,"could not connect to the
primaryserver: FATAL:  the database system is starting up 
",,,,,,,,,""
2014-03-19 15:40:16.468 GMT,,,13136,,5329ba5e.3350,1,,2014-03-19 15:40:14 GMT,,0,LOG,00000,"started streaming WAL from
primaryat 12/F1000000 on timeline 1",,,,,,,,,"" 
2014-03-19 15:40:16.468 GMT,,,13136,,5329ba5e.3350,2,,2014-03-19 15:40:14 GMT,,0,FATAL,XX000,"could not receive data
fromWAL stream: ERROR:  requested starting point 12/F1000000 is ahead of the WAL flush position of this server
12/F0FFFCE8
",,,,,,,,,""

last message was repeated several times
and then this happened

2014-03-19 15:42:04.623 GMT,,,13722,,5329bacc.359a,1,,2014-03-19 15:42:04 GMT,,0,LOG,00000,"started streaming WAL from
primaryat 12/F1000000 on timeline 1",,,,,,,,,"" 
2014-03-19 15:42:04.628 GMT,,,7388,,53108c69.1cdc,7,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at
12/F1031DF8",,,,,,,,,""
2014-03-19 15:42:04.628 GMT,,,13722,,5329bacc.359a,2,,2014-03-19 15:42:04 GMT,,0,FATAL,57P01,"terminating walreceiver
processdue to administrator command",,,,,,,,,"" 
2014-03-19 15:42:09.628 GMT,,,7388,,53108c69.1cdc,8,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at
12/F1031DF8",,,,,,,,,""
2014-03-19 15:42:14.628 GMT,,,7388,,53108c69.1cdc,9,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at
12/F1031DF8",,,,,,,,,""
2014-03-19 15:42:19.628 GMT,,,7388,,53108c69.1cdc,10,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at
12/F1031DF8",,,,,,,,,”" 
and it just repeats forever.


Meanwhile on master
2014-03-19 15:39:30.957 GMT,,,7115,,5329ba32.1bcb,2,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"database system was not
properlyshut down; automatic recovery in progress",,,,,,,,,"" 
2014-03-19 15:39:30.989 GMT,,,7115,,5329ba32.1bcb,3,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"redo starts at
12/DFFBB3E8",,,,,,,,,""
2014-03-19 15:39:47.114 GMT,,,7115,,5329ba32.1bcb,4,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"redo done at
12/F0FFFC38",,,,,,,,,""
2014-03-19 15:39:47.114 GMT,,,7115,,5329ba32.1bcb,5,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"last completed transaction
wasat log time 2014-03-19 05:02:29.273138+00",,,,,,,,,"" 
2014-03-19 15:39:47.115 GMT,,,7115,,5329ba32.1bcb,6,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"checkpoint starting:
end-of-recoveryimmediate",,,,,,,,,"" 
2014-03-19 15:40:16.466 GMT,"replicator","",7986,"10.162.2.52:44336",5329ba5e.1f32,1,"idle",2014-03-19 15:40:14
GMT,2/0,0,ERROR,XX000,"requestedstarting point 12/F1000000 is ahead of the WAL flush position of this server
12/F0FFFCE8",,,,,,,,,"walreceiver"

So, all two slaves are disconnected from master, which somehow is past his slaves.

I decided to promote one of the slaves, so we can have some snapshot of the data.
relevant logs from this are
2014-03-19 16:50:43.118 GMT,,,4444,,5329cae3.115c,3,,2014-03-19 16:50:43 GMT,,0,LOG,00000,"redo starts at
12/DFFBB3E8",,,,,,,,,""
2014-03-19 16:50:50.028 GMT,"dboperator","postgres",4452,"[local]",5329caea.1164,1,"",2014-03-19 16:50:50
GMT,,0,FATAL,57P03,"thedatabase system is starting up",,,,,,,,,"" 
2014-03-19 16:50:51.128 GMT,,,4444,,5329cae3.115c,4,,2014-03-19 16:50:43 GMT,,0,LOG,00000,"invalid contrecord length
5736at 12/F0FFFC80",,,,,,,,,"" 
2014-03-19 16:50:51.128 GMT,,,4444,,5329cae3.115c,5,,2014-03-19 16:50:43 GMT,,0,LOG,00000,"redo done at
12/F0FFFC38",,,,,,,,,””

It is interesting that redo done at 12/F0FFFC38 both on master and promoted slave.

The main question is there is actual latest data, and how is it possible that master is behind his slave in synchronous
replication.

Thanks for the help.

pgsql-performance by date:

Previous
From: "acanada"
Date:
Subject: Re: Query taking long time
Next
From: Sethu Prasad
Date:
Subject: Re: slave wal is ahead of master