Thread: Strange replication problem - segment restored from archive but still requested from master
Strange replication problem - segment restored from archive but still requested from master
From
Piotr Gasidło
Date:
Got strange problem. Unable to repeat, but got logs. Simple master-slave using streaming replication. Master is running. Slave is down. Segment 0000000400004C4D00000090 was successfully archived and send from master to slave. Now I've started slave, and: ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST @ 50645 LOG: database system was shut down in recovery at 2015-05-21 21:22:03 CEST May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST @ 50645 LOG: entering standby mode May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST @ 50645 LOG: restored log file "0000000400004C4D00000088" from archive May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST @ 50645 LOG: redo starts at 4C4D/88493B50 May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST @ 50645 LOG: restored log file "0000000400004C4D00000089" from archive May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST @ 50645 LOG: restored log file "0000000400004C4D0000008A" from archive May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST @ 50645 LOG: restored log file "0000000400004C4D0000008B" from archive May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376 CEST @ 50645 LOG: restored log file "0000000400004C4D0000008C" from archive May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690 CEST @ 50645 LOG: restored log file "0000000400004C4D0000008D" from archive May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855 CEST @ 50645 LOG: restored log file "0000000400004C4D0000008E" from archive May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275 CEST @ 50645 LOG: restored log file "0000000400004C4D0000008F" from archive May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654 CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from archive May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222 CEST @ 50645 LOG: consistent recovery state reached at 4C4D/90FFF9C8 May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST @ 50644 LOG: database system is ready to accept read only connections May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223 CEST @ 50645 LOG: unexpected pageaddr 4C46/E000000 in log segment 0000000400004C4D00000091, offset 0 May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST @ 50699 LOG: started streaming WAL from primary at 4C4D/90000000 on timeline 4 May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST @ 50699 FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000400004C4D00000090 has already been removed May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255 CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from archive May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST @ 50703 LOG: started streaming WAL from primary at 4C4D/90000000 on timeline 4 May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST @ 50703 FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000400004C4D00000090 has already been removed May 21 21:23:40 d8 postgres[50703]: [5-2] ... (and so on) So, as I understand: - slave was started and entered restore, - slave restored 0000000400004C4D00000090 from archive, reached consistent recovery state - now, it connected to master and noticed, that, it has 0000000400004C4D00000091 segment uncomplete - and then, started yelling about missing segment (master deleted it already after archiving) on master Why? I fixed it by stopping slave, doing some operations on master, copying 0000000400004C4D00000091 and 0000000400004C4D00000092 from master, and staring the slave: May 21 21:24:56 d8 postgres[50644]: [4-1] 2015-05-21 21:24:56.160 CEST @ 50644 LOG: received fast shutdown request May 21 21:24:56 d8 postgres[50644]: [5-1] 2015-05-21 21:24:56.160 CEST @ 50644 LOG: aborting any active transactions May 21 21:24:56 d8 postgres[50657]: [3-1] 2015-05-21 21:24:56.162 CEST @ 50657 LOG: shutting down May 21 21:24:56 d8 postgres[50657]: [4-1] 2015-05-21 21:24:56.174 CEST @ 50657 LOG: database system is shut down May 21 21:28:54 d8 postgres[53574]: [1-1] 2015-05-21 21:28:54.781 CEST @ 53574 LOG: loaded library "pg_stat_statements" May 21 21:28:55 d8 postgres[53574]: [2-1] 2015-05-21 21:28:55.123 CEST @ 53574 LOG: ending log output to stderr May 21 21:28:55 d8 postgres[53574]: [2-2] 2015-05-21 21:28:55.123 CEST @ 53574 HINT: Future log output will go to log destination "syslog". May 21 21:28:55 d8 postgres[53575]: [3-1] 2015-05-21 21:28:55.124 CEST @ 53575 LOG: database system was shut down in recovery at 2015-05-21 21:24:56 CEST May 21 21:28:55 d8 postgres[53575]: [4-1] 2015-05-21 21:28:55.126 CEST @ 53575 LOG: entering standby mode May 21 21:28:55 d8 postgres[53575]: [5-1] 2015-05-21 21:28:55.150 CEST @ 53575 LOG: restored log file "0000000400004C4D00000088" from archive May 21 21:28:55 d8 postgres[53575]: [6-1] 2015-05-21 21:28:55.209 CEST @ 53575 LOG: redo starts at 4C4D/88493B50 May 21 21:28:55 d8 postgres[53575]: [7-1] 2015-05-21 21:28:55.314 CEST @ 53575 LOG: restored log file "0000000400004C4D00000089" from archive May 21 21:28:55 d8 postgres[53575]: [8-1] 2015-05-21 21:28:55.512 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008A" from archive May 21 21:28:55 d8 postgres[53575]: [9-1] 2015-05-21 21:28:55.831 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008B" from archive May 21 21:28:56 d8 postgres[53575]: [10-1] 2015-05-21 21:28:56.508 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008C" from archive May 21 21:28:56 d8 postgres[53575]: [11-1] 2015-05-21 21:28:56.818 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008D" from archive May 21 21:28:56 d8 postgres[53575]: [12-1] 2015-05-21 21:28:56.982 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008E" from archive May 21 21:28:57 d8 postgres[53575]: [13-1] 2015-05-21 21:28:57.415 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008F" from archive May 21 21:28:57 d8 postgres[53575]: [14-1] 2015-05-21 21:28:57.790 CEST @ 53575 LOG: restored log file "0000000400004C4D00000090" from archive May 21 21:28:58 d8 postgres[53575]: [15-1] 2015-05-21 21:28:58.328 CEST @ 53575 LOG: consistent recovery state reached at 4C4D/90FFF9C8 May 21 21:28:58 d8 postgres[53574]: [3-1] 2015-05-21 21:28:58.328 CEST @ 53574 LOG: database system is ready to accept read only connections May 21 21:28:58 d8 postgres[53575]: [16-1] 2015-05-21 21:28:58.349 CEST @ 53575 LOG: restored log file "0000000400004C4D00000091" from archive May 21 21:28:59 d8 postgres[53575]: [17-1] 2015-05-21 21:28:59.962 CEST @ 53575 LOG: restored log file "0000000400004C4D00000092" from archive May 21 21:29:00 d8 postgres[53575]: [18-1] 2015-05-21 21:29:00.037 CEST @ 53575 LOG: unexpected pageaddr 4C45/DC000000 in log segment 0000000400004C4D00000093, offset 0 May 21 21:29:00 d8 postgres[53640]: [4-1] 2015-05-21 21:29:00.047 CEST @ 53640 LOG: started streaming WAL from primary at 4C4D/92000000 on timeline 4 What was wrong? Postgresql 9.3.6 on slave and 9.3.4 on master. Unable to retest if master was also 9.3.6. -- Piotr Gasidło
Re: Strange replication problem - segment restored from archive but still requested from master
From
Venkata Balaji N
Date:
On Fri, May 22, 2015 at 6:52 AM, Piotr Gasidło <quaker@barbara.eu.org> wrote:
It might be yelling about the WAL segment due to the delay in shipping it from master to slave.
Got strange problem. Unable to repeat, but got logs.
Simple master-slave using streaming replication.
Master is running. Slave is down.
Segment 0000000400004C4D00000090 was successfully archived and send
from master to slave.
Now I've started slave, and:
ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST
@ 50645 LOG: database system was shut down in recovery at
2015-05-21 21:22:03 CEST
May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST
@ 50645 LOG: entering standby mode
May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST
@ 50645 LOG: restored log file "0000000400004C4D00000088" from
archive
May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST
@ 50645 LOG: redo starts at 4C4D/88493B50
May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST
@ 50645 LOG: restored log file "0000000400004C4D00000089" from
archive
May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST
@ 50645 LOG: restored log file "0000000400004C4D0000008A" from
archive
May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST
@ 50645 LOG: restored log file "0000000400004C4D0000008B" from
archive
May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008C" from
archive
May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008D" from
archive
May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008E" from
archive
May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008F" from
archive
May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654
CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from
archive
May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222
CEST @ 50645 LOG: consistent recovery state reached at
4C4D/90FFF9C8
May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST
@ 50644 LOG: database system is ready to accept read only
connections
May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223
CEST @ 50645 LOG: unexpected pageaddr 4C46/E000000 in log segment
0000000400004C4D00000091, offset 0
May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST
@ 50699 LOG: started streaming WAL from primary at 4C4D/90000000 on
timeline 4
May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST
@ 50699 FATAL: could not receive data from WAL stream: ERROR:
requested WAL segment 0000000400004C4D00000090 has already been
removed
May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255
CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from
archive
May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST
@ 50703 LOG: started streaming WAL from primary at 4C4D/90000000 on
timeline 4
May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST
@ 50703 FATAL: could not receive data from WAL stream: ERROR:
requested WAL segment 0000000400004C4D00000090 has already been
removed
May 21 21:23:40 d8 postgres[50703]: [5-2]
...
(and so on)
So, as I understand:
- slave was started and entered restore,
- slave restored 0000000400004C4D00000090 from archive, reached
consistent recovery state
- now, it connected to master and noticed, that, it has
0000000400004C4D00000091 segment uncomplete
- and then, started yelling about missing segment (master deleted it
already after archiving) on master
Why?
Do you have the restore_command set up in the recovery.conf file ? do you have any automated job which is shipping WAL archives from master to slave.
Slave - if it does not find the WAL segment in master pg_xlog location, it looks for the same in the location mentioned in the restore_command. It yells if it is unable to find it.
Cannot give any concrete comments/suggestions, until recovery.conf entries are known.
I fixed it by stopping slave, doing some operations on master, copying
0000000400004C4D00000091 and 0000000400004C4D00000092 from master, and
staring the slave:
May 21 21:24:56 d8 postgres[50644]: [4-1] 2015-05-21 21:24:56.160 CEST
@ 50644 LOG: received fast shutdown request
May 21 21:24:56 d8 postgres[50644]: [5-1] 2015-05-21 21:24:56.160 CEST
@ 50644 LOG: aborting any active transactions
May 21 21:24:56 d8 postgres[50657]: [3-1] 2015-05-21 21:24:56.162 CEST
@ 50657 LOG: shutting down
May 21 21:24:56 d8 postgres[50657]: [4-1] 2015-05-21 21:24:56.174 CEST
@ 50657 LOG: database system is shut down
May 21 21:28:54 d8 postgres[53574]: [1-1] 2015-05-21 21:28:54.781 CEST
@ 53574 LOG: loaded library "pg_stat_statements"
May 21 21:28:55 d8 postgres[53574]: [2-1] 2015-05-21 21:28:55.123 CEST
@ 53574 LOG: ending log output to stderr
May 21 21:28:55 d8 postgres[53574]: [2-2] 2015-05-21 21:28:55.123 CEST
@ 53574 HINT: Future log output will go to log destination
"syslog".
May 21 21:28:55 d8 postgres[53575]: [3-1] 2015-05-21 21:28:55.124 CEST
@ 53575 LOG: database system was shut down in recovery at
2015-05-21 21:24:56 CEST
May 21 21:28:55 d8 postgres[53575]: [4-1] 2015-05-21 21:28:55.126 CEST
@ 53575 LOG: entering standby mode
May 21 21:28:55 d8 postgres[53575]: [5-1] 2015-05-21 21:28:55.150 CEST
@ 53575 LOG: restored log file "0000000400004C4D00000088" from
archive
May 21 21:28:55 d8 postgres[53575]: [6-1] 2015-05-21 21:28:55.209 CEST
@ 53575 LOG: redo starts at 4C4D/88493B50
May 21 21:28:55 d8 postgres[53575]: [7-1] 2015-05-21 21:28:55.314 CEST
@ 53575 LOG: restored log file "0000000400004C4D00000089" from
archive
May 21 21:28:55 d8 postgres[53575]: [8-1] 2015-05-21 21:28:55.512 CEST
@ 53575 LOG: restored log file "0000000400004C4D0000008A" from
archive
May 21 21:28:55 d8 postgres[53575]: [9-1] 2015-05-21 21:28:55.831 CEST
@ 53575 LOG: restored log file "0000000400004C4D0000008B" from
archive
May 21 21:28:56 d8 postgres[53575]: [10-1] 2015-05-21 21:28:56.508
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008C" from
archive
May 21 21:28:56 d8 postgres[53575]: [11-1] 2015-05-21 21:28:56.818
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008D" from
archive
May 21 21:28:56 d8 postgres[53575]: [12-1] 2015-05-21 21:28:56.982
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008E" from
archive
May 21 21:28:57 d8 postgres[53575]: [13-1] 2015-05-21 21:28:57.415
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008F" from
archive
May 21 21:28:57 d8 postgres[53575]: [14-1] 2015-05-21 21:28:57.790
CEST @ 53575 LOG: restored log file "0000000400004C4D00000090" from
archive
May 21 21:28:58 d8 postgres[53575]: [15-1] 2015-05-21 21:28:58.328
CEST @ 53575 LOG: consistent recovery state reached at
4C4D/90FFF9C8
May 21 21:28:58 d8 postgres[53574]: [3-1] 2015-05-21 21:28:58.328 CEST
@ 53574 LOG: database system is ready to accept read only
connections
May 21 21:28:58 d8 postgres[53575]: [16-1] 2015-05-21 21:28:58.349
CEST @ 53575 LOG: restored log file "0000000400004C4D00000091" from
archive
May 21 21:28:59 d8 postgres[53575]: [17-1] 2015-05-21 21:28:59.962
CEST @ 53575 LOG: restored log file "0000000400004C4D00000092" from
archive
May 21 21:29:00 d8 postgres[53575]: [18-1] 2015-05-21 21:29:00.037
CEST @ 53575 LOG: unexpected pageaddr 4C45/DC000000 in log segment
0000000400004C4D00000093, offset 0
May 21 21:29:00 d8 postgres[53640]: [4-1] 2015-05-21 21:29:00.047 CEST
@ 53640 LOG: started streaming WAL from primary at 4C4D/92000000 on
timeline 4
What was wrong?
This error message can be ignored, did you try shipping 0000000400004C4D00000093 to slave from master's archive location and see if slave picked it up and applied ?
Regards,
Venkata B N
Fujitsu Australia
Re: Strange replication problem - segment restored from archive but still requested from master
From
Fujii Masao
Date:
On Fri, May 22, 2015 at 5:52 AM, Piotr Gasidło <quaker@barbara.eu.org> wrote: > Got strange problem. Unable to repeat, but got logs. > > Simple master-slave using streaming replication. > Master is running. Slave is down. > Segment 0000000400004C4D00000090 was successfully archived and send > from master to slave. > > Now I've started slave, and: > > ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST > @ 50645 LOG: database system was shut down in recovery at > 2015-05-21 21:22:03 CEST > May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST > @ 50645 LOG: entering standby mode > May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST > @ 50645 LOG: restored log file "0000000400004C4D00000088" from > archive > May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST > @ 50645 LOG: redo starts at 4C4D/88493B50 > May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST > @ 50645 LOG: restored log file "0000000400004C4D00000089" from > archive > May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST > @ 50645 LOG: restored log file "0000000400004C4D0000008A" from > archive > May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST > @ 50645 LOG: restored log file "0000000400004C4D0000008B" from > archive > May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008C" from > archive > May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008D" from > archive > May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008E" from > archive > May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008F" from > archive > May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654 > CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from > archive > May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222 > CEST @ 50645 LOG: consistent recovery state reached at > 4C4D/90FFF9C8 > May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST > @ 50644 LOG: database system is ready to accept read only > connections > May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223 > CEST @ 50645 LOG: unexpected pageaddr 4C46/E000000 in log segment > 0000000400004C4D00000091, offset 0 > May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST > @ 50699 LOG: started streaming WAL from primary at 4C4D/90000000 on > timeline 4 > May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST > @ 50699 FATAL: could not receive data from WAL stream: ERROR: > requested WAL segment 0000000400004C4D00000090 has already been > removed > May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255 > CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from > archive > May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST > @ 50703 LOG: started streaming WAL from primary at 4C4D/90000000 on > timeline 4 > May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST > @ 50703 FATAL: could not receive data from WAL stream: ERROR: > requested WAL segment 0000000400004C4D00000090 has already been > removed > May 21 21:23:40 d8 postgres[50703]: [5-2] > ... > (and so on) > > So, as I understand: > - slave was started and entered restore, > - slave restored 0000000400004C4D00000090 from archive, reached > consistent recovery state > - now, it connected to master and noticed, that, it has > 0000000400004C4D00000091 segment uncomplete > - and then, started yelling about missing segment (master deleted it > already after archiving) on master > > Why? Thanks for the report! This seems to be a bug. This problem happens when WAL record is stored in separate two WAL files and there is no valid latter WAL file in the standby. In your case, the former file is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091. In this case, the first half of WAL record can be read from the former WAL file, but the remaining half not because no valid latter file exists in the standby. Then the standby tries to retrieve the latter WAL file via replication. The problem here is that the standby tries to start the replication from the starting point of WAL record, i.e., that's the location of the former WAL file. So the already-read WAL file is requested via replication. To address this problem, maybe we can just use the location where the remaining half of WAL record starts from as the replication starting location. We would need to look at carefully the side effect of that change, though. Regards, -- Fujii Masao
Re: Strange replication problem - segment restored from archive but still requested from master
From
Piotr Gasidło
Date:
2015-05-22 2:16 GMT+02:00 Venkata Balaji N <nag1010@gmail.com>: > > It might be yelling about the WAL segment due to the delay in shipping it > from master to slave. > Do you have the restore_command set up in the recovery.conf file ? do you > have any automated job which is shipping WAL archives from master to slave. Both yes. I've used omnipitr. Heavy logged. So this is on master: postgresql.conf: archive_mode = on archive_command = '/usr/local/pgsql/scripts/omnipitr/bin/omnipitr-archive --config-file /usr/local/pgsql/core/scripts/omnipitr/config/archive "%p"' /usr/local/pgsql/core/scripts/omnipitr/config/archive: --verbose --data-dir=/usr/local/pgsql/core/data/ --dst-remote=pgsql@d0.XXX:/usr/local/pgsql/core/wal/ --dst-remote=pgsql@d8.XXX:/usr/local/pgsql/core/wal/ --state-dir=/usr/local/pgsql/core/scripts/omnipitr/state --log=/usr/local/pgsql/core/scripts/omnipitr/log/omnipitr.log /usr/local/pgsql/core/scripts/omnipitr/log/omnipitr.log, part: 2015-05-21 20:35:32.404395 +0200 : 98206 : omnipitr-archive : LOG : Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008E successfully sent to all destinations. 2015-05-21 20:35:59.993614 +0200 : 98214 : omnipitr-archive : LOG : Called with parameters: --config-file /usr/local/pgsql/core/scripts/omnipitr/config/archive pg_xlog/0000000400004C4D0000008F 2015-05-21 20:36:00.495839 +0200 : 98214 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008F to pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D0000008F ended in 0.500619s 2015-05-21 20:36:00.896906 +0200 : 98214 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008F to pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D0000008F ended in 0.396684s 2015-05-21 20:36:00.897467 +0200 : 98214 : omnipitr-archive : LOG : Timer [Segment delivery] took: 0.903s 2015-05-21 20:36:00.901777 +0200 : 98214 : omnipitr-archive : LOG : Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008F successfully sent to all destinations. 2015-05-21 20:36:33.916879 +0200 : 98222 : omnipitr-archive : LOG : Called with parameters: --config-file /usr/local/pgsql/core/scripts/omnipitr/config/archive pg_xlog/0000000400004C4D00000090 2015-05-21 20:36:34.379965 +0200 : 98222 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended in 0.458563s 2015-05-21 20:36:34.931918 +0200 : 98222 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended in 0.550890s 2015-05-21 20:36:34.935241 +0200 : 98222 : omnipitr-archive : LOG : Timer [Segment delivery] took: 1.015s 2015-05-21 20:36:34.935776 +0200 : 98222 : omnipitr-archive : LOG : Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 successfully sent to all destinations. 2015-05-21 21:12:38.242687 +0200 : 561 : omnipitr-archive : LOG : Called with parameters: --config-file /usr/local/pgsql/core/scripts/omnipitr/config/archive pg_xlog/0000000400004C4D00000090 2015-05-21 21:12:38.425904 +0200 : 561 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended in 0.181528s 2015-05-21 21:12:38.969018 +0200 : 561 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended in 0.539258s 2015-05-21 21:12:38.971513 +0200 : 561 : omnipitr-archive : LOG : Timer [Segment delivery] took: 0.728s 2015-05-21 21:12:38.973229 +0200 : 561 : omnipitr-archive : LOG : Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 successfully sent to all destinations. 2015-05-21 21:28:51.389622 +0200 : 1634 : omnipitr-archive : LOG : Called with parameters: --config-file /usr/local/pgsql/core/scripts/omnipitr/config/archive pg_xlog/0000000400004C4D00000091 2015-05-21 21:28:52.457545 +0200 : 1634 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000091 to pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000091 ended in 1.065172s 2015-05-21 21:28:52.908541 +0200 : 1634 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000091 to pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000091 ended in 0.449934s 2015-05-21 21:28:52.910337 +0200 : 1634 : omnipitr-archive : LOG : Timer [Segment delivery] took: 1.519s 2015-05-21 21:28:52.910902 +0200 : 1634 : omnipitr-archive : LOG : Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000091 successfully sent to all destinations. 2015-05-21 21:28:53.028408 +0200 : 1639 : omnipitr-archive : LOG : Called with parameters: --config-file /usr/local/pgsql/core/scripts/omnipitr/config/archive pg_xlog/0000000400004C4D00000092 2015-05-21 21:28:53.783568 +0200 : 1639 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000092 to pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000092 ended in 0.753511s 2015-05-21 21:28:54.260492 +0200 : 1639 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000092 to pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000092 ended in 0.473037s 2015-05-21 21:28:54.261054 +0200 : 1639 : omnipitr-archive : LOG : Timer [Segment delivery] took: 1.232s 2015-05-21 21:28:54.262764 +0200 : 1639 : omnipitr-archive : LOG : Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000092 successfully sent to all destinations. 2015-05-21 21:30:06.034584 +0200 : 1870 : omnipitr-archive : LOG : Called with parameters: --config-file /usr/local/pgsql/core/scripts/omnipitr/config/archive pg_xlog/0000000400004C4D00000093 2015-05-21 21:30:06.920507 +0200 : 1870 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000093 to pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000093 ended in 0.883097s 2015-05-21 21:30:07.391485 +0200 : 1870 : omnipitr-archive : LOG : Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000093 to pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000093 ended in 0.469941s 2015-05-21 21:30:07.394826 +0200 : 1870 : omnipitr-archive : LOG : Timer [Segment delivery] took: 1.358s But now see some unexpected problem. Shipping for 0000000400004C4D00000090 was runned twice. But probably it is caused by me - I've tried to put file again in master pg_xlog from d0.XXX and run CHECKPOINT on master. But this didn't worked out. I have WAL files: d0:/usr/local/pgsql/core/wal # ls -laT 0000000400004C4D00000090 0000000400004C4D00000091 0000000400004C4D00000092 0000000400004C4D00000093 -rw------- 1 pgsql pgsql 16777216 May 21 20:36:33 2015 0000000400004C4D00000090 -rw------- 1 pgsql pgsql 16777216 May 21 21:28:51 2015 0000000400004C4D00000091 -rw------- 1 pgsql pgsql 16777216 May 21 21:28:52 2015 0000000400004C4D00000092 -rw------- 1 pgsql pgsql 16777216 May 21 21:30:05 2015 0000000400004C4D00000093 > Slave - if it does not find the WAL segment in master pg_xlog location, it > looks for the same in the location mentioned in the restore_command. It > yells if it is unable to find it. > Cannot give any concrete comments/suggestions, until recovery.conf entries > are known. On slave, recovery.conf: In the time of writing it I switched from OmniPITR: restore_command = '/usr/local/pgsql/scripts/omnipitr/bin/omnipitr-restore --config-file /usr/local/pgsql/core/scripts/omnipitr/config/restore -sr "%f" "%p"' To: restore_command = 'cp /usr/local/pgsql/core/wal/%f %p' And manualy rsynced files from d0.XXX to /usr/local/pgsql/core/wal/ directory. >> What was wrong? > > > This error message can be ignored, did you try shipping > 0000000400004C4D00000093 to slave from master's archive location and see if > slave picked it up and applied ? As I wrote, I've only shipped up to 0000000400004C4D00000092. File 0000000400004C4D00000093 was archived after slave managed to start and successfuly start streaming WAL. But I've done this. When I've noticed that slave is OK - I stopped it, changed back to OmniPITR restore_command, and started again. So full log: -- here where slave sucessfuly start for the first time: May 21 21:28:55 d8 postgres[53574]: [2-1] 2015-05-21 21:28:55.123 CEST @ 53574 LOG: ending log output to stderr May 21 21:28:55 d8 postgres[53574]: [2-2] 2015-05-21 21:28:55.123 CEST @ 53574 HINT: Future log output will go to log destination "syslog". May 21 21:28:55 d8 postgres[53575]: [3-1] 2015-05-21 21:28:55.124 CEST @ 53575 LOG: database system was shut down in recovery at 2015-05-21 21:24:56 CEST May 21 21:28:55 d8 postgres[53575]: [4-1] 2015-05-21 21:28:55.126 CEST @ 53575 LOG: entering standby mode May 21 21:28:55 d8 postgres[53575]: [5-1] 2015-05-21 21:28:55.150 CEST @ 53575 LOG: restored log file "0000000400004C4D00000088" from archive May 21 21:28:55 d8 postgres[53575]: [6-1] 2015-05-21 21:28:55.209 CEST @ 53575 LOG: redo starts at 4C4D/88493B50 May 21 21:28:55 d8 postgres[53575]: [7-1] 2015-05-21 21:28:55.314 CEST @ 53575 LOG: restored log file "0000000400004C4D00000089" from archive May 21 21:28:55 d8 postgres[53575]: [8-1] 2015-05-21 21:28:55.512 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008A" from archive May 21 21:28:55 d8 postgres[53585]: [3-1] 2015-05-21 21:28:55.781 CEST pgsql@postgres 53585 [local] [unknown] FATAL: the database system is starting up May 21 21:28:55 d8 postgres[53575]: [9-1] 2015-05-21 21:28:55.831 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008B" from archive May 21 21:28:56 d8 postgres[53575]: [10-1] 2015-05-21 21:28:56.508 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008C" from archive May 21 21:28:56 d8 postgres[53594]: [3-1] 2015-05-21 21:28:56.782 CEST pgsql@postgres 53594 [local] [unknown] FATAL: the database system is starting up May 21 21:28:56 d8 postgres[53575]: [11-1] 2015-05-21 21:28:56.818 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008D" from archive May 21 21:28:56 d8 postgres[53575]: [12-1] 2015-05-21 21:28:56.982 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008E" from archive May 21 21:28:57 d8 postgres[53575]: [13-1] 2015-05-21 21:28:57.415 CEST @ 53575 LOG: restored log file "0000000400004C4D0000008F" from archive May 21 21:28:57 d8 postgres[53607]: [3-1] 2015-05-21 21:28:57.784 CEST pgsql@postgres 53607 [local] [unknown] FATAL: the database system is starting up May 21 21:28:57 d8 postgres[53575]: [14-1] 2015-05-21 21:28:57.790 CEST @ 53575 LOG: restored log file "0000000400004C4D00000090" from archive May 21 21:28:58 d8 postgres[53575]: [15-1] 2015-05-21 21:28:58.328 CEST @ 53575 LOG: consistent recovery state reached at 4C4D/90FFF9C8 May 21 21:28:58 d8 postgres[53574]: [3-1] 2015-05-21 21:28:58.328 CEST @ 53574 LOG: database system is ready to accept read only connections May 21 21:28:58 d8 postgres[53575]: [16-1] 2015-05-21 21:28:58.349 CEST @ 53575 LOG: restored log file "0000000400004C4D00000091" from archive May 21 21:28:58 d8 postgres[53618]: [4-1] 2015-05-21 21:28:58.786 CEST pgsql@postgres 53618 [local] [unknown] FATAL: role "pgsql" does not exist May 21 21:28:59 d8 postgres[53575]: [17-1] 2015-05-21 21:28:59.962 CEST @ 53575 LOG: restored log file "0000000400004C4D00000092" from archive May 21 21:29:00 d8 postgres[53575]: [18-1] 2015-05-21 21:29:00.037 CEST @ 53575 LOG: unexpected pageaddr 4C45/DC000000 in log segment 0000000400004C4D00000093, offset 0 May 21 21:29:00 d8 postgres[53640]: [4-1] 2015-05-21 21:29:00.047 CEST @ 53640 LOG: started streaming WAL from primary at 4C4D/92000000 on timeline 4 -- I stopped him. May 21 21:30:13 d8 postgres[53574]: [4-1] 2015-05-21 21:30:13.417 CEST @ 53574 LOG: received fast shutdown request May 21 21:30:13 d8 postgres[53574]: [5-1] 2015-05-21 21:30:13.418 CEST @ 53574 LOG: aborting any active transactions May 21 21:30:13 d8 postgres[53640]: [5-1] 2015-05-21 21:30:13.441 CEST @ 53640 FATAL: terminating walreceiver process due to administrator command May 21 21:30:13 d8 postgres[53579]: [3-1] 2015-05-21 21:30:13.443 CEST @ 53579 LOG: shutting down May 21 21:30:13 d8 postgres[53579]: [4-1] 2015-05-21 21:30:13.443 CEST @ 53579 LOG: restartpoint starting: shutdown immediate May 21 21:30:15 d8 postgres[53579]: [5-1] 2015-05-21 21:30:15.094 CEST @ 53579 LOG: restartpoint complete: wrote 25212 buffers (0.8%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=0.946 s, sync=0.629 s, total=1.651 s; sync files=66, longest=0.245 s, aver May 21 21:30:15 d8 postgres[53579]: [6-1] 2015-05-21 21:30:15.094 CEST @ 53579 LOG: recovery restart point at 4C4D/91A59348 May 21 21:30:15 d8 postgres[53579]: [6-2] 2015-05-21 21:30:15.094 CEST @ 53579 DETAIL: last completed transaction was at log time 2015-05-21 21:30:13.377052+02 May 21 21:30:15 d8 postgres[53579]: [7-1] 2015-05-21 21:30:15.390 CEST @ 53579 LOG: database system is shut down -- I've changed restore_command, master managed to ship 0000000400004C4D00000093 in background. -- Starting up. May 21 21:30:25 d8 postgres[54546]: [1-1] 2015-05-21 21:30:25.983 CEST @ 54546 LOG: loaded library "pg_stat_statements" May 21 21:30:26 d8 postgres[54546]: [2-1] 2015-05-21 21:30:26.329 CEST @ 54546 LOG: ending log output to stderr May 21 21:30:26 d8 postgres[54546]: [2-2] 2015-05-21 21:30:26.329 CEST @ 54546 HINT: Future log output will go to log destination "syslog". May 21 21:30:26 d8 postgres[54547]: [3-1] 2015-05-21 21:30:26.331 CEST @ 54547 LOG: database system was shut down in recovery at 2015-05-21 21:30:15 CEST May 21 21:30:26 d8 postgres[54547]: [4-1] 2015-05-21 21:30:26.626 CEST @ 54547 LOG: entering standby mode May 21 21:30:27 d8 postgres[54563]: [3-1] 2015-05-21 21:30:27.026 CEST pgsql@postgres 54563 [local] [unknown] FATAL: the database system is starting up May 21 21:30:27 d8 postgres[54547]: [5-1] 2015-05-21 21:30:27.068 CEST @ 54547 LOG: restored log file "0000000400004C4D00000091" from archive May 21 21:30:27 d8 postgres[54547]: [6-1] 2015-05-21 21:30:27.129 CEST @ 54547 LOG: redo starts at 4C4D/91A59348 May 21 21:30:27 d8 postgres[54547]: [7-1] 2015-05-21 21:30:27.301 CEST @ 54547 LOG: restored log file "0000000400004C4D00000092" from archive May 21 21:30:27 d8 postgres[54547]: [8-1] 2015-05-21 21:30:27.532 CEST @ 54547 LOG: restored log file "0000000400004C4D00000093" from archive May 21 21:30:27 d8 postgres[54547]: [9-1] 2015-05-21 21:30:27.957 CEST @ 54547 LOG: consistent recovery state reached at 4C4D/946CE620 May 21 21:30:27 d8 postgres[54547]: [10-1] 2015-05-21 21:30:27.957 CEST @ 54547 LOG: unexpected pageaddr 4C45/CD6D4000 in log segment 0000000400004C4D00000094, offset 7159808 May 21 21:30:27 d8 postgres[54546]: [3-1] 2015-05-21 21:30:27.957 CEST @ 54546 LOG: database system is ready to accept read only connections May 21 21:30:27 d8 postgres[54580]: [4-1] 2015-05-21 21:30:27.969 CEST @ 54580 LOG: started streaming WAL from primary at 4C4D/94000000 on timeline 4 May 21 21:30:28 d8 postgres[54581]: [4-1] 2015-05-21 21:30:28.039 CEST pgsql@postgres 54581 [local] [unknown] FATAL: role "pgsql" does not exist May 21 21:36:58 d8 postgres[54564]: [3-1] 2015-05-21 21:36:58.080 CEST @ 54564 LOG: restartpoint starting: time -- And everything goes OK. -- Piotr Gasidło
Re: Strange replication problem - segment restored from archive but still requested from master
From
Piotr Gasidło
Date:
2015-05-22 6:55 GMT+02:00 Fujii Masao <masao.fujii@gmail.com>: > Thanks for the report! This seems to be a bug. > > This problem happens when WAL record is stored in separate two WAL files and > there is no valid latter WAL file in the standby. In your case, the former file > is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091. I've got bot files, when I've pg_xdumplog-ed them: $ cat dump1 0000000400004C4D00000090.dump ... rmgr: Heap2 len (rec/tot): 24/ 7772, tx: 0, lsn: 4C4D/90FF6DB8, prev 4C4D/90FF51E8, bkp: 1000, desc: clean: rel 1663/131438/10079072; blk 13843 remxid 472083193 rmgr: Heap2 len (rec/tot): 504/ 536, tx: 0, lsn: 4C4D/90FF8C30, prev 4C4D/90FF6DB8, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13843; cutoff xid 422325322 ntuples 40 rmgr: Heap2 len (rec/tot): 24/ 5632, tx: 0, lsn: 4C4D/90FF8E48, prev 4C4D/90FF8C30, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13844; cutoff xid 422325322 ntuples 30 rmgr: Heap2 len (rec/tot): 24/ 7528, tx: 0, lsn: 4C4D/90FFA460, prev 4C4D/90FF8E48, bkp: 1000, desc: clean: rel 1663/131438/10079072; blk 13845 remxid 472083193 rmgr: Heap2 len (rec/tot): 564/ 596, tx: 0, lsn: 4C4D/90FFC1E0, prev 4C4D/90FFA460, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13845; cutoff xid 422325322 ntuples 45 rmgr: Heap2 len (rec/tot): 24/ 5316, tx: 0, lsn: 4C4D/90FFC438, prev 4C4D/90FFC1E0, bkp: 1000, desc: clean: rel 1663/131438/10079072; blk 13846 remxid 472083193 rmgr: Heap2 len (rec/tot): 420/ 452, tx: 0, lsn: 4C4D/90FFD900, prev 4C4D/90FFC438, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13846; cutoff xid 422325322 ntuples 33 rmgr: Heap2 len (rec/tot): 24/ 7856, tx: 0, lsn: 4C4D/90FFDAC8, prev 4C4D/90FFD900, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13847; cutoff xid 422325322 ntuples 53 rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 4C4D/90FFF990, prev 4C4D/90FFDAC8, bkp: 0000, desc: visible: rel 1663/131438/10079072; blk 13847 0000000400004C4D00000091.dump rmgr: Heap2 len (rec/tot): 600/ 632, tx: 0, lsn: 4C4D/910014A0, prev 4C4D/90FFF9C8, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13848; cutoff xid 422325322 ntuples 48 rmgr: Heap2 len (rec/tot): 24/ 6640, tx: 0, lsn: 4C4D/91001718, prev 4C4D/910014A0, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13849; cutoff xid 422325322 ntuples 42 rmgr: Heap2 len (rec/tot): 24/ 7848, tx: 0, lsn: 4C4D/91003120, prev 4C4D/91001718, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13850; cutoff xid 422325322 ntuples 36 rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 4C4D/91004FE0, prev 4C4D/91003120, bkp: 0000, desc: visible: rel 1663/131438/10079072; blk 13850 rmgr: Heap2 len (rec/tot): 24/ 6208, tx: 0, lsn: 4C4D/91005018, prev 4C4D/91004FE0, bkp: 1000, desc: clean: rel 1663/131438/10079072; blk 13851 remxid 472083193 rmgr: Heap2 len (rec/tot): 564/ 596, tx: 0, lsn: 4C4D/91006870, prev 4C4D/91005018, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13851; cutoff xid 422325322 ntuples 45 rmgr: Heap2 len (rec/tot): 24/ 7340, tx: 0, lsn: 4C4D/91006AC8, prev 4C4D/91006870, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13852; cutoff xid 422325322 ntuples 39 rmgr: Heap2 len (rec/tot): 24/ 7744, tx: 0, lsn: 4C4D/91008790, prev 4C4D/91006AC8, bkp: 1000, desc: clean: rel 1663/131438/10079072; blk 13853 remxid 472083193 rmgr: Heap2 len (rec/tot): 492/ 524, tx: 0, lsn: 4C4D/9100A5E8, prev 4C4D/91008790, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13853; cutoff xid 422325322 ntuples 39 rmgr: Heap2 len (rec/tot): 24/ 7152, tx: 0, lsn: 4C4D/9100A7F8, prev 4C4D/9100A5E8, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13854; cutoff xid 422325322 ntuples 32 rmgr: Heap2 len (rec/tot): 24/ 7380, tx: 0, lsn: 4C4D/9100C400, prev 4C4D/9100A7F8, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13855; cutoff xid 422325322 ntuples 42 rmgr: Heap2 len (rec/tot): 24/ 6784, tx: 0, lsn: 4C4D/9100E0F0, prev 4C4D/9100C400, bkp: 1000, desc: clean: rel 1663/131438/10079072; blk 13856 remxid 472083193 rmgr: Heap2 len (rec/tot): 408/ 440, tx: 0, lsn: 4C4D/9100FB70, prev 4C4D/9100E0F0, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13856; cutoff xid 422325322 ntuples 32 rmgr: Heap2 len (rec/tot): 24/ 7312, tx: 0, lsn: 4C4D/9100FD28, prev 4C4D/9100FB70, bkp: 1000, desc: clean: rel 1663/131438/10079072; blk 13857 remxid 472083193 rmgr: Heap2 len (rec/tot): 432/ 464, tx: 0, lsn: 4C4D/910119D0, prev 4C4D/9100FD28, bkp: 0000, desc: freeze_page: rel 1663/131438/10079072; blk 13857; cutoff xid 422325322 ntuples 34 rmgr: Heap2 len (rec/tot): 24/ 7920, tx: 0, lsn: 4C4D/91011BA0, prev 4C4D/910119D0, bkp: 1000, desc: freeze_page: rel 1663/131438/10079072; blk 13858; cutoff xid 422325322 ntuples 60 ... I this is not enought, I can attach both files. -- Piotr Gasidło
Re: Strange replication problem - segment restored from archive but still requested from master
From
Piotr Gasidło
Date:
2015-05-22 6:55 GMT+02:00 Fujii Masao <masao.fujii@gmail.com>: > > This problem happens when WAL record is stored in separate two WAL files and > there is no valid latter WAL file in the standby. In your case, the former file > is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091. > > In this case, the first half of WAL record can be read from the former WAL file, > but the remaining half not because no valid latter file exists in the standby. > Then the standby tries to retrieve the latter WAL file via replication. > The problem here is that the standby tries to start the replication from the > starting point of WAL record, i.e., that's the location of the former WAL file. > So the already-read WAL file is requested via replication. > (..) I currently have wal_keep_segments set to 0. Setting this to higher value will help? As I understand: master won't delete segment and could stream it to slave on request - so it will help. Does this setting delays WAL archiving? -- Piotr Gasidło
Re: Strange replication problem - segment restored from archive but still requested from master
From
Guillaume Lelarge
Date:
2015-05-22 18:36 GMT+02:00 Piotr Gasidło <quaker@barbara.eu.org>:
--
2015-05-22 6:55 GMT+02:00 Fujii Masao <masao.fujii@gmail.com>:
>
> This problem happens when WAL record is stored in separate two WAL files and
> there is no valid latter WAL file in the standby. In your case, the former file
> is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091.
>
> In this case, the first half of WAL record can be read from the former WAL file,
> but the remaining half not because no valid latter file exists in the standby.
> Then the standby tries to retrieve the latter WAL file via replication.
> The problem here is that the standby tries to start the replication from the
> starting point of WAL record, i.e., that's the location of the former WAL file.
> So the already-read WAL file is requested via replication.
> (..)
I currently have wal_keep_segments set to 0.
Setting this to higher value will help? As I understand: master won't
delete segment and could stream it to slave on request - so it will
help.
It definitely helps, but the issue could still happen.
Does this setting delays WAL archiving?
Nope. It delays recycling.
--
Re: Strange replication problem - segment restored from archive but still requested from master
From
Piotr Gasidło
Date:
2015-05-25 11:30 GMT+02:00 Guillaume Lelarge <guillaume@lelarge.info>: >> I currently have wal_keep_segments set to 0. >> Setting this to higher value will help? As I understand: master won't >> delete segment and could stream it to slave on request - so it will >> help. > > > It definitely helps, but the issue could still happen. > What conditions must be met for issue to happen? Both archive_command on master and restore_commands are set and working. Also wal_keep_segments is set. I see no point of failure - only delay in the case of high WAL traffic on master: - slave starts with restoring WALs from archive, - now, it connects to master and notices, that for last master's WAL it needs previous one ("the issue"), - slave asks master for previous WAL and gets it - job done, streaming replication set, exit - if unable to get it (WAL traffic is high, and after restoring last WAL from archive and asking master for next one more than wal_keep_segments were recycled) it returns to looking WALs in archive. Do I get it right? -- Piotr Gasidło
Re: Strange replication problem - segment restored from archive but still requested from master
From
Guillaume Lelarge
Date:
2015-05-25 15:15 GMT+02:00 Piotr Gasidło <quaker@barbara.eu.org>:
--
2015-05-25 11:30 GMT+02:00 Guillaume Lelarge <guillaume@lelarge.info>:
>> I currently have wal_keep_segments set to 0.
>> Setting this to higher value will help? As I understand: master won't
>> delete segment and could stream it to slave on request - so it will
>> help.
>
>
> It definitely helps, but the issue could still happen.
>
What conditions must be met for issue to happen?
Very high WAL traffic can make the slave lag enough that even wal_keep_segments isn't enough.
Both archive_command on master and restore_commands are set and working.
Also wal_keep_segments is set.
I see no point of failure - only delay in the case of high WAL traffic
on master:
- slave starts with restoring WALs from archive,
- now, it connects to master and notices, that for last master's WAL
it needs previous one ("the issue"),
- slave asks master for previous WAL and gets it - job done, streaming
replication set, exit
- if unable to get it (WAL traffic is high, and after restoring last
WAL from archive and asking master for next one more than
wal_keep_segments were recycled) it returns to looking WALs in
archive.
Do I get it right?
Yes. If you set correctly archive_command (on the master) and restore_command (on the slave), there's no point of failure. You might still get the "WAL not available" error message, but the slave can synchronize itself with the archived WALs.
--