Thread: Strange replication problem - segment restored from archive but still requested from master

Strange replication problem - segment restored from archive but still requested from master

From

Piotr Gasidło

Date:

21 May 2015, 20:52:45

Got strange problem. Unable to repeat, but got logs.

Simple master-slave using streaming replication.
Master is running. Slave is down.
Segment 0000000400004C4D00000090 was successfully archived and send
from master to slave.

Now I've started slave, and:

ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST
@ 50645   LOG:  database system was shut down in recovery at
2015-05-21 21:22:03 CEST
May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST
@ 50645   LOG:  entering standby mode
May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST
@ 50645   LOG:  restored log file "0000000400004C4D00000088" from
archive
May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST
@ 50645   LOG:  redo starts at 4C4D/88493B50
May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST
@ 50645   LOG:  restored log file "0000000400004C4D00000089" from
archive
May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST
@ 50645   LOG:  restored log file "0000000400004C4D0000008A" from
archive
May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST
@ 50645   LOG:  restored log file "0000000400004C4D0000008B" from
archive
May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376
CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008C" from
archive
May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690
CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008D" from
archive
May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855
CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008E" from
archive
May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275
CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008F" from
archive
May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654
CEST @ 50645   LOG:  restored log file "0000000400004C4D00000090" from
archive
May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222
CEST @ 50645   LOG:  consistent recovery state reached at
4C4D/90FFF9C8
May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST
@ 50644   LOG:  database system is ready to accept read only
connections
May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223
CEST @ 50645   LOG:  unexpected pageaddr 4C46/E000000 in log segment
0000000400004C4D00000091, offset 0
May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST
@ 50699   LOG:  started streaming WAL from primary at 4C4D/90000000 on
timeline 4
May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST
@ 50699   FATAL:  could not receive data from WAL stream: ERROR:
requested WAL segment 0000000400004C4D00000090 has already been
removed
May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255
CEST @ 50645   LOG:  restored log file "0000000400004C4D00000090" from
archive
May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST
@ 50703   LOG:  started streaming WAL from primary at 4C4D/90000000 on
timeline 4
May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST
@ 50703   FATAL:  could not receive data from WAL stream: ERROR:
requested WAL segment 0000000400004C4D00000090 has already been
removed
May 21 21:23:40 d8 postgres[50703]: [5-2]
...
(and so on)

So, as I understand:
- slave was started and entered restore,
- slave restored 0000000400004C4D00000090 from archive, reached
consistent recovery state
- now, it connected to master and noticed, that, it has
0000000400004C4D00000091 segment uncomplete
- and then, started yelling about missing segment (master deleted it
already after archiving) on master

Why?

I fixed it by stopping slave, doing some operations on master, copying
0000000400004C4D00000091 and 0000000400004C4D00000092 from master, and
staring the slave:


May 21 21:24:56 d8 postgres[50644]: [4-1] 2015-05-21 21:24:56.160 CEST
@ 50644   LOG:  received fast shutdown request
May 21 21:24:56 d8 postgres[50644]: [5-1] 2015-05-21 21:24:56.160 CEST
@ 50644   LOG:  aborting any active transactions
May 21 21:24:56 d8 postgres[50657]: [3-1] 2015-05-21 21:24:56.162 CEST
@ 50657   LOG:  shutting down
May 21 21:24:56 d8 postgres[50657]: [4-1] 2015-05-21 21:24:56.174 CEST
@ 50657   LOG:  database system is shut down
May 21 21:28:54 d8 postgres[53574]: [1-1] 2015-05-21 21:28:54.781 CEST
@ 53574   LOG:  loaded library "pg_stat_statements"
May 21 21:28:55 d8 postgres[53574]: [2-1] 2015-05-21 21:28:55.123 CEST
@ 53574   LOG:  ending log output to stderr
May 21 21:28:55 d8 postgres[53574]: [2-2] 2015-05-21 21:28:55.123 CEST
@ 53574   HINT:  Future log output will go to log destination
"syslog".
May 21 21:28:55 d8 postgres[53575]: [3-1] 2015-05-21 21:28:55.124 CEST
@ 53575   LOG:  database system was shut down in recovery at
2015-05-21 21:24:56 CEST
May 21 21:28:55 d8 postgres[53575]: [4-1] 2015-05-21 21:28:55.126 CEST
@ 53575   LOG:  entering standby mode
May 21 21:28:55 d8 postgres[53575]: [5-1] 2015-05-21 21:28:55.150 CEST
@ 53575   LOG:  restored log file "0000000400004C4D00000088" from
archive
May 21 21:28:55 d8 postgres[53575]: [6-1] 2015-05-21 21:28:55.209 CEST
@ 53575   LOG:  redo starts at 4C4D/88493B50
May 21 21:28:55 d8 postgres[53575]: [7-1] 2015-05-21 21:28:55.314 CEST
@ 53575   LOG:  restored log file "0000000400004C4D00000089" from
archive
May 21 21:28:55 d8 postgres[53575]: [8-1] 2015-05-21 21:28:55.512 CEST
@ 53575   LOG:  restored log file "0000000400004C4D0000008A" from
archive
May 21 21:28:55 d8 postgres[53575]: [9-1] 2015-05-21 21:28:55.831 CEST
@ 53575   LOG:  restored log file "0000000400004C4D0000008B" from
archive
May 21 21:28:56 d8 postgres[53575]: [10-1] 2015-05-21 21:28:56.508
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008C" from
archive
May 21 21:28:56 d8 postgres[53575]: [11-1] 2015-05-21 21:28:56.818
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008D" from
archive
May 21 21:28:56 d8 postgres[53575]: [12-1] 2015-05-21 21:28:56.982
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008E" from
archive
May 21 21:28:57 d8 postgres[53575]: [13-1] 2015-05-21 21:28:57.415
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008F" from
archive
May 21 21:28:57 d8 postgres[53575]: [14-1] 2015-05-21 21:28:57.790
CEST @ 53575   LOG:  restored log file "0000000400004C4D00000090" from
archive
May 21 21:28:58 d8 postgres[53575]: [15-1] 2015-05-21 21:28:58.328
CEST @ 53575   LOG:  consistent recovery state reached at
4C4D/90FFF9C8
May 21 21:28:58 d8 postgres[53574]: [3-1] 2015-05-21 21:28:58.328 CEST
@ 53574   LOG:  database system is ready to accept read only
connections
May 21 21:28:58 d8 postgres[53575]: [16-1] 2015-05-21 21:28:58.349
CEST @ 53575   LOG:  restored log file "0000000400004C4D00000091" from
archive
May 21 21:28:59 d8 postgres[53575]: [17-1] 2015-05-21 21:28:59.962
CEST @ 53575   LOG:  restored log file "0000000400004C4D00000092" from
archive
May 21 21:29:00 d8 postgres[53575]: [18-1] 2015-05-21 21:29:00.037
CEST @ 53575   LOG:  unexpected pageaddr 4C45/DC000000 in log segment
0000000400004C4D00000093, offset 0
May 21 21:29:00 d8 postgres[53640]: [4-1] 2015-05-21 21:29:00.047 CEST
@ 53640   LOG:  started streaming WAL from primary at 4C4D/92000000 on
timeline 4

What was wrong?

Postgresql 9.3.6 on slave and 9.3.4 on master. Unable to retest if
master was also 9.3.6.



--
Piotr Gasidło

Re: Strange replication problem - segment restored from archive but still requested from master

From

Venkata Balaji N

Date:

22 May 2015, 00:16:33

On Fri, May 22, 2015 at 6:52 AM, Piotr Gasidło <quaker@barbara.eu.org> wrote:

Got strange problem. Unable to repeat, but got logs.

Simple master-slave using streaming replication.
Master is running. Slave is down.
Segment 0000000400004C4D00000090 was successfully archived and send
from master to slave.

Now I've started slave, and:

ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST
@ 50645 LOG: database system was shut down in recovery at
2015-05-21 21:22:03 CEST
May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST
@ 50645 LOG: entering standby mode
May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST
@ 50645 LOG: restored log file "0000000400004C4D00000088" from
archive
May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST
@ 50645 LOG: redo starts at 4C4D/88493B50
May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST
@ 50645 LOG: restored log file "0000000400004C4D00000089" from
archive
May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST
@ 50645 LOG: restored log file "0000000400004C4D0000008A" from
archive
May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST
@ 50645 LOG: restored log file "0000000400004C4D0000008B" from
archive
May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008C" from
archive
May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008D" from
archive
May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008E" from
archive
May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275
CEST @ 50645 LOG: restored log file "0000000400004C4D0000008F" from
archive
May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654
CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from
archive
May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222
CEST @ 50645 LOG: consistent recovery state reached at
4C4D/90FFF9C8
May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST
@ 50644 LOG: database system is ready to accept read only
connections
May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223
CEST @ 50645 LOG: unexpected pageaddr 4C46/E000000 in log segment
0000000400004C4D00000091, offset 0
May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST
@ 50699 LOG: started streaming WAL from primary at 4C4D/90000000 on
timeline 4
May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST
@ 50699 FATAL: could not receive data from WAL stream: ERROR:
requested WAL segment 0000000400004C4D00000090 has already been
removed
May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255
CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from
archive
May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST
@ 50703 LOG: started streaming WAL from primary at 4C4D/90000000 on
timeline 4
May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST
@ 50703 FATAL: could not receive data from WAL stream: ERROR:
requested WAL segment 0000000400004C4D00000090 has already been
removed
May 21 21:23:40 d8 postgres[50703]: [5-2]
...
(and so on)

So, as I understand:
- slave was started and entered restore,
- slave restored 0000000400004C4D00000090 from archive, reached
consistent recovery state
- now, it connected to master and noticed, that, it has
0000000400004C4D00000091 segment uncomplete
- and then, started yelling about missing segment (master deleted it
already after archiving) on master

Why?

It might be yelling about the WAL segment due to the delay in shipping it from master to slave.

Do you have the restore_command set up in the recovery.conf file ? do you have any automated job which is shipping WAL archives from master to slave.

Slave - if it does not find the WAL segment in master pg_xlog location, it looks for the same in the location mentioned in the restore_command. It yells if it is unable to find it.

Cannot give any concrete comments/suggestions, until recovery.conf entries are known.

I fixed it by stopping slave, doing some operations on master, copying
0000000400004C4D00000091 and 0000000400004C4D00000092 from master, and
staring the slave:

May 21 21:24:56 d8 postgres[50644]: [4-1] 2015-05-21 21:24:56.160 CEST
@ 50644 LOG: received fast shutdown request
May 21 21:24:56 d8 postgres[50644]: [5-1] 2015-05-21 21:24:56.160 CEST
@ 50644 LOG: aborting any active transactions
May 21 21:24:56 d8 postgres[50657]: [3-1] 2015-05-21 21:24:56.162 CEST
@ 50657 LOG: shutting down
May 21 21:24:56 d8 postgres[50657]: [4-1] 2015-05-21 21:24:56.174 CEST
@ 50657 LOG: database system is shut down
May 21 21:28:54 d8 postgres[53574]: [1-1] 2015-05-21 21:28:54.781 CEST
@ 53574 LOG: loaded library "pg_stat_statements"
May 21 21:28:55 d8 postgres[53574]: [2-1] 2015-05-21 21:28:55.123 CEST
@ 53574 LOG: ending log output to stderr
May 21 21:28:55 d8 postgres[53574]: [2-2] 2015-05-21 21:28:55.123 CEST
@ 53574 HINT: Future log output will go to log destination
"syslog".
May 21 21:28:55 d8 postgres[53575]: [3-1] 2015-05-21 21:28:55.124 CEST
@ 53575 LOG: database system was shut down in recovery at
2015-05-21 21:24:56 CEST
May 21 21:28:55 d8 postgres[53575]: [4-1] 2015-05-21 21:28:55.126 CEST
@ 53575 LOG: entering standby mode
May 21 21:28:55 d8 postgres[53575]: [5-1] 2015-05-21 21:28:55.150 CEST
@ 53575 LOG: restored log file "0000000400004C4D00000088" from
archive
May 21 21:28:55 d8 postgres[53575]: [6-1] 2015-05-21 21:28:55.209 CEST
@ 53575 LOG: redo starts at 4C4D/88493B50
May 21 21:28:55 d8 postgres[53575]: [7-1] 2015-05-21 21:28:55.314 CEST
@ 53575 LOG: restored log file "0000000400004C4D00000089" from
archive
May 21 21:28:55 d8 postgres[53575]: [8-1] 2015-05-21 21:28:55.512 CEST
@ 53575 LOG: restored log file "0000000400004C4D0000008A" from
archive
May 21 21:28:55 d8 postgres[53575]: [9-1] 2015-05-21 21:28:55.831 CEST
@ 53575 LOG: restored log file "0000000400004C4D0000008B" from
archive
May 21 21:28:56 d8 postgres[53575]: [10-1] 2015-05-21 21:28:56.508
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008C" from
archive
May 21 21:28:56 d8 postgres[53575]: [11-1] 2015-05-21 21:28:56.818
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008D" from
archive
May 21 21:28:56 d8 postgres[53575]: [12-1] 2015-05-21 21:28:56.982
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008E" from
archive
May 21 21:28:57 d8 postgres[53575]: [13-1] 2015-05-21 21:28:57.415
CEST @ 53575 LOG: restored log file "0000000400004C4D0000008F" from
archive
May 21 21:28:57 d8 postgres[53575]: [14-1] 2015-05-21 21:28:57.790
CEST @ 53575 LOG: restored log file "0000000400004C4D00000090" from
archive
May 21 21:28:58 d8 postgres[53575]: [15-1] 2015-05-21 21:28:58.328
CEST @ 53575 LOG: consistent recovery state reached at
4C4D/90FFF9C8
May 21 21:28:58 d8 postgres[53574]: [3-1] 2015-05-21 21:28:58.328 CEST
@ 53574 LOG: database system is ready to accept read only
connections
May 21 21:28:58 d8 postgres[53575]: [16-1] 2015-05-21 21:28:58.349
CEST @ 53575 LOG: restored log file "0000000400004C4D00000091" from
archive
May 21 21:28:59 d8 postgres[53575]: [17-1] 2015-05-21 21:28:59.962
CEST @ 53575 LOG: restored log file "0000000400004C4D00000092" from
archive
May 21 21:29:00 d8 postgres[53575]: [18-1] 2015-05-21 21:29:00.037
CEST @ 53575 LOG: unexpected pageaddr 4C45/DC000000 in log segment
0000000400004C4D00000093, offset 0
May 21 21:29:00 d8 postgres[53640]: [4-1] 2015-05-21 21:29:00.047 CEST
@ 53640 LOG: started streaming WAL from primary at 4C4D/92000000 on
timeline 4

What was wrong?

This error message can be ignored, did you try shipping 0000000400004C4D00000093 to slave from master's archive location and see if slave picked it up and applied ?

Regards,

Venkata B N

Fujitsu Australia

Re: Strange replication problem - segment restored from archive but still requested from master

From

Fujii Masao

Date:

22 May 2015, 04:55:49

On Fri, May 22, 2015 at 5:52 AM, Piotr Gasidło <quaker@barbara.eu.org> wrote:
> Got strange problem. Unable to repeat, but got logs.
>
> Simple master-slave using streaming replication.
> Master is running. Slave is down.
> Segment 0000000400004C4D00000090 was successfully archived and send
> from master to slave.
>
> Now I've started slave, and:
>
> ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST
> @ 50645   LOG:  database system was shut down in recovery at
> 2015-05-21 21:22:03 CEST
> May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST
> @ 50645   LOG:  entering standby mode
> May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST
> @ 50645   LOG:  restored log file "0000000400004C4D00000088" from
> archive
> May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST
> @ 50645   LOG:  redo starts at 4C4D/88493B50
> May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST
> @ 50645   LOG:  restored log file "0000000400004C4D00000089" from
> archive
> May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST
> @ 50645   LOG:  restored log file "0000000400004C4D0000008A" from
> archive
> May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST
> @ 50645   LOG:  restored log file "0000000400004C4D0000008B" from
> archive
> May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376
> CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008C" from
> archive
> May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690
> CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008D" from
> archive
> May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855
> CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008E" from
> archive
> May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275
> CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008F" from
> archive
> May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654
> CEST @ 50645   LOG:  restored log file "0000000400004C4D00000090" from
> archive
> May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222
> CEST @ 50645   LOG:  consistent recovery state reached at
> 4C4D/90FFF9C8
> May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST
> @ 50644   LOG:  database system is ready to accept read only
> connections
> May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223
> CEST @ 50645   LOG:  unexpected pageaddr 4C46/E000000 in log segment
> 0000000400004C4D00000091, offset 0
> May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST
> @ 50699   LOG:  started streaming WAL from primary at 4C4D/90000000 on
> timeline 4
> May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST
> @ 50699   FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000400004C4D00000090 has already been
> removed
> May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255
> CEST @ 50645   LOG:  restored log file "0000000400004C4D00000090" from
> archive
> May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST
> @ 50703   LOG:  started streaming WAL from primary at 4C4D/90000000 on
> timeline 4
> May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST
> @ 50703   FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000400004C4D00000090 has already been
> removed
> May 21 21:23:40 d8 postgres[50703]: [5-2]
> ...
> (and so on)
>
> So, as I understand:
> - slave was started and entered restore,
> - slave restored 0000000400004C4D00000090 from archive, reached
> consistent recovery state
> - now, it connected to master and noticed, that, it has
> 0000000400004C4D00000091 segment uncomplete
> - and then, started yelling about missing segment (master deleted it
> already after archiving) on master
>
> Why?

Thanks for the report! This seems to be a bug.

This problem happens when WAL record is stored in separate two WAL files and
there is no valid latter WAL file in the standby. In your case, the former file
is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091.

In this case, the first half of WAL record can be read from the former WAL file,
but the remaining half not because no valid latter file exists in the standby.
Then the standby tries to retrieve the latter WAL file via replication.
The problem here is that the standby tries to start the replication from the
starting point of WAL record, i.e., that's the location of the former WAL file.
So the already-read WAL file is requested via replication.

To address this problem, maybe we can just use the location where the remaining
half of WAL record starts from as the replication starting location. We would
need to look at carefully the side effect of that change, though.

Regards,

--
Fujii Masao

Re: Strange replication problem - segment restored from archive but still requested from master

From

Piotr Gasidło

Date:

22 May 2015, 08:08:46

2015-05-22 2:16 GMT+02:00 Venkata Balaji N <nag1010@gmail.com>:
>
> It might be yelling about the WAL segment due to the delay in shipping it
> from master to slave.
> Do you have the restore_command set up in the recovery.conf file ? do you
> have any automated job which is shipping WAL archives from master to slave.

Both yes. I've used omnipitr. Heavy logged. So this is on master:

postgresql.conf:

archive_mode = on
archive_command =
'/usr/local/pgsql/scripts/omnipitr/bin/omnipitr-archive --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/archive "%p"'

/usr/local/pgsql/core/scripts/omnipitr/config/archive:

--verbose
--data-dir=/usr/local/pgsql/core/data/
--dst-remote=pgsql@d0.XXX:/usr/local/pgsql/core/wal/
--dst-remote=pgsql@d8.XXX:/usr/local/pgsql/core/wal/
--state-dir=/usr/local/pgsql/core/scripts/omnipitr/state
--log=/usr/local/pgsql/core/scripts/omnipitr/log/omnipitr.log

/usr/local/pgsql/core/scripts/omnipitr/log/omnipitr.log, part:

2015-05-21 20:35:32.404395 +0200 : 98206 : omnipitr-archive : LOG :
Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008E
successfully sent to all destinations.
2015-05-21 20:35:59.993614 +0200 : 98214 : omnipitr-archive : LOG :
Called with parameters: --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/archive
pg_xlog/0000000400004C4D0000008F
2015-05-21 20:36:00.495839 +0200 : 98214 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008F to
pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D0000008F ended
in 0.500619s
2015-05-21 20:36:00.896906 +0200 : 98214 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008F to
pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D0000008F ended
in 0.396684s
2015-05-21 20:36:00.897467 +0200 : 98214 : omnipitr-archive : LOG :
Timer [Segment delivery] took: 0.903s
2015-05-21 20:36:00.901777 +0200 : 98214 : omnipitr-archive : LOG :
Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D0000008F
successfully sent to all destinations.
2015-05-21 20:36:33.916879 +0200 : 98222 : omnipitr-archive : LOG :
Called with parameters: --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/archive
pg_xlog/0000000400004C4D00000090
2015-05-21 20:36:34.379965 +0200 : 98222 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to
pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended
in 0.458563s
2015-05-21 20:36:34.931918 +0200 : 98222 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to
pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended
in 0.550890s
2015-05-21 20:36:34.935241 +0200 : 98222 : omnipitr-archive : LOG :
Timer [Segment delivery] took: 1.015s
2015-05-21 20:36:34.935776 +0200 : 98222 : omnipitr-archive : LOG :
Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090
successfully sent to all destinations.
2015-05-21 21:12:38.242687 +0200 : 561 : omnipitr-archive : LOG :
Called with parameters: --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/archive
pg_xlog/0000000400004C4D00000090
2015-05-21 21:12:38.425904 +0200 : 561 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to
pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended
in 0.181528s
2015-05-21 21:12:38.969018 +0200 : 561 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090 to
pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000090 ended
in 0.539258s
2015-05-21 21:12:38.971513 +0200 : 561 : omnipitr-archive : LOG :
Timer [Segment delivery] took: 0.728s
2015-05-21 21:12:38.973229 +0200 : 561 : omnipitr-archive : LOG :
Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000090
successfully sent to all destinations.
2015-05-21 21:28:51.389622 +0200 : 1634 : omnipitr-archive : LOG :
Called with parameters: --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/archive
pg_xlog/0000000400004C4D00000091
2015-05-21 21:28:52.457545 +0200 : 1634 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000091 to
pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000091 ended
in 1.065172s
2015-05-21 21:28:52.908541 +0200 : 1634 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000091 to
pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000091 ended
in 0.449934s
2015-05-21 21:28:52.910337 +0200 : 1634 : omnipitr-archive : LOG :
Timer [Segment delivery] took: 1.519s
2015-05-21 21:28:52.910902 +0200 : 1634 : omnipitr-archive : LOG :
Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000091
successfully sent to all destinations.
2015-05-21 21:28:53.028408 +0200 : 1639 : omnipitr-archive : LOG :
Called with parameters: --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/archive
pg_xlog/0000000400004C4D00000092
2015-05-21 21:28:53.783568 +0200 : 1639 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000092 to
pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000092 ended
in 0.753511s
2015-05-21 21:28:54.260492 +0200 : 1639 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000092 to
pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000092 ended
in 0.473037s
2015-05-21 21:28:54.261054 +0200 : 1639 : omnipitr-archive : LOG :
Timer [Segment delivery] took: 1.232s
2015-05-21 21:28:54.262764 +0200 : 1639 : omnipitr-archive : LOG :
Segment /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000092
successfully sent to all destinations.
2015-05-21 21:30:06.034584 +0200 : 1870 : omnipitr-archive : LOG :
Called with parameters: --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/archive
pg_xlog/0000000400004C4D00000093
2015-05-21 21:30:06.920507 +0200 : 1870 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000093 to
pgsql@d0.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000093 ended
in 0.883097s
2015-05-21 21:30:07.391485 +0200 : 1870 : omnipitr-archive : LOG :
Sending /usr/local/pgsql/core/data/pg_xlog/0000000400004C4D00000093 to
pgsql@d8.XXX:/usr/local/pgsql/core/wal/0000000400004C4D00000093 ended
in 0.469941s
2015-05-21 21:30:07.394826 +0200 : 1870 : omnipitr-archive : LOG :
Timer [Segment delivery] took: 1.358s

But now see some unexpected problem. Shipping for
0000000400004C4D00000090 was runned twice. But probably it is caused
by me - I've tried to put file again in master pg_xlog from d0.XXX and
run CHECKPOINT on master. But this didn't worked out.

I have WAL files:

d0:/usr/local/pgsql/core/wal # ls -laT  0000000400004C4D00000090
0000000400004C4D00000091 0000000400004C4D00000092
0000000400004C4D00000093
-rw-------  1 pgsql  pgsql  16777216 May 21 20:36:33 2015
0000000400004C4D00000090
-rw-------  1 pgsql  pgsql  16777216 May 21 21:28:51 2015
0000000400004C4D00000091
-rw-------  1 pgsql  pgsql  16777216 May 21 21:28:52 2015
0000000400004C4D00000092
-rw-------  1 pgsql  pgsql  16777216 May 21 21:30:05 2015
0000000400004C4D00000093


> Slave - if it does not find the WAL segment in master pg_xlog location, it
> looks for the same in the location mentioned in the restore_command. It
> yells if it is unable to find it.
> Cannot give any concrete comments/suggestions, until recovery.conf entries
> are known.

On slave, recovery.conf:

In the time of writing it I switched from OmniPITR:

restore_command =
'/usr/local/pgsql/scripts/omnipitr/bin/omnipitr-restore --config-file
/usr/local/pgsql/core/scripts/omnipitr/config/restore -sr "%f" "%p"'

To:

restore_command = 'cp /usr/local/pgsql/core/wal/%f %p'

And manualy rsynced files from d0.XXX to /usr/local/pgsql/core/wal/ directory.

>> What was wrong?
>
>
> This error message can be ignored, did you try shipping
> 0000000400004C4D00000093 to slave from master's archive location and see if
> slave picked it up and applied ?

As I wrote, I've only shipped up to 0000000400004C4D00000092. File
0000000400004C4D00000093 was archived after slave managed to start and
successfuly start streaming WAL.

But I've done this. When I've noticed that slave is OK - I stopped it,
changed back to OmniPITR restore_command, and started again. So full
log:

-- here where slave sucessfuly start for the first time:
May 21 21:28:55 d8 postgres[53574]: [2-1] 2015-05-21 21:28:55.123 CEST
@ 53574   LOG:  ending log output to stderr
May 21 21:28:55 d8 postgres[53574]: [2-2] 2015-05-21 21:28:55.123 CEST
@ 53574   HINT:  Future log output will go to log destination
"syslog".
May 21 21:28:55 d8 postgres[53575]: [3-1] 2015-05-21 21:28:55.124 CEST
@ 53575   LOG:  database system was shut down in recovery at
2015-05-21 21:24:56 CEST
May 21 21:28:55 d8 postgres[53575]: [4-1] 2015-05-21 21:28:55.126 CEST
@ 53575   LOG:  entering standby mode
May 21 21:28:55 d8 postgres[53575]: [5-1] 2015-05-21 21:28:55.150 CEST
@ 53575   LOG:  restored log file "0000000400004C4D00000088" from
archive
May 21 21:28:55 d8 postgres[53575]: [6-1] 2015-05-21 21:28:55.209 CEST
@ 53575   LOG:  redo starts at 4C4D/88493B50
May 21 21:28:55 d8 postgres[53575]: [7-1] 2015-05-21 21:28:55.314 CEST
@ 53575   LOG:  restored log file "0000000400004C4D00000089" from
archive
May 21 21:28:55 d8 postgres[53575]: [8-1] 2015-05-21 21:28:55.512 CEST
@ 53575   LOG:  restored log file "0000000400004C4D0000008A" from
archive
May 21 21:28:55 d8 postgres[53585]: [3-1] 2015-05-21 21:28:55.781 CEST
pgsql@postgres 53585 [local] [unknown] FATAL:  the database system is
starting up
May 21 21:28:55 d8 postgres[53575]: [9-1] 2015-05-21 21:28:55.831 CEST
@ 53575   LOG:  restored log file "0000000400004C4D0000008B" from
archive
May 21 21:28:56 d8 postgres[53575]: [10-1] 2015-05-21 21:28:56.508
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008C" from
archive
May 21 21:28:56 d8 postgres[53594]: [3-1] 2015-05-21 21:28:56.782 CEST
pgsql@postgres 53594 [local] [unknown] FATAL:  the database system is
starting up
May 21 21:28:56 d8 postgres[53575]: [11-1] 2015-05-21 21:28:56.818
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008D" from
archive
May 21 21:28:56 d8 postgres[53575]: [12-1] 2015-05-21 21:28:56.982
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008E" from
archive
May 21 21:28:57 d8 postgres[53575]: [13-1] 2015-05-21 21:28:57.415
CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008F" from
archive
May 21 21:28:57 d8 postgres[53607]: [3-1] 2015-05-21 21:28:57.784 CEST
pgsql@postgres 53607 [local] [unknown] FATAL:  the database system is
starting up
May 21 21:28:57 d8 postgres[53575]: [14-1] 2015-05-21 21:28:57.790
CEST @ 53575   LOG:  restored log file "0000000400004C4D00000090" from
archive
May 21 21:28:58 d8 postgres[53575]: [15-1] 2015-05-21 21:28:58.328
CEST @ 53575   LOG:  consistent recovery state reached at
4C4D/90FFF9C8
May 21 21:28:58 d8 postgres[53574]: [3-1] 2015-05-21 21:28:58.328 CEST
@ 53574   LOG:  database system is ready to accept read only
connections
May 21 21:28:58 d8 postgres[53575]: [16-1] 2015-05-21 21:28:58.349
CEST @ 53575   LOG:  restored log file "0000000400004C4D00000091" from
archive
May 21 21:28:58 d8 postgres[53618]: [4-1] 2015-05-21 21:28:58.786 CEST
pgsql@postgres 53618 [local] [unknown] FATAL:  role "pgsql" does not
exist
May 21 21:28:59 d8 postgres[53575]: [17-1] 2015-05-21 21:28:59.962
CEST @ 53575   LOG:  restored log file "0000000400004C4D00000092" from
archive
May 21 21:29:00 d8 postgres[53575]: [18-1] 2015-05-21 21:29:00.037
CEST @ 53575   LOG:  unexpected pageaddr 4C45/DC000000 in log segment
0000000400004C4D00000093, offset 0
May 21 21:29:00 d8 postgres[53640]: [4-1] 2015-05-21 21:29:00.047 CEST
@ 53640   LOG:  started streaming WAL from primary at 4C4D/92000000 on
timeline 4

-- I stopped him.
May 21 21:30:13 d8 postgres[53574]: [4-1] 2015-05-21 21:30:13.417 CEST
@ 53574   LOG:  received fast shutdown request
May 21 21:30:13 d8 postgres[53574]: [5-1] 2015-05-21 21:30:13.418 CEST
@ 53574   LOG:  aborting any active transactions
May 21 21:30:13 d8 postgres[53640]: [5-1] 2015-05-21 21:30:13.441 CEST
@ 53640   FATAL:  terminating walreceiver process due to administrator
command
May 21 21:30:13 d8 postgres[53579]: [3-1] 2015-05-21 21:30:13.443 CEST
@ 53579   LOG:  shutting down
May 21 21:30:13 d8 postgres[53579]: [4-1] 2015-05-21 21:30:13.443 CEST
@ 53579   LOG:  restartpoint starting: shutdown immediate
May 21 21:30:15 d8 postgres[53579]: [5-1] 2015-05-21 21:30:15.094 CEST
@ 53579   LOG:  restartpoint complete: wrote 25212 buffers (0.8%); 0
transaction log file(s) added, 0 removed, 4 recycled; write=0.946 s,
sync=0.629 s, total=1.651 s; sync files=66, longest=0.245 s, aver
May 21 21:30:15 d8 postgres[53579]: [6-1] 2015-05-21 21:30:15.094 CEST
@ 53579   LOG:  recovery restart point at 4C4D/91A59348
May 21 21:30:15 d8 postgres[53579]: [6-2] 2015-05-21 21:30:15.094 CEST
@ 53579   DETAIL:  last completed transaction was at log time
2015-05-21 21:30:13.377052+02
May 21 21:30:15 d8 postgres[53579]: [7-1] 2015-05-21 21:30:15.390 CEST
@ 53579   LOG:  database system is shut down

-- I've changed restore_command, master managed to ship
0000000400004C4D00000093 in background.
-- Starting up.
May 21 21:30:25 d8 postgres[54546]: [1-1] 2015-05-21 21:30:25.983 CEST
@ 54546   LOG:  loaded library "pg_stat_statements"
May 21 21:30:26 d8 postgres[54546]: [2-1] 2015-05-21 21:30:26.329 CEST
@ 54546   LOG:  ending log output to stderr
May 21 21:30:26 d8 postgres[54546]: [2-2] 2015-05-21 21:30:26.329 CEST
@ 54546   HINT:  Future log output will go to log destination
"syslog".
May 21 21:30:26 d8 postgres[54547]: [3-1] 2015-05-21 21:30:26.331 CEST
@ 54547   LOG:  database system was shut down in recovery at
2015-05-21 21:30:15 CEST
May 21 21:30:26 d8 postgres[54547]: [4-1] 2015-05-21 21:30:26.626 CEST
@ 54547   LOG:  entering standby mode
May 21 21:30:27 d8 postgres[54563]: [3-1] 2015-05-21 21:30:27.026 CEST
pgsql@postgres 54563 [local] [unknown] FATAL:  the database system is
starting up
May 21 21:30:27 d8 postgres[54547]: [5-1] 2015-05-21 21:30:27.068 CEST
@ 54547   LOG:  restored log file "0000000400004C4D00000091" from
archive
May 21 21:30:27 d8 postgres[54547]: [6-1] 2015-05-21 21:30:27.129 CEST
@ 54547   LOG:  redo starts at 4C4D/91A59348
May 21 21:30:27 d8 postgres[54547]: [7-1] 2015-05-21 21:30:27.301 CEST
@ 54547   LOG:  restored log file "0000000400004C4D00000092" from
archive
May 21 21:30:27 d8 postgres[54547]: [8-1] 2015-05-21 21:30:27.532 CEST
@ 54547   LOG:  restored log file "0000000400004C4D00000093" from
archive
May 21 21:30:27 d8 postgres[54547]: [9-1] 2015-05-21 21:30:27.957 CEST
@ 54547   LOG:  consistent recovery state reached at 4C4D/946CE620
May 21 21:30:27 d8 postgres[54547]: [10-1] 2015-05-21 21:30:27.957
CEST @ 54547   LOG:  unexpected pageaddr 4C45/CD6D4000 in log segment
0000000400004C4D00000094, offset 7159808
May 21 21:30:27 d8 postgres[54546]: [3-1] 2015-05-21 21:30:27.957 CEST
@ 54546   LOG:  database system is ready to accept read only
connections
May 21 21:30:27 d8 postgres[54580]: [4-1] 2015-05-21 21:30:27.969 CEST
@ 54580   LOG:  started streaming WAL from primary at 4C4D/94000000 on
timeline 4
May 21 21:30:28 d8 postgres[54581]: [4-1] 2015-05-21 21:30:28.039 CEST
pgsql@postgres 54581 [local] [unknown] FATAL:  role "pgsql" does not
exist
May 21 21:36:58 d8 postgres[54564]: [3-1] 2015-05-21 21:36:58.080 CEST
@ 54564   LOG:  restartpoint starting: time
-- And everything goes OK.

--
Piotr Gasidło

Re: Strange replication problem - segment restored from archive but still requested from master

From

Piotr Gasidło

Date:

22 May 2015, 08:12:57

2015-05-22 6:55 GMT+02:00 Fujii Masao <masao.fujii@gmail.com>:
> Thanks for the report! This seems to be a bug.
>
> This problem happens when WAL record is stored in separate two WAL files and
> there is no valid latter WAL file in the standby. In your case, the former file
> is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091.

I've got bot files, when I've pg_xdumplog-ed them:

$ cat dump1
0000000400004C4D00000090.dump
...
rmgr: Heap2       len (rec/tot):     24/  7772, tx:          0, lsn:
4C4D/90FF6DB8, prev 4C4D/90FF51E8, bkp: 1000, desc: clean: rel
1663/131438/10079072; blk 13843 remxid 472083193
rmgr: Heap2       len (rec/tot):    504/   536, tx:          0, lsn:
4C4D/90FF8C30, prev 4C4D/90FF6DB8, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13843; cutoff xid 422325322 ntuples 40
rmgr: Heap2       len (rec/tot):     24/  5632, tx:          0, lsn:
4C4D/90FF8E48, prev 4C4D/90FF8C30, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13844; cutoff xid 422325322 ntuples 30
rmgr: Heap2       len (rec/tot):     24/  7528, tx:          0, lsn:
4C4D/90FFA460, prev 4C4D/90FF8E48, bkp: 1000, desc: clean: rel
1663/131438/10079072; blk 13845 remxid 472083193
rmgr: Heap2       len (rec/tot):    564/   596, tx:          0, lsn:
4C4D/90FFC1E0, prev 4C4D/90FFA460, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13845; cutoff xid 422325322 ntuples 45
rmgr: Heap2       len (rec/tot):     24/  5316, tx:          0, lsn:
4C4D/90FFC438, prev 4C4D/90FFC1E0, bkp: 1000, desc: clean: rel
1663/131438/10079072; blk 13846 remxid 472083193
rmgr: Heap2       len (rec/tot):    420/   452, tx:          0, lsn:
4C4D/90FFD900, prev 4C4D/90FFC438, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13846; cutoff xid 422325322 ntuples 33
rmgr: Heap2       len (rec/tot):     24/  7856, tx:          0, lsn:
4C4D/90FFDAC8, prev 4C4D/90FFD900, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13847; cutoff xid 422325322 ntuples 53
rmgr: Heap2       len (rec/tot):     20/    52, tx:          0, lsn:
4C4D/90FFF990, prev 4C4D/90FFDAC8, bkp: 0000, desc: visible: rel
1663/131438/10079072; blk 13847
0000000400004C4D00000091.dump
rmgr: Heap2       len (rec/tot):    600/   632, tx:          0, lsn:
4C4D/910014A0, prev 4C4D/90FFF9C8, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13848; cutoff xid 422325322 ntuples 48
rmgr: Heap2       len (rec/tot):     24/  6640, tx:          0, lsn:
4C4D/91001718, prev 4C4D/910014A0, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13849; cutoff xid 422325322 ntuples 42
rmgr: Heap2       len (rec/tot):     24/  7848, tx:          0, lsn:
4C4D/91003120, prev 4C4D/91001718, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13850; cutoff xid 422325322 ntuples 36
rmgr: Heap2       len (rec/tot):     20/    52, tx:          0, lsn:
4C4D/91004FE0, prev 4C4D/91003120, bkp: 0000, desc: visible: rel
1663/131438/10079072; blk 13850
rmgr: Heap2       len (rec/tot):     24/  6208, tx:          0, lsn:
4C4D/91005018, prev 4C4D/91004FE0, bkp: 1000, desc: clean: rel
1663/131438/10079072; blk 13851 remxid 472083193
rmgr: Heap2       len (rec/tot):    564/   596, tx:          0, lsn:
4C4D/91006870, prev 4C4D/91005018, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13851; cutoff xid 422325322 ntuples 45
rmgr: Heap2       len (rec/tot):     24/  7340, tx:          0, lsn:
4C4D/91006AC8, prev 4C4D/91006870, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13852; cutoff xid 422325322 ntuples 39
rmgr: Heap2       len (rec/tot):     24/  7744, tx:          0, lsn:
4C4D/91008790, prev 4C4D/91006AC8, bkp: 1000, desc: clean: rel
1663/131438/10079072; blk 13853 remxid 472083193
rmgr: Heap2       len (rec/tot):    492/   524, tx:          0, lsn:
4C4D/9100A5E8, prev 4C4D/91008790, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13853; cutoff xid 422325322 ntuples 39
rmgr: Heap2       len (rec/tot):     24/  7152, tx:          0, lsn:
4C4D/9100A7F8, prev 4C4D/9100A5E8, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13854; cutoff xid 422325322 ntuples 32
rmgr: Heap2       len (rec/tot):     24/  7380, tx:          0, lsn:
4C4D/9100C400, prev 4C4D/9100A7F8, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13855; cutoff xid 422325322 ntuples 42
rmgr: Heap2       len (rec/tot):     24/  6784, tx:          0, lsn:
4C4D/9100E0F0, prev 4C4D/9100C400, bkp: 1000, desc: clean: rel
1663/131438/10079072; blk 13856 remxid 472083193
rmgr: Heap2       len (rec/tot):    408/   440, tx:          0, lsn:
4C4D/9100FB70, prev 4C4D/9100E0F0, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13856; cutoff xid 422325322 ntuples 32
rmgr: Heap2       len (rec/tot):     24/  7312, tx:          0, lsn:
4C4D/9100FD28, prev 4C4D/9100FB70, bkp: 1000, desc: clean: rel
1663/131438/10079072; blk 13857 remxid 472083193
rmgr: Heap2       len (rec/tot):    432/   464, tx:          0, lsn:
4C4D/910119D0, prev 4C4D/9100FD28, bkp: 0000, desc: freeze_page: rel
1663/131438/10079072; blk 13857; cutoff xid 422325322 ntuples 34
rmgr: Heap2       len (rec/tot):     24/  7920, tx:          0, lsn:
4C4D/91011BA0, prev 4C4D/910119D0, bkp: 1000, desc: freeze_page: rel
1663/131438/10079072; blk 13858; cutoff xid 422325322 ntuples 60
...

I this is not enought, I can attach both files.


--
Piotr Gasidło

Re: Strange replication problem - segment restored from archive but still requested from master

From

Piotr Gasidło

Date:

22 May 2015, 16:36:58

2015-05-22 6:55 GMT+02:00 Fujii Masao <masao.fujii@gmail.com>:
>
> This problem happens when WAL record is stored in separate two WAL files and
> there is no valid latter WAL file in the standby. In your case, the former file
> is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091.
>
> In this case, the first half of WAL record can be read from the former WAL file,
> but the remaining half not because no valid latter file exists in the standby.
> Then the standby tries to retrieve the latter WAL file via replication.
> The problem here is that the standby tries to start the replication from the
> starting point of WAL record, i.e., that's the location of the former WAL file.
> So the already-read WAL file is requested via replication.
> (..)

I currently have wal_keep_segments set to 0.
Setting this to higher value will help? As I understand: master won't
delete segment and could stream it to slave on request - so it will
help.
Does this setting delays WAL archiving?

--
Piotr Gasidło

Re: Strange replication problem - segment restored from archive but still requested from master

From

Guillaume Lelarge

Date:

25 May 2015, 09:30:43

2015-05-22 18:36 GMT+02:00 Piotr Gasidło <quaker@barbara.eu.org>:

2015-05-22 6:55 GMT+02:00 Fujii Masao <masao.fujii@gmail.com>:
>
> This problem happens when WAL record is stored in separate two WAL files and
> there is no valid latter WAL file in the standby. In your case, the former file
> is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091.
>
> In this case, the first half of WAL record can be read from the former WAL file,
> but the remaining half not because no valid latter file exists in the standby.
> Then the standby tries to retrieve the latter WAL file via replication.
> The problem here is that the standby tries to start the replication from the
> starting point of WAL record, i.e., that's the location of the former WAL file.
> So the already-read WAL file is requested via replication.
> (..)

I currently have wal_keep_segments set to 0.
Setting this to higher value will help? As I understand: master won't
delete segment and could stream it to slave on request - so it will
help.

It definitely helps, but the issue could still happen.

Does this setting delays WAL archiving?

Nope. It delays recycling.

Guillaume.

http://blog.guillaume.lelarge.info

http://www.dalibo.com

Re: Strange replication problem - segment restored from archive but still requested from master

From

Piotr Gasidło

Date:

25 May 2015, 13:15:13

2015-05-25 11:30 GMT+02:00 Guillaume Lelarge <guillaume@lelarge.info>:

>> I currently have wal_keep_segments set to 0.
>> Setting this to higher value will help? As I understand: master won't
>> delete segment and could stream it to slave on request - so it will
>> help.
>
>
> It definitely helps, but the issue could still happen.
>

What conditions must be met for issue to happen?
Both archive_command on master and restore_commands are set and working.
Also wal_keep_segments is set.

I see no point of failure - only delay in the case of high WAL traffic
on master:
- slave starts with restoring WALs from archive,
- now, it connects to master and notices, that for last master's WAL
it needs previous one ("the issue"),
- slave asks master for previous WAL and gets it - job done, streaming
replication set, exit
- if unable to get it (WAL traffic is high, and after restoring last
WAL from archive and asking master for next one more than
wal_keep_segments were recycled) it returns to looking WALs in
archive.

Do I get it right?

--
Piotr Gasidło

Re: Strange replication problem - segment restored from archive but still requested from master

From

Guillaume Lelarge

Date:

25 May 2015, 15:35:28

2015-05-25 15:15 GMT+02:00 Piotr Gasidło <quaker@barbara.eu.org>:

2015-05-25 11:30 GMT+02:00 Guillaume Lelarge <guillaume@lelarge.info>:

>> I currently have wal_keep_segments set to 0.
>> Setting this to higher value will help? As I understand: master won't
>> delete segment and could stream it to slave on request - so it will
>> help.
>
>
> It definitely helps, but the issue could still happen.
>

What conditions must be met for issue to happen?

Very high WAL traffic can make the slave lag enough that even wal_keep_segments isn't enough.

Both archive_command on master and restore_commands are set and working.
Also wal_keep_segments is set.

I see no point of failure - only delay in the case of high WAL traffic
on master:
- slave starts with restoring WALs from archive,
- now, it connects to master and notices, that for last master's WAL
it needs previous one ("the issue"),
- slave asks master for previous WAL and gets it - job done, streaming
replication set, exit
- if unable to get it (WAL traffic is high, and after restoring last
WAL from archive and asking master for next one more than
wal_keep_segments were recycled) it returns to looking WALs in
archive.

Do I get it right?

Yes. If you set correctly archive_command (on the master) and restore_command (on the slave), there's no point of failure. You might still get the "WAL not available" error message, but the slave can synchronize itself with the archived WALs.

Guillaume.

http://blog.guillaume.lelarge.info

http://www.dalibo.com