Re: BUG #15151: Error with wal replay after planned manualswitchover. - Mailing list pgsql-bugs
From | Michael Paquier |
---|---|
Subject | Re: BUG #15151: Error with wal replay after planned manualswitchover. |
Date | |
Msg-id | 20180413011046.GB1552@paquier.xyz Whole thread Raw |
In response to | BUG #15151: Error with wal replay after planned manual switchover. (PG Bug reporting form <noreply@postgresql.org>) |
List | pgsql-bugs |
On Thu, Apr 12, 2018 at 12:20:09PM +0000, PG Bug reporting form wrote: > Today after planned manual switch-over (via remove all load from master, > checkpoint master, wait replica in sync, stop master, stop replica, switch > master and replica roles without promote), I got very strange error on new > replica (log_min_messages=debug5): > > 2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: database > system was shut down in recovery at 2018-04-12 04:23:26 MSK > 2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: entering > standby mode > 2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: invalid > resource manager ID 211 at C7BB/EDC42C68 C7BB/EDC42C68 is not a valid WAL record, still the message can happen if you face the end of WAL data as the XLOG reader has not received yet all the data it wants. The error messages which validate the format of a record's header can be triggered randomly, as when a WAL segment is recycled by a checkpoint its data is not zero'ed for performance reasons so there is random data in the segment range. All the sanity checks are here to be careful about that, as well as per-record CRCs. After seeing the end of the WAL stream, as this is a standby this fetches into a mode where it waits for WAL data using different sources: the local pg_xlog, a full segment using restore_command or WAL streaming. And... > 2018-04-12 04:23:28 MSK 32060 rohh.app@hh from 192.168.1.2 [vxid: txid:0] [] > FATAL: the database system is starting up > 2018-04-12 04:23:28 MSK 32061 rohh.app@hh from [local] [vxid: txid:0] [] > FATAL: the database system is starting up > 2018-04-12 04:23:28 MSK 32063 [unknown]@[unknown] from [local] [vxid: > txid:0] [] LOG: incomplete startup packet > 2018-04-12 04:23:28 MSK 32064 postgres@postgres from [local] [vxid: txid:0] > [] FATAL: the database system is starting up > 2018-04-12 04:23:28 MSK 32062 @ from [vxid: txid:0] [] LOG: started > streaming WAL from primary at C7BB/ED000000 on timeline 1 When streaming begins, it fetches data from the beginning of the previous segment, and would try to fetch its data completely, so once the next record is fully present, then replay would be allowed to resume. > 2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: invalid > resource manager ID 211 at C7BB/EDC42C68 > 2018-04-12 04:23:30 MSK 32062 @ from [vxid: txid:0] [] FATAL: terminating > walreceiver process due to administrator command What's that actually? > Restart of new slave doesn't help. > Stop new slave, manual clean xlog directory on slave and start slave - > doesn't help (e.g. it doesn't look like the single wal file was damaged > during network transfer). Hm. What happens if you copy 000000010000C7BB000000ED to the standby's pg_xlog. Have you checked if the segment used on the standby is able to move after record C7BB/EDC42C68? Perhaps it is not able to move on because you are pointing to an incorrect primary so the stream cannot resume? > pg_xlogdump show no errors in wal files: > > <snip> > > rmgr: XLOG len (rec/tot): 51/ 2520, tx: 0, lsn: > C7BB/EDC41F38, prev C7BB/EDC41348, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16400/18105 blk 326155 FPW > rmgr: XLOG len (rec/tot): 51/ 2703, tx: 0, lsn: > C7BB/EDC42928, prev C7BB/EDC41F38, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16400/18105 blk 326262 FPW > > rmgr: XLOG len (rec/tot): 51/ 2556, tx: 0, lsn: > C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16400/18105 blk 326257 FPW > rmgr: XLOG len (rec/tot): 51/ 3140, tx: 0, lsn: > C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16400/150046445 blk 32094 FPW > rmgr: XLOG len (rec/tot): 51/ 2595, tx: 0, lsn: > C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16400/18105 blk 326250 FPW > > pg_xlogdump show no visible to me anomalies in 000000010000C7BB000000ED or > 000000010000C7BB000000EC files. Indeed that looks clean. > Any additional information I'll try provide, but my experience with internal > wal structure outside of using pg_xlogdump output very limited. It looks like the stream is not able to resume because the replay cannot move on. Perhaps you should check if the standby is fetching its data via streaming from the correct source. -- Michael
Attachment
pgsql-bugs by date: