Thread: BUG #15412: "invalid contrecord length" during WAL replica recovery
BUG #15412: "invalid contrecord length" during WAL replica recovery
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15412 Logged by: Timur Luchkin Email address: timur.luchkin@gmail.com PostgreSQL version: 10.4 Operating system: Ubuntu 16.04.4 LTS Description: Hello. Sorry to post it again, but I really need help to recover broken replica. LOG: invalid contrecord length 861 at 159E/A6FFFC40 We are getting this kind of error in WAL based replica's log after the master DB was down due to no space left issue. Disk space was already added and master up'n'running (including its synchronous streaming based replica), but offsite WAL replica can't start due to above mentioned error. We can't recreate it using pg_basebackup due to slow network and huge DB size. Are there any fixes possible to continue apply WALs? More details: <2018-09-25 08:07:19 UTC--- [app:,pid:19517,00000]>LOG: restored log file "000000010000159E000000A6" from archive <2018-09-25 08:07:23 UTC--- [app:,pid:19517,00000]>LOG: restored log file "000000010000159E000000A7" from archive <2018-09-25 08:07:23 UTC--- [app:,pid:19517,00000]>LOG: invalid contrecord length 861 at 159E/A6FFFC40
On Mon, Oct 01, 2018 at 08:38:23AM +0000, PG Bug reporting form wrote: > Sorry to post it again, but I really need help to recover broken replica. > LOG: invalid contrecord length 861 at 159E/A6FFFC40 Heikki, Horiguchi-san, couldn't this be a side effect of ca572db22? I am afraid that this is not the first report we have on the matter lately. -- Michael
Attachment
Hello. At Mon, 1 Oct 2018 18:06:46 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20181001090646.GM11712@paquier.xyz> > On Mon, Oct 01, 2018 at 08:38:23AM +0000, PG Bug reporting form wrote: > > Sorry to post it again, but I really need help to recover broken replica. > > LOG: invalid contrecord length 861 at 159E/A6FFFC40 > > Heikki, Horiguchi-san, couldn't this be a side effect of ca572db22? > I am afraid that this is not the first report we have on the matter > lately. First, I'd say it is not relevant to the patch with confidence. The patch allows to fetch a contrecord in the next segment anywhere available *after finding it is missing*. The server in trouble fetches segments from WAL archive continuously in the case. I suppose that the "offsite WAL replica" is "A server that is not a part of the main site cluster and it is recovering from it's own archive files that are continuously fed from (maybe) the master in the main site". > <2018-09-25 08:07:23 UTC--- [app:,pid:19517,00000]>LOG: invalid contrecord length 861 at 159E/A6FFFC40 The last page for the contrecords resides in A7 is found to disagree on the remaining bytes. I suspect that the A7 is copied while halfway written (and the archve file should be overwritten after master restart), even though I'm not sure how a halfway written file leads to the failure. I'd check consistency of the A7 file of the offsite replica against the source (master or replica in the main site), using md5 or something like. If they don't match, re-copying the A7 into the offsite archive directory will fix the problem. Thoughts? -- Kyotaro Horiguchi NTT Open Source Software Center