Re: BUG #15938: Corrupted WAL segment after crash recovery - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: BUG #15938: Corrupted WAL segment after crash recovery
Date
Msg-id 20190806.171140.265475235.horikyota.ntt@gmail.com
Whole thread Raw
List pgsql-hackers
Hello.

At Thu, 01 Aug 2019 13:52:52 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
<15938-8591df7e95064538@postgresql.org>
> The following bug has been logged on the website:
> 
> Bug reference:      15938
> Logged by:          Alexander Kukushkin
> Email address:      cyberdemn@gmail.com
> PostgreSQL version: 10.9
> Operating system:   Ubuntu 18.04.2 LTS
> Description:        
> 
> On one of our cluster one of the postgres backend processes was killed by
> kernel oom.

Although I don't think replication reconnection to
crash-recovered master is generally guaranteed, but this seems
different.

> 2019-08-01 12:41:06.376 UTC,,,22697,,5d42dddb.58a9,4,,2019-08-01 12:40:59
> UTC,,0,LOG,00000,"redo done at B8D/44FFE570",,,,,,,,,""
> 2019-08-01 12:41:06.376 UTC,,,22697,,5d42dddb.58a9,5,,2019-08-01 12:40:59
> UTC,,0,LOG,00000,"last completed transaction was at log time 2019-08-01
> 12:40:54.782747+00",,,,,,,,,""
..
> Unfortunately I can't compare files in the archive and on the primary,
> because it was recycled despite usage of replication slots.
> pg_replication_slots view reports restart_lsn as B8D/451EB540 (the next
> segment).

WAL records since 44ffe6d0 (maybe) till 451eb540 are somehow
ignored during crash recovery of the master, or lost despite it
should have been fsynced out. The only thing I came up for this
is the fsync problem but it is fixed in 10.7. But the following
wiki description:

https://wiki.postgresql.org/wiki/Fsync_Errors

> Linux 4.13 and 4.15: fsync() only reports writeback errors that
> occurred after you called open() so our schemes for closing and
> opening files LRU-style and handing fsync() work off to the
> checkpointer process can hide write-back errors; also buffers are
> marked clean after errors so even if you opened the file before
> the failure, retrying fsync() can falsely report success and the
> modified buffer can be thrown away at any time due to memory
> pressure.

If I read this correctly, after checkpointer failed to fsync
while creating of a preallocate file (this is an ERROR, not a
PANIC), other processes will never receive fsync error about the
file.  This scenario is consistent with the fact that it seems
that the data loss starts from new segment's beginning (assuming
that the original 44ffe6d0 continues to the next segment).

Thoughts?


> But I can run pg_waldump, and the last record in this file is
> CHECKPOINT_SHUTDOWN:
> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn:
> B8D/44FFF6D0, prev B8D/44FFE570, desc: CHECKPOINT_SHUTDOWN redo
> B8D/44FFF6D0; tli 80; prev tli 80; fpw true; xid 0:42293547; oid 10741249;
> multi 1; offset 0; oldest xid 549 in DB 1; oldest multi 1 in DB 1;
> oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> pg_waldump: FATAL:  error in WAL record at B8D/44FFF6D0: invalid record
> length at B8D/44FFF740: wanted 24, got 0

The shutdown record is written at the end of crash recovery of
the master, then replicated. As the result the subsequent bytes
are inconsistent with the record.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs
Next
From: Kuntal Ghosh
Date:
Subject: Re: Remove HeapTuple and Buffer dependency for predicate locking functions