Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers

From Michael Paquier
Subject Incorrect handling of OOM in WAL replay leading to data loss
Date
Msg-id ZMh/WV+CuknqePQQ@paquier.xyz
Whole thread Raw
Responses Re: Incorrect handling of OOM in WAL replay leading to data loss
Re: Incorrect handling of OOM in WAL replay leading to data loss
List pgsql-hackers
Hi all,

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space.  In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

As far as I can see, PerformWalRecovery() uses LOG as elevel for its
private callback in the xlogreader, when doing through ReadRecord(),
which leads to a failure being reported, but recovery considers that
the failure is the end of WAL and decides to abruptly end recovery,
leading to some data lost.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Attached is a patch that can be applied on HEAD to inject an error,
then just run the script xlogreader_oom.bash attached, or something
similar, to see the failure in the logs:
LOG:  redo starts at 0/1913CD0
LOG:  out of memory while trying to decode a record of length 57
LOG:  redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

It also looks like recovery_prefetch may mitigate a bit the issue if
we do a read in non-blocking mode, but that's not a strong guarantee
either, especially if the host is under memory pressure.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:
https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

Comments?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Support to define custom wait events for extensions
Next
From: Peter Smith
Date:
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication