Home > mailing lists

Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers

From	Michael Paquier
Subject	Incorrect handling of OOM in WAL replay leading to data loss
Date	August 1, 2023 06:43:21
Msg-id	ZMh/WV+CuknqePQQ@paquier.xyz Whole thread Raw
Responses	Re: Incorrect handling of OOM in WAL replay leading to data loss Re: Incorrect handling of OOM in WAL replay leading to data loss
List	pgsql-hackers

Tree view

Hi all,

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space.  In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

As far as I can see, PerformWalRecovery() uses LOG as elevel for its
private callback in the xlogreader, when doing through ReadRecord(),
which leads to a failure being reported, but recovery considers that
the failure is the end of WAL and decides to abruptly end recovery,
leading to some data lost.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Attached is a patch that can be applied on HEAD to inject an error,
then just run the script xlogreader_oom.bash attached, or something
similar, to see the failure in the logs:
LOG:  redo starts at 0/1913CD0
LOG:  out of memory while trying to decode a record of length 57
LOG:  redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

It also looks like recovery_prefetch may mitigate a bit the issue if
we do a read in non-blocking mode, but that's not a strong guarantee
either, especially if the host is under memory pressure.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:
https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

Comments?
--
Michael

Attachment

pgsql-hackers by date:

From: Andres Freund
Date: 01 August 2023, 06:23:49
Subject: Re: Support to define custom wait events for extensions

From: Peter Smith
Date: 01 August 2023, 07:14:02
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers

Attachment

Previous

Next