Thread: XLogReadBufferExtended() vs disconnected segments

XLogReadBufferExtended() vs disconnected segments

From
Andres Freund
Date:
Hi,

I was trying to implement ExtendRelationBufferedTo(), responding to a review
comment by Heikki, in
https://www.postgresql.org/message-id/20230222203152.rh4s75aedj65hyjn@awork3.anarazel.de

Which lead me to stare at the P_NEW do while loop in
XLogReadBufferExtended(). I first started to reply on that thread, but it
seems like a big enough issue that it seemed worth starting a separate thread.

The relevant logic was added in 6f2aead1ffec, the relevant discussion is at
https://www.postgresql.org/message-id/32313.1392231107%40sss.pgh.pa.us

My understanding of what happend there is that we tried to extend a relation,
sized one block below a segment boundary, and after that the relation was much
larger, because the next segment file existed, and had a non-zero size. And
because we extended blkno-lastblock times, we'd potentially blow up the
relation size much more than intended.

The actual cause of that in the reported case appears to have been a bug in
wal-e. But I suspect it's possible to hit something like that without such
problems, just due to crashes on the replica, or "skew" while taking a base
backup.


I find it pretty sketchy that we just leave the contents of the previously
"disconnected" segment contents around, without using log_invalid_page() for
the range, or warning, or ...

Most of the time this issue would be fixed due to later WAL replay
initializing the later segment. But I don't think there's a guarantee for
that (see below).

It'd be one thing if we accidentally used data in such a segment, if the
situation is only caused by a bug in base backup tooling, or filesystem
corruption, or ...

But I think we can encounter the issue without anything like that being
involved. Imagine this scenario:

1) filenode A gets extended to segment 3
2) basebackup starts, including performing a checkpoint
3) basebackup ends up copying A's segment 3 first, while in progress
4) filenode A is dropped
5) checkpoint happens, allowing smgrrel 10 to be used again
6) filenode 10 is created newly
7) basebackup ends

At that point A will have segment 0, segment 3. The WAL replay for 4) won't
drop segment 3, because an smgrnblocks() won't even see it, because segment 2
doesn't exist.

If a replica starts from this base backup, we'll be fine until A again grows
far enough to fill segment 2. At that point, we'll suddenly have completely
bogus contents in 3. Obviously accesses to those contents could trivially
crash at that point.


I suspect there's an easier to hit version of this: Consider this path in
ExecuteTruncateGuts():

        /*
         * Normally, we need a transaction-safe truncation here.  However, if
         * the table was either created in the current (sub)transaction or has
         * a new relfilenumber in the current (sub)transaction, then we can
         * just truncate it in-place, because a rollback would cause the whole
         * table or the current physical file to be thrown away anyway.
         */
        if (rel->rd_createSubid == mySubid ||
            rel->rd_newRelfilelocatorSubid == mySubid)
        {
            /* Immediate, non-rollbackable truncation is OK */
            heap_truncate_one_rel(rel);



Afaict that could easily lead to a version of the above that doesn't even
require relfilenodes getting recycled.


One way to to defend against this would be to make mdextend(), whenever it
extends into the last block of a segment, unlink the next segment - it can't
be a validly existing contents.  But it seems scary to just unlink entire
segments.


Greetings,

Andres Freund



Re: XLogReadBufferExtended() vs disconnected segments

From
Andres Freund
Date:
Hi,

On 2023-02-22 17:01:47 -0800, Andres Freund wrote:
> One way to to defend against this would be to make mdextend(), whenever it
> extends into the last block of a segment, unlink the next segment - it can't
> be a validly existing contents.  But it seems scary to just unlink entire
> segments.

Another way might be for XLOG_SMGR_TRUNCATE record, as well as smgr unlinks in
commit/abort records, to include not just the "target size", as we do today,
but to also include the current size.

I'm not sure that'd fix all potential issues, but it seems like it'd fix a lot
of the more obvious issues, because it'd prevent scenarios like a base backup
copying segment N, without copying N - 1, due to a concurrent truncate/drop,
from causing harm. Due to the range being included in the WAL record, replay
would know that N needs to be unlinked, even if smgrnblocks() thinks the
relation is much smaller.

Greetings,

Andres Freund