XLogReadBufferExtended() vs disconnected segments - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | XLogReadBufferExtended() vs disconnected segments |
Date | |
Msg-id | 20230223010147.32oir7sb66slqnjk@awork3.anarazel.de Whole thread Raw |
Responses |
Re: XLogReadBufferExtended() vs disconnected segments
|
List | pgsql-hackers |
Hi, I was trying to implement ExtendRelationBufferedTo(), responding to a review comment by Heikki, in https://www.postgresql.org/message-id/20230222203152.rh4s75aedj65hyjn@awork3.anarazel.de Which lead me to stare at the P_NEW do while loop in XLogReadBufferExtended(). I first started to reply on that thread, but it seems like a big enough issue that it seemed worth starting a separate thread. The relevant logic was added in 6f2aead1ffec, the relevant discussion is at https://www.postgresql.org/message-id/32313.1392231107%40sss.pgh.pa.us My understanding of what happend there is that we tried to extend a relation, sized one block below a segment boundary, and after that the relation was much larger, because the next segment file existed, and had a non-zero size. And because we extended blkno-lastblock times, we'd potentially blow up the relation size much more than intended. The actual cause of that in the reported case appears to have been a bug in wal-e. But I suspect it's possible to hit something like that without such problems, just due to crashes on the replica, or "skew" while taking a base backup. I find it pretty sketchy that we just leave the contents of the previously "disconnected" segment contents around, without using log_invalid_page() for the range, or warning, or ... Most of the time this issue would be fixed due to later WAL replay initializing the later segment. But I don't think there's a guarantee for that (see below). It'd be one thing if we accidentally used data in such a segment, if the situation is only caused by a bug in base backup tooling, or filesystem corruption, or ... But I think we can encounter the issue without anything like that being involved. Imagine this scenario: 1) filenode A gets extended to segment 3 2) basebackup starts, including performing a checkpoint 3) basebackup ends up copying A's segment 3 first, while in progress 4) filenode A is dropped 5) checkpoint happens, allowing smgrrel 10 to be used again 6) filenode 10 is created newly 7) basebackup ends At that point A will have segment 0, segment 3. The WAL replay for 4) won't drop segment 3, because an smgrnblocks() won't even see it, because segment 2 doesn't exist. If a replica starts from this base backup, we'll be fine until A again grows far enough to fill segment 2. At that point, we'll suddenly have completely bogus contents in 3. Obviously accesses to those contents could trivially crash at that point. I suspect there's an easier to hit version of this: Consider this path in ExecuteTruncateGuts(): /* * Normally, we need a transaction-safe truncation here. However, if * the table was either created in the current (sub)transaction or has * a new relfilenumber in the current (sub)transaction, then we can * just truncate it in-place, because a rollback would cause the whole * table or the current physical file to be thrown away anyway. */ if (rel->rd_createSubid == mySubid || rel->rd_newRelfilelocatorSubid == mySubid) { /* Immediate, non-rollbackable truncation is OK */ heap_truncate_one_rel(rel); Afaict that could easily lead to a version of the above that doesn't even require relfilenodes getting recycled. One way to to defend against this would be to make mdextend(), whenever it extends into the last block of a segment, unlink the next segment - it can't be a validly existing contents. But it seems scary to just unlink entire segments. Greetings, Andres Freund
pgsql-hackers by date: