Re: base backup vs. concurrent truncation - Mailing list pgsql-hackers

From Andres Freund
Subject Re: base backup vs. concurrent truncation
Date
Msg-id 20230425000340.uq2c43diuyfpr4au@awork3.anarazel.de
Whole thread Raw
In response to base backup vs. concurrent truncation  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: base backup vs. concurrent truncation
List pgsql-hackers
Hi,

On 2023-04-21 09:42:57 -0400, Robert Haas wrote:
> Apologies if this has already been discussed someplace, but I couldn't
> find a previous discussion. It seems to me that base backups are
> broken in the face of a concurrent truncation that reduces the number
> of segments in a relation.

I think
https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de
and the commits + discussions referenced from there is relevant for the topic.


> Suppose we have a relation that is 1.5GB in size, so that we have two
> files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back
> those files up in whichever order the directory scan finds them.
> Suppose we back up 23456.1 first. Then the relation is truncated to
> 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we
> back up the file 23456. Now our backup contains files 23456 and
> 23456.1, each 0.5GB. But this breaks the invariant in md.c:


>  *      On disk, a relation must consist of consecutively numbered segment
>  *      files in the pattern
>  *              -- Zero or more full segments of exactly RELSEG_SIZE blocks each
>  *              -- Exactly one partial segment of size 0 <= size <
> RELSEG_SIZE blocks
>  *              -- Optionally, any number of inactive segments of size 0 blocks.
> 
> basebackup.c's theory about relation truncation is that it doesn't
> really matter because WAL replay will fix things up. But in this case,
> I don't think it will, because WAL replay relies on the above
> invariant holding. As mdnblocks says:
> 
>                 /*
>                  * If segment is exactly RELSEG_SIZE, advance to next one.
>                  */
>                 segno++;
> 
> So I think what's going to happen is we're not going to notice 23456.1
> when we recover the backup. It will just sit there as an orphaned file
> forever, unless we extend 23456 back to a full 1GB, at which point we
> might abruptly start considering that file part of the relation again.

One important point is that I think 23456.1 at that point can contain
data. Which can lead to deleted rows reappearing, vacuum failing due to rows
from before relfrozenxid existing, etc.


> Assuming I'm not wrong about all of this, the question arises: whose
> fault is this, and what to do about it? It seems to me that it's a bit
> hard to blame basebackup.c, because if you used pg_backup_start() and
> pg_backup_stop() and copied the directory yourself, you'd have exactly
> the same situation, and while we could (and perhaps should) teach
> basebackup.c to do something smarter, it doesn't seem realistic to
> impose complex constraints on the user's choice of file copy tool.

Agreed.


> So I think the problem is with md.c assuming that its invariant must
> hold on a cluster that's not guaranteed to be in a consistent state.
> But mdnblocks() clearly can't try to open every segment up to whatever
> the maximum theoretical possible segment number is every time it's
> invoked, because that would be wicked expensive. An idea that occurs
> to me is to remove all segment files following the first partial
> segment during startup, before we begin WAL replay. If that state
> occurs at startup, then either we have a scenario involving
> truncation, like those above, or a scenario involving relation
> extension, where we added a new segment and that made it to disk but
> the prior extension of the previous last segment file to maximum
> length did not. But in that case, WAL replay should, I think, fix
> things up. However, I'm not completely sure that there isn't some hole
> in this theory, and this way forward also doesn't sound particularly
> cheap. Nonetheless I don't have another idea right now.

What we've discussed somewhere in the past is to always truncate N+1 when
creating the first page in N. I.e. if we extend into 23456.1, we truncate
23456.2 to 0 blocks.  As far as I can tell, that'd solve this issue?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: pg_stat_io not tracking smgrwriteback() is confusing
Next
From: Andres Freund
Date:
Subject: Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call