Re: base backup vs. concurrent truncation - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: base backup vs. concurrent truncation |
Date | |
Msg-id | 20230425000340.uq2c43diuyfpr4au@awork3.anarazel.de Whole thread Raw |
In response to | base backup vs. concurrent truncation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: base backup vs. concurrent truncation
|
List | pgsql-hackers |
Hi, On 2023-04-21 09:42:57 -0400, Robert Haas wrote: > Apologies if this has already been discussed someplace, but I couldn't > find a previous discussion. It seems to me that base backups are > broken in the face of a concurrent truncation that reduces the number > of segments in a relation. I think https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de and the commits + discussions referenced from there is relevant for the topic. > Suppose we have a relation that is 1.5GB in size, so that we have two > files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back > those files up in whichever order the directory scan finds them. > Suppose we back up 23456.1 first. Then the relation is truncated to > 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we > back up the file 23456. Now our backup contains files 23456 and > 23456.1, each 0.5GB. But this breaks the invariant in md.c: > * On disk, a relation must consist of consecutively numbered segment > * files in the pattern > * -- Zero or more full segments of exactly RELSEG_SIZE blocks each > * -- Exactly one partial segment of size 0 <= size < > RELSEG_SIZE blocks > * -- Optionally, any number of inactive segments of size 0 blocks. > > basebackup.c's theory about relation truncation is that it doesn't > really matter because WAL replay will fix things up. But in this case, > I don't think it will, because WAL replay relies on the above > invariant holding. As mdnblocks says: > > /* > * If segment is exactly RELSEG_SIZE, advance to next one. > */ > segno++; > > So I think what's going to happen is we're not going to notice 23456.1 > when we recover the backup. It will just sit there as an orphaned file > forever, unless we extend 23456 back to a full 1GB, at which point we > might abruptly start considering that file part of the relation again. One important point is that I think 23456.1 at that point can contain data. Which can lead to deleted rows reappearing, vacuum failing due to rows from before relfrozenxid existing, etc. > Assuming I'm not wrong about all of this, the question arises: whose > fault is this, and what to do about it? It seems to me that it's a bit > hard to blame basebackup.c, because if you used pg_backup_start() and > pg_backup_stop() and copied the directory yourself, you'd have exactly > the same situation, and while we could (and perhaps should) teach > basebackup.c to do something smarter, it doesn't seem realistic to > impose complex constraints on the user's choice of file copy tool. Agreed. > So I think the problem is with md.c assuming that its invariant must > hold on a cluster that's not guaranteed to be in a consistent state. > But mdnblocks() clearly can't try to open every segment up to whatever > the maximum theoretical possible segment number is every time it's > invoked, because that would be wicked expensive. An idea that occurs > to me is to remove all segment files following the first partial > segment during startup, before we begin WAL replay. If that state > occurs at startup, then either we have a scenario involving > truncation, like those above, or a scenario involving relation > extension, where we added a new segment and that made it to disk but > the prior extension of the previous last segment file to maximum > length did not. But in that case, WAL replay should, I think, fix > things up. However, I'm not completely sure that there isn't some hole > in this theory, and this way forward also doesn't sound particularly > cheap. Nonetheless I don't have another idea right now. What we've discussed somewhere in the past is to always truncate N+1 when creating the first page in N. I.e. if we extend into 23456.1, we truncate 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue? Greetings, Andres Freund
pgsql-hackers by date: