Re: base backup vs. concurrent truncation - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: base backup vs. concurrent truncation |
Date | |
Msg-id | 20230508204109.yxatpsdsx3k2dj5m@awork3.anarazel.de Whole thread Raw |
In response to | Re: base backup vs. concurrent truncation (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, On 2023-04-25 10:28:58 -0700, Andres Freund wrote: > On 2023-04-25 11:42:43 -0400, Robert Haas wrote: > > On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres@anarazel.de> wrote: > > > What we've discussed somewhere in the past is to always truncate N+1 when > > > creating the first page in N. I.e. if we extend into 23456.1, we truncate > > > 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue? > > > > Yeah, although leaving 23456.2 forever unless and until that happens > > doesn't sound amazing. > > It isn't - but the alternatives aren't great either. It's not that easy to hit > this scenario, so I think something along these lines is more palatable than > adding a pass through the entire data directory. > I think eventually we'll have to make the WAL logging bulletproof enough > against this to avoid the risk of it. I think that is possible. I think we should extend my proposal above with improved WAL logging. Right now the replay of XLOG_SMGR_TRUNCATE records uses the normal smgrtruncate() path - which essentially relies on smgrnblocks() to determine the relation size, which in turn iterates over the segments until it finds one < SEGSIZE. That's fine during normal running, where we are consistent. But it's bogus when we're not consistent - in case of a concurrent truncation, the base backup might have missed intermediate segments, while copying later segments. We should fix this by including not just the "target" length in the WAL record, but also the "current" length. Then during WAL replay of such records we'd not look at the files currently present, we'd just stupidly truncate all the segments mentioned in the range. I think we ought to do the same for mdunlinkfork(). > I suspect we would need to prevent checkpoints from happening in the wrong > moment, if we were to go down that route. > > I guess that eventually we'll need to polish the infrastructure for > determining restartpointsm so that delayChkptFlags doesn't actually prevent > checkpoints, just moves the restart to a safe LSN. Although I guess that > truncations aren't frequent enough (compared to transaction commits), for that > to be required "now". > Unfortunately the current approach of smgrtruncate records is quite racy with checkpoints. You can end up with a sequence of something like 1) SMGRTRUNCATE record 2) CHECKPOINT; 3) Truncating of segment files if you crash anywhere in 3), we don't replay the SMGRTRUNCATE record. It's not a great solution, but I suspect we'll just have to set delayChkptFlags during truncations to prevent this. I also think that we need to fix the order in which mdunlink() operates. It seems *quite* bogus that we unlink in "forward" segment order, rather than in reverse order (like mdtruncate()). If we crash after unlinking segment.N, we'll not redo unlinking the later segments after a crash. Nor in WAL replay. While the improved WAL logging would address this as well, it still seems pointlessly risky to iterate forward, rather than backward. Even with those changes, I think we might still need something like the "always truncate the next segment" bit I described in my prior email though. Greetings, Andres Freund
pgsql-hackers by date: