Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [Patch] Optimize dropping of relation buffers using dlist |
Date | |
Msg-id | CA+hUKGKHu6C2iShHavhNcX=Vy=0h_jmuoi8=Eo9CJtEA9AydhA@mail.gmail.com Whole thread Raw |
In response to | Re: [Patch] Optimize dropping of relation buffers using dlist (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [Patch] Optimize dropping of relation buffers using dlist
|
List | pgsql-hackers |
On Fri, Nov 6, 2020 at 5:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Nov 6, 2020 at 5:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Here's a devil's advocate position I thought about: It's OK to leave > > stray buffers (clean or dirty) in the buffer pool if files are > > truncated underneath us by gremlins, as long as your system eventually > > crashes before completing a checkpoint. The OID can't be recycled > > until after a successful checkpoint, so the stray blocks can't be > > confused with the blocks of another relation, and weird errors are > > expected on a system that is in serious trouble. It's actually much > > worse that we can give incorrect answers to queries when files are > > truncated by gremlins (in the window of time before we presumably > > crash because of EIO), because we're violating basic ACID principles > > in user-visible ways. In this thread, discussion has focused on > > availability (ie avoiding failures when trying to write back stray > > buffers to a file that has been unlinked), but really a system that > > can't see arbitrary committed transactions *shouldn't be available*. > > This argument applies whether you think SEEK_END can only give weird > > answers in the specific scenario I demonstrated with NFS, or whether > > you think it's arbitrarily b0rked and reports random numbers: we > > fundamentally can't tolerate that, so why are we trying to? > > It is not very clear to me how this argument applies to the patch > in-discussion where we are relying on the cached value of blocks > during recovery. I understand your point that we might skip scanning > the pages and thus might not show some recently added data but that > point is not linked with what we are trying to do with this patch. It's an argument for giving up the hard-to-name cache trick completely and going back to using unmodified smgrnblocks(), both in recovery and online. If the only mechanism for unexpected file shrinkage is writeback failure, then your system will be panicking soon enough anyway -- so is it really that bad if there are potentially some other weird errors logged some time before that? Maybe those errors will even take the system down sooner, and maybe that's appropriate? If there are other mechanisms for random file shrinkage that don't imply a panic in your near future, then we have bigger problems that can't be solved by any number of bandaids, at least not without understanding the details of this hypothetical unknown failure mode. The main argument I can think of against the idea of using plain old smgrnblocks() is that the current error messages on smgrwrite() failure for stray blocks would be indistinguishible from cases where an external actor unlinked the file. I don't mind getting an error that prevents checkpointing -- your system is in big trouble! -- but it'd be nice to be able to detect that *we* unlinked the file, implying the filesystem and bufferpool are out of sync, and spit out a special diagnostic message. I suppose if it's the checkpointer doing the writing, it could check if the relfilenode is on the queued-up-for-delete-after-the-checkpoint list, and if so, it could produce a different error message just for this edge case. Unfortunately that's not a general solution, because any backend might try to write a buffer out and they aren't synchronised with checkpoints. I'm not sure what the best approach is. It'd certainly be nice to be able to drop small tables quickly online too, as a benefit of this approach.
pgsql-hackers by date: