Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id CAA4eK1KRAsnfN1K+TiANtE6WvjaOHRbvL5KctvM4GKkmL4+zyw@mail.gmail.com
Whole thread Raw
In response to Re: [Patch] Optimize dropping of relation buffers using dlist  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: [Patch] Optimize dropping of relation buffers using dlist  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Fri, Nov 6, 2020 at 11:10 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Nov 6, 2020 at 5:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It is not very clear to me how this argument applies to the patch
> > in-discussion where we are relying on the cached value of blocks
> > during recovery. I understand your point that we might skip scanning
> > the pages and thus might not show some recently added data but that
> > point is not linked with what we are trying to do with this patch.
>
> It's an argument for giving up the hard-to-name cache trick completely
> and going back to using unmodified smgrnblocks(), both in recovery and
> online.  If the only mechanism for unexpected file shrinkage is
> writeback failure, then your system will be panicking soon enough
> anyway
>

How else (except for writeback failure due to unexpected shrinkage)
the system will panic? Are you saying that if users don't get some
data due to lseek lying with us then it will be equivalent to panic or
are you indicating the scenario where ReadBuffer_common gives error
"unexpected data beyond EOF ...."?

> -- so is it really that bad if there are potentially some other
> weird errors logged some time before that?  Maybe those errors will
> even take the system down sooner, and maybe that's appropriate?
>

Yeah, it might be appropriate to panic in such situations but
ReadBuffer_common gives an error and ask user to update the system.


>  If
> there are other mechanisms for random file shrinkage that don't imply
> a panic in your near future, then we have bigger problems that can't
> be solved by any number of bandaids, at least not without
> understanding the details of this hypothetical unknown failure mode.
>

I think one of the problems is returning fewer rows and that too
without any warning or error, so maybe that is a bigger problem but we
seem to be okay with it as that is already a known thing though I
think that is not documented anywhere.

> The main argument I can think of against the idea of using plain old
> smgrnblocks() is that the current error messages on smgrwrite()
> failure for stray blocks would be indistinguishible from cases where
> an external actor unlinked the file.  I don't mind getting an error
> that prevents checkpointing -- your system is in big trouble! -- but
> it'd be nice to be able to detect that *we* unlinked the file,
> implying the filesystem and bufferpool are out of sync, and spit out a
> special diagnostic message.  I suppose if it's the checkpointer doing
> the writing, it could check if the relfilenode is on the
> queued-up-for-delete-after-the-checkpoint list, and if so, it could
> produce a different error message just for this edge case.
> Unfortunately that's not a general solution, because any backend might
> try to write a buffer out and they aren't synchronised with
> checkpoints.
>

Yeah, but I am not sure if we can consider manual (external actor)
tinkering with the files the same as something that happened due to
the database server relying on the wrong information.

> I'm not sure what the best approach is.  It'd certainly be nice to be
> able to drop small tables quickly online too, as a benefit of this
> approach.

Right, that is why I was thinking to do it only for recovery where it
is safe from the database server perspective. OTOH, if we broadly
accept that any time filesystem lies with us the behavior could be
unpredictable like the system can return fewer rows than expected or
it could cause panic. I think there is an argument that it might be
better to error out (even with panic) rather than silently returning
fewer rows but unfortunately detecting the same in each and every case
doesn't seem feasible.

One vague idea could be to develop pg_test_seek which can detect such
problems but not sure if we can rely on such a tool to always give us
the right answer. Were you able to consistently reproduce the lseek
problem on the system where you have tried?

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Corruption during WAL replay
Next
From: Magnus Hagander
Date:
Subject: Re: Move OpenSSL random under USE_OPENSSL_RANDOM