On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
> > I am not sure I understand that point. We can either update the
> > in-memory bit before performing the on-disk operations or
> > afterwards. Either way, there's a way to be inconsistent if the disk
> > operation fails somewhere inbetween (it might fail but still have
> > deleted the file/directory!). The normal way to handle that in other
> > places is PANICing when we don't know so we recover from the on-disk
> > state.
> > I really don't see the problem here? Code doesn't get more robust by
> > doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
> > ERROR, often that's not warranted.
>
> People get cranky when the database PANICs because of a filesystem
> failure. We should avoid that, especially when it's trivial to do so.
> The update to shared memory should be done second and should be set
> up to be no-fail.
I don't see how that would help. If we fail during unlink/rmdir, we
don't really know at which point we failed. Just keeping the slot in
memory, won't help us in any way - we'll continue to reserve resources
while the slot is half-gone.
I don't think trying to handle errors we don't understand and we don't
routinely expect actually improves robustness. It just leads to harder
to diagnose errors. It's not like the cases here are likely to be caused
by anthything but severe admin failure like removing the write
permissions of the postgres directory while the server is running. Or do
you see more valid causes?
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services