Re: BUG #10533: 9.4 beta1 assertion failure in autovacuum process - Mailing list pgsql-bugs

From Andres Freund
Subject Re: BUG #10533: 9.4 beta1 assertion failure in autovacuum process
Date
Msg-id 20140606231149.GA24880@awork2.anarazel.de
Whole thread Raw
In response to Re: BUG #10533: 9.4 beta1 assertion failure in autovacuum process  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On 2014-06-06 18:21:45 -0400, Tom Lane wrote:
> Also, there are a bunch of fsync_fname() calls inside critical sections in
> replication/slot.c.  Seems at best pretty damn risky; what's more, the
> critical sections cover only the fsyncs and not anything else, which is
> flat out broken.  If it was okay to fail just before calling the fsync,
> why is it critical to not fail inside it?  Somebody was not thinking
> clearly there.

No, it actually makes sense. If:
* the open, write or fsync to the temp file fails: no permanent state
  has changed. We can gracefully error out.
* rename(tmpfile, realname) fails: we know (by posix) that the file
  hasn't been renamed. The old state is still valid.
* if the fsync() to the new file fails (damn unlikely) we don't know
  which state is valid. So if we'd crash in that moment we might loose
  our reservation on resources (e.g. catalog xmin). And might start to
  decode with the wrong catalog state. Bad. On startup we'll try to
  fsync the slot files again, so we won't startup until that's clear.

Why is it that risky? We fdatasync() files while inside a critical
section all the time. And we've done the space allocation (the fsync on
the old filename) and the rename() outside the critical section.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #10533: 9.4 beta1 assertion failure in autovacuum process
Next
From: Joe Conway
Date:
Subject: Re: BUG #10544: I cannot canceling query during computing of R