Thread: Fsync-before-close thought experiment

Fsync-before-close thought experiment

From

Thomas Munro

Date:

04 March 2019, 02:30:31

Hello,

We corrected our previously held belief that it's safe to retry
fsync().  We still haven't dealt with the possibility that some
kernels can forget about write-back errors due to inode cache
pressure, if there is a time when no process has the file open.  To
prevent that we need to hold dirty files open, somehow, until fsync()
is called.

The leading idea so far is a scheme to keep the fd that performed the
oldest write around by passing it through a Unix domain socket to the
checkpointer.  Among the problems with that are (1) there's an as-yet
unresolved deadlock problem, but let's call that just a SMoP, (2) it
relies on atomic Unix socket messages without any guarantee from the
OS, (3) it relies on being able to keep fds in the Unix domain socket
(not counted by PostgreSQL), admittedly with socket buffer size as
back pressure, (4) it relies on the somewhat obscure and badly
documented fd-passing support to work the way we think it works and
without bugs on every OS, (5) it's too complicated to back-patch.  In
this thread I'd like to discuss a simpler alternative.

A more obvious approach that probably moves us closer to the way
kernel developers expect us to write programs is to call fsync()
before close() (due to vfd pressure) if you've written.  The obvious
problem with that is that you could finish up doing loads more
fsyncing than we're doing today if you're regularly dirtying more than
max_files_per_process in the same backend.  I wonder if we could make
it more acceptable like so:

* when performing writes, record the checkpointer's cycle counter in the File

* when closing due to vfd pressure, only call fsync() if the cycle
hasn't advanced (that is, you get to skip the fsync() if you haven't
written in this sync cycle, because the checkpointer must have taken
care of your writes from the previous cycle)

* if you find you're doing this too much (by default, dirtying more
than 1k relation segments per checkpoint cycle), maybe log a warning
that the user might want to consider increasing max_files_per_process
(and related OS limits)

A possible improvement, stolen from the fd-passing patch, is to
introduce a "sync cycle", separate from checkpoint cycle, so that the
checkpointer can process its table of pending operations more than
once per checkpoint if that would help minimise fsyncing in foreground
processes.  I haven't thought much about how exactly that would work.

There is a less obvious problem with this scheme though:

1.  Suppose the operating system has a single error flag for a file no
matter how many fds have it open, and it is cleared by the first
syscall to return EIO.  There is a broken schedule like this (B =
regular backend, C = checkpointer):

B: fsync() -> EIO # clears error flag
C: fsync() -> success, log checkpoint
B: PANIC!

2.  On an operating system that has an error counter + seen flag
(Linux 4.14+), in general you receive the error in all fds, which is a
very nice property, but we'd still have a broken schedule involving
the seen flag:

B: fsync() -> EIO # clears seen flag
C: open() # opened after error
C: fsync() -> success, log checkpoint
B: PANIC!

Here's one kind of interlocking that might work:  Hash pathnames and
map to an array of lwlocks + error flags.  Any process trying to sync
a file must hold the lock and check for a pre-existing error flag.
Now a checkpoint cannot succeed if any backend has recently decided to
panic.  You could skip that if data_sync_retry = on.

-- 
Thomas Munro
https://enterprisedb.com

Re: Fsync-before-close thought experiment

From

Robert Haas

Date:

05 March 2019, 17:58:50

On Sun, Mar 3, 2019 at 6:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> A more obvious approach that probably moves us closer to the way
> kernel developers expect us to write programs is to call fsync()
> before close() (due to vfd pressure) if you've written.

Interesting....

> The obvious
> problem with that is that you could finish up doing loads more
> fsyncing than we're doing today if you're regularly dirtying more than
> max_files_per_process in the same backend.

Check.

> I wonder if we could make
> it more acceptable like so:
>
> * when performing writes, record the checkpointer's cycle counter in the File
>
> * when closing due to vfd pressure, only call fsync() if the cycle
> hasn't advanced (that is, you get to skip the fsync() if you haven't
> written in this sync cycle, because the checkpointer must have taken
> care of your writes from the previous cycle)

Hmm, OK.

> * if you find you're doing this too much (by default, dirtying more
> than 1k relation segments per checkpoint cycle), maybe log a warning
> that the user might want to consider increasing max_files_per_process
> (and related OS limits)
>
> A possible improvement, stolen from the fd-passing patch, is to
> introduce a "sync cycle", separate from checkpoint cycle, so that the
> checkpointer can process its table of pending operations more than
> once per checkpoint if that would help minimise fsyncing in foreground
> processes.  I haven't thought much about how exactly that would work.

Yeah, that seems worth considering.  I suppose that a backend could
keep track of how many times it's recorded the current sync cycle in a
File that is still open -- this seems like it should be pretty simple
and cheap, provided we can find the right places to put the counter
adjustments.  If that number gets too big, like say greater than 80%
of the number of fds, it sends a ping to the checkpointer.  I'm not
sure if that would then immediately trigger a full sync cycle or if
there is something more granular we could do.

> There is a less obvious problem with this scheme though:
>
> 1.  Suppose the operating system has a single error flag for a file no
> matter how many fds have it open, and it is cleared by the first
> syscall to return EIO.  There is a broken schedule like this (B =
> regular backend, C = checkpointer):
>
> B: fsync() -> EIO # clears error flag
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> 2.  On an operating system that has an error counter + seen flag
> (Linux 4.14+), in general you receive the error in all fds, which is a
> very nice property, but we'd still have a broken schedule involving
> the seen flag:
>
> B: fsync() -> EIO # clears seen flag
> C: open() # opened after error
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> Here's one kind of interlocking that might work:  Hash pathnames and
> map to an array of lwlocks + error flags.  Any process trying to sync
> a file must hold the lock and check for a pre-existing error flag.
> Now a checkpoint cannot succeed if any backend has recently decided to
> panic.  You could skip that if data_sync_retry = on.

That might be fine, but I think it might be possible to create
something more light-weight.  Suppose we just decide that foreground
processes always win and the checkpointer has to wait before logging
the checkpoint.  To that end, a foreground process advertises in
shared memory whether or not it is currently performing an fsync.  The
checkpointer must observe each process until it sees that process in
the not-fsyncing state at least once.

If a process starts fsync-ing after being observed not-fsyncing, it
just means that the backend started doing an fsync() after the
checkpointer had completed the fsyncs for that checkpoint.  And in
that case the checkpointer would have observed the EIO for any writes
prior to the checkpoint, so it's OK to write that checkpoint; it's
only the next one that has an issue, and the fact that we're now
advertising that we are fsync()-ing again will prevent that one from
completing before we emit any necessary PANIC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company