Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Robert Haas
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id CA+TgmoYEY8QsUDArSLE8iDi0+mOSLV3rm62oEtbLjS0BbPyARQ@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
List pgsql-hackers
On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> In the mean time, I propose that we fsync() on close() before we age FDs out
> of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
> we don't seem to have many better options. At least it'll only flush what we
> actually wrote to the OS buffers not what we may have in shared_buffers. If
> the bgwriter does the same thing, we should be 100% safe from this problem
> on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
> full_page_writes options that people can turn off if they know the risks /
> know their storage is safe / don't care.

Ouch.  If a process exits -- say, because the user typed \q into psql
-- then you're talking about potentially calling fsync() on a really
large number of file descriptor flushing many gigabytes of data to
disk.  And it may well be that you never actually wrote any data to
any of those file descriptors -- those writes could have come from
other backends.  Or you may have written a little bit of data through
those FDs, but there could be lots of other data that you end up
flushing incidentally.  Perfectly innocuous things like starting up a
backend, running a few short queries, and then having that backend
exit suddenly turn into something that could have a massive
system-wide performance impact.

Also, if a backend ever manages to exit without running through this
code, or writes any dirty blocks afterward, then this still fails to
fix the problem completely.  I guess that's probably avoidable -- we
can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Optimization of range queries
Next
From: Teodor Sigaev
Date:
Subject: Re: Optimization of range queries