Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id CAMsr+YHtosoQKzHh-nAmyG75cAPTzTtwyk871d+1O-sNQRdeyg@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
List pgsql-hackers
On 2 April 2018 at 02:24, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
 

Maybe my drive-by assessment of those kernel routines is wrong and
someone will correct me, but I'm starting to think you might be better
to assume the worst on all systems.  Perhaps a GUC that defaults to
panicking, so that users on those rare OSes could turn that off?  Even
then I'm not sure if the failure mode will be that great anyway or if
it's worth having two behaviours.  Thoughts?


I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.

BTW, the support team at 2ndQ is presently working on two separate issues where ENOSPC resulted in DB corruption, though neither of them involve logs of lost page writes. I'm planning on taking some time tomorrow to write a torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in the test case I linked to in my original StackOverflow post.

If this is just an EIO issue then I see no point doing anything other than PANICing unconditionally.

If it's a concern for ENOSPC too, we should try harder to fail more nicely whenever we possibly can.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Partition-wise aggregation/grouping
Next
From: "Bossart, Nathan"
Date:
Subject: Re: Change RangeVarGetRelidExtended() to take flags argument?