Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id 20180418115657.GC20040@momjian.us
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Apr 17, 2018 at 02:41:42PM -0700, Andres Freund wrote:
> On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:
> > On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> > > That doesn't seem like a very practical way. It's better than nothing,
> > > of course, but I wonder how would that work with containers (where I
> > > think you may not have access to the kernel log at all). Also, I'm
> > > pretty sure the messages do change based on kernel version (and possibly
> > > filesystem) so parsing it reliably seems rather difficult. And we
> > > probably don't want to PANIC after I/O error on an unrelated device, so
> > > we'd need to understand which devices are related to PostgreSQL.
> 
> You can certainly have access to the kernel log in containers. I'd
> assume such a script wouldn't check various system logs but instead tail
> /dev/kmsg or such. Otherwise the variance between installations would be
> too big.

I was thinking 'dmesg', but the result is similar.

> There's not *that* many different type of error messages and they don't
> change that often. If we'd just detect error for the most common FSs
> we'd probably be good. Detecting a few general storage layer message
> wouldn't be that hard either, most things have been unified over the
> last ~8-10 years.

It is hard to know exactly what the message format should be for each
operating system because it is hard to generate them on demand, and we
would need to filter based on Postgres devices.

The other issue is that once you see a message during a checkpoint and
exit, you don't want to see that message again after the problem has
been fixed and the server restarted.  The simplest solution is to save
the output of the last check and look for only new entries.  I am
attaching a script I run every 15 minutes from cron that emails me any
unexpected kernel messages.

I am thinking we would need a contrib module with sample scripts for
various operating systems.

> > Replying to your specific case, I am not sure how we would use a script
> > to check for I/O errors/space-exhaustion if the postgres user doesn't
> > have access to it.
> 
> Not sure what you mean?
> 
> Space exhaustiion can be checked when allocating space, FWIW. We'd just
> need to use posix_fallocate et al.

I was asking about cases where permissions prevent viewing of kernel
messages.  I think you can view them in containers, but in virtual
machines you might not have access to the host operating system's kernel
messages, and that might be where they are.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] proposal: schema variables
Next
From: Craig Ringer
Date:
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS