PANIC caused by open_sync on Linux - Mailing list pgsql-hackers

From ITAGAKI Takahiro
Subject PANIC caused by open_sync on Linux
Date
Msg-id 20071026123608.29AD.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Whole thread Raw
Responses Re: PANIC caused by open_sync on Linux  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-hackers
I encountered PANICs on CentOS 5.0 when I ran write-mostly workload.
It occurs only if wal_sync_method is set to open_sync; there were
no problem in fdatasync. It occurred on both Postgres 8.2.5 and 8.3dev.
 PANIC:  could not write to log file 0, segment 212 at offset 3399680,         length 737280: Input/output error
STATEMENT: COMMIT;
 

My nearby Linux guy says mixed usage of bufferd I/O and direct I/O
could cause errors (EIO) on many version of Linux kernels. If we use
buffered I/O before direct I/O, Linux could fail to discard kernel buffer
cache of the region and report EIO -- yes, it's a bug in Linux.

We use bufferd I/O on WAL segements even if wal_sync_method is open_sync.
We initialized segements with zero using buffered I/O, and after that,
we re-open them with specified sync options.

The behaviors in the bug are different on RHEL 4 and 5. RHEL 4 -> No error reports even though the kernel cache is
incosistenet.RHEL 5 -> write() failes with EIO (Input/output error)
 
PANIC occurs only on RHEL 5, but RHEL 4 also has a problem. If a wal archiver
reads the inconsistent cache of wal segments, it could archive wrong contents
and PITR might fail at the corrupted archived file.


I'll recommend not to use open_sync for users on Linux until the bug is
fiexed. However, are there any idea to avoid the bug and to use direct i/o?
Mixed usage of bufferd and direct i/o is legal, but enforces complexity
to kernels. If we simplify it, things would be more relaxed. For example,
dropping zero-filling and only use direct i/o. Is it possible?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PERFORM] 8.3beta1 testing on Solaris
Next
From: Greg Smith
Date:
Subject: Re: PANIC caused by open_sync on Linux