Thread: PANIC caused by open_sync on Linux

PANIC caused by open_sync on Linux

From
ITAGAKI Takahiro
Date:
I encountered PANICs on CentOS 5.0 when I ran write-mostly workload.
It occurs only if wal_sync_method is set to open_sync; there were
no problem in fdatasync. It occurred on both Postgres 8.2.5 and 8.3dev.
 PANIC:  could not write to log file 0, segment 212 at offset 3399680,         length 737280: Input/output error
STATEMENT: COMMIT;
 

My nearby Linux guy says mixed usage of bufferd I/O and direct I/O
could cause errors (EIO) on many version of Linux kernels. If we use
buffered I/O before direct I/O, Linux could fail to discard kernel buffer
cache of the region and report EIO -- yes, it's a bug in Linux.

We use bufferd I/O on WAL segements even if wal_sync_method is open_sync.
We initialized segements with zero using buffered I/O, and after that,
we re-open them with specified sync options.

The behaviors in the bug are different on RHEL 4 and 5. RHEL 4 -> No error reports even though the kernel cache is
incosistenet.RHEL 5 -> write() failes with EIO (Input/output error)
 
PANIC occurs only on RHEL 5, but RHEL 4 also has a problem. If a wal archiver
reads the inconsistent cache of wal segments, it could archive wrong contents
and PITR might fail at the corrupted archived file.


I'll recommend not to use open_sync for users on Linux until the bug is
fiexed. However, are there any idea to avoid the bug and to use direct i/o?
Mixed usage of bufferd and direct i/o is legal, but enforces complexity
to kernels. If we simplify it, things would be more relaxed. For example,
dropping zero-filling and only use direct i/o. Is it possible?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



Re: PANIC caused by open_sync on Linux

From
Greg Smith
Date:
On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote:

> My nearby Linux guy says mixed usage of buffered I/O and direct I/O 
> could cause errors (EIO) on many version of Linux kernels.

I'd be curious to get some more information about this--specifically which 
versions have the problems.  I'd heard about some weird bugs in the sync 
write code in versions between RHEL 4 (2.6.9) and 5 (2.6.18), but I wasn't 
aware of anything wrong with those two stable ones in this area.  I have a 
RHEL 5 system here, will see if I can replicate this EIO error.

> Mixed usage of buffered and direct i/o is legal, but enforces complexity 
> to kernels. If we simplify it, things would be more relaxed. For 
> example, dropping zero-filling and only use direct i/o. Is it possible?

It's possible, but performance suffers considerably.  I played around with 
this at one point when looking into doing all database writes as sync 
writes.  Having to wait until the entire 16MB WAL segment made its way to 
disk before more WAL could be written can cause a nasty pause in activity, 
even with direct I/O sync writes.  Even the current buffered zero-filled 
write of that size can be a bit of a drag on performance for the clients 
that get caught behind it, making it any sort of sync write will be far 
worse.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: PANIC caused by open_sync on Linux

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote:
>> Mixed usage of buffered and direct i/o is legal, but enforces complexity 
>> to kernels. If we simplify it, things would be more relaxed. For 
>> example, dropping zero-filling and only use direct i/o. Is it possible?

> It's possible, but performance suffers considerably.  I played around with 
> this at one point when looking into doing all database writes as sync 
> writes.  Having to wait until the entire 16MB WAL segment made its way to 
> disk before more WAL could be written can cause a nasty pause in activity, 
> even with direct I/O sync writes.  Even the current buffered zero-filled 
> write of that size can be a bit of a drag on performance for the clients 
> that get caught behind it, making it any sort of sync write will be far 
> worse.

This ties into a loose end we didn't get to yet: being more aggressive
about creating future WAL segments.  ISTM there is no good reason for
clients ever to have to wait for WAL segment creation --- the bgwriter,
or possibly the walwriter, ought to handle that in the background.  But
we only check for the case once per checkpoint and we don't create a
segment unless there's very little space left.
        regards, tom lane


Re: PANIC caused by open_sync on Linux

From
"Jonah H. Harris"
Date:
On 10/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This ties into a loose end we didn't get to yet: being more aggressive
> about creating future WAL segments.  ISTM there is no good reason for
> clients ever to have to wait for WAL segment creation --- the bgwriter,
> or possibly the walwriter, ought to handle that in the background.

Agreed.

-- 
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation                | fax: 732.331.1301
499 Thornall Street, 2nd Floor          | jonah.harris@enterprisedb.com
Edison, NJ 08837                        | http://www.enterprisedb.com/


Re: PANIC caused by open_sync on Linux

From
Andrew Sullivan
Date:
On Fri, Oct 26, 2007 at 08:34:49AM -0400, Tom Lane wrote:
> we only check for the case once per checkpoint and we don't create a
> segment unless there's very little space left.

Sort of a filthy hack, but what about always having an _extra_
segment around?  The bgwriter could do that, no?

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca


Re: PANIC caused by open_sync on Linux

From
Greg Smith
Date:
On Fri, 26 Oct 2007, Andrew Sullivan wrote:

> Sort of a filthy hack, but what about always having an _extra_
> segment around?  The bgwriter could do that, no?

Now it could.  The bgwriter in <=8.2 stops executing when there's a 
checkpoint going on, and needing more WAL segments because a checkpoint is 
taking too long is one of the major failure cases where proactively 
creating additional segments would be most helpful.

The 8.3 bgwriter keeps running even during checkpoints, so it's feasible 
to add such a feature now.  But that only became true well into the 8.3 
feature freeze, after some changes Heikki made just before the "load 
distributed checkpoint" patch was commited.  Before that, it was hard to 
implement this feature; afterwards, it was too late to fit the change into 
the 8.3 release.  Should be easy enough to add to 8.4 one day.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: PANIC caused by open_sync on Linux

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> The 8.3 bgwriter keeps running even during checkpoints, so it's feasible 
> to add such a feature now.

I wonder though whether the walwriter wouldn't be a better place for it.
        regards, tom lane


Re: PANIC caused by open_sync on Linux

From
Greg Smith
Date:
On Fri, 26 Oct 2007, Tom Lane wrote:

>> The 8.3 bgwriter keeps running even during checkpoints, so it's feasible
>> to add such a feature now.
> I wonder though whether the walwriter wouldn't be a better place for it.

I do, too, but that wasn't available until too late in the 8.3 cycle to 
consider adding this feature to there either.

There's a couple of potential to-do list ideas that build on the changes 
in this area in 8.3:

-Aggressively pre-allocate WAL segments 
-Space out checkpoint fsync requests in addition to disk writes
-Consider re-inserting a smarter bgwriter all-scan that writes sorted by 
usage count during idle periods

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: PANIC caused by open_sync on Linux

From
ITAGAKI Takahiro
Date:
Greg Smith <gsmith@gregsmith.com> wrote:

> There's a couple of potential to-do list ideas that build on the changes 
> in this area in 8.3:
> 
> -Aggressively pre-allocate WAL segments 
> -Space out checkpoint fsync requests in addition to disk writes
> -Consider re-inserting a smarter bgwriter all-scan that writes sorted by 
> usage count during idle periods

I'd like to add:
- Remove "filling with zero" before we recycle WAL segments.

If it is not needed, we can avoid buffered i/o on open_sync except
first allocation of segments. I think we can do it if we have more
robust WAL records that can ignore garbage data written before.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: PANIC caused by open_sync on Linux

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> I'd like to add:
> - Remove "filling with zero" before we recycle WAL segments.

Huh?  We have never done that.
        regards, tom lane


Re: PANIC caused by open_sync on Linux

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > I'd like to add:
> > - Remove "filling with zero" before we recycle WAL segments.
> 
> Huh?  We have never done that.

Oh, sorry. I misread the codes.

I would avoid PANIC if I have enough segements at start up.
I'll test the configuration.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: PANIC caused by open_sync on Linux

From
Andrew Sullivan
Date:
On Fri, Oct 26, 2007 at 10:39:12PM -0400, Greg Smith wrote:
> There's a couple of potential to-do list ideas that build on the changes 
> in this area in 8.3:

I think that's the right way to go.  It's too bad that this may still
happen in 8.3, but we're way past the point that this is a bug fix,
IMO.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The plural of anecdote is not data.    --Roger Brinner


Re: PANIC caused by open_sync on Linux

From
Bruce Momjian
Date:
Added to TODO:

* Be more aggressive about creating WAL files
 http://archives.postgresql.org/pgsql-hackers/2007-10/msg01325.php


---------------------------------------------------------------------------

Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
> > On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote:
> >> Mixed usage of buffered and direct i/o is legal, but enforces complexity 
> >> to kernels. If we simplify it, things would be more relaxed. For 
> >> example, dropping zero-filling and only use direct i/o. Is it possible?
> 
> > It's possible, but performance suffers considerably.  I played around with 
> > this at one point when looking into doing all database writes as sync 
> > writes.  Having to wait until the entire 16MB WAL segment made its way to 
> > disk before more WAL could be written can cause a nasty pause in activity, 
> > even with direct I/O sync writes.  Even the current buffered zero-filled 
> > write of that size can be a bit of a drag on performance for the clients 
> > that get caught behind it, making it any sort of sync write will be far 
> > worse.
> 
> This ties into a loose end we didn't get to yet: being more aggressive
> about creating future WAL segments.  ISTM there is no good reason for
> clients ever to have to wait for WAL segment creation --- the bgwriter,
> or possibly the walwriter, ought to handle that in the background.  But
> we only check for the case once per checkpoint and we don't create a
> segment unless there's very little space left.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +