Re: Raid 10 chunksize - Mailing list pgsql-performance

From david@lang.hm
Subject Re: Raid 10 chunksize
Date
Msg-id alpine.DEB.1.10.0904011341500.28893@asgard.lang.hm
Whole thread Raw
In response to Re: Raid 10 chunksize  (david@lang.hm)
List pgsql-performance
On Wed, 1 Apr 2009, david@lang.hm wrote:

> On Wed, 1 Apr 2009, Mark Kirkwood wrote:
>
>> Scott Carey wrote:
>>>
>>> A little extra info here >>  md, LVM, and some other tools do not allow
>>> the
>>> file system to use write barriers properly.... So those are on the bad
>>> list
>>> for data integrity with SAS or SATA write caches without battery back-up.
>>> However, this is NOT an issue on the postgres data partition.  Data fsync
>>> still works fine, its the file system journal that might have out-of-order
>>> writes.  For xlogs, write barriers are not important, only fsync() not
>>> lying.
>>>
>>> As an additional note, ext4 uses checksums per block in the journal, so it
>>> is resistant to out of order writes causing trouble.  The test compared to
>>> here was on ext4, and most likely the speed increase is partly due to
>>> that.
>>>
>>>
>>
>> [Looks at  Stef's  config - 2x 7200 rpm SATA RAID 0]  I'm still highly
>> suspicious of such a system being capable of outperforming one with the
>> same number of (effective) - much faster - disks *plus* a dedicated WAL
>> disk pair... unless it is being a little loose about fsync! I'm happy to
>> believe ext4 is better than ext3 - but not that much!
>
> given how _horrible_ ext3 is with fsync, I can belive it more easily with
> fsync turned on than with it off.

I realized after sending this that I needed to elaborate a little more.

over the last week there has been a _huge_ thread on the linux-kernel list
(>400 messages) that is summarized on lwn.net at
http://lwn.net/SubscriberLink/326471/b7f5fedf0f7c545f/

there is a lot of information in this thread, but one big thing is that in
data=ordered mode (the default for most distros) ext3 can end up having to
write all pending data when you do a fsync on one file, In addition
reading from disk can take priority over writing the journal entry (the IO
scheduler assumes that there is someone waiting for a read, but not for a
write), so if you have one process trying to do a fsync and another
reading from the disk, the one doing the fsync needs to wait until the
disk is idle to get the fsync completed.

ext4 does things enough differently that fsyncs are relativly cheap again
(like they are on XFS, ext2, and other filesystems). the tradeoff is that
if you _don't_ do an fsync there is a increased window where you will get
data corruption if you crash.

David Lang

pgsql-performance by date:

Previous
From: Stef Telford
Date:
Subject: Re: Raid 10 chunksize
Next
From: Robert Haas
Date:
Subject: Re: self join revisited