Re: Filesystem benchmarking for pg 8.3.3 server - Mailing list pgsql-performance

From Greg Smith
Subject Re: Filesystem benchmarking for pg 8.3.3 server
Date
Msg-id Pine.GSO.4.64.0808140825330.26747@westnet.com
Whole thread Raw
In response to Re: Filesystem benchmarking for pg 8.3.3 server  (Ron Mayer <rm_pg@cheapcomplexdevices.com>)
Responses Re: Filesystem benchmarking for pg 8.3.3 server
Re: Filesystem benchmarking for pg 8.3.3 server
List pgsql-performance
On Wed, 13 Aug 2008, Ron Mayer wrote:

> First off - some IDE drives don't even support the relatively recent ATA
> command that apparently lets the software know when a cache flush is
> complete.

Right, so this is one reason you can't assume barriers will be available.
And barriers don't work regardless if you go through the device mapper,
like some LVM and software RAID configurations; see
http://lwn.net/Articles/283161/

> Second of all - ext3 fsync() appears to me to be *extremely* stupid.
> It only seems to correctly do the correct flushing (and waiting) for a
> drive's cache to be flushed when a file's inode has changed.

This is bad, but the way PostgreSQL uses fsync seems to work fine--if it
didn't, we'd all see unnaturally high write rates all the time.

> So I take back what I said about linux and write barriers
> being sane.   They're not.

Right.  Where Linux seems to be at right now is that there's this
occasional problem people run into where ext3 volumes can get corrupted if
there are out of order writes to its journal:
http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal
http://archives.free.net.ph/message/20070518.134838.52e26369.en.html

(By the way:  I just fixed the ext3 Wikipedia article to reflect the
current state of things and dumped a bunch of reference links in to there,
including some that are not listed here.  I prefer to keep my notes about
interesting topics in Wikipedia instead of having my own copies whenever
possible).

There are two ways to get around this issue ext3. You can disable write
caching, changing your default mount options to "data=journal".  In the
PostgreSQL case, the way the WAL is used seems to keep corruption at bay
even with the default "data=ordered" case, but after reading up on this
again I'm thinking I may want to switch to "journal" anyway in the future
(and retrofit some older installs with that change).  I also avoid using
Linux LVM whenever possible for databases just on general principle; one
less flakey thing in the way.

The other way, barriers, is just plain scary unless you know your disk
hardware does the right thing and the planets align just right, and even
then it seems buggy.  I personally just ignore the fact that they exist on
ext3, and maybe one day ext4 will get this right.

By the way:  there is a great ext3 "torture test" program that just came
out a few months ago that's useful for checking general filesystem
corruption in this context I keep meaning to try, if you've got some
cycles to spare working in this area check it out:
http://uwsg.indiana.edu/hypermail/linux/kernel/0805.2/1470.html

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-performance by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Incorrect estimates on correlated filters
Next
From: "Scott Marlowe"
Date:
Subject: Re: Filesystem benchmarking for pg 8.3.3 server