Re: New Linux xfs/reiser file systems - Mailing list pgsql-hackers

From teg@redhat.com (Trond Eivind Glomsrød)
Subject Re: New Linux xfs/reiser file systems
Date
Msg-id xuyhez1p341.fsf@halden.devel.redhat.com
Whole thread Raw
In response to New Linux xfs/reiser file systems  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: New Linux xfs/reiser file systems  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
I got some information from Stephen Tweedie on this - please keep him
"Cc:" as he's not on this list

************************************************************************
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> I was talking to a Linux user yesterday, and he said that performance
> using the xfs file system is pretty bad.  He believes it has to do with
> the fact that fsync() on log-based file systems requires more writes.


Performance doing what?  XFS has known performance problems doing
unlinks and truncates, but not synchronous IO.  The user should be
using fdatasync() for databases, btw, not fsync().

First, XFS, ext3 and reiserfs are *NOT* log-based filesystems.  They
are journaling filesystems.  They have a log, but they are not
log-based because they do not store data permanently in a log
structure.  Berkeley LFS, Sprite and Spiralog are log-based
filesystems.

> With a standard BSD/ext2 file system, WAL writes can stay on the same
> cylinder to perform fsync.  Is that true of log-based file systems?

Not true on ext2 or BSD.  Write-aheads are _usually_ close to the
inode, but not always.  For true log-based filesystems, writes are
always completely sequential, so the issue just goes away.  For
journaling filesystems, depending on the setup there may be a seek to
the journal involved, but some journaling filesystems can use a
separate disk for the journal so no seek is required.

> I know xfs and reiser are both log based.  Do we need to be concerned
> about PostgreSQL performance on these file systems?  I use BSD FFS with
> soft updates here, so it doesn't affect me.

A database normally preallocates its data files and then performs most
of its writes using update-in-place.  In such cases, fsync() is almost
always the wrong thing to be doing --- the data writes have changed
nothing in the inode except for the timestamps, and there's no need to
flush the timestamps to disk for every write.  fdatasync() is
designed for this --- if the only inode change is timestamps,
fdatasync() will skip the seek to the inode and will only update the
data.  If any significant inode fields have been changed, then a full
flush is done.

Using fdatasync, most filesystems will incur no seeks for data flush,
regardless of whether the filesystem is journaling or not.

Cheers,Stephen
************************************************************************

-- 
Trond Eivind Glomsrød
Red Hat, Inc.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Packaging 7.1.1
Next
From: Tom Lane
Date:
Subject: Re: Packaging 7.1.1