Thread: file system and raid performance

file system and raid performance

From
"Mark Wong"
Date:
Hi all,

We've thrown together some results from simple i/o tests on Linux
comparing various file systems, hardware and software raid with a
little bit of volume management:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide

What I'd like to ask of the folks on the list is how relevant is this
information in helping make decisions such as "What file system should
I use?"  "What performance can I expect from this RAID configuration?"
 I know these kind of tests won't help answer questions like "Which
file system is most reliable?" but we would like to be as helpful as
we can.

Any suggestions/comments/criticisms for what would be more relevant or
interesting also appreciated.  We've started with Linux but we'd also
like to hit some other OS's.  I'm assuming FreeBSD would be the other
popular choice for the DL-380 that we're using.

I hope this is helpful.

Regards,
Mark

Re: file system and raid performance

From
david@lang.hm
Date:
On Mon, 4 Aug 2008, Mark Wong wrote:

> Hi all,
>
> We've thrown together some results from simple i/o tests on Linux
> comparing various file systems, hardware and software raid with a
> little bit of volume management:
>
> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
>
> What I'd like to ask of the folks on the list is how relevant is this
> information in helping make decisions such as "What file system should
> I use?"  "What performance can I expect from this RAID configuration?"
> I know these kind of tests won't help answer questions like "Which
> file system is most reliable?" but we would like to be as helpful as
> we can.
>
> Any suggestions/comments/criticisms for what would be more relevant or
> interesting also appreciated.  We've started with Linux but we'd also
> like to hit some other OS's.  I'm assuming FreeBSD would be the other
> popular choice for the DL-380 that we're using.
>
> I hope this is helpful.

it's definantly timely for me (we were having a spirited 'discussion' on
this topic at work today ;-)

what happened with XFS?

you show it as not completing half the tests in the single-disk table and
it's completly missing from the other ones.

what OS/kernel were you running?

if it was linux, which software raid did you try (md or dm) did you use
lvm or raw partitions?

David Lang

Re: file system and raid performance

From
"Gregory S. Youngblood"
Date:
I recently ran some tests on Ubuntu Hardy Server (Linux) comparing JFS, XFS,
and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only used
bonnie++, so the numbers are really only useful for my hardware.

What parameters were used to create the XFS partition in these tests? And,
what options were used to mount the file system? Was the kernel 32-bit or
64-bit? Given what I've seen with some of the XFS options (like lazy-count),
I am wondering about the options used in these tests.

Thanks,
Greg




Re: file system and raid performance

From
"Mark Wong"
Date:
On Mon, Aug 4, 2008 at 10:04 PM,  <david@lang.hm> wrote:
> On Mon, 4 Aug 2008, Mark Wong wrote:
>
>> Hi all,
>>
>> We've thrown together some results from simple i/o tests on Linux
>> comparing various file systems, hardware and software raid with a
>> little bit of volume management:
>>
>> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
>>
>> What I'd like to ask of the folks on the list is how relevant is this
>> information in helping make decisions such as "What file system should
>> I use?"  "What performance can I expect from this RAID configuration?"
>> I know these kind of tests won't help answer questions like "Which
>> file system is most reliable?" but we would like to be as helpful as
>> we can.
>>
>> Any suggestions/comments/criticisms for what would be more relevant or
>> interesting also appreciated.  We've started with Linux but we'd also
>> like to hit some other OS's.  I'm assuming FreeBSD would be the other
>> popular choice for the DL-380 that we're using.
>>
>> I hope this is helpful.
>
> it's definantly timely for me (we were having a spirited 'discussion' on
> this topic at work today ;-)
>
> what happened with XFS?

Not exactly sure, I didn't attempt to debug much.  I only looked into
it enough to see that the fio processes were waiting for something.
In one case I left the test go for 24 hours too see if it would stop.
Note that I specified to fio not to run longer than an hour.

> you show it as not completing half the tests in the single-disk table and
> it's completly missing from the other ones.
>
> what OS/kernel were you running?

This is a Gentoo system, running the 2.6.25-gentoo-r6 kernel.

> if it was linux, which software raid did you try (md or dm) did you use lvm
> or raw partitions?

We tried mdraid, not device-mapper.  So far we have only used raw
partitions (whole devices without partitions.)

Regards,
Mark

Re: file system and raid performance

From
"Mark Wong"
Date:
On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood <greg@tcscs.com> wrote:
> I recently ran some tests on Ubuntu Hardy Server (Linux) comparing JFS, XFS,
> and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only used
> bonnie++, so the numbers are really only useful for my hardware.
>
> What parameters were used to create the XFS partition in these tests? And,
> what options were used to mount the file system? Was the kernel 32-bit or
> 64-bit? Given what I've seen with some of the XFS options (like lazy-count),
> I am wondering about the options used in these tests.

The default (no arguments specified) parameters were used to create
the XFS partitions.  Mount options specified are described in the
table.  This was a 64-bit OS.

Regards,
Mark

Re: file system and raid performance

From
"Fernando Ike"
Date:
On Tue, Aug 5, 2008 at 4:54 AM, Mark Wong <markwkm@gmail.com> wrote:
> Hi all,
 Hi

> We've thrown together some results from simple i/o tests on Linux
> comparing various file systems, hardware and software raid with a
> little bit of volume management:
>
> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
>
>
> Any suggestions/comments/criticisms for what would be more relevant or
> interesting also appreciated.  We've started with Linux but we'd also
> like to hit some other OS's.  I'm assuming FreeBSD would be the other
> popular choice for the DL-380 that we're using.
>

   Would be interesting also tests with Ext4. Despite of don't
consider stable in kernel linux, on the case is possible because the
version kernel and assuming that is e2fsprogs is supported.



Regards,
--
Fernando Ike
http://www.midstorm.org/~fike/weblog

Re: file system and raid performance

From
"Gregory S. Youngblood"
Date:
> From: Mark Kirkwood [mailto:markir@paradise.net.nz]
> Mark Wong wrote:
> > On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood
> <greg@tcscs.com> wrote:
> >
> >> I recently ran some tests on Ubuntu Hardy Server (Linux) comparing
> JFS, XFS,
> >> and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only
> used
> >> bonnie++, so the numbers are really only useful for my hardware.
> >>
> >> What parameters were used to create the XFS partition in these
> tests? And,
> >> what options were used to mount the file system? Was the kernel 32-
> bit or
> >> 64-bit? Given what I've seen with some of the XFS options (like
> lazy-count),
> >> I am wondering about the options used in these tests.
> >>
> >
> > The default (no arguments specified) parameters were used to create
> > the XFS partitions.  Mount options specified are described in the
> > table.  This was a 64-bit OS.
> >
> I think it is a good idea to match the raid stripe size and give some
> indication of how many disks are in the array. E.g:
>
> For a 4 disk system with 256K stripe size I used:
>
>  $ mkfs.xfs -d su=256k,sw=2 /dev/mdx
>
> which performed about 2-3 times quicker than the default (I did try
> sw=4
> as well, but didn't notice any difference compared to sw=4).

[Greg says]
I thought that xfs picked up those details when using md and a soft-raid
configuration.





Re: file system and raid performance

From
Mark Kirkwood
Date:
Mark Wong wrote:
> On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood <greg@tcscs.com> wrote:
>
>> I recently ran some tests on Ubuntu Hardy Server (Linux) comparing JFS, XFS,
>> and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only used
>> bonnie++, so the numbers are really only useful for my hardware.
>>
>> What parameters were used to create the XFS partition in these tests? And,
>> what options were used to mount the file system? Was the kernel 32-bit or
>> 64-bit? Given what I've seen with some of the XFS options (like lazy-count),
>> I am wondering about the options used in these tests.
>>
>
> The default (no arguments specified) parameters were used to create
> the XFS partitions.  Mount options specified are described in the
> table.  This was a 64-bit OS.
>
> Regards,
> Mark
>
>
I think it is a good idea to match the raid stripe size and give some
indication of how many disks are in the array. E.g:

For a 4 disk system with 256K stripe size I used:

 $ mkfs.xfs -d su=256k,sw=2 /dev/mdx

which performed about 2-3 times quicker than the default (I did try sw=4
as well, but didn't notice any difference compared to sw=4).

regards

Mark


Re: file system and raid performance

From
Mark Kirkwood
Date:
Gregory S. Youngblood wrote:
>> From: Mark Kirkwood [mailto:markir@paradise.net.nz]
>> Mark Wong wrote:
>>
>>> On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood
>>>
>> <greg@tcscs.com> wrote:
>>
>>>> I recently ran some tests on Ubuntu Hardy Server (Linux) comparing
>>>>
>> JFS, XFS,
>>
>>>> and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only
>>>>
>> used
>>
>>>> bonnie++, so the numbers are really only useful for my hardware.
>>>>
>>>> What parameters were used to create the XFS partition in these
>>>>
>> tests? And,
>>
>>>> what options were used to mount the file system? Was the kernel 32-
>>>>
>> bit or
>>
>>>> 64-bit? Given what I've seen with some of the XFS options (like
>>>>
>> lazy-count),
>>
>>>> I am wondering about the options used in these tests.
>>>>
>>>>
>>> The default (no arguments specified) parameters were used to create
>>> the XFS partitions.  Mount options specified are described in the
>>> table.  This was a 64-bit OS.
>>>
>>>
>> I think it is a good idea to match the raid stripe size and give some
>> indication of how many disks are in the array. E.g:
>>
>> For a 4 disk system with 256K stripe size I used:
>>
>>  $ mkfs.xfs -d su=256k,sw=2 /dev/mdx
>>
>> which performed about 2-3 times quicker than the default (I did try
>> sw=4
>> as well, but didn't notice any difference compared to sw=4).
>>
>
> [Greg says]
> I thought that xfs picked up those details when using md and a soft-raid
> configuration.
>
>
>
>
>
>
You are right, it does (I may be recalling performance from my other
machine that has a 3Ware card - this was a couple of years ago...)
Anyway, I'm thinking for the Hardware raid tests they may need to be
specified.

Cheers

Mark

Re: file system and raid performance

From
Mark Kirkwood
Date:
Mark Kirkwood wrote:
> You are right, it does (I may be recalling performance from my other
> machine that has a 3Ware card - this was a couple of years ago...)
> Anyway, I'm thinking for the Hardware raid tests they may need to be
> specified.
>
>

FWIW - of course this somewhat academic given that the single disk xfs
test failed! I'm puzzled - having a Gentoo system of similar
configuration (2.6.25-gentoo-r6) and running the fio tests a little
modified for my config (2 cpu PIII 2G RAM with 4x ATA disks RAID0 and
all xfs filesystems - I changed sizes of files to 4G and no. processes
to 4) all tests that failed on Marks HP work on my Supermicro P2TDER +
Promise TX4000. In fact the performance is pretty reasonable on the old
girl as well (seq read is 142Mb/s and the random read/write is 12.7/12.0
Mb/s).

I certainly would like to see some more info on why the xfs tests were
failing - as on most systems I've encountered xfs is a great performer.

regards

Mark

Re: file system and raid performance

From
Mario Weilguni
Date:
Mark Kirkwood schrieb:
> Mark Kirkwood wrote:
>> You are right, it does (I may be recalling performance from my other
>> machine that has a 3Ware card - this was a couple of years ago...)
>> Anyway, I'm thinking for the Hardware raid tests they may need to be
>> specified.
>>
>>
>
> FWIW - of course this somewhat academic given that the single disk xfs
> test failed! I'm puzzled - having a Gentoo system of similar
> configuration (2.6.25-gentoo-r6) and running the fio tests a little
> modified for my config (2 cpu PIII 2G RAM with 4x ATA disks RAID0 and
> all xfs filesystems - I changed sizes of files to 4G and no. processes
> to 4) all tests that failed on Marks HP work on my Supermicro P2TDER +
> Promise TX4000. In fact the performance is pretty reasonable on the
> old girl as well (seq read is 142Mb/s and the random read/write is
> 12.7/12.0 Mb/s).
>
> I certainly would like to see some more info on why the xfs tests were
> failing - as on most systems I've encountered xfs is a great performer.
>
> regards
>
> Mark
>
I can second this, we use XFS on nearly all our database servers, and
never encountered the problems mentioned.


Re: file system and raid performance

From
"Mark Wong"
Date:
On Thu, Aug 7, 2008 at 3:21 AM, Mario Weilguni <mweilguni@sime.com> wrote:
> Mark Kirkwood schrieb:
>>
>> Mark Kirkwood wrote:
>>>
>>> You are right, it does (I may be recalling performance from my other
>>> machine that has a 3Ware card - this was a couple of years ago...) Anyway,
>>> I'm thinking for the Hardware raid tests they may need to be specified.
>>>
>>>
>>
>> FWIW - of course this somewhat academic given that the single disk xfs
>> test failed! I'm puzzled - having a Gentoo system of similar configuration
>> (2.6.25-gentoo-r6) and running the fio tests a little modified for my config
>> (2 cpu PIII 2G RAM with 4x ATA disks RAID0 and all xfs filesystems - I
>> changed sizes of files to 4G and no. processes to 4) all tests that failed
>> on Marks HP work on my Supermicro P2TDER + Promise TX4000. In fact the
>> performance is pretty reasonable on the old girl as well (seq read is
>> 142Mb/s and the random read/write is 12.7/12.0 Mb/s).
>>
>> I certainly would like to see some more info on why the xfs tests were
>> failing - as on most systems I've encountered xfs is a great performer.
>>
>> regards
>>
>> Mark
>>
> I can second this, we use XFS on nearly all our database servers, and never
> encountered the problems mentioned.

I have heard of one or two situations where the combination of the
disk controller caused bizarre behaviors with different journaling
file systems.  They seem so few and far between though.  I personally
wasn't looking forwarding to chasing Linux file system problems, but I
can set up an account and remote management access if anyone else
would like to volunteer.

Regards,
Mark

Re: file system and raid performance

From
"Andrej Ricnik-Bay"
Date:
To me it still boggles the mind that noatime should actually slow down
activities on ANY file-system ... has someone got an explanation for
that kind of behaviour?  As far as I'm concerned this means that even
to any read I'll add the overhead of a write - most likely in a disk-location
slightly off of the position that I read the data ... how would that speed
the process up on average?



Cheers,
Andrej

Re: file system and raid performance

From
"Scott Marlowe"
Date:
On Thu, Aug 7, 2008 at 2:59 PM, Andrej Ricnik-Bay
<andrej.groups@gmail.com> wrote:
> To me it still boggles the mind that noatime should actually slow down
> activities on ANY file-system ... has someone got an explanation for
> that kind of behaviour?  As far as I'm concerned this means that even
> to any read I'll add the overhead of a write - most likely in a disk-location
> slightly off of the position that I read the data ... how would that speed
> the process up on average?

noatime turns off the atime write behaviour.  Or did you already know
that and I missed some weird post where noatime somehow managed to
slow down performance?

Re: file system and raid performance

From
"Andrej Ricnik-Bay"
Date:
2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>:
> noatime turns off the atime write behaviour.  Or did you already know
> that and I missed some weird post where noatime somehow managed to
> slow down performance?

Scott, I'm quite aware of what noatime does ... you didn't miss a post, but
if you look at Mark's graphs on
http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
they pretty much all indicate that (unless I completely misinterpret the
meaning and purpose of the labels), independent of the file-system,
using noatime slows read/writes down (on average).




--
Please don't top post, and don't use HTML e-Mail :} Make your quotes concise.

http://www.american.edu/econ/notes/htmlmail.htm

Re: file system and raid performance

From
"Gregory S. Youngblood"
Date:
> -----Original Message-----
> From: Mark Wong [mailto:markwkm@gmail.com]
> Sent: Thursday, August 07, 2008 12:37 PM
> To: Mario Weilguni
> Cc: Mark Kirkwood; greg@tcscs.com; david@lang.hm; pgsql-
> performance@postgresql.org; Gabrielle Roth
> Subject: Re: [PERFORM] file system and raid performance
>

> I have heard of one or two situations where the combination of the
> disk controller caused bizarre behaviors with different journaling
> file systems.  They seem so few and far between though.  I personally
> wasn't looking forwarding to chasing Linux file system problems, but I
> can set up an account and remote management access if anyone else
> would like to volunteer.

[Greg says]
Tempting... if no one else takes you up on it by then, I might have some
time in a week or two to experiment and test a couple of things.

One thing I've noticed with a Silicon Image 3124 SATA going through a
Silicon Image 3726 port multiplier with the binary-only drivers from Silicon
Image (until the PM support made it into the mainline kernel - 2.6.24 I
think, might have been .25) is that under some heavy loads it might drop a
sata channel and if that channel happens to have a PM on it, it drops 5
drives. I saw this with a card that had 4 channels, 2 connected to a PM w/5
drives and 2 direct. It was pretty random.

Not saying that's happening in this case, but odd things have been known to
happen under unusual usage patterns.




Re: file system and raid performance

From
"Scott Marlowe"
Date:
On Thu, Aug 7, 2008 at 3:57 PM, Andrej Ricnik-Bay
<andrej.groups@gmail.com> wrote:
> 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>:
>> noatime turns off the atime write behaviour.  Or did you already know
>> that and I missed some weird post where noatime somehow managed to
>> slow down performance?
>
> Scott, I'm quite aware of what noatime does ... you didn't miss a post, but
> if you look at Mark's graphs on
> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
> they pretty much all indicate that (unless I completely misinterpret the
> meaning and purpose of the labels), independent of the file-system,
> using noatime slows read/writes down (on average).

Interesting.  While a few of the benchmarks looks noticeably slower
with noatime (reiserfs for instance) most seem faster in that listing.

I am just now setting up our big database server for work and noticed
a MUCH lower performance without noatime.

Re: file system and raid performance

From
Mark Mielke
Date:
Andrej Ricnik-Bay wrote:
2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>: 
noatime turns off the atime write behaviour.  Or did you already know
that and I missed some weird post where noatime somehow managed to
slow down performance?   
Scott, I'm quite aware of what noatime does ... you didn't miss a post, but
if you look at Mark's graphs on
http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
they pretty much all indicate that (unless I completely misinterpret the
meaning and purpose of the labels), independent of the file-system,
using noatime slows read/writes down (on average)

That doesn't make sense - if noatime slows things down, then the analysis is probably wrong.

Now, modern Linux distributions default to "relatime" - which will only update access time if the access time is currently less than the update time or something like this. The effect is that modern Linux distributions do not benefit from "noatime" as much as they have in the past. In this case, "noatime" vs default would probably be measuring % noise.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: file system and raid performance

From
"Mark Wong"
Date:
On Thu, Aug 7, 2008 at 1:24 PM, Gregory S. Youngblood <greg@tcscs.com> wrote:
>> -----Original Message-----
>> From: Mark Wong [mailto:markwkm@gmail.com]
>> Sent: Thursday, August 07, 2008 12:37 PM
>> To: Mario Weilguni
>> Cc: Mark Kirkwood; greg@tcscs.com; david@lang.hm; pgsql-
>> performance@postgresql.org; Gabrielle Roth
>> Subject: Re: [PERFORM] file system and raid performance
>>
>
>> I have heard of one or two situations where the combination of the
>> disk controller caused bizarre behaviors with different journaling
>> file systems.  They seem so few and far between though.  I personally
>> wasn't looking forwarding to chasing Linux file system problems, but I
>> can set up an account and remote management access if anyone else
>> would like to volunteer.
>
> [Greg says]
> Tempting... if no one else takes you up on it by then, I might have some
> time in a week or two to experiment and test a couple of things.

Ok, let me know and I'll set you up with access.

Regards,
Mark

Re: file system and raid performance

From
"Mark Wong"
Date:
On Thu, Aug 7, 2008 at 3:08 PM, Mark Mielke <mark@mark.mielke.cc> wrote:
> Andrej Ricnik-Bay wrote:
>
> 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>:
>
>
> noatime turns off the atime write behaviour.  Or did you already know
> that and I missed some weird post where noatime somehow managed to
> slow down performance?
>
>
> Scott, I'm quite aware of what noatime does ... you didn't miss a post, but
> if you look at Mark's graphs on
> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
> they pretty much all indicate that (unless I completely misinterpret the
> meaning and purpose of the labels), independent of the file-system,
> using noatime slows read/writes down (on average)
>
> That doesn't make sense - if noatime slows things down, then the analysis is
> probably wrong.
>
> Now, modern Linux distributions default to "relatime" - which will only
> update access time if the access time is currently less than the update time
> or something like this. The effect is that modern Linux distributions do not
> benefit from "noatime" as much as they have in the past. In this case,
> "noatime" vs default would probably be measuring % noise.

Anyone know what to look for in kernel profiles?  There is readprofile
(profile.text) and oprofile (oprofile.kernel and oprofile.user) data
available.  Just click on the results number, then the "raw data" link
for a directory listing of files.  For example, here is one of the
links:

http://osdldbt.sourceforge.net/dl380/3disk/sraid5/ext3-journal/seq-read/fio/profiling/oprofile.kernel

Regards,
Mark

Re: file system and raid performance

From
"Mark Wong"
Date:
On Thu, Aug 7, 2008 at 3:08 PM, Mark Mielke <mark@mark.mielke.cc> wrote:
> Andrej Ricnik-Bay wrote:
>
> 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>:
>
>
> noatime turns off the atime write behaviour.  Or did you already know
> that and I missed some weird post where noatime somehow managed to
> slow down performance?
>
>
> Scott, I'm quite aware of what noatime does ... you didn't miss a post, but
> if you look at Mark's graphs on
> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
> they pretty much all indicate that (unless I completely misinterpret the
> meaning and purpose of the labels), independent of the file-system,
> using noatime slows read/writes down (on average)
>
> That doesn't make sense - if noatime slows things down, then the analysis is
> probably wrong.
>
> Now, modern Linux distributions default to "relatime" - which will only
> update access time if the access time is currently less than the update time
> or something like this. The effect is that modern Linux distributions do not
> benefit from "noatime" as much as they have in the past. In this case,
> "noatime" vs default would probably be measuring % noise.

Interesting, now how would we see if it is defaulting to "relatime"?

Regards,
Mark

Re: file system and raid performance

From
"Mark Wong"
Date:
On Thu, Aug 7, 2008 at 3:08 PM, Mark Mielke <mark@mark.mielke.cc> wrote:
> Andrej Ricnik-Bay wrote:
>
> 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>:
>
>
> noatime turns off the atime write behaviour.  Or did you already know
> that and I missed some weird post where noatime somehow managed to
> slow down performance?
>
>
> Scott, I'm quite aware of what noatime does ... you didn't miss a post, but
> if you look at Mark's graphs on
> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
> they pretty much all indicate that (unless I completely misinterpret the
> meaning and purpose of the labels), independent of the file-system,
> using noatime slows read/writes down (on average)
>
> That doesn't make sense - if noatime slows things down, then the analysis is
> probably wrong.
>
> Now, modern Linux distributions default to "relatime" - which will only
> update access time if the access time is currently less than the update time
> or something like this. The effect is that modern Linux distributions do not
> benefit from "noatime" as much as they have in the past. In this case,
> "noatime" vs default would probably be measuring % noise.

It appears that the default mount option on this system is "atime".
Not specifying any options, "relatime" or "noatime", results in
neither being shown in /proc/mounts.  I'm assuming if the default
behavior was to use "relatime" that it would be shown in /proc/mounts.

Regards,
Mark

Re: file system and raid performance

From
Greg Smith
Date:
On Thu, 7 Aug 2008, Mark Mielke wrote:

> Now, modern Linux distributions default to "relatime"

Right, but Mark's HP test system is running Gentoo.

(ducks)

According to http://brainstorm.ubuntu.com/idea/2369/ relatime is the
default for Fedora 8, Mandriva 2008, Pardus, and Ubuntu 8.04.

Anyway, there aren't many actual files involved in this test, and I
suspect the atime writes are just being cached until forced out to disk
only periodically.  You need to run something that accesses more files
and/or regularly forces sync to disk periodically to get a more
database-like situation where the atime writes degrade performance.  Note
how Joshua Drake's ext2 vs. ext3 comparison, which does show a large
difference here, was run with the iozone's -e parameter that flushes the
writes with fsync.  I don't see anything like that in the DL380 G5 fio
tests.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: file system and raid performance

From
Bruce Momjian
Date:
Mark Wong wrote:
> On Mon, Aug 4, 2008 at 10:04 PM,  <david@lang.hm> wrote:
> > On Mon, 4 Aug 2008, Mark Wong wrote:
> >
> >> Hi all,
> >>
> >> We've thrown together some results from simple i/o tests on Linux
> >> comparing various file systems, hardware and software raid with a
> >> little bit of volume management:
> >>
> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide

Mark, very useful analysis.  I am curious why you didn't test
'data=writeback' on ext3;  'data=writeback' is the recommended mount
method for that file system, though I see that is not mentioned in our
official documentation.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: file system and raid performance

From
"Mark Wong"
Date:
On Fri, Aug 15, 2008 at 12:22 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Mark Wong wrote:
>> On Mon, Aug 4, 2008 at 10:04 PM,  <david@lang.hm> wrote:
>> > On Mon, 4 Aug 2008, Mark Wong wrote:
>> >
>> >> Hi all,
>> >>
>> >> We've thrown together some results from simple i/o tests on Linux
>> >> comparing various file systems, hardware and software raid with a
>> >> little bit of volume management:
>> >>
>> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
>
> Mark, very useful analysis.  I am curious why you didn't test
> 'data=writeback' on ext3;  'data=writeback' is the recommended mount
> method for that file system, though I see that is not mentioned in our
> official documentation.

I think the short answer is that I neglected to. :)  I didn't realized
'data=writeback' is the recommended journal mode.  We'll get a result
or two and see how it looks.

Mark

Re: file system and raid performance

From
Greg Smith
Date:
On Fri, 15 Aug 2008, Bruce Momjian wrote:

> 'data=writeback' is the recommended mount method for that file system,
> though I see that is not mentioned in our official documentation.

While writeback has good performance characteristics, I don't know that
I'd go so far as to support making that an official recommendation.  The
integrity guarantees of that journaling mode are pretty weak.  Sure the
database itself should be fine; it's got the WAL as a backup if the
filesytem loses some recently written bits.  But I'd hate to see somebody
switch to that mount option on this project's recommendation only to find
some other files got corrupted on a power loss because of writeback's
limited journalling.  ext3 has plenty of problem already without picking
its least safe mode, and recommending writeback would need a carefully
written warning to that effect.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: file system and raid performance

From
Mark Mielke
Date:
Greg Smith wrote:
> On Fri, 15 Aug 2008, Bruce Momjian wrote:
>> 'data=writeback' is the recommended mount method for that file
>> system, though I see that is not mentioned in our official
>> documentation.
> While writeback has good performance characteristics, I don't know
> that I'd go so far as to support making that an official
> recommendation.  The integrity guarantees of that journaling mode are
> pretty weak.  Sure the database itself should be fine; it's got the
> WAL as a backup if the filesytem loses some recently written bits.
> But I'd hate to see somebody switch to that mount option on this
> project's recommendation only to find some other files got corrupted
> on a power loss because of writeback's limited journalling.  ext3 has
> plenty of problem already without picking its least safe mode, and
> recommending writeback would need a carefully written warning to that
> effect.

To contrast - not recommending it means that most people unaware will be
running with a less effective mode, and they will base their performance
measurements on this less effective mode.

Perhaps the documentation should only state that "With ext3,
data=writeback is the recommended mode for PostgreSQL. PostgreSQL
performs its own journalling of data and does not require the additional
guarantees provided by the more conservative ext3 modes. However, if the
file system is used for any purpose other than PostregSQL database
storage, the data integrity requirements of these other purposes must be
considered on their own."

Personally, I use data=writeback for most purposes, but use data=journal
for /mail and /home. In these cases, I find even the default ext3 mode
to be fewer guarantees than I am comfortable with. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Re: file system and raid performance

From
"Mark Wong"
Date:
On Fri, Aug 15, 2008 at 12:22 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Mark Wong wrote:
>> On Mon, Aug 4, 2008 at 10:04 PM,  <david@lang.hm> wrote:
>> > On Mon, 4 Aug 2008, Mark Wong wrote:
>> >
>> >> Hi all,
>> >>
>> >> We've thrown together some results from simple i/o tests on Linux
>> >> comparing various file systems, hardware and software raid with a
>> >> little bit of volume management:
>> >>
>> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
>
> Mark, very useful analysis.  I am curious why you didn't test
> 'data=writeback' on ext3;  'data=writeback' is the recommended mount
> method for that file system, though I see that is not mentioned in our
> official documentation.

I have one set of results with ext3 data=writeback and it appears that
some of the write tests have less throughput than data=ordered.  For
anyone who wants to look at the results details:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide

it's under the "Aggregate Bandwidth (MB/s) - RAID 5 (256KB stripe) -
No partition table" table.

Regards,
Mark

Re: file system and raid performance

From
Bruce Momjian
Date:
Mark Mielke wrote:
> Greg Smith wrote:
> > On Fri, 15 Aug 2008, Bruce Momjian wrote:
> >> 'data=writeback' is the recommended mount method for that file
> >> system, though I see that is not mentioned in our official
> >> documentation.
> > While writeback has good performance characteristics, I don't know
> > that I'd go so far as to support making that an official
> > recommendation.  The integrity guarantees of that journaling mode are
> > pretty weak.  Sure the database itself should be fine; it's got the
> > WAL as a backup if the filesytem loses some recently written bits.
> > But I'd hate to see somebody switch to that mount option on this
> > project's recommendation only to find some other files got corrupted
> > on a power loss because of writeback's limited journalling.  ext3 has
> > plenty of problem already without picking its least safe mode, and
> > recommending writeback would need a carefully written warning to that
> > effect.
>
> To contrast - not recommending it means that most people unaware will be
> running with a less effective mode, and they will base their performance
> measurements on this less effective mode.
>
> Perhaps the documentation should only state that "With ext3,
> data=writeback is the recommended mode for PostgreSQL. PostgreSQL
> performs its own journalling of data and does not require the additional
> guarantees provided by the more conservative ext3 modes. However, if the
> file system is used for any purpose other than PostregSQL database
> storage, the data integrity requirements of these other purposes must be
> considered on their own."
>
> Personally, I use data=writeback for most purposes, but use data=journal
> for /mail and /home. In these cases, I find even the default ext3 mode
> to be fewer guarantees than I am comfortable with. :-)

I have documented this in the WAL section of the manual, which seemed
like the most logical location.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.53
diff -c -c -r1.53 wal.sgml
*** doc/src/sgml/wal.sgml    2 May 2008 19:52:37 -0000    1.53
--- doc/src/sgml/wal.sgml    6 Dec 2008 21:32:59 -0000
***************
*** 135,140 ****
--- 135,155 ----
      roll-forward recovery, also known as REDO.)
     </para>

+    <tip>
+     <para>
+      Because <acronym>WAL</acronym> restores database file
+      contents after a crash, it is not necessary to use a
+      journaled filesystem;  in fact, journaling overhead can
+      reduce performance.  For best performance, turn off
+      <emphasis>data</emphasis> journaling as a filesystem mount
+      option, e.g. use <literal>data=writeback</> on Linux.
+      Meta-data journaling (e.g.  file creation and directory
+      modification) is still desirable for faster rebooting after
+      a crash.
+     </para>
+    </tip>
+
+
     <para>
      Using <acronym>WAL</acronym> results in a
      significantly reduced number of disk writes, because only the log

Re: file system and raid performance

From
"M. Edward (Ed) Borasky"
Date:
Bruce Momjian wrote:
> Mark Mielke wrote:
>> Greg Smith wrote:
>>> On Fri, 15 Aug 2008, Bruce Momjian wrote:
>>>> 'data=writeback' is the recommended mount method for that file
>>>> system, though I see that is not mentioned in our official
>>>> documentation.
>>> While writeback has good performance characteristics, I don't know
>>> that I'd go so far as to support making that an official
>>> recommendation.  The integrity guarantees of that journaling mode are
>>> pretty weak.  Sure the database itself should be fine; it's got the
>>> WAL as a backup if the filesytem loses some recently written bits.
>>> But I'd hate to see somebody switch to that mount option on this
>>> project's recommendation only to find some other files got corrupted
>>> on a power loss because of writeback's limited journalling.  ext3 has
>>> plenty of problem already without picking its least safe mode, and
>>> recommending writeback would need a carefully written warning to that
>>> effect.
>> To contrast - not recommending it means that most people unaware will be
>> running with a less effective mode, and they will base their performance
>> measurements on this less effective mode.
>>
>> Perhaps the documentation should only state that "With ext3,
>> data=writeback is the recommended mode for PostgreSQL. PostgreSQL
>> performs its own journalling of data and does not require the additional
>> guarantees provided by the more conservative ext3 modes. However, if the
>> file system is used for any purpose other than PostregSQL database
>> storage, the data integrity requirements of these other purposes must be
>> considered on their own."
>>
>> Personally, I use data=writeback for most purposes, but use data=journal
>> for /mail and /home. In these cases, I find even the default ext3 mode
>> to be fewer guarantees than I am comfortable with. :-)
>
> I have documented this in the WAL section of the manual, which seemed
> like the most logical location.
>
>
>
> ------------------------------------------------------------------------
>
>

Ah, but shouldn't a PostgreSQL (or any other database, for that matter)
have its own set of filesystems tuned to the application's I/O patterns?
Sure, there are some people who need to have all of their eggs in one
basket because they can't afford multiple baskets. For them, maybe the
OS defaults are the right choice. But if you're building a
database-specific server, you can optimize the I/O for that.

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM

"A mathematician is a device for turning coffee into theorems." --
Alfréd Rényi via Paul Erdős


Re: file system and raid performance

From
Jean-David Beyer
Date:
M. Edward (Ed) Borasky wrote:

> Ah, but shouldn't a PostgreSQL (or any other database, for that matter)
> have its own set of filesystems tuned to the application's I/O patterns?
> Sure, there are some people who need to have all of their eggs in one
> basket because they can't afford multiple baskets. For them, maybe the
> OS defaults are the right choice. But if you're building a
> database-specific server, you can optimize the I/O for that.
>
I used to run IBM's DB2 database management system. It can use a normal
Linux file system (e.g., ext2 or ext3), but it prefers to run a partition
(or more, preferably more than one) itself in raw mode. This eliminates
core-to-core copies of in put and output, organizing the disk space as it
prefers, allows multiple writer processes (typically one per disk drive),
and multiple reader processes (also, typically one per drive), and
potentially increasing the concurrency of reading, writing, and processing.

My dbms needs are extremely modest (only one database, usually only one
user, all on one machine), so I saw only a little benefit to using DB2, and
there were administrative problems. I use Red Hat Enterprise Linux, and the
latest version of that (RHEL 5) does not offer raw file systems anymore, but
provides the same thing by other means. Trouble is, I would have to buy the
latest version of DB2 to be compatible with my version of Linux. Instead, I
just converted everything to postgreSQL instead, it it works very well.

When I started with this in 1986, I first tried Microsoft Access, but could
not get it to accept the database description I was using. So I switched to
Linux (for many reasons -- that was just one of them) and tried postgreSQL.
At the time, it was useless. One version would not do views (it accepted the
construct, IIRC, but they did not work), and the other version would do
views, but would not do something else (I forget what), so I got Informix
that worked pretty well with Red Hat Linux 5.0. When I upgraded to RHL 5.2
or 6.0 (I forget which), Informix would not work (could not even install
it), I could get no support from them, so that is why I went to DB2. When I
got tired of trying to keep DB2 working with RHEL 5, I switched to
postgreSQL, and the dbms itself worked right out of the box. I had to diddle
my programs very slightly (I used embedded SQL), but there were superficial
changes here and there.

The advantage of using one of the OS's file systems (I use ext2 for the dbms
and ext3 for everything else) are that the dbms developers have to be ready
for only about one file system. That is a really big advantage, I imagine. I
also have atime turned off. The main database is on 4 small hard drive
(about 18 GBytes each) each of which has just one partition taking the
entire drive. They are all on a single SCSI controller that also has my
backup tape drive on it. The machine has two other hard drives (around 80
GBytes each) on another SCSI controller and nothing else on that controller.
One of the drives has a partition on it where mainly the WAL is placed, and
another with little stuff. Those two drives have other partitions for the
Linus stuff, /tmp, /var, and /home as the main partitions on them, but the
one with the WAL on it is just about unused (contains /usr/source and stuff
like that) when postgres is running. That is good enough for me. If I were
in a serious production environment, I would take everything except the dbms
off that machine and run it on another one.

I cannot make any speed comparisons between postgreSQL and DB2, because the
machine I ran DB2 on has two 550 MHz processors and 512 Megabytes RAM
running RHL 7.3, and the new machine for postgres has two 3.06 GBYte
hyperthreaded Xeon processors and 8 GBytes RAM running RHEL 5, so a
comparison would be kind of meaningless.

--
   .~.  Jean-David Beyer          Registered Linux User 85642.
   /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
  /( )\ Shrewsbury, New Jersey    http://counter.li.org
  ^^-^^ 06:50:01 up 4 days, 17:08, 4 users, load average: 4.18, 4.15, 4.07

Re: file system and raid performance

From
"Scott Marlowe"
Date:
On Sun, Dec 7, 2008 at 10:59 PM, M. Edward (Ed) Borasky
<znmeb@cesmail.net> wrote:
> Ah, but shouldn't a PostgreSQL (or any other database, for that matter)
> have its own set of filesystems tuned to the application's I/O patterns?
> Sure, there are some people who need to have all of their eggs in one
> basket because they can't afford multiple baskets. For them, maybe the
> OS defaults are the right choice. But if you're building a
> database-specific server, you can optimize the I/O for that.

It's really about a cost / benefits analysis.  20 years ago file
systems were slow and buggy and a database could, with little work,
outperform them.  Nowadays, not so much.  I'm guessing that the extra
cost and effort of maintaining a file system for pgsql outweighs any
real gain you're likely to see performance wise.

But I'm sure that if you implemented one that outran XFS / ZFS / ext3
et. al. people would want to hear about it.

Re: file system and raid performance

From
david@lang.hm
Date:
On Mon, 8 Dec 2008, Scott Marlowe wrote:

> On Sun, Dec 7, 2008 at 10:59 PM, M. Edward (Ed) Borasky
> <znmeb@cesmail.net> wrote:
>> Ah, but shouldn't a PostgreSQL (or any other database, for that matter)
>> have its own set of filesystems tuned to the application's I/O patterns?
>> Sure, there are some people who need to have all of their eggs in one
>> basket because they can't afford multiple baskets. For them, maybe the
>> OS defaults are the right choice. But if you're building a
>> database-specific server, you can optimize the I/O for that.
>
> It's really about a cost / benefits analysis.  20 years ago file
> systems were slow and buggy and a database could, with little work,
> outperform them.  Nowadays, not so much.  I'm guessing that the extra
> cost and effort of maintaining a file system for pgsql outweighs any
> real gain you're likely to see performance wise.

especially with the need to support the new 'filesystem' on many different
OS types.

David Lang

> But I'm sure that if you implemented one that outran XFS / ZFS / ext3
> et. al. people would want to hear about it.
>
>

Re: file system and raid performance

From
"M. Edward (Ed) Borasky"
Date:
Scott Marlowe wrote:
> On Sun, Dec 7, 2008 at 10:59 PM, M. Edward (Ed) Borasky
> <znmeb@cesmail.net> wrote:
>> Ah, but shouldn't a PostgreSQL (or any other database, for that matter)
>> have its own set of filesystems tuned to the application's I/O patterns?
>> Sure, there are some people who need to have all of their eggs in one
>> basket because they can't afford multiple baskets. For them, maybe the
>> OS defaults are the right choice. But if you're building a
>> database-specific server, you can optimize the I/O for that.
>
> It's really about a cost / benefits analysis.  20 years ago file
> systems were slow and buggy and a database could, with little work,
> outperform them.  Nowadays, not so much.  I'm guessing that the extra
> cost and effort of maintaining a file system for pgsql outweighs any
> real gain you're likely to see performance wise.
>
> But I'm sure that if you implemented one that outran XFS / ZFS / ext3
> et. al. people would want to hear about it.
>
I guess I wasn't clear -- I didn't mean a PostgreSQL-specific filesystem
design, although BTRFS does have some things that are "RDBMS-friendly".
I meant that one should hand-tune existing filesystems / hardware for
optimum performance on specific workloads. The tablespaces in PostgreSQL
give you that kind of potential granularity, I think.

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM

"A mathematician is a device for turning coffee into theorems." --
Alfréd Rényi via Paul Erdős