Thread: file system and raid performance
Hi all, We've thrown together some results from simple i/o tests on Linux comparing various file systems, hardware and software raid with a little bit of volume management: http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide What I'd like to ask of the folks on the list is how relevant is this information in helping make decisions such as "What file system should I use?" "What performance can I expect from this RAID configuration?" I know these kind of tests won't help answer questions like "Which file system is most reliable?" but we would like to be as helpful as we can. Any suggestions/comments/criticisms for what would be more relevant or interesting also appreciated. We've started with Linux but we'd also like to hit some other OS's. I'm assuming FreeBSD would be the other popular choice for the DL-380 that we're using. I hope this is helpful. Regards, Mark
On Mon, 4 Aug 2008, Mark Wong wrote: > Hi all, > > We've thrown together some results from simple i/o tests on Linux > comparing various file systems, hardware and software raid with a > little bit of volume management: > > http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > > What I'd like to ask of the folks on the list is how relevant is this > information in helping make decisions such as "What file system should > I use?" "What performance can I expect from this RAID configuration?" > I know these kind of tests won't help answer questions like "Which > file system is most reliable?" but we would like to be as helpful as > we can. > > Any suggestions/comments/criticisms for what would be more relevant or > interesting also appreciated. We've started with Linux but we'd also > like to hit some other OS's. I'm assuming FreeBSD would be the other > popular choice for the DL-380 that we're using. > > I hope this is helpful. it's definantly timely for me (we were having a spirited 'discussion' on this topic at work today ;-) what happened with XFS? you show it as not completing half the tests in the single-disk table and it's completly missing from the other ones. what OS/kernel were you running? if it was linux, which software raid did you try (md or dm) did you use lvm or raw partitions? David Lang
I recently ran some tests on Ubuntu Hardy Server (Linux) comparing JFS, XFS, and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only used bonnie++, so the numbers are really only useful for my hardware. What parameters were used to create the XFS partition in these tests? And, what options were used to mount the file system? Was the kernel 32-bit or 64-bit? Given what I've seen with some of the XFS options (like lazy-count), I am wondering about the options used in these tests. Thanks, Greg
On Mon, Aug 4, 2008 at 10:04 PM, <david@lang.hm> wrote: > On Mon, 4 Aug 2008, Mark Wong wrote: > >> Hi all, >> >> We've thrown together some results from simple i/o tests on Linux >> comparing various file systems, hardware and software raid with a >> little bit of volume management: >> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide >> >> What I'd like to ask of the folks on the list is how relevant is this >> information in helping make decisions such as "What file system should >> I use?" "What performance can I expect from this RAID configuration?" >> I know these kind of tests won't help answer questions like "Which >> file system is most reliable?" but we would like to be as helpful as >> we can. >> >> Any suggestions/comments/criticisms for what would be more relevant or >> interesting also appreciated. We've started with Linux but we'd also >> like to hit some other OS's. I'm assuming FreeBSD would be the other >> popular choice for the DL-380 that we're using. >> >> I hope this is helpful. > > it's definantly timely for me (we were having a spirited 'discussion' on > this topic at work today ;-) > > what happened with XFS? Not exactly sure, I didn't attempt to debug much. I only looked into it enough to see that the fio processes were waiting for something. In one case I left the test go for 24 hours too see if it would stop. Note that I specified to fio not to run longer than an hour. > you show it as not completing half the tests in the single-disk table and > it's completly missing from the other ones. > > what OS/kernel were you running? This is a Gentoo system, running the 2.6.25-gentoo-r6 kernel. > if it was linux, which software raid did you try (md or dm) did you use lvm > or raw partitions? We tried mdraid, not device-mapper. So far we have only used raw partitions (whole devices without partitions.) Regards, Mark
On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood <greg@tcscs.com> wrote: > I recently ran some tests on Ubuntu Hardy Server (Linux) comparing JFS, XFS, > and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only used > bonnie++, so the numbers are really only useful for my hardware. > > What parameters were used to create the XFS partition in these tests? And, > what options were used to mount the file system? Was the kernel 32-bit or > 64-bit? Given what I've seen with some of the XFS options (like lazy-count), > I am wondering about the options used in these tests. The default (no arguments specified) parameters were used to create the XFS partitions. Mount options specified are described in the table. This was a 64-bit OS. Regards, Mark
On Tue, Aug 5, 2008 at 4:54 AM, Mark Wong <markwkm@gmail.com> wrote: > Hi all, Hi > We've thrown together some results from simple i/o tests on Linux > comparing various file systems, hardware and software raid with a > little bit of volume management: > > http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > > > Any suggestions/comments/criticisms for what would be more relevant or > interesting also appreciated. We've started with Linux but we'd also > like to hit some other OS's. I'm assuming FreeBSD would be the other > popular choice for the DL-380 that we're using. > Would be interesting also tests with Ext4. Despite of don't consider stable in kernel linux, on the case is possible because the version kernel and assuming that is e2fsprogs is supported. Regards, -- Fernando Ike http://www.midstorm.org/~fike/weblog
> From: Mark Kirkwood [mailto:markir@paradise.net.nz] > Mark Wong wrote: > > On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood > <greg@tcscs.com> wrote: > > > >> I recently ran some tests on Ubuntu Hardy Server (Linux) comparing > JFS, XFS, > >> and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only > used > >> bonnie++, so the numbers are really only useful for my hardware. > >> > >> What parameters were used to create the XFS partition in these > tests? And, > >> what options were used to mount the file system? Was the kernel 32- > bit or > >> 64-bit? Given what I've seen with some of the XFS options (like > lazy-count), > >> I am wondering about the options used in these tests. > >> > > > > The default (no arguments specified) parameters were used to create > > the XFS partitions. Mount options specified are described in the > > table. This was a 64-bit OS. > > > I think it is a good idea to match the raid stripe size and give some > indication of how many disks are in the array. E.g: > > For a 4 disk system with 256K stripe size I used: > > $ mkfs.xfs -d su=256k,sw=2 /dev/mdx > > which performed about 2-3 times quicker than the default (I did try > sw=4 > as well, but didn't notice any difference compared to sw=4). [Greg says] I thought that xfs picked up those details when using md and a soft-raid configuration.
Mark Wong wrote: > On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood <greg@tcscs.com> wrote: > >> I recently ran some tests on Ubuntu Hardy Server (Linux) comparing JFS, XFS, >> and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only used >> bonnie++, so the numbers are really only useful for my hardware. >> >> What parameters were used to create the XFS partition in these tests? And, >> what options were used to mount the file system? Was the kernel 32-bit or >> 64-bit? Given what I've seen with some of the XFS options (like lazy-count), >> I am wondering about the options used in these tests. >> > > The default (no arguments specified) parameters were used to create > the XFS partitions. Mount options specified are described in the > table. This was a 64-bit OS. > > Regards, > Mark > > I think it is a good idea to match the raid stripe size and give some indication of how many disks are in the array. E.g: For a 4 disk system with 256K stripe size I used: $ mkfs.xfs -d su=256k,sw=2 /dev/mdx which performed about 2-3 times quicker than the default (I did try sw=4 as well, but didn't notice any difference compared to sw=4). regards Mark
Gregory S. Youngblood wrote: >> From: Mark Kirkwood [mailto:markir@paradise.net.nz] >> Mark Wong wrote: >> >>> On Mon, Aug 4, 2008 at 10:56 PM, Gregory S. Youngblood >>> >> <greg@tcscs.com> wrote: >> >>>> I recently ran some tests on Ubuntu Hardy Server (Linux) comparing >>>> >> JFS, XFS, >> >>>> and ZFS+FUSE. It was all 32-bit and on old hardware, plus I only >>>> >> used >> >>>> bonnie++, so the numbers are really only useful for my hardware. >>>> >>>> What parameters were used to create the XFS partition in these >>>> >> tests? And, >> >>>> what options were used to mount the file system? Was the kernel 32- >>>> >> bit or >> >>>> 64-bit? Given what I've seen with some of the XFS options (like >>>> >> lazy-count), >> >>>> I am wondering about the options used in these tests. >>>> >>>> >>> The default (no arguments specified) parameters were used to create >>> the XFS partitions. Mount options specified are described in the >>> table. This was a 64-bit OS. >>> >>> >> I think it is a good idea to match the raid stripe size and give some >> indication of how many disks are in the array. E.g: >> >> For a 4 disk system with 256K stripe size I used: >> >> $ mkfs.xfs -d su=256k,sw=2 /dev/mdx >> >> which performed about 2-3 times quicker than the default (I did try >> sw=4 >> as well, but didn't notice any difference compared to sw=4). >> > > [Greg says] > I thought that xfs picked up those details when using md and a soft-raid > configuration. > > > > > > You are right, it does (I may be recalling performance from my other machine that has a 3Ware card - this was a couple of years ago...) Anyway, I'm thinking for the Hardware raid tests they may need to be specified. Cheers Mark
Mark Kirkwood wrote: > You are right, it does (I may be recalling performance from my other > machine that has a 3Ware card - this was a couple of years ago...) > Anyway, I'm thinking for the Hardware raid tests they may need to be > specified. > > FWIW - of course this somewhat academic given that the single disk xfs test failed! I'm puzzled - having a Gentoo system of similar configuration (2.6.25-gentoo-r6) and running the fio tests a little modified for my config (2 cpu PIII 2G RAM with 4x ATA disks RAID0 and all xfs filesystems - I changed sizes of files to 4G and no. processes to 4) all tests that failed on Marks HP work on my Supermicro P2TDER + Promise TX4000. In fact the performance is pretty reasonable on the old girl as well (seq read is 142Mb/s and the random read/write is 12.7/12.0 Mb/s). I certainly would like to see some more info on why the xfs tests were failing - as on most systems I've encountered xfs is a great performer. regards Mark
Mark Kirkwood schrieb: > Mark Kirkwood wrote: >> You are right, it does (I may be recalling performance from my other >> machine that has a 3Ware card - this was a couple of years ago...) >> Anyway, I'm thinking for the Hardware raid tests they may need to be >> specified. >> >> > > FWIW - of course this somewhat academic given that the single disk xfs > test failed! I'm puzzled - having a Gentoo system of similar > configuration (2.6.25-gentoo-r6) and running the fio tests a little > modified for my config (2 cpu PIII 2G RAM with 4x ATA disks RAID0 and > all xfs filesystems - I changed sizes of files to 4G and no. processes > to 4) all tests that failed on Marks HP work on my Supermicro P2TDER + > Promise TX4000. In fact the performance is pretty reasonable on the > old girl as well (seq read is 142Mb/s and the random read/write is > 12.7/12.0 Mb/s). > > I certainly would like to see some more info on why the xfs tests were > failing - as on most systems I've encountered xfs is a great performer. > > regards > > Mark > I can second this, we use XFS on nearly all our database servers, and never encountered the problems mentioned.
On Thu, Aug 7, 2008 at 3:21 AM, Mario Weilguni <mweilguni@sime.com> wrote: > Mark Kirkwood schrieb: >> >> Mark Kirkwood wrote: >>> >>> You are right, it does (I may be recalling performance from my other >>> machine that has a 3Ware card - this was a couple of years ago...) Anyway, >>> I'm thinking for the Hardware raid tests they may need to be specified. >>> >>> >> >> FWIW - of course this somewhat academic given that the single disk xfs >> test failed! I'm puzzled - having a Gentoo system of similar configuration >> (2.6.25-gentoo-r6) and running the fio tests a little modified for my config >> (2 cpu PIII 2G RAM with 4x ATA disks RAID0 and all xfs filesystems - I >> changed sizes of files to 4G and no. processes to 4) all tests that failed >> on Marks HP work on my Supermicro P2TDER + Promise TX4000. In fact the >> performance is pretty reasonable on the old girl as well (seq read is >> 142Mb/s and the random read/write is 12.7/12.0 Mb/s). >> >> I certainly would like to see some more info on why the xfs tests were >> failing - as on most systems I've encountered xfs is a great performer. >> >> regards >> >> Mark >> > I can second this, we use XFS on nearly all our database servers, and never > encountered the problems mentioned. I have heard of one or two situations where the combination of the disk controller caused bizarre behaviors with different journaling file systems. They seem so few and far between though. I personally wasn't looking forwarding to chasing Linux file system problems, but I can set up an account and remote management access if anyone else would like to volunteer. Regards, Mark
To me it still boggles the mind that noatime should actually slow down activities on ANY file-system ... has someone got an explanation for that kind of behaviour? As far as I'm concerned this means that even to any read I'll add the overhead of a write - most likely in a disk-location slightly off of the position that I read the data ... how would that speed the process up on average? Cheers, Andrej
On Thu, Aug 7, 2008 at 2:59 PM, Andrej Ricnik-Bay <andrej.groups@gmail.com> wrote: > To me it still boggles the mind that noatime should actually slow down > activities on ANY file-system ... has someone got an explanation for > that kind of behaviour? As far as I'm concerned this means that even > to any read I'll add the overhead of a write - most likely in a disk-location > slightly off of the position that I read the data ... how would that speed > the process up on average? noatime turns off the atime write behaviour. Or did you already know that and I missed some weird post where noatime somehow managed to slow down performance?
2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>: > noatime turns off the atime write behaviour. Or did you already know > that and I missed some weird post where noatime somehow managed to > slow down performance? Scott, I'm quite aware of what noatime does ... you didn't miss a post, but if you look at Mark's graphs on http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide they pretty much all indicate that (unless I completely misinterpret the meaning and purpose of the labels), independent of the file-system, using noatime slows read/writes down (on average). -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.american.edu/econ/notes/htmlmail.htm
> -----Original Message----- > From: Mark Wong [mailto:markwkm@gmail.com] > Sent: Thursday, August 07, 2008 12:37 PM > To: Mario Weilguni > Cc: Mark Kirkwood; greg@tcscs.com; david@lang.hm; pgsql- > performance@postgresql.org; Gabrielle Roth > Subject: Re: [PERFORM] file system and raid performance > > I have heard of one or two situations where the combination of the > disk controller caused bizarre behaviors with different journaling > file systems. They seem so few and far between though. I personally > wasn't looking forwarding to chasing Linux file system problems, but I > can set up an account and remote management access if anyone else > would like to volunteer. [Greg says] Tempting... if no one else takes you up on it by then, I might have some time in a week or two to experiment and test a couple of things. One thing I've noticed with a Silicon Image 3124 SATA going through a Silicon Image 3726 port multiplier with the binary-only drivers from Silicon Image (until the PM support made it into the mainline kernel - 2.6.24 I think, might have been .25) is that under some heavy loads it might drop a sata channel and if that channel happens to have a PM on it, it drops 5 drives. I saw this with a card that had 4 channels, 2 connected to a PM w/5 drives and 2 direct. It was pretty random. Not saying that's happening in this case, but odd things have been known to happen under unusual usage patterns.
On Thu, Aug 7, 2008 at 3:57 PM, Andrej Ricnik-Bay <andrej.groups@gmail.com> wrote: > 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>: >> noatime turns off the atime write behaviour. Or did you already know >> that and I missed some weird post where noatime somehow managed to >> slow down performance? > > Scott, I'm quite aware of what noatime does ... you didn't miss a post, but > if you look at Mark's graphs on > http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > they pretty much all indicate that (unless I completely misinterpret the > meaning and purpose of the labels), independent of the file-system, > using noatime slows read/writes down (on average). Interesting. While a few of the benchmarks looks noticeably slower with noatime (reiserfs for instance) most seem faster in that listing. I am just now setting up our big database server for work and noticed a MUCH lower performance without noatime.
Andrej Ricnik-Bay wrote:
That doesn't make sense - if noatime slows things down, then the analysis is probably wrong.
Now, modern Linux distributions default to "relatime" - which will only update access time if the access time is currently less than the update time or something like this. The effect is that modern Linux distributions do not benefit from "noatime" as much as they have in the past. In this case, "noatime" vs default would probably be measuring % noise.
Cheers,
mark
2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>:noatime turns off the atime write behaviour. Or did you already know that and I missed some weird post where noatime somehow managed to slow down performance?Scott, I'm quite aware of what noatime does ... you didn't miss a post, but if you look at Mark's graphs on http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide they pretty much all indicate that (unless I completely misinterpret the meaning and purpose of the labels), independent of the file-system, using noatime slows read/writes down (on average)
That doesn't make sense - if noatime slows things down, then the analysis is probably wrong.
Now, modern Linux distributions default to "relatime" - which will only update access time if the access time is currently less than the update time or something like this. The effect is that modern Linux distributions do not benefit from "noatime" as much as they have in the past. In this case, "noatime" vs default would probably be measuring % noise.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
On Thu, Aug 7, 2008 at 1:24 PM, Gregory S. Youngblood <greg@tcscs.com> wrote: >> -----Original Message----- >> From: Mark Wong [mailto:markwkm@gmail.com] >> Sent: Thursday, August 07, 2008 12:37 PM >> To: Mario Weilguni >> Cc: Mark Kirkwood; greg@tcscs.com; david@lang.hm; pgsql- >> performance@postgresql.org; Gabrielle Roth >> Subject: Re: [PERFORM] file system and raid performance >> > >> I have heard of one or two situations where the combination of the >> disk controller caused bizarre behaviors with different journaling >> file systems. They seem so few and far between though. I personally >> wasn't looking forwarding to chasing Linux file system problems, but I >> can set up an account and remote management access if anyone else >> would like to volunteer. > > [Greg says] > Tempting... if no one else takes you up on it by then, I might have some > time in a week or two to experiment and test a couple of things. Ok, let me know and I'll set you up with access. Regards, Mark
On Thu, Aug 7, 2008 at 3:08 PM, Mark Mielke <mark@mark.mielke.cc> wrote: > Andrej Ricnik-Bay wrote: > > 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>: > > > noatime turns off the atime write behaviour. Or did you already know > that and I missed some weird post where noatime somehow managed to > slow down performance? > > > Scott, I'm quite aware of what noatime does ... you didn't miss a post, but > if you look at Mark's graphs on > http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > they pretty much all indicate that (unless I completely misinterpret the > meaning and purpose of the labels), independent of the file-system, > using noatime slows read/writes down (on average) > > That doesn't make sense - if noatime slows things down, then the analysis is > probably wrong. > > Now, modern Linux distributions default to "relatime" - which will only > update access time if the access time is currently less than the update time > or something like this. The effect is that modern Linux distributions do not > benefit from "noatime" as much as they have in the past. In this case, > "noatime" vs default would probably be measuring % noise. Anyone know what to look for in kernel profiles? There is readprofile (profile.text) and oprofile (oprofile.kernel and oprofile.user) data available. Just click on the results number, then the "raw data" link for a directory listing of files. For example, here is one of the links: http://osdldbt.sourceforge.net/dl380/3disk/sraid5/ext3-journal/seq-read/fio/profiling/oprofile.kernel Regards, Mark
On Thu, Aug 7, 2008 at 3:08 PM, Mark Mielke <mark@mark.mielke.cc> wrote: > Andrej Ricnik-Bay wrote: > > 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>: > > > noatime turns off the atime write behaviour. Or did you already know > that and I missed some weird post where noatime somehow managed to > slow down performance? > > > Scott, I'm quite aware of what noatime does ... you didn't miss a post, but > if you look at Mark's graphs on > http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > they pretty much all indicate that (unless I completely misinterpret the > meaning and purpose of the labels), independent of the file-system, > using noatime slows read/writes down (on average) > > That doesn't make sense - if noatime slows things down, then the analysis is > probably wrong. > > Now, modern Linux distributions default to "relatime" - which will only > update access time if the access time is currently less than the update time > or something like this. The effect is that modern Linux distributions do not > benefit from "noatime" as much as they have in the past. In this case, > "noatime" vs default would probably be measuring % noise. Interesting, now how would we see if it is defaulting to "relatime"? Regards, Mark
On Thu, Aug 7, 2008 at 3:08 PM, Mark Mielke <mark@mark.mielke.cc> wrote: > Andrej Ricnik-Bay wrote: > > 2008/8/8 Scott Marlowe <scott.marlowe@gmail.com>: > > > noatime turns off the atime write behaviour. Or did you already know > that and I missed some weird post where noatime somehow managed to > slow down performance? > > > Scott, I'm quite aware of what noatime does ... you didn't miss a post, but > if you look at Mark's graphs on > http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > they pretty much all indicate that (unless I completely misinterpret the > meaning and purpose of the labels), independent of the file-system, > using noatime slows read/writes down (on average) > > That doesn't make sense - if noatime slows things down, then the analysis is > probably wrong. > > Now, modern Linux distributions default to "relatime" - which will only > update access time if the access time is currently less than the update time > or something like this. The effect is that modern Linux distributions do not > benefit from "noatime" as much as they have in the past. In this case, > "noatime" vs default would probably be measuring % noise. It appears that the default mount option on this system is "atime". Not specifying any options, "relatime" or "noatime", results in neither being shown in /proc/mounts. I'm assuming if the default behavior was to use "relatime" that it would be shown in /proc/mounts. Regards, Mark
On Thu, 7 Aug 2008, Mark Mielke wrote: > Now, modern Linux distributions default to "relatime" Right, but Mark's HP test system is running Gentoo. (ducks) According to http://brainstorm.ubuntu.com/idea/2369/ relatime is the default for Fedora 8, Mandriva 2008, Pardus, and Ubuntu 8.04. Anyway, there aren't many actual files involved in this test, and I suspect the atime writes are just being cached until forced out to disk only periodically. You need to run something that accesses more files and/or regularly forces sync to disk periodically to get a more database-like situation where the atime writes degrade performance. Note how Joshua Drake's ext2 vs. ext3 comparison, which does show a large difference here, was run with the iozone's -e parameter that flushes the writes with fsync. I don't see anything like that in the DL380 G5 fio tests. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Mark Wong wrote: > On Mon, Aug 4, 2008 at 10:04 PM, <david@lang.hm> wrote: > > On Mon, 4 Aug 2008, Mark Wong wrote: > > > >> Hi all, > >> > >> We've thrown together some results from simple i/o tests on Linux > >> comparing various file systems, hardware and software raid with a > >> little bit of volume management: > >> > >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide Mark, very useful analysis. I am curious why you didn't test 'data=writeback' on ext3; 'data=writeback' is the recommended mount method for that file system, though I see that is not mentioned in our official documentation. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, Aug 15, 2008 at 12:22 PM, Bruce Momjian <bruce@momjian.us> wrote: > Mark Wong wrote: >> On Mon, Aug 4, 2008 at 10:04 PM, <david@lang.hm> wrote: >> > On Mon, 4 Aug 2008, Mark Wong wrote: >> > >> >> Hi all, >> >> >> >> We've thrown together some results from simple i/o tests on Linux >> >> comparing various file systems, hardware and software raid with a >> >> little bit of volume management: >> >> >> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > > Mark, very useful analysis. I am curious why you didn't test > 'data=writeback' on ext3; 'data=writeback' is the recommended mount > method for that file system, though I see that is not mentioned in our > official documentation. I think the short answer is that I neglected to. :) I didn't realized 'data=writeback' is the recommended journal mode. We'll get a result or two and see how it looks. Mark
On Fri, 15 Aug 2008, Bruce Momjian wrote: > 'data=writeback' is the recommended mount method for that file system, > though I see that is not mentioned in our official documentation. While writeback has good performance characteristics, I don't know that I'd go so far as to support making that an official recommendation. The integrity guarantees of that journaling mode are pretty weak. Sure the database itself should be fine; it's got the WAL as a backup if the filesytem loses some recently written bits. But I'd hate to see somebody switch to that mount option on this project's recommendation only to find some other files got corrupted on a power loss because of writeback's limited journalling. ext3 has plenty of problem already without picking its least safe mode, and recommending writeback would need a carefully written warning to that effect. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Fri, 15 Aug 2008, Bruce Momjian wrote: >> 'data=writeback' is the recommended mount method for that file >> system, though I see that is not mentioned in our official >> documentation. > While writeback has good performance characteristics, I don't know > that I'd go so far as to support making that an official > recommendation. The integrity guarantees of that journaling mode are > pretty weak. Sure the database itself should be fine; it's got the > WAL as a backup if the filesytem loses some recently written bits. > But I'd hate to see somebody switch to that mount option on this > project's recommendation only to find some other files got corrupted > on a power loss because of writeback's limited journalling. ext3 has > plenty of problem already without picking its least safe mode, and > recommending writeback would need a carefully written warning to that > effect. To contrast - not recommending it means that most people unaware will be running with a less effective mode, and they will base their performance measurements on this less effective mode. Perhaps the documentation should only state that "With ext3, data=writeback is the recommended mode for PostgreSQL. PostgreSQL performs its own journalling of data and does not require the additional guarantees provided by the more conservative ext3 modes. However, if the file system is used for any purpose other than PostregSQL database storage, the data integrity requirements of these other purposes must be considered on their own." Personally, I use data=writeback for most purposes, but use data=journal for /mail and /home. In these cases, I find even the default ext3 mode to be fewer guarantees than I am comfortable with. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
On Fri, Aug 15, 2008 at 12:22 PM, Bruce Momjian <bruce@momjian.us> wrote: > Mark Wong wrote: >> On Mon, Aug 4, 2008 at 10:04 PM, <david@lang.hm> wrote: >> > On Mon, 4 Aug 2008, Mark Wong wrote: >> > >> >> Hi all, >> >> >> >> We've thrown together some results from simple i/o tests on Linux >> >> comparing various file systems, hardware and software raid with a >> >> little bit of volume management: >> >> >> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide > > Mark, very useful analysis. I am curious why you didn't test > 'data=writeback' on ext3; 'data=writeback' is the recommended mount > method for that file system, though I see that is not mentioned in our > official documentation. I have one set of results with ext3 data=writeback and it appears that some of the write tests have less throughput than data=ordered. For anyone who wants to look at the results details: http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide it's under the "Aggregate Bandwidth (MB/s) - RAID 5 (256KB stripe) - No partition table" table. Regards, Mark
Mark Mielke wrote: > Greg Smith wrote: > > On Fri, 15 Aug 2008, Bruce Momjian wrote: > >> 'data=writeback' is the recommended mount method for that file > >> system, though I see that is not mentioned in our official > >> documentation. > > While writeback has good performance characteristics, I don't know > > that I'd go so far as to support making that an official > > recommendation. The integrity guarantees of that journaling mode are > > pretty weak. Sure the database itself should be fine; it's got the > > WAL as a backup if the filesytem loses some recently written bits. > > But I'd hate to see somebody switch to that mount option on this > > project's recommendation only to find some other files got corrupted > > on a power loss because of writeback's limited journalling. ext3 has > > plenty of problem already without picking its least safe mode, and > > recommending writeback would need a carefully written warning to that > > effect. > > To contrast - not recommending it means that most people unaware will be > running with a less effective mode, and they will base their performance > measurements on this less effective mode. > > Perhaps the documentation should only state that "With ext3, > data=writeback is the recommended mode for PostgreSQL. PostgreSQL > performs its own journalling of data and does not require the additional > guarantees provided by the more conservative ext3 modes. However, if the > file system is used for any purpose other than PostregSQL database > storage, the data integrity requirements of these other purposes must be > considered on their own." > > Personally, I use data=writeback for most purposes, but use data=journal > for /mail and /home. In these cases, I find even the default ext3 mode > to be fewer guarantees than I am comfortable with. :-) I have documented this in the WAL section of the manual, which seemed like the most logical location. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.53 diff -c -c -r1.53 wal.sgml *** doc/src/sgml/wal.sgml 2 May 2008 19:52:37 -0000 1.53 --- doc/src/sgml/wal.sgml 6 Dec 2008 21:32:59 -0000 *************** *** 135,140 **** --- 135,155 ---- roll-forward recovery, also known as REDO.) </para> + <tip> + <para> + Because <acronym>WAL</acronym> restores database file + contents after a crash, it is not necessary to use a + journaled filesystem; in fact, journaling overhead can + reduce performance. For best performance, turn off + <emphasis>data</emphasis> journaling as a filesystem mount + option, e.g. use <literal>data=writeback</> on Linux. + Meta-data journaling (e.g. file creation and directory + modification) is still desirable for faster rebooting after + a crash. + </para> + </tip> + + <para> Using <acronym>WAL</acronym> results in a significantly reduced number of disk writes, because only the log
Bruce Momjian wrote: > Mark Mielke wrote: >> Greg Smith wrote: >>> On Fri, 15 Aug 2008, Bruce Momjian wrote: >>>> 'data=writeback' is the recommended mount method for that file >>>> system, though I see that is not mentioned in our official >>>> documentation. >>> While writeback has good performance characteristics, I don't know >>> that I'd go so far as to support making that an official >>> recommendation. The integrity guarantees of that journaling mode are >>> pretty weak. Sure the database itself should be fine; it's got the >>> WAL as a backup if the filesytem loses some recently written bits. >>> But I'd hate to see somebody switch to that mount option on this >>> project's recommendation only to find some other files got corrupted >>> on a power loss because of writeback's limited journalling. ext3 has >>> plenty of problem already without picking its least safe mode, and >>> recommending writeback would need a carefully written warning to that >>> effect. >> To contrast - not recommending it means that most people unaware will be >> running with a less effective mode, and they will base their performance >> measurements on this less effective mode. >> >> Perhaps the documentation should only state that "With ext3, >> data=writeback is the recommended mode for PostgreSQL. PostgreSQL >> performs its own journalling of data and does not require the additional >> guarantees provided by the more conservative ext3 modes. However, if the >> file system is used for any purpose other than PostregSQL database >> storage, the data integrity requirements of these other purposes must be >> considered on their own." >> >> Personally, I use data=writeback for most purposes, but use data=journal >> for /mail and /home. In these cases, I find even the default ext3 mode >> to be fewer guarantees than I am comfortable with. :-) > > I have documented this in the WAL section of the manual, which seemed > like the most logical location. > > > > ------------------------------------------------------------------------ > > Ah, but shouldn't a PostgreSQL (or any other database, for that matter) have its own set of filesystems tuned to the application's I/O patterns? Sure, there are some people who need to have all of their eggs in one basket because they can't afford multiple baskets. For them, maybe the OS defaults are the right choice. But if you're building a database-specific server, you can optimize the I/O for that. -- M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM "A mathematician is a device for turning coffee into theorems." -- Alfréd Rényi via Paul Erdős
M. Edward (Ed) Borasky wrote: > Ah, but shouldn't a PostgreSQL (or any other database, for that matter) > have its own set of filesystems tuned to the application's I/O patterns? > Sure, there are some people who need to have all of their eggs in one > basket because they can't afford multiple baskets. For them, maybe the > OS defaults are the right choice. But if you're building a > database-specific server, you can optimize the I/O for that. > I used to run IBM's DB2 database management system. It can use a normal Linux file system (e.g., ext2 or ext3), but it prefers to run a partition (or more, preferably more than one) itself in raw mode. This eliminates core-to-core copies of in put and output, organizing the disk space as it prefers, allows multiple writer processes (typically one per disk drive), and multiple reader processes (also, typically one per drive), and potentially increasing the concurrency of reading, writing, and processing. My dbms needs are extremely modest (only one database, usually only one user, all on one machine), so I saw only a little benefit to using DB2, and there were administrative problems. I use Red Hat Enterprise Linux, and the latest version of that (RHEL 5) does not offer raw file systems anymore, but provides the same thing by other means. Trouble is, I would have to buy the latest version of DB2 to be compatible with my version of Linux. Instead, I just converted everything to postgreSQL instead, it it works very well. When I started with this in 1986, I first tried Microsoft Access, but could not get it to accept the database description I was using. So I switched to Linux (for many reasons -- that was just one of them) and tried postgreSQL. At the time, it was useless. One version would not do views (it accepted the construct, IIRC, but they did not work), and the other version would do views, but would not do something else (I forget what), so I got Informix that worked pretty well with Red Hat Linux 5.0. When I upgraded to RHL 5.2 or 6.0 (I forget which), Informix would not work (could not even install it), I could get no support from them, so that is why I went to DB2. When I got tired of trying to keep DB2 working with RHEL 5, I switched to postgreSQL, and the dbms itself worked right out of the box. I had to diddle my programs very slightly (I used embedded SQL), but there were superficial changes here and there. The advantage of using one of the OS's file systems (I use ext2 for the dbms and ext3 for everything else) are that the dbms developers have to be ready for only about one file system. That is a really big advantage, I imagine. I also have atime turned off. The main database is on 4 small hard drive (about 18 GBytes each) each of which has just one partition taking the entire drive. They are all on a single SCSI controller that also has my backup tape drive on it. The machine has two other hard drives (around 80 GBytes each) on another SCSI controller and nothing else on that controller. One of the drives has a partition on it where mainly the WAL is placed, and another with little stuff. Those two drives have other partitions for the Linus stuff, /tmp, /var, and /home as the main partitions on them, but the one with the WAL on it is just about unused (contains /usr/source and stuff like that) when postgres is running. That is good enough for me. If I were in a serious production environment, I would take everything except the dbms off that machine and run it on another one. I cannot make any speed comparisons between postgreSQL and DB2, because the machine I ran DB2 on has two 550 MHz processors and 512 Megabytes RAM running RHL 7.3, and the new machine for postgres has two 3.06 GBYte hyperthreaded Xeon processors and 8 GBytes RAM running RHEL 5, so a comparison would be kind of meaningless. -- .~. Jean-David Beyer Registered Linux User 85642. /V\ PGP-Key: 9A2FC99A Registered Machine 241939. /( )\ Shrewsbury, New Jersey http://counter.li.org ^^-^^ 06:50:01 up 4 days, 17:08, 4 users, load average: 4.18, 4.15, 4.07
On Sun, Dec 7, 2008 at 10:59 PM, M. Edward (Ed) Borasky <znmeb@cesmail.net> wrote: > Ah, but shouldn't a PostgreSQL (or any other database, for that matter) > have its own set of filesystems tuned to the application's I/O patterns? > Sure, there are some people who need to have all of their eggs in one > basket because they can't afford multiple baskets. For them, maybe the > OS defaults are the right choice. But if you're building a > database-specific server, you can optimize the I/O for that. It's really about a cost / benefits analysis. 20 years ago file systems were slow and buggy and a database could, with little work, outperform them. Nowadays, not so much. I'm guessing that the extra cost and effort of maintaining a file system for pgsql outweighs any real gain you're likely to see performance wise. But I'm sure that if you implemented one that outran XFS / ZFS / ext3 et. al. people would want to hear about it.
On Mon, 8 Dec 2008, Scott Marlowe wrote: > On Sun, Dec 7, 2008 at 10:59 PM, M. Edward (Ed) Borasky > <znmeb@cesmail.net> wrote: >> Ah, but shouldn't a PostgreSQL (or any other database, for that matter) >> have its own set of filesystems tuned to the application's I/O patterns? >> Sure, there are some people who need to have all of their eggs in one >> basket because they can't afford multiple baskets. For them, maybe the >> OS defaults are the right choice. But if you're building a >> database-specific server, you can optimize the I/O for that. > > It's really about a cost / benefits analysis. 20 years ago file > systems were slow and buggy and a database could, with little work, > outperform them. Nowadays, not so much. I'm guessing that the extra > cost and effort of maintaining a file system for pgsql outweighs any > real gain you're likely to see performance wise. especially with the need to support the new 'filesystem' on many different OS types. David Lang > But I'm sure that if you implemented one that outran XFS / ZFS / ext3 > et. al. people would want to hear about it. > >
Scott Marlowe wrote: > On Sun, Dec 7, 2008 at 10:59 PM, M. Edward (Ed) Borasky > <znmeb@cesmail.net> wrote: >> Ah, but shouldn't a PostgreSQL (or any other database, for that matter) >> have its own set of filesystems tuned to the application's I/O patterns? >> Sure, there are some people who need to have all of their eggs in one >> basket because they can't afford multiple baskets. For them, maybe the >> OS defaults are the right choice. But if you're building a >> database-specific server, you can optimize the I/O for that. > > It's really about a cost / benefits analysis. 20 years ago file > systems were slow and buggy and a database could, with little work, > outperform them. Nowadays, not so much. I'm guessing that the extra > cost and effort of maintaining a file system for pgsql outweighs any > real gain you're likely to see performance wise. > > But I'm sure that if you implemented one that outran XFS / ZFS / ext3 > et. al. people would want to hear about it. > I guess I wasn't clear -- I didn't mean a PostgreSQL-specific filesystem design, although BTRFS does have some things that are "RDBMS-friendly". I meant that one should hand-tune existing filesystems / hardware for optimum performance on specific workloads. The tablespaces in PostgreSQL give you that kind of potential granularity, I think. -- M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM "A mathematician is a device for turning coffee into theorems." -- Alfréd Rényi via Paul Erdős