Thread: File Systems Compared
All tests are with bonnie++ 1.03a Main components of system: 16 WD Raptor 150GB 10000 RPM drives all in a RAID 10 ARECA 1280 PCI-Express RAID adapter with 1GB BB Cache (Thanks for the recommendation, Ron!) 32 GB RAM Dual Intel 5160 Xeon Woodcrest 3.0 GHz processors OS: SUSE Linux 10.1 All runs are with the write cache disabled on the hard disks, except for one additional test for xfs where it was enabled. I tested with ordered and writeback journaling modes for ext3 to see if writeback journaling would help over the default of ordered. The 1GB of battery backed cache on the RAID card was enabled for all tests as well. Tests are in order of increasing random seek performance. In my tests on this hardware, xfs is the decisive winner, beating all of the other file systems in performance on every single metric. 658 random seeks per second, 433 MB/sec sequential read, and 350 MB/sec sequential write seems decent enough, but not as high as numbers other people have suggested are attainable with a 16 disk RAID 10. 350 MB/sec sequential write with disk caches enabled versus 280 MB/ sec sequential write with disk caches disabled sure makes enabling the disk write cache tempting. Anyone run their RAIDs with disk caches enabled, or is this akin to having fsync off? ext3 (writeback data journaling mode): /usr/local/sbin/bonnie++ -d bonnie -s 64368:8k Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk4 64368M 78625 91 279921 51 112346 13 89463 96 417695 22 545.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 5903 99 +++++ +++ +++++ +++ 6112 99 +++++ ++ + 18620 100 hulk4,64368M, 78625,91,279921,51,112346,13,89463,96,417695,22,545.7,0,16,5903,99,+++ ++,+++,+++++,+++,6112,99,+++++,+++,18620,100 ext3 (ordered data journaling mode): /usr/local/sbin/bonnie++ -d bonnie -s 64368:8k Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk4 64368M 74902 89 250274 52 123637 16 88992 96 417222 23 548.3 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 5941 97 +++++ +++ +++++ +++ 6270 99 +++++ ++ + 18670 99 hulk4,64368M, 74902,89,250274,52,123637,16,88992,96,417222,23,548.3,0,16,5941,97,+++ ++,+++,+++++,+++,6270,99,+++++,+++,18670,99 reiserfs: /usr/local/sbin/bonnie++ -d bonnie -s 64368:8k Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk4 64368M 81004 99 269191 50 128322 16 87865 96 407035 28 550.3 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ ++ + +++++ +++ hulk4,64368M, 81004,99,269191,50,128322,16,87865,96,407035,28,550.3,0,16,+++++,+++,+ ++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ jfs: /usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk4 64368M 73246 80 268886 28 110465 9 89516 96 413897 21 639.5 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 3756 5 +++++ +++ +++++ +++ 23763 90 +++++ ++ + 22371 70 hulk4,64368M, 73246,80,268886,28,110465,9,89516,96,413897,21,639.5,0,16,3756,5,++++ +,+++,+++++,+++,23763,90,+++++,+++,22371,70 xfs (with write cache disabled on disks): /usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk4 64368M 90621 99 283916 35 105871 11 88569 97 433890 23 644.5 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 28435 95 +++++ +++ 28895 82 28523 91 +++++ ++ + 24369 86 hulk4,64368M, 90621,99,283916,35,105871,11,88569,97,433890,23,644.5,0,16,28435,95,++ +++,+++,28895,82,28523,91,+++++,+++,24369,86 xfs (with write cache enabled on disks): /usr/local/sbin/bonnie++ -d bonnie -s 64368:8k Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk4 64368M 90861 99 348401 43 131887 14 89412 97 432964 23 658.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 28871 90 +++++ +++ 28923 91 30879 93 +++++ ++ + 28012 94 hulk4,64368M, 90861,99,348401,43,131887,14,89412,97,432964,23,658.7,0,16,28871,90,++ +++,+++,28923,91,30879,93,+++++,+++,28012,94
Brian Wipf wrote: > All tests are with bonnie++ 1.03a Thanks for posting these tests. Now I have actual numbers to beat our storage server provider about the head and shoulders with. Also, I found them interesting in and of themselves. These numbers are close enough to bus-saturation rates that I'd strongly advise new people setting up systems to go this route over spending money on some fancy storage area network solution- unless you need more HD space than fits nicely in one of these raids. If reliability is a concern, buy 2 servers and implement Sloni for failover. Brian
On Dec 6, 2006, at 16:40 , Brian Wipf wrote: > All tests are with bonnie++ 1.03a [snip] Care to post these numbers *without* word wrapping? Thanks. Alexander.
Brian, On 12/6/06 8:02 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote: > These numbers are close enough to bus-saturation rates PCIX is 1GB/s + and the memory architecture is 20GB/s+, though each CPU is likely to obtain only 2-3GB/s. We routinely achieve 1GB/s I/O rate on two 3Ware adapters and 2GB/s on the Sun X4500 with ZFS. > advise new people setting up systems to go this route over spending > money on some fancy storage area network solution People buy SANs for interesting reasons, some of them having to do with the manageability features of high end SANs. I've heard it said in those cases that "performance doesn't matter much". As you suggest, database replication provides one of those features, and Solaris ZFS has many of the data management features found in high end SANs. Perhaps we can get the best of both? In the end, I think SAN vs. server storage is a religious battle. - Luke
Hi, Alexander Staubo wrote: > Care to post these numbers *without* word wrapping? Thanks. How is one supposed to do that? Care giving an example? Markus
> As you suggest, database replication provides one of those features, and > Solaris ZFS has many of the data management features found in high end SANs. > Perhaps we can get the best of both? > > In the end, I think SAN vs. server storage is a religious battle. I agree. I have many people that want to purchase a SAN because someone told them that is what they need... Yet they can spend 20% of the cost on two external arrays and get incredible performance... We are seeing great numbers from the following config: (2) HP MS 30s (loaded) dual bus (2) HP 6402, one connected to each MSA. The performance for the money is incredible. Sincerely, Joshua D. Drake > > - Luke > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Luke Lonergan wrote:
Brian
For some reason I'd got it stuck in my head that PCI-Express maxed out at a theoretical 533 MByte/sec- at which point, getting 480 MByte/sec across it is pretty dang good. But actually looking things up, I see that PCI-Express has a theoretical 8 Gbit/sec, or about 800Mbyte/sec. It's PCI-X that's 533 MByte/sec. So there's still some headroom available there.Brian, On 12/6/06 8:02 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote:These numbers are close enough to bus-saturation ratesPCIX is 1GB/s + and the memory architecture is 20GB/s+, though each CPU is likely to obtain only 2-3GB/s. We routinely achieve 1GB/s I/O rate on two 3Ware adapters and 2GB/s on the Sun X4500 with ZFS.
Brian
On Wed, Dec 06, 2006 at 05:31:01PM +0100, Markus Schiltknecht wrote: >> Care to post these numbers *without* word wrapping? Thanks. > How is one supposed to do that? Care giving an example? This is a rather long sentence without any kind of word wrapping except what would be imposed on your own side -- how toset that up properly depends on the sending e-mail client, but in mine it's just a matter of turning off the word wrappingin your editor :-) /* Steinar */ -- Homepage: http://www.sesse.net/
* Brian Wipf: > Anyone run their RAIDs with disk caches enabled, or is this akin to > having fsync off? If your cache is backed by a battery, enabling write cache shouldn't be a problem. You can check if the whole thing is working well by running this test script: <http://brad.livejournal.com/2116715.html> Enabling write cache leads to various degrees of data corruption in case of a power outage (possibly including file system corruption requiring manual recover). -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
> Anyone run their RAIDs with disk caches enabled, or is this akin to > having fsync off? Disk write caches are basically always akin to having fsync off. The only time a write-cache is (more or less) safe to enable is when it is backed by a battery or in some other way made non-volatile. So a RAID controller with a battery-backed write cache can enable its own write cache, but can't safely enable the write-caches on the disk drives it manages. -- Mark Lewis
Brian, On 12/6/06 8:40 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote: > But actually looking things up, I see that PCI-Express has a theoretical 8 > Gbit/sec, or about 800Mbyte/sec. It's PCI-X that's 533 MByte/sec. So there's > still some headroom available there. See here for the official specifications of both: http://www.pcisig.com/specifications/pcix_20/ Note that PCI-X version 1.0 at 133MHz runs at 1GB/s. It's a parallel bus, 64 bits wide (8 bytes) and runs at 133MHz, so 8 x 133 ~= 1 gigabyte/second. PCI Express with 16 lanes (PCIe x16) can transfer data at 4GB/s. The Arecas use (PCIe x8, see here: http://www.areca.com.tw/products/html/pcie-sata.htm), so they can do 2GB/s. - Luke
Hi, Steinar H. Gunderson wrote: > This is a rather long sentence without any kind of word wrapping except what would be imposed on your own side -- how toset that up properly depends on the sending e-mail client, but in mine it's just a matter of turning off the word wrappingin your editor :-) Duh! Cool, thank you for the example :-) I thought the MTA or at least the the mailing list would wrap mails at some limit. I'venow set word-wrap to 9999 characters (it seems not possible to turn it off completely in thunderbird). But when writing,I'm now getting one long line. What's common practice? What's it on the pgsql mailing lists? Regards Markus
Markus Schiltknecht a écrit : > What's common practice? What's it on the pgsql mailing lists? The netiquette usually advise mailers to wrap after 72 characters on mailing lists. This does not apply for format=flowed I guess (that's the format used in Steinar's message).
On Wed, Dec 06, 2006 at 06:45:56PM +0100, Markus Schiltknecht wrote: > Cool, thank you for the example :-) I thought the MTA or at least the the > mailing list would wrap mails at some limit. I've now set word-wrap to 9999 > characters (it seems not possible to turn it off completely in > thunderbird). But when writing, I'm now getting one long line. Thunderbird uses format=flowed, so it's wrapped nevertheless. Google to find out how to turn it off if you really need to. > What's common practice? Usually 72 or 76 characters, TTBOMK -- but when posting tables or big query plans, one should simply turn it off, as it kills readability. > What's it on the pgsql mailing lists? No idea. :-) /* Steinar */ -- Homepage: http://www.sesse.net/
On Wed, Dec 06, 2006 at 06:59:12PM +0100, Arnaud Lesauvage wrote: >Markus Schiltknecht a écrit : >>What's common practice? What's it on the pgsql mailing lists? > >The netiquette usually advise mailers to wrap after 72 characters >on mailing lists. >This does not apply for format=flowed I guess (that's the format >used in Steinar's message). It would apply to either; format=flowed can be wrapped at the receiver's end, but still be formatted to a particular column for readers that don't understand format=flowed. (Which is likely to be many, since that's a standard that never really took off.) No wrap netiquette applies to formatted text blocks which are unreadable if wrapped (such as bonnie or EXPLAIN output). Mike Stone
On 6-Dec-06, at 9:05 AM, Alexander Staubo wrote: >> All tests are with bonnie++ 1.03a > [snip] > Care to post these numbers *without* word wrapping? Thanks. That's what Bonnie++'s output looks like. If you have Bonnie++ installed, you can run the following: bon_csv2html << EOF hulk4,64368M, 78625,91,279921,51,112346,13,89463,96,417695,22,545.7,0,16,5903,99,+++ ++,+++,+++++,+++,6112,99,+++++,+++,18620,100 EOF Which will prettify the CSV results using HTML.
On 12/6/06, Luke Lonergan <llonergan@greenplum.com> wrote: > People buy SANs for interesting reasons, some of them having to do with the > manageability features of high end SANs. I've heard it said in those cases > that "performance doesn't matter much". There is movement in the industry right now away form tape systems to managed disk storage for backups and data retention. In these cases performance requirements are not very high -- and a single server can manage a huge amount of storage. In theory, you can do the same thing attached via sas expanders but fc networking is imo more flexible and scalable. The manageability features of SANs are a mixed bag and decidedly overrated but they have a their place, imo. merlin
Luke Lonergan wrote:
Brian
Thanks. I stand corrected (again).Brian, On 12/6/06 8:40 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote:But actually looking things up, I see that PCI-Express has a theoretical 8 Gbit/sec, or about 800Mbyte/sec. It's PCI-X that's 533 MByte/sec. So there's still some headroom available there.See here for the official specifications of both: http://www.pcisig.com/specifications/pcix_20/ Note that PCI-X version 1.0 at 133MHz runs at 1GB/s. It's a parallel bus, 64 bits wide (8 bytes) and runs at 133MHz, so 8 x 133 ~= 1 gigabyte/second. PCI Express with 16 lanes (PCIe x16) can transfer data at 4GB/s. The Arecas use (PCIe x8, see here: http://www.areca.com.tw/products/html/pcie-sata.htm), so they can do 2GB/s. - Luke
Brian
On Wed, Dec 06, 2006 at 18:45:56 +0100, Markus Schiltknecht <markus@bluegap.ch> wrote: > > Cool, thank you for the example :-) I thought the MTA or at least the the > mailing list would wrap mails at some limit. I've now set word-wrap to 9999 > characters (it seems not possible to turn it off completely in > thunderbird). But when writing, I'm now getting one long line. > > What's common practice? What's it on the pgsql mailing lists? If you do this you should set format=flowed (see rfc 2646). If you do that, then clients can break the lines in an appropiate way. This is actually better than fixing the line width in the original message, since the recipient may not have the same number of characters (or pixels) of display as the sender.
At 10:40 AM 12/6/2006, Brian Wipf wrote: All tests are with bonnie++ 1.03a Main components of system: 16 WD Raptor 150GB 10000 RPM drives all in a RAID 10 ARECA 1280 PCI-Express RAID adapter with 1GB BB Cache (Thanks for the recommendation, Ron!) 32 GB RAM Dual Intel 5160 Xeon Woodcrest 3.0 GHz processors OS: SUSE Linux 10.1 >xfs (with write cache disabled on disks): >/usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k >Version 1.03 ------Sequential Output------ --Sequential Input- >--Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- -- > Block-- --Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % >CP /sec %CP >hulk4 64368M 90621 99 283916 35 105871 11 88569 97 >433890 23 644.5 0 > ------Sequential Create------ --------Random >Create-------- > -Create-- --Read--- -Delete-- -Create-- -- > Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec % > CP /sec %CP > 16 28435 95 +++++ +++ 28895 82 28523 91 +++++ > ++ + 24369 86 >hulk4,64368M, >90621,99,283916,35,105871,11,88569,97,433890,23,644.5,0,16,28435,95,++ >+++,+++,28895,82,28523,91,+++++,+++,24369,86 > >xfs (with write cache enabled on disks): >/usr/local/sbin/bonnie++ -d bonnie -s 64368:8k >Version 1.03 ------Sequential Output------ --Sequential Input- >--Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- -- > Block-- --Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % >CP /sec %CP >hulk4 64368M 90861 99 348401 43 131887 14 89412 97 >432964 23 658.7 0 > ------Sequential Create------ --------Random >Create-------- > -Create-- --Read--- -Delete-- -Create-- -- > Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec % > CP /sec %CP > 16 28871 90 +++++ +++ 28923 91 30879 93 +++++ > ++ + 28012 94 >hulk4,64368M, >90861,99,348401,43,131887,14,89412,97,432964,23,658.7,0,16,28871,90,++ >+++,+++,28923,91,30879,93,+++++,+++,28012,94 Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K rpm HDs, you should be seeing higher absolute performance numbers. Find out what HW the Areca guys and Tweakers guys used to test the 1280s. At LW2006, Areca was demonstrating all-in-cache reads and writes of ~1600MBps and ~1300MBps respectively along with RAID 0 Sustained Rates of ~900MBps read, and ~850MBps write. Luke, I know you've managed to get higher IO rates than this with this class of HW. Is there a OS or SW config issue Brian should closely investigate? Ron Peacetree
> Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K > rpm HDs, you should be seeing higher absolute performance numbers. > > Find out what HW the Areca guys and Tweakers guys used to test the > 1280s. > At LW2006, Areca was demonstrating all-in-cache reads and writes of > ~1600MBps and ~1300MBps respectively along with RAID 0 Sustained > Rates of ~900MBps read, and ~850MBps write. > > Luke, I know you've managed to get higher IO rates than this with > this class of HW. Is there a OS or SW config issue Brian should > closely investigate? I wrote 1280 by a mistake. It's actually a 1260. Sorry about that. The IOP341 class of cards weren't available when we ordered the parts for the box, so we had to go with the 1260. The box(es) we build next month will either have the 1261ML or 1280 depending on whether we go 16 or 24 disk. I noticed Bucky got almost 800 random seeks per second on her 6 disk 10000 RPM SAS drive Dell PowerEdge 2950. The random seek performance of this box disappointed me the most. Even running 2 concurrent bonnies, the random seek performance only increased from 644 seeks/ sec to 813 seeks/sec. Maybe there is some setting I'm missing? This card looked pretty impressive on tweakers.net.
On 6-Dec-06, at 2:47 PM, Brian Wipf wrote: >> Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K >> rpm HDs, you should be seeing higher absolute performance numbers. >> >> Find out what HW the Areca guys and Tweakers guys used to test the >> 1280s. >> At LW2006, Areca was demonstrating all-in-cache reads and writes >> of ~1600MBps and ~1300MBps respectively along with RAID 0 >> Sustained Rates of ~900MBps read, and ~850MBps write. >> >> Luke, I know you've managed to get higher IO rates than this with >> this class of HW. Is there a OS or SW config issue Brian should >> closely investigate? > > I wrote 1280 by a mistake. It's actually a 1260. Sorry about that. > The IOP341 class of cards weren't available when we ordered the > parts for the box, so we had to go with the 1260. The box(es) we > build next month will either have the 1261ML or 1280 depending on > whether we go 16 or 24 disk. > > I noticed Bucky got almost 800 random seeks per second on her 6 > disk 10000 RPM SAS drive Dell PowerEdge 2950. The random seek > performance of this box disappointed me the most. Even running 2 > concurrent bonnies, the random seek performance only increased from > 644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm > missing? This card looked pretty impressive on tweakers.net. Areca has some performance numbers in a downloadable PDF for the Areca ARC-1120, which is in the same class as the ARC-1260, except with 8 ports. With all 8 drives in a RAID 0 the card gets the following performance numbers: Card single thread write 20 thread write single thread read 20 thread read ARC-1120 321.26 MB/s 404.76 MB/s 412.55 MB/ s 672.45 MB/s My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10 are slightly better than the ARC-1120 in an 8 disk RAID 0 for a single thread. I guess this means my numbers are reasonable.
The 1100 series is PCI-X based. The 1200 series is PCI-E x8 based. Apples and oranges. I still think Luke Lonergan or Josh Berkus may have some interesting ideas regarding possible OS and SW optimizations. WD1500ADFDs are each good for ~90MBps read and ~60MBps write ASTR. That means your 16 HD RAID 10 should be sequentially transferring ~720MBps read and ~480MBps write. Clearly more HDs will be required to allow a ARC-12xx to attain its peak performance. One thing that occurs to me with your present HW is that your CPU utilization numbers are relatively high. Since 5160s are clocked about as high as is available, that leaves trying CPUs with more cores and trying more CPUs. You've got basically got 4 HW threads at the moment. If you can, evaluate CPUs and mainboards that allow for 8 or 16 HW threads. Intel-wise, that's the new Kentfields. AMD-wise, you have lot's of 4S mainboard options, but the AMD 4C CPUs won't be available until sometime late in 2007. I've got other ideas, but this list is not the appropriate venue for the level of detail required. Ron Peacetree At 05:30 PM 12/6/2006, Brian Wipf wrote: >On 6-Dec-06, at 2:47 PM, Brian Wipf wrote: > >>>Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K >>>rpm HDs, you should be seeing higher absolute performance numbers. >>> >>>Find out what HW the Areca guys and Tweakers guys used to test the >>>1280s. >>>At LW2006, Areca was demonstrating all-in-cache reads and writes >>>of ~1600MBps and ~1300MBps respectively along with RAID 0 >>>Sustained Rates of ~900MBps read, and ~850MBps write. >>> >>>Luke, I know you've managed to get higher IO rates than this with >>>this class of HW. Is there a OS or SW config issue Brian should >>>closely investigate? >> >>I wrote 1280 by a mistake. It's actually a 1260. Sorry about that. >>The IOP341 class of cards weren't available when we ordered the >>parts for the box, so we had to go with the 1260. The box(es) we >>build next month will either have the 1261ML or 1280 depending on >>whether we go 16 or 24 disk. >> >>I noticed Bucky got almost 800 random seeks per second on her 6 >>disk 10000 RPM SAS drive Dell PowerEdge 2950. The random seek >>performance of this box disappointed me the most. Even running 2 >>concurrent bonnies, the random seek performance only increased from >>644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm >>missing? This card looked pretty impressive on tweakers.net. > >Areca has some performance numbers in a downloadable PDF for the >Areca ARC-1120, which is in the same class as the ARC-1260, except >with 8 ports. With all 8 drives in a RAID 0 the card gets the >following performance numbers: > >Card single thread write 20 thread write single >thread read 20 thread read >ARC-1120 321.26 MB/s 404.76 MB/s 412.55 MB/ >s 672.45 MB/s > >My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10 >are slightly better than the ARC-1120 in an 8 disk RAID 0 for a >single thread. I guess this means my numbers are reasonable.
I appreciate your suggestions, Ron. And that helps answer my question on processor selection for our next box; I wasn't sure if the lower MHz speed of the Kentsfield compared to the Woodcrest but with double the cores would be better for us overall or not. On 6-Dec-06, at 4:25 PM, Ron wrote: > The 1100 series is PCI-X based. The 1200 series is PCI-E x8 > based. Apples and oranges. > > I still think Luke Lonergan or Josh Berkus may have some > interesting ideas regarding possible OS and SW optimizations. > > WD1500ADFDs are each good for ~90MBps read and ~60MBps write ASTR. > That means your 16 HD RAID 10 should be sequentially transferring > ~720MBps read and ~480MBps write. > Clearly more HDs will be required to allow a ARC-12xx to attain its > peak performance. > > One thing that occurs to me with your present HW is that your CPU > utilization numbers are relatively high. > Since 5160s are clocked about as high as is available, that leaves > trying CPUs with more cores and trying more CPUs. > > You've got basically got 4 HW threads at the moment. If you can, > evaluate CPUs and mainboards that allow for 8 or 16 HW threads. > Intel-wise, that's the new Kentfields. AMD-wise, you have lot's of > 4S mainboard options, but the AMD 4C CPUs won't be available until > sometime late in 2007. > > I've got other ideas, but this list is not the appropriate venue > for the level of detail required. > > Ron Peacetree > > > At 05:30 PM 12/6/2006, Brian Wipf wrote: >> On 6-Dec-06, at 2:47 PM, Brian Wipf wrote: >> >>>> Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K >>>> rpm HDs, you should be seeing higher absolute performance numbers. >>>> >>>> Find out what HW the Areca guys and Tweakers guys used to test the >>>> 1280s. >>>> At LW2006, Areca was demonstrating all-in-cache reads and writes >>>> of ~1600MBps and ~1300MBps respectively along with RAID 0 >>>> Sustained Rates of ~900MBps read, and ~850MBps write. >>>> >>>> Luke, I know you've managed to get higher IO rates than this with >>>> this class of HW. Is there a OS or SW config issue Brian should >>>> closely investigate? >>> >>> I wrote 1280 by a mistake. It's actually a 1260. Sorry about that. >>> The IOP341 class of cards weren't available when we ordered the >>> parts for the box, so we had to go with the 1260. The box(es) we >>> build next month will either have the 1261ML or 1280 depending on >>> whether we go 16 or 24 disk. >>> >>> I noticed Bucky got almost 800 random seeks per second on her 6 >>> disk 10000 RPM SAS drive Dell PowerEdge 2950. The random seek >>> performance of this box disappointed me the most. Even running 2 >>> concurrent bonnies, the random seek performance only increased from >>> 644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm >>> missing? This card looked pretty impressive on tweakers.net. >> >> Areca has some performance numbers in a downloadable PDF for the >> Areca ARC-1120, which is in the same class as the ARC-1260, except >> with 8 ports. With all 8 drives in a RAID 0 the card gets the >> following performance numbers: >> >> Card single thread write 20 thread write single >> thread read 20 thread read >> ARC-1120 321.26 MB/s 404.76 MB/s 412.55 >> MB/ s 672.45 MB/s >> >> My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10 >> are slightly better than the ARC-1120 in an 8 disk RAID 0 for a >> single thread. I guess this means my numbers are reasonable. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
At 06:40 PM 12/6/2006, Brian Wipf wrote: >I appreciate your suggestions, Ron. And that helps answer my question >on processor selection for our next box; I wasn't sure if the lower >MHz speed of the Kentsfield compared to the Woodcrest but with double >the cores would be better for us overall or not. Please do not misunderstand me. I am not endorsing the use of Kentsfield. I am recommending =evaluating= Kentsfield. I am also recommending the evaluation of 2C 4S AMD solutions. All this stuff is so leading edge that it is far from clear what the RW performance of DBMS based on these components will be without extensive testing of =your= app under =your= workload. One thing that is clear from what you've posted thus far is that you are going to needmore HDs if you want to have any chance of fully utilizing your Areca HW. Out of curiosity, where are you geographically? Hoping I'm being helpful, Ron
On Wed, 6 Dec 2006, Alexander Staubo wrote: > Care to post these numbers *without* word wrapping? Brian's message was sent with format=flowed and therefore it's easy to re-assemble into original form if your software understands that. I just checked with two e-mail clients (Thunderbird and Pine) and all his bonnie++ results were perfectly readable on both as soon as I made the display wide enough. If you had trouble reading it, you might consider upgrading your mail client to one that understands that standard. Statistically, though, if you have this problem you're probably using Outlook and there may not be a useful upgrade path for you. I know it's been added to the latest Express version (which even defaults to sending messages flowed, driving many people crazy), but am not sure if any of the Office Outlooks know what to do with flowed messages yet. And those of you pointing people at the RFC's, that's a bit hardcore--the RFC documents themselves could sure use some better formatting. https://bugzilla.mozilla.org/attachment.cgi?id=134270&action=view has a readable introduction to the encoding of flowed messages, http://mailformat.dan.info/body/linelength.html gives some history to how we all got into this mess in the first place, and http://joeclark.org/ffaq.html also has some helpful (albeit out of date in spots) comments on this subject. Even if it is correct netiquette to disable word-wrapping for long lines like bonnie output (there are certainly two sides with valid points in that debate), to make them more compatible with flow-impaired clients, you can't expect that mail composition software is sophisticated enough to allow doing that for one section while still wrapping the rest of the text correctly. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 6-Dec-06, at 5:26 PM, Ron wrote: > At 06:40 PM 12/6/2006, Brian Wipf wrote: >> I appreciate your suggestions, Ron. And that helps answer my question >> on processor selection for our next box; I wasn't sure if the lower >> MHz speed of the Kentsfield compared to the Woodcrest but with double >> the cores would be better for us overall or not. > Please do not misunderstand me. I am not endorsing the use of > Kentsfield. > I am recommending =evaluating= Kentsfield. > > I am also recommending the evaluation of 2C 4S AMD solutions. > > All this stuff is so leading edge that it is far from clear what > the RW performance of DBMS based on these components will be > without extensive testing of =your= app under =your= workload. I want the best performance for the dollar, so I can't rule anything out. Right now I'm leaning towards Kentsfield, but I will do some more research before I make a decision. We probably won't wait much past January though. > One thing that is clear from what you've posted thus far is that > you are going to needmore HDs if you want to have any chance of > fully utilizing your Areca HW. Do you know off hand where I might find a chassis that can fit 24[+] drives? The last chassis we ordered was through Supermicro, and the largest they carry fits 16 drives. > Hoping I'm being helpful I appreciate any help I can get. Brian Wipf <brian@clickspace.com>
At 03:37 AM 12/7/2006, Brian Wipf wrote: >On 6-Dec-06, at 5:26 PM, Ron wrote: >> >>All this stuff is so leading edge that it is far from clear what >>the RW performance of DBMS based on these components will be >>without extensive testing of =your= app under =your= workload. >I want the best performance for the dollar, so I can't rule anything >out. Right now I'm leaning towards Kentsfield, but I will do some >more research before I make a decision. We probably won't wait much >past January though. Kentsfield's outrageously high pricing and operating costs (power and cooling) are not likely to make it the cost/performance winner. OTOH, 1= ATM it is the way to throw the most cache per socket at a DBMS within the Core2 CPU line (Tulsa has even more at 16MB per CPU). 2= SSSE3 and other Core2 optimizations have led to some impressive performance numbers- unless raw clock rate is the thing that can help you the most. If what you need for highest performance is the absolute highest clock rate or most cache per core, then bench some Intel Tulsa's. Apps with memory footprints too large for on die or in socket caches or that require extreme memory subsystem performance are still best served by AMD CPUs. If you are getting the impression that it is presently complicated deciding which CPU is best for any specific pg app, then I am making the impression I intend to. >>One thing that is clear from what you've posted thus far is that >>you are going to needmore HDs if you want to have any chance of >>fully utilizing your Areca HW. >Do you know off hand where I might find a chassis that can fit 24[+] >drives? The last chassis we ordered was through Supermicro, and the >largest they carry fits 16 drives. www.pogolinux.com has 24 and 48 bay 3.5" HD chassis'; and a 64 bay 2.5" chassis. Tell them I sent you. www.impediment.com are folks I trust regarding all things storage (and RAM). Again, tell them I sent you. www.aberdeeninc.com is also a vendor I've had luck with, but try Pogo and Impediment first. Good luck and please post what happens, Ron Peacetree
>> One thing that is clear from what you've posted thus far is that you >> are going to needmore HDs if you want to have any chance of fully >> utilizing your Areca HW. > Do you know off hand where I might find a chassis that can fit 24[+] > drives? The last chassis we ordered was through Supermicro, and the > largest they carry fits 16 drives. Chenbro has a 24 drive case - the largest I have seen. It fits the big 4/8 cpu boards as well. http://www.chenbro.com/corporatesite/products_01features.php?serno=43 -- Shane Ambler pgSQL@007Marketing.com Get Sheeky @ http://Sheeky.Biz
I'm building a SuperServer 6035B server (16 scsi drives). My schema has basically two large tables (million+ per day) each which are partitioned daily, and queried independently of each other. Would you recommend a raid1 system partition and 14 drives in a raid 10 or should i create separate partitions/tablespaces for the two large tables and indexes?
Thanks
Gene
--
Gene Hart
cell: 443-604-2679
Thanks
Gene
On 12/7/06, Shane Ambler <pgsql@007marketing.com> wrote:
>> One thing that is clear from what you've posted thus far is that you
>> are going to needmore HDs if you want to have any chance of fully
>> utilizing your Areca HW.
> Do you know off hand where I might find a chassis that can fit 24[+]
> drives? The last chassis we ordered was through Supermicro, and the
> largest they carry fits 16 drives.
Chenbro has a 24 drive case - the largest I have seen. It fits the big
4/8 cpu boards as well.
http://www.chenbro.com/corporatesite/products_01features.php?serno=43
--
Shane Ambler
pgSQL@007Marketing.com
Get Sheeky @ http://Sheeky.Biz
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
--
Gene Hart
cell: 443-604-2679
On 12/6/06, Brian Wipf <brian@clickspace.com> wrote: > > Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K > > rpm HDs, you should be seeing higher absolute performance numbers. > > > > Find out what HW the Areca guys and Tweakers guys used to test the > > 1280s. > > At LW2006, Areca was demonstrating all-in-cache reads and writes of > > ~1600MBps and ~1300MBps respectively along with RAID 0 Sustained > > Rates of ~900MBps read, and ~850MBps write. > > > > Luke, I know you've managed to get higher IO rates than this with > > this class of HW. Is there a OS or SW config issue Brian should > > closely investigate? > > I wrote 1280 by a mistake. It's actually a 1260. Sorry about that. > The IOP341 class of cards weren't available when we ordered the parts > for the box, so we had to go with the 1260. The box(es) we build next > month will either have the 1261ML or 1280 depending on whether we go > 16 or 24 disk. > > I noticed Bucky got almost 800 random seeks per second on her 6 disk > 10000 RPM SAS drive Dell PowerEdge 2950. The random seek performance > of this box disappointed me the most. Even running 2 concurrent > bonnies, the random seek performance only increased from 644 seeks/ > sec to 813 seeks/sec. Maybe there is some setting I'm missing? This > card looked pretty impressive on tweakers.net. I've been looking a lot at the SAS enclosures lately and am starting to feel like that's the way to go. Performance is amazing and the flexibility of choosing low cost SATA or high speed SAS drives is great. not only that, but more and more SAS is coming out in 2.5" drives which seems to be a better fit for databases...more spindles. with a 2.5" drive enclosure they can stuff 10 hot swap drives into a 1u enclosure...that's pretty amazing. one downside of SAS is most of the HBAs are pci-express only, that can limit your options unless your server is very new. also you don't want to skimp on the hba, get the best available, which looks to be lsi logic at the moment (dell perc5/e is lsi logic controller as is the intel sas hba)...others? merlin
At 11:02 AM 12/7/2006, Gene wrote: >I'm building a SuperServer 6035B server (16 scsi drives). My schema >has basically two large tables (million+ per day) each which are >partitioned daily, and queried independently of each other. Would >you recommend a raid1 system partition and 14 drives in a raid 10 or >should i create separate partitions/tablespaces for the two large >tables and indexes? Not an easy question to answer w/o knowing more about your actual queries and workload. To keep the math simple, let's assume each SCSI HD has and ASTR of 75MBps. A 14 HD RAID 10 therefore has an ASTR of 7* 75= 525MBps. If the rest of your system can handle this much or more bandwidth, then this is most probably the best config. Dedicating spindles to specific tables is usually best done when there is HD bandwidth that can't be utilized if the HDs are in a larger set +and+ there is a significant hot spot that can use dedicated resources. My first attempt would be to use other internal HDs for a RAID 1 systems volume and use all 16 of your HBA HDs for a 16 HD RAID 10 array. Then I'd bench the config to see if it had acceptable performance. If yes, stop. Else start considering the more complicated alternatives. Remember that adding HDs and RAM is far cheaper than even a few hours of skilled technical labor. Ron Peacetree
On Wed, Dec 06, 2006 at 08:55:14 -0800, Mark Lewis <mark.lewis@mir3.com> wrote: > > Anyone run their RAIDs with disk caches enabled, or is this akin to > > having fsync off? > > Disk write caches are basically always akin to having fsync off. The > only time a write-cache is (more or less) safe to enable is when it is > backed by a battery or in some other way made non-volatile. > > So a RAID controller with a battery-backed write cache can enable its > own write cache, but can't safely enable the write-caches on the disk > drives it manages. This appears to be changing under Linux. Recent kernels have write barriers implemented using cache flush commands (which some drives ignore, so you need to be careful). In very recent kernels, software raid using raid 1 will also handle write barriers. To get this feature, you are supposed to mount ext3 file systems with the barrier=1 option. For other file systems, the parameter may need to be different.
On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote: > On Wed, Dec 06, 2006 at 08:55:14 -0800, > Mark Lewis <mark.lewis@mir3.com> wrote: >>> Anyone run their RAIDs with disk caches enabled, or is this akin to >>> having fsync off? >> >> Disk write caches are basically always akin to having fsync off. The >> only time a write-cache is (more or less) safe to enable is when >> it is >> backed by a battery or in some other way made non-volatile. >> >> So a RAID controller with a battery-backed write cache can enable its >> own write cache, but can't safely enable the write-caches on the disk >> drives it manages. > > This appears to be changing under Linux. Recent kernels have write > barriers > implemented using cache flush commands (which some drives ignore, > so you > need to be careful). In very recent kernels, software raid using > raid 1 > will also handle write barriers. To get this feature, you are > supposed to > mount ext3 file systems with the barrier=1 option. For other file > systems, > the parameter may need to be different. But would that actually provide a meaningful benefit? When you COMMIT, the WAL data must hit non-volatile storage of some kind, which without a BBU or something similar, means hitting the platter. So I don't see how enabling the disk cache will help, unless of course it's ignoring fsync. Now, I have heard something about drives using their stored rotational energy to flush out the cache... but I tend to suspect urban legend there... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, Dec 14, 2006 at 01:39:00 -0500, Jim Nasby <decibel@decibel.org> wrote: > On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote: > > > >This appears to be changing under Linux. Recent kernels have write > >barriers > >implemented using cache flush commands (which some drives ignore, > >so you > >need to be careful). In very recent kernels, software raid using > >raid 1 > >will also handle write barriers. To get this feature, you are > >supposed to > >mount ext3 file systems with the barrier=1 option. For other file > >systems, > >the parameter may need to be different. > > But would that actually provide a meaningful benefit? When you > COMMIT, the WAL data must hit non-volatile storage of some kind, > which without a BBU or something similar, means hitting the platter. > So I don't see how enabling the disk cache will help, unless of > course it's ignoring fsync. When you do an fsync, the OS sends a cache flush command to the drive, which on most drives (but supposedly there are ones that ignore this command) doesn't return until all of the cached pages have been written to the platter, and doesn't return from the fsync until the flush is complete. While this writes more sectors than you really need, it is safe. And it allows for caching to speed up some things (though not as much as having queued commands would). I have done some tests on my systems and the speeds I am getting make it clear that write barriers slow things down to about the same range as having caches disabled. So I believe that it is likely working as advertised. Note the use case for this is more for hobbiests or development boxes. You can only use it on software raid (md) 1, which rules out most "real" systems.
Bruno Wolff III wrote: > On Thu, Dec 14, 2006 at 01:39:00 -0500, > Jim Nasby <decibel@decibel.org> wrote: >> On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote: >>> This appears to be changing under Linux. Recent kernels have write >>> barriers implemented using cache flush commands (which >>> some drives ignore, so you need to be careful). Is it true that some drives ignore this; or is it mostly an urban legend that was started by testers that didn't have kernels with write barrier support. I'd be especially interested in knowing if there are any currently available drives which ignore those commands. >>> In very recent kernels, software raid using raid 1 will also >>> handle write barriers. To get this feature, you are supposed to >>> mount ext3 file systems with the barrier=1 option. For other file >>> systems, the parameter may need to be different. With XFS the default is apparently to enable write barrier support unless you explicitly disable it with the nobarrier mount option. It also will warn you in the system log if the underlying device doesn't have write barrier support. SGI recommends that you use the "nobarrier" mount option if you do have a persistent (battery backed) write cache on your raid device. http://oss.sgi.com/projects/xfs/faq.html#wcache >> But would that actually provide a meaningful benefit? When you >> COMMIT, the WAL data must hit non-volatile storage of some kind, >> which without a BBU or something similar, means hitting the platter. >> So I don't see how enabling the disk cache will help, unless of >> course it's ignoring fsync. With write barriers, fsync() waits for the physical disk; but I believe the background writes from write() done by pdflush don't have to; so it's kinda like only disabling the cache for WAL files and the filesystem's journal, but having it enabled for the rest of your write activity (the tables except at checkpoints? the log file?). > Note the use case for this is more for hobbiests or development boxes. You can > only use it on software raid (md) 1, which rules out most "real" systems. > Ugh. Looking for where that's documented; and hoping it is or will soon work on software 1+0 as well.
The reply wasn't (directly copied to the performance list, but I will copy this one back. On Thu, Dec 14, 2006 at 13:21:11 -0800, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Bruno Wolff III wrote: > > On Thu, Dec 14, 2006 at 01:39:00 -0500, > > Jim Nasby <decibel@decibel.org> wrote: > >> On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote: > >>> This appears to be changing under Linux. Recent kernels have write > >>> barriers implemented using cache flush commands (which > >>> some drives ignore, so you need to be careful). > > Is it true that some drives ignore this; or is it mostly > an urban legend that was started by testers that didn't > have kernels with write barrier support. I'd be especially > interested in knowing if there are any currently available > drives which ignore those commands. > > >>> In very recent kernels, software raid using raid 1 will also > >>> handle write barriers. To get this feature, you are supposed to > >>> mount ext3 file systems with the barrier=1 option. For other file > >>> systems, the parameter may need to be different. > > With XFS the default is apparently to enable write barrier > support unless you explicitly disable it with the nobarrier mount option. > It also will warn you in the system log if the underlying device > doesn't have write barrier support. > > SGI recommends that you use the "nobarrier" mount option if you do > have a persistent (battery backed) write cache on your raid device. > > http://oss.sgi.com/projects/xfs/faq.html#wcache > > > >> But would that actually provide a meaningful benefit? When you > >> COMMIT, the WAL data must hit non-volatile storage of some kind, > >> which without a BBU or something similar, means hitting the platter. > >> So I don't see how enabling the disk cache will help, unless of > >> course it's ignoring fsync. > > With write barriers, fsync() waits for the physical disk; but I believe > the background writes from write() done by pdflush don't have to; so > it's kinda like only disabling the cache for WAL files and the filesystem's > journal, but having it enabled for the rest of your write activity (the > tables except at checkpoints? the log file?). > > > Note the use case for this is more for hobbiests or development boxes. You can > > only use it on software raid (md) 1, which rules out most "real" systems. > > > > Ugh. Looking for where that's documented; and hoping it is or will soon > work on software 1+0 as well.
On Thu, Dec 14, 2006 at 13:21:11 -0800, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Bruno Wolff III wrote: > > On Thu, Dec 14, 2006 at 01:39:00 -0500, > > Jim Nasby <decibel@decibel.org> wrote: > >> On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote: > >>> This appears to be changing under Linux. Recent kernels have write > >>> barriers implemented using cache flush commands (which > >>> some drives ignore, so you need to be careful). > > Is it true that some drives ignore this; or is it mostly > an urban legend that was started by testers that didn't > have kernels with write barrier support. I'd be especially > interested in knowing if there are any currently available > drives which ignore those commands. I saw posts claiming this, but no specific drives mentioned. I did see one post that claimed that the cache flush command was mandated (not optional) by the spec. > >>> In very recent kernels, software raid using raid 1 will also > >>> handle write barriers. To get this feature, you are supposed to > >>> mount ext3 file systems with the barrier=1 option. For other file > >>> systems, the parameter may need to be different. > > With XFS the default is apparently to enable write barrier > support unless you explicitly disable it with the nobarrier mount option. > It also will warn you in the system log if the underlying device > doesn't have write barrier support. I think there might be a similar patch for ext3 going into 2.6.19. I haven't checked a 2.6.19 kernel to make sure though. > > SGI recommends that you use the "nobarrier" mount option if you do > have a persistent (battery backed) write cache on your raid device. > > http://oss.sgi.com/projects/xfs/faq.html#wcache > > > >> But would that actually provide a meaningful benefit? When you > >> COMMIT, the WAL data must hit non-volatile storage of some kind, > >> which without a BBU or something similar, means hitting the platter. > >> So I don't see how enabling the disk cache will help, unless of > >> course it's ignoring fsync. > > With write barriers, fsync() waits for the physical disk; but I believe > the background writes from write() done by pdflush don't have to; so > it's kinda like only disabling the cache for WAL files and the filesystem's > journal, but having it enabled for the rest of your write activity (the > tables except at checkpoints? the log file?). Not exactly. Whenever you commit the file system log or fsync the wal file, all previously written blocks will be flushed to the disk platter, before any new write requests are honored. So journalling semantics will work properly. > > Note the use case for this is more for hobbiests or development boxes. You can > > only use it on software raid (md) 1, which rules out most "real" systems. > > > > Ugh. Looking for where that's documented; and hoping it is or will soon > work on software 1+0 as well. I saw a comment somewhere that raid 0 provided some problems and the suggestion was to handle the barrier at a different level (though I don't know how you could). So I don't belive 1+0 or 5 are currently supported or will be in the near term. The other feature I would like is to be able to use write barriers with encrypted file systems. I haven't found anythign on whether or not there are near term plans by any one to support that.
On Fri, Dec 15, 2006 at 10:34:15 -0600, Bruno Wolff III <bruno@wolff.to> wrote: > The reply wasn't (directly copied to the performance list, but I will > copy this one back. Sorry about this one, I meant to intersperse my replies and hit the 'y' key at the wrong time. (And there ended up being a copy on performance anyway from the news gateway.)
On Fri, Dec 15, 2006 at 10:44:39 -0600, Bruno Wolff III <bruno@wolff.to> wrote: > > The other feature I would like is to be able to use write barriers with > encrypted file systems. I haven't found anythign on whether or not there > are near term plans by any one to support that. I asked about this on the dm-crypt list and was told that write barriers work pre 2.6.19. There was a change for 2.6.19 that might break things for SMP systems. But that will probably get fixed eventually.