Thread: File Systems Compared

File Systems Compared

From
Brian Wipf
Date:
All tests are with bonnie++ 1.03a

Main components of system:
16 WD Raptor 150GB 10000 RPM drives all in a RAID 10
ARECA 1280 PCI-Express RAID adapter with 1GB BB Cache (Thanks for the
recommendation, Ron!)
32 GB RAM
Dual Intel 5160 Xeon Woodcrest 3.0 GHz processors
OS: SUSE Linux 10.1

All runs are with the write cache disabled on the hard disks, except
for one additional test for xfs where it was enabled. I tested with
ordered and writeback journaling modes for ext3 to see if writeback
journaling would help over the default of ordered. The 1GB of battery
backed cache on the RAID card was enabled for all tests as well.
Tests are in order of increasing random seek performance. In my tests
on this hardware, xfs is the decisive winner, beating all of the
other file systems in performance on every single metric. 658 random
seeks per second, 433 MB/sec sequential read, and 350 MB/sec
sequential write seems decent enough, but not as high as numbers
other people have suggested are attainable with a 16 disk RAID 10.
350 MB/sec sequential write with disk caches enabled versus 280 MB/
sec sequential write with disk caches disabled sure makes enabling
the disk write cache tempting. Anyone run their RAIDs with disk
caches enabled, or is this akin to having fsync off?

ext3 (writeback data journaling mode):
/usr/local/sbin/bonnie++ -d bonnie -s 64368:8k
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
CP  /sec %CP
hulk4        64368M 78625  91 279921  51 112346  13 89463  96 417695
22 545.7   0
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --
Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
CP  /sec %CP
                  16  5903  99 +++++ +++ +++++ +++  6112  99 +++++ ++
+ 18620 100
hulk4,64368M,
78625,91,279921,51,112346,13,89463,96,417695,22,545.7,0,16,5903,99,+++
++,+++,+++++,+++,6112,99,+++++,+++,18620,100

ext3 (ordered data journaling mode):
/usr/local/sbin/bonnie++ -d bonnie -s 64368:8k
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
CP  /sec %CP
hulk4        64368M 74902  89 250274  52 123637  16 88992  96 417222
23 548.3   0
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --
Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
CP  /sec %CP
                  16  5941  97 +++++ +++ +++++ +++  6270  99 +++++ ++
+ 18670  99
hulk4,64368M,
74902,89,250274,52,123637,16,88992,96,417222,23,548.3,0,16,5941,97,+++
++,+++,+++++,+++,6270,99,+++++,+++,18670,99


reiserfs:
/usr/local/sbin/bonnie++ -d bonnie -s 64368:8k
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
CP  /sec %CP
hulk4        64368M 81004  99 269191  50 128322  16 87865  96 407035
28 550.3   0
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --
Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
CP  /sec %CP
                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ ++
+ +++++ +++
hulk4,64368M,
81004,99,269191,50,128322,16,87865,96,407035,28,550.3,0,16,+++++,+++,+
++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

jfs:
/usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
CP  /sec %CP
hulk4        64368M 73246  80 268886  28 110465   9 89516  96 413897
21 639.5   0
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --
Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
CP  /sec %CP
                  16  3756   5 +++++ +++ +++++ +++ 23763  90 +++++ ++
+ 22371  70
hulk4,64368M,
73246,80,268886,28,110465,9,89516,96,413897,21,639.5,0,16,3756,5,++++
+,+++,+++++,+++,23763,90,+++++,+++,22371,70

xfs (with write cache disabled on disks):
/usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
CP  /sec %CP
hulk4        64368M 90621  99 283916  35 105871  11 88569  97 433890
23 644.5   0
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --
Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
CP  /sec %CP
                  16 28435  95 +++++ +++ 28895  82 28523  91 +++++ ++
+ 24369  86
hulk4,64368M,
90621,99,283916,35,105871,11,88569,97,433890,23,644.5,0,16,28435,95,++
+++,+++,28895,82,28523,91,+++++,+++,24369,86

xfs (with write cache enabled on disks):
/usr/local/sbin/bonnie++ -d bonnie -s 64368:8k
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
CP  /sec %CP
hulk4        64368M 90861  99 348401  43 131887  14 89412  97 432964
23 658.7   0
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --
Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
CP  /sec %CP
                  16 28871  90 +++++ +++ 28923  91 30879  93 +++++ ++
+ 28012  94
hulk4,64368M,
90861,99,348401,43,131887,14,89412,97,432964,23,658.7,0,16,28871,90,++
+++,+++,28923,91,30879,93,+++++,+++,28012,94




Re: File Systems Compared

From
Brian Hurt
Date:
Brian Wipf wrote:

> All tests are with bonnie++ 1.03a

Thanks for posting these tests.  Now I have actual numbers to beat our
storage server provider about the head and shoulders with.  Also, I
found them interesting in and of themselves.

These numbers are close enough to bus-saturation rates that I'd strongly
advise new people setting up systems to go this route over spending
money on some fancy storage area network solution- unless you need more
HD space than fits nicely in one of these raids.  If reliability is a
concern, buy 2 servers and implement Sloni for failover.

Brian


Re: File Systems Compared

From
Alexander Staubo
Date:
On Dec 6, 2006, at 16:40 , Brian Wipf wrote:

> All tests are with bonnie++ 1.03a
[snip]

Care to post these numbers *without* word wrapping? Thanks.

Alexander.

Re: File Systems Compared

From
"Luke Lonergan"
Date:
Brian,

On 12/6/06 8:02 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote:

> These numbers are close enough to bus-saturation rates

PCIX is 1GB/s + and the memory architecture is 20GB/s+, though each CPU is
likely to obtain only 2-3GB/s.

We routinely achieve 1GB/s I/O rate on two 3Ware adapters and 2GB/s on the
Sun X4500 with ZFS.

> advise new people setting up systems to go this route over spending
> money on some fancy storage area network solution

People buy SANs for interesting reasons, some of them having to do with the
manageability features of high end SANs.  I've heard it said in those cases
that "performance doesn't matter much".

As you suggest, database replication provides one of those features, and
Solaris ZFS has many of the data management features found in high end SANs.
Perhaps we can get the best of both?

In the end, I think SAN vs. server storage is a religious battle.

- Luke



Re: File Systems Compared

From
Markus Schiltknecht
Date:
Hi,

Alexander Staubo wrote:
> Care to post these numbers *without* word wrapping? Thanks.

How is one supposed to do that? Care giving an example?

Markus


Re: File Systems Compared

From
"Joshua D. Drake"
Date:
> As you suggest, database replication provides one of those features, and
> Solaris ZFS has many of the data management features found in high end SANs.
> Perhaps we can get the best of both?
>
> In the end, I think SAN vs. server storage is a religious battle.

I agree. I have many people that want to purchase a SAN because someone
told them that is what they need... Yet they can spend 20% of the cost
on two external arrays and get incredible performance...

We are seeing great numbers from the following config:

(2) HP MS 30s (loaded) dual bus
(2) HP 6402, one connected to each MSA.

The performance for the money is incredible.

Sincerely,

Joshua D. Drake



>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: File Systems Compared

From
Brian Hurt
Date:
Luke Lonergan wrote:
Brian,

On 12/6/06 8:02 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote:
 
These numbers are close enough to bus-saturation rates   
PCIX is 1GB/s + and the memory architecture is 20GB/s+, though each CPU is
likely to obtain only 2-3GB/s.

We routinely achieve 1GB/s I/O rate on two 3Ware adapters and 2GB/s on the
Sun X4500 with ZFS.
 
For some reason I'd got it stuck in my head that PCI-Express maxed out at a theoretical 533 MByte/sec- at which point, getting 480 MByte/sec across it is pretty dang good.  But actually looking things up, I see that PCI-Express has a theoretical 8 Gbit/sec, or about 800Mbyte/sec.  It's PCI-X that's 533 MByte/sec.  So there's still some headroom available there.

Brian

Re: File Systems Compared

From
"Steinar H. Gunderson"
Date:
On Wed, Dec 06, 2006 at 05:31:01PM +0100, Markus Schiltknecht wrote:
>> Care to post these numbers *without* word wrapping? Thanks.
> How is one supposed to do that? Care giving an example?

This is a rather long sentence without any kind of word wrapping except what would be imposed on your own side -- how
toset that up properly depends on the sending e-mail client, but in mine it's just a matter of turning off the word
wrappingin your editor :-) 

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: File Systems Compared

From
Florian Weimer
Date:
* Brian Wipf:

> Anyone run their RAIDs with disk caches enabled, or is this akin to
> having fsync off?

If your cache is backed by a battery, enabling write cache shouldn't
be a problem.  You can check if the whole thing is working well by
running this test script: <http://brad.livejournal.com/2116715.html>

Enabling write cache leads to various degrees of data corruption in
case of a power outage (possibly including file system corruption
requiring manual recover).

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: File Systems Compared

From
Mark Lewis
Date:
> Anyone run their RAIDs with disk caches enabled, or is this akin to
> having fsync off?

Disk write caches are basically always akin to having fsync off.  The
only time a write-cache is (more or less) safe to enable is when it is
backed by a battery or in some other way made non-volatile.

So a RAID controller with a battery-backed write cache can enable its
own write cache, but can't safely enable the write-caches on the disk
drives it manages.

-- Mark Lewis

Re: File Systems Compared

From
"Luke Lonergan"
Date:
Brian,

On 12/6/06 8:40 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote:

> But actually looking things up, I see that PCI-Express has a theoretical 8
> Gbit/sec, or about 800Mbyte/sec. It's PCI-X that's 533 MByte/sec.  So there's
> still some headroom available there.

See here for the official specifications of both:
  http://www.pcisig.com/specifications/pcix_20/

Note that PCI-X version 1.0 at 133MHz runs at 1GB/s.  It's a parallel bus,
64 bits wide (8 bytes) and runs at 133MHz, so 8 x 133 ~= 1 gigabyte/second.

PCI Express with 16 lanes (PCIe x16) can transfer data at 4GB/s.  The Arecas
use (PCIe x8, see here:
http://www.areca.com.tw/products/html/pcie-sata.htm), so they can do 2GB/s.

- Luke



Re: File Systems Compared

From
Markus Schiltknecht
Date:
Hi,

Steinar H. Gunderson wrote:
> This is a rather long sentence without any kind of word wrapping except what would be imposed on your own side -- how
toset that up properly depends on the sending e-mail client, but in mine it's just a matter of turning off the word
wrappingin your editor :-) 

Duh!

Cool, thank you for the example :-)  I thought the MTA or at least the the mailing list would wrap mails at some limit.
I'venow set word-wrap to 9999 characters (it seems not possible to turn it off completely in thunderbird). But when
writing,I'm now getting one long line. 

What's common practice? What's it on the pgsql mailing lists?

Regards

Markus


Re: File Systems Compared

From
Arnaud Lesauvage
Date:
Markus Schiltknecht a écrit :
> What's common practice? What's it on the pgsql mailing lists?

The netiquette usually advise mailers to wrap after 72 characters
on mailing lists.
This does not apply for format=flowed I guess (that's the format
used in Steinar's message).

[offtopic] Word wrapping

From
"Steinar H. Gunderson"
Date:
On Wed, Dec 06, 2006 at 06:45:56PM +0100, Markus Schiltknecht wrote:
> Cool, thank you for the example :-)  I thought the MTA or at least the the
> mailing list would wrap mails at some limit. I've now set word-wrap to 9999
> characters (it seems not possible to turn it off completely in
> thunderbird). But when writing, I'm now getting one long line.

Thunderbird uses format=flowed, so it's wrapped nevertheless. Google to find
out how to turn it off if you really need to.

> What's common practice?

Usually 72 or 76 characters, TTBOMK -- but when posting tables or big query
plans, one should simply turn it off, as it kills readability.

> What's it on the pgsql mailing lists?

No idea. :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: File Systems Compared

From
Michael Stone
Date:
On Wed, Dec 06, 2006 at 06:59:12PM +0100, Arnaud Lesauvage wrote:
>Markus Schiltknecht a écrit :
>>What's common practice? What's it on the pgsql mailing lists?
>
>The netiquette usually advise mailers to wrap after 72 characters
>on mailing lists.
>This does not apply for format=flowed I guess (that's the format
>used in Steinar's message).

It would apply to either; format=flowed can be wrapped at the receiver's
end, but still be formatted to a particular column for readers that
don't understand format=flowed. (Which is likely to be many, since
that's a standard that never really took off.) No wrap netiquette
applies to formatted text blocks which are unreadable if wrapped (such
as bonnie or EXPLAIN output).

Mike Stone

Re: [offtopic] File Systems Compared

From
Brian Wipf
Date:
On 6-Dec-06, at 9:05 AM, Alexander Staubo wrote:
>> All tests are with bonnie++ 1.03a
> [snip]
> Care to post these numbers *without* word wrapping? Thanks.

That's what Bonnie++'s output looks like. If you have Bonnie++
installed, you can run the following:

bon_csv2html << EOF
hulk4,64368M,
78625,91,279921,51,112346,13,89463,96,417695,22,545.7,0,16,5903,99,+++
++,+++,+++++,+++,6112,99,+++++,+++,18620,100
EOF

Which will prettify the CSV results using HTML.


Re: File Systems Compared

From
"Merlin Moncure"
Date:
On 12/6/06, Luke Lonergan <llonergan@greenplum.com> wrote:
> People buy SANs for interesting reasons, some of them having to do with the
> manageability features of high end SANs.  I've heard it said in those cases
> that "performance doesn't matter much".

There is movement in the industry right now away form tape systems to
managed disk storage for backups and data retention.  In these cases
performance requirements are not very high -- and a single server can
manage a huge amount of storage.  In theory, you can do the same thing
attached via sas expanders but fc networking is imo more flexible and
scalable.

The manageability features of SANs are a mixed bag and decidedly
overrated but they have a their place, imo.

merlin

Re: File Systems Compared

From
Brian Hurt
Date:
Luke Lonergan wrote:
Brian,

On 12/6/06 8:40 AM, "Brian Hurt" <bhurt@janestcapital.com> wrote:
 
But actually looking things up, I see that PCI-Express has a theoretical 8
Gbit/sec, or about 800Mbyte/sec. It's PCI-X that's 533 MByte/sec.  So there's
still some headroom available there.   
See here for the official specifications of both: http://www.pcisig.com/specifications/pcix_20/

Note that PCI-X version 1.0 at 133MHz runs at 1GB/s.  It's a parallel bus,
64 bits wide (8 bytes) and runs at 133MHz, so 8 x 133 ~= 1 gigabyte/second.

PCI Express with 16 lanes (PCIe x16) can transfer data at 4GB/s.  The Arecas
use (PCIe x8, see here:
http://www.areca.com.tw/products/html/pcie-sata.htm), so they can do 2GB/s.

- Luke 



 
Thanks.  I stand corrected (again).

Brian

Re: File Systems Compared

From
Bruno Wolff III
Date:
On Wed, Dec 06, 2006 at 18:45:56 +0100,
  Markus Schiltknecht <markus@bluegap.ch> wrote:
>
> Cool, thank you for the example :-)  I thought the MTA or at least the the
> mailing list would wrap mails at some limit. I've now set word-wrap to 9999
> characters (it seems not possible to turn it off completely in
> thunderbird). But when writing, I'm now getting one long line.
>
> What's common practice? What's it on the pgsql mailing lists?

If you do this you should set format=flowed (see rfc 2646). If you do that,
then clients can break the lines in an appropiate way. This is actually
better than fixing the line width in the original message, since the
recipient may not have the same number of characters (or pixels) of display
as the sender.

Re: File Systems Compared

From
Ron
Date:
At 10:40 AM 12/6/2006, Brian Wipf wrote:

All tests are with bonnie++ 1.03a

Main components of system:
16 WD Raptor 150GB 10000 RPM drives all in a RAID 10
ARECA 1280 PCI-Express RAID adapter with 1GB BB Cache (Thanks for the
recommendation, Ron!)
32 GB RAM
Dual Intel 5160 Xeon Woodcrest 3.0 GHz processors
OS: SUSE Linux 10.1

>xfs (with write cache disabled on disks):
>/usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k
>Version  1.03       ------Sequential Output------ --Sequential Input-
>--Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
> Block-- --Seeks--
>Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
>CP  /sec %CP
>hulk4        64368M 90621  99 283916  35 105871  11 88569  97
>433890  23 644.5   0
>                     ------Sequential Create------ --------Random
>Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --
> Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
> CP  /sec %CP
>                  16 28435  95 +++++ +++ 28895  82 28523  91 +++++
> ++ + 24369  86
>hulk4,64368M,
>90621,99,283916,35,105871,11,88569,97,433890,23,644.5,0,16,28435,95,++
>+++,+++,28895,82,28523,91,+++++,+++,24369,86
>
>xfs (with write cache enabled on disks):
>/usr/local/sbin/bonnie++ -d bonnie -s 64368:8k
>Version  1.03       ------Sequential Output------ --Sequential Input-
>--Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --
> Block-- --Seeks--
>Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %
>CP  /sec %CP
>hulk4        64368M 90861  99 348401  43 131887  14 89412  97
>432964  23 658.7   0
>                     ------Sequential Create------ --------Random
>Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --
> Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %
> CP  /sec %CP
>                  16 28871  90 +++++ +++ 28923  91 30879  93 +++++
> ++ + 28012  94
>hulk4,64368M,
>90861,99,348401,43,131887,14,89412,97,432964,23,658.7,0,16,28871,90,++
>+++,+++,28923,91,30879,93,+++++,+++,28012,94
Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K
rpm HDs, you should be seeing higher absolute performance numbers.

Find out what HW the Areca guys and Tweakers guys used to test the 1280s.
At LW2006, Areca was demonstrating all-in-cache reads and writes of
~1600MBps and ~1300MBps respectively along with RAID 0 Sustained
Rates of ~900MBps read, and ~850MBps write.

Luke, I know you've managed to get higher IO rates than this with
this class of HW.  Is there a OS or SW config issue Brian should
closely investigate?

Ron Peacetree


Re: File Systems Compared

From
Brian Wipf
Date:
> Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K
> rpm HDs, you should be seeing higher absolute performance numbers.
>
> Find out what HW the Areca guys and Tweakers guys used to test the
> 1280s.
> At LW2006, Areca was demonstrating all-in-cache reads and writes of
> ~1600MBps and ~1300MBps respectively along with RAID 0 Sustained
> Rates of ~900MBps read, and ~850MBps write.
>
> Luke, I know you've managed to get higher IO rates than this with
> this class of HW.  Is there a OS or SW config issue Brian should
> closely investigate?

I wrote 1280 by a mistake. It's actually a 1260. Sorry about that.
The IOP341 class of cards weren't available when we ordered the parts
for the box, so we had to go with the 1260. The box(es) we build next
month will either have the 1261ML or 1280 depending on whether we go
16 or 24 disk.

I noticed Bucky got almost 800 random seeks per second on her 6 disk
10000 RPM SAS drive Dell PowerEdge 2950. The random seek performance
of this box disappointed me the most. Even running 2 concurrent
bonnies, the random seek performance only increased from 644 seeks/
sec to 813 seeks/sec. Maybe there is some setting I'm missing? This
card looked pretty impressive on tweakers.net.


Re: Areca 1260 Performance (was: File Systems Compared)

From
Brian Wipf
Date:
On 6-Dec-06, at 2:47 PM, Brian Wipf wrote:

>> Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K
>> rpm HDs, you should be seeing higher absolute performance numbers.
>>
>> Find out what HW the Areca guys and Tweakers guys used to test the
>> 1280s.
>> At LW2006, Areca was demonstrating all-in-cache reads and writes
>> of ~1600MBps and ~1300MBps respectively along with RAID 0
>> Sustained Rates of ~900MBps read, and ~850MBps write.
>>
>> Luke, I know you've managed to get higher IO rates than this with
>> this class of HW.  Is there a OS or SW config issue Brian should
>> closely investigate?
>
> I wrote 1280 by a mistake. It's actually a 1260. Sorry about that.
> The IOP341 class of cards weren't available when we ordered the
> parts for the box, so we had to go with the 1260. The box(es) we
> build next month will either have the 1261ML or 1280 depending on
> whether we go 16 or 24 disk.
>
> I noticed Bucky got almost 800 random seeks per second on her 6
> disk 10000 RPM SAS drive Dell PowerEdge 2950. The random seek
> performance of this box disappointed me the most. Even running 2
> concurrent bonnies, the random seek performance only increased from
> 644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm
> missing? This card looked pretty impressive on tweakers.net.

Areca has some performance numbers in a downloadable PDF for the
Areca ARC-1120, which is in the same class as the ARC-1260, except
with 8 ports. With all 8 drives in a RAID 0 the card gets the
following performance numbers:

Card         single thread write    20 thread write      single
thread read        20 thread read
ARC-1120     321.26 MB/s            404.76 MB/s          412.55 MB/
s               672.45 MB/s

My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10
are slightly better than the ARC-1120 in an 8 disk RAID 0 for a
single thread. I guess this means my numbers are reasonable.


Re: Areca 1260 Performance (was: File Systems

From
Ron
Date:
The 1100 series is PCI-X based.  The 1200 series is PCI-E x8
based.  Apples and oranges.

I still think Luke Lonergan or Josh Berkus may have some interesting
ideas regarding possible OS and SW optimizations.

WD1500ADFDs are each good for ~90MBps read and ~60MBps write ASTR.
That means your 16 HD RAID 10 should be sequentially transferring
~720MBps read and ~480MBps write.
Clearly more HDs will be required to allow a ARC-12xx to attain its
peak performance.

One thing that occurs to me with your present HW is that your CPU
utilization numbers are relatively high.
Since 5160s are clocked about as high as is available, that leaves
trying CPUs with more cores and trying more CPUs.

You've got basically got 4 HW threads at the moment.  If you can,
evaluate CPUs and mainboards that allow for 8 or 16 HW threads.
Intel-wise, that's the new Kentfields.  AMD-wise, you have lot's of
4S mainboard options, but the AMD 4C CPUs won't be available until
sometime late in 2007.

I've got other ideas, but this list is not the appropriate venue for
the level of detail required.

Ron Peacetree


At 05:30 PM 12/6/2006, Brian Wipf wrote:
>On 6-Dec-06, at 2:47 PM, Brian Wipf wrote:
>
>>>Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K
>>>rpm HDs, you should be seeing higher absolute performance numbers.
>>>
>>>Find out what HW the Areca guys and Tweakers guys used to test the
>>>1280s.
>>>At LW2006, Areca was demonstrating all-in-cache reads and writes
>>>of ~1600MBps and ~1300MBps respectively along with RAID 0
>>>Sustained Rates of ~900MBps read, and ~850MBps write.
>>>
>>>Luke, I know you've managed to get higher IO rates than this with
>>>this class of HW.  Is there a OS or SW config issue Brian should
>>>closely investigate?
>>
>>I wrote 1280 by a mistake. It's actually a 1260. Sorry about that.
>>The IOP341 class of cards weren't available when we ordered the
>>parts for the box, so we had to go with the 1260. The box(es) we
>>build next month will either have the 1261ML or 1280 depending on
>>whether we go 16 or 24 disk.
>>
>>I noticed Bucky got almost 800 random seeks per second on her 6
>>disk 10000 RPM SAS drive Dell PowerEdge 2950. The random seek
>>performance of this box disappointed me the most. Even running 2
>>concurrent bonnies, the random seek performance only increased from
>>644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm
>>missing? This card looked pretty impressive on tweakers.net.
>
>Areca has some performance numbers in a downloadable PDF for the
>Areca ARC-1120, which is in the same class as the ARC-1260, except
>with 8 ports. With all 8 drives in a RAID 0 the card gets the
>following performance numbers:
>
>Card         single thread write    20 thread write      single
>thread read        20 thread read
>ARC-1120     321.26 MB/s            404.76 MB/s          412.55 MB/
>s               672.45 MB/s
>
>My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10
>are slightly better than the ARC-1120 in an 8 disk RAID 0 for a
>single thread. I guess this means my numbers are reasonable.


Re: Areca 1260 Performance

From
Brian Wipf
Date:
I appreciate your suggestions, Ron. And that helps answer my question
on processor selection for our next box; I wasn't sure if the lower
MHz speed of the Kentsfield compared to the Woodcrest but with double
the cores would be better for us overall or not.

On 6-Dec-06, at 4:25 PM, Ron wrote:

> The 1100 series is PCI-X based.  The 1200 series is PCI-E x8
> based.  Apples and oranges.
>
> I still think Luke Lonergan or Josh Berkus may have some
> interesting ideas regarding possible OS and SW optimizations.
>
> WD1500ADFDs are each good for ~90MBps read and ~60MBps write ASTR.
> That means your 16 HD RAID 10 should be sequentially transferring
> ~720MBps read and ~480MBps write.
> Clearly more HDs will be required to allow a ARC-12xx to attain its
> peak performance.
>
> One thing that occurs to me with your present HW is that your CPU
> utilization numbers are relatively high.
> Since 5160s are clocked about as high as is available, that leaves
> trying CPUs with more cores and trying more CPUs.
>
> You've got basically got 4 HW threads at the moment.  If you can,
> evaluate CPUs and mainboards that allow for 8 or 16 HW threads.
> Intel-wise, that's the new Kentfields.  AMD-wise, you have lot's of
> 4S mainboard options, but the AMD 4C CPUs won't be available until
> sometime late in 2007.
>
> I've got other ideas, but this list is not the appropriate venue
> for the level of detail required.
>
> Ron Peacetree
>
>
> At 05:30 PM 12/6/2006, Brian Wipf wrote:
>> On 6-Dec-06, at 2:47 PM, Brian Wipf wrote:
>>
>>>> Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K
>>>> rpm HDs, you should be seeing higher absolute performance numbers.
>>>>
>>>> Find out what HW the Areca guys and Tweakers guys used to test the
>>>> 1280s.
>>>> At LW2006, Areca was demonstrating all-in-cache reads and writes
>>>> of ~1600MBps and ~1300MBps respectively along with RAID 0
>>>> Sustained Rates of ~900MBps read, and ~850MBps write.
>>>>
>>>> Luke, I know you've managed to get higher IO rates than this with
>>>> this class of HW.  Is there a OS or SW config issue Brian should
>>>> closely investigate?
>>>
>>> I wrote 1280 by a mistake. It's actually a 1260. Sorry about that.
>>> The IOP341 class of cards weren't available when we ordered the
>>> parts for the box, so we had to go with the 1260. The box(es) we
>>> build next month will either have the 1261ML or 1280 depending on
>>> whether we go 16 or 24 disk.
>>>
>>> I noticed Bucky got almost 800 random seeks per second on her 6
>>> disk 10000 RPM SAS drive Dell PowerEdge 2950. The random seek
>>> performance of this box disappointed me the most. Even running 2
>>> concurrent bonnies, the random seek performance only increased from
>>> 644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm
>>> missing? This card looked pretty impressive on tweakers.net.
>>
>> Areca has some performance numbers in a downloadable PDF for the
>> Areca ARC-1120, which is in the same class as the ARC-1260, except
>> with 8 ports. With all 8 drives in a RAID 0 the card gets the
>> following performance numbers:
>>
>> Card         single thread write    20 thread write      single
>> thread read        20 thread read
>> ARC-1120     321.26 MB/s            404.76 MB/s          412.55
>> MB/ s               672.45 MB/s
>>
>> My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10
>> are slightly better than the ARC-1120 in an 8 disk RAID 0 for a
>> single thread. I guess this means my numbers are reasonable.
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>



Re: Areca 1260 Performance

From
Ron
Date:
At 06:40 PM 12/6/2006, Brian Wipf wrote:
>I appreciate your suggestions, Ron. And that helps answer my question
>on processor selection for our next box; I wasn't sure if the lower
>MHz speed of the Kentsfield compared to the Woodcrest but with double
>the cores would be better for us overall or not.
Please do not misunderstand me.  I am not endorsing the use of Kentsfield.
I am recommending =evaluating= Kentsfield.

I am also recommending the evaluation of 2C 4S AMD solutions.

All this stuff is so leading edge that it is far from clear what the
RW performance of DBMS based on these components will be without
extensive testing of =your= app under =your= workload.

One thing that is clear from what you've posted thus far is that you
are going to needmore HDs if you want to have any chance of fully
utilizing your Areca HW.

Out of curiosity, where are you geographically?

Hoping I'm being helpful,
Ron



Re: File Systems Compared

From
Greg Smith
Date:
On Wed, 6 Dec 2006, Alexander Staubo wrote:

> Care to post these numbers *without* word wrapping?

Brian's message was sent with format=flowed and therefore it's easy to
re-assemble into original form if your software understands that.  I just
checked with two e-mail clients (Thunderbird and Pine) and all his
bonnie++ results were perfectly readable on both as soon as I made the
display wide enough.  If you had trouble reading it, you might consider
upgrading your mail client to one that understands that standard.
Statistically, though, if you have this problem you're probably using
Outlook and there may not be a useful upgrade path for you.  I know it's
been added to the latest Express version (which even defaults to sending
messages flowed, driving many people crazy), but am not sure if any of the
Office Outlooks know what to do with flowed messages yet.

And those of you pointing people at the RFC's, that's a bit hardcore--the
RFC documents themselves could sure use some better formatting.
https://bugzilla.mozilla.org/attachment.cgi?id=134270&action=view has a
readable introduction to the encoding of flowed messages,
http://mailformat.dan.info/body/linelength.html gives some history to how
we all got into this mess in the first place, and
http://joeclark.org/ffaq.html also has some helpful (albeit out of date in
spots) comments on this subject.

Even if it is correct netiquette to disable word-wrapping for long lines
like bonnie output (there are certainly two sides with valid points in
that debate), to make them more compatible with flow-impaired clients, you
can't expect that mail composition software is sophisticated enough to
allow doing that for one section while still wrapping the rest of the text
correctly.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Areca 1260 Performance

From
Brian Wipf
Date:
On 6-Dec-06, at 5:26 PM, Ron wrote:
> At 06:40 PM 12/6/2006, Brian Wipf wrote:
>> I appreciate your suggestions, Ron. And that helps answer my question
>> on processor selection for our next box; I wasn't sure if the lower
>> MHz speed of the Kentsfield compared to the Woodcrest but with double
>> the cores would be better for us overall or not.
> Please do not misunderstand me.  I am not endorsing the use of
> Kentsfield.
> I am recommending =evaluating= Kentsfield.
>
> I am also recommending the evaluation of 2C 4S AMD solutions.
>
> All this stuff is so leading edge that it is far from clear what
> the RW performance of DBMS based on these components will be
> without extensive testing of =your= app under =your= workload.
I want the best performance for the dollar, so I can't rule anything
out. Right now I'm leaning towards Kentsfield, but I will do some
more research before I make a decision. We probably won't wait much
past January though.

> One thing that is clear from what you've posted thus far is that
> you are going to needmore HDs if you want to have any chance of
> fully utilizing your Areca HW.
Do you know off hand where I might find a chassis that can fit 24[+]
drives? The last chassis we ordered was through Supermicro, and the
largest they carry fits 16 drives.

> Hoping I'm being helpful
I appreciate any help I can get.

Brian Wipf
<brian@clickspace.com>


Re: Areca 1260 Performance

From
Ron
Date:
At 03:37 AM 12/7/2006, Brian Wipf wrote:
>On 6-Dec-06, at 5:26 PM, Ron wrote:
>>
>>All this stuff is so leading edge that it is far from clear what
>>the RW performance of DBMS based on these components will be
>>without extensive testing of =your= app under =your= workload.
>I want the best performance for the dollar, so I can't rule anything
>out. Right now I'm leaning towards Kentsfield, but I will do some
>more research before I make a decision. We probably won't wait much
>past January though.
Kentsfield's outrageously high pricing and operating costs (power and
cooling) are not likely to make it the cost/performance winner.

OTOH,
1= ATM it is the way to throw the most cache per socket at a DBMS
within the Core2 CPU line (Tulsa has even more at 16MB per CPU).
2= SSSE3 and other Core2 optimizations have led to some impressive
performance numbers- unless raw clock rate is the thing that can help
you the most.

If what you need for highest performance is the absolute highest
clock rate or most cache per core, then bench some Intel Tulsa's.

Apps with memory footprints too large for on die or in socket caches
or that require extreme memory subsystem performance are still best
served by AMD CPUs.

If you are getting the impression that it is presently complicated
deciding which CPU is best for any specific pg app, then I am making
the impression I intend to.


>>One thing that is clear from what you've posted thus far is that
>>you are going to needmore HDs if you want to have any chance of
>>fully utilizing your Areca HW.
>Do you know off hand where I might find a chassis that can fit 24[+]
>drives? The last chassis we ordered was through Supermicro, and the
>largest they carry fits 16 drives.
www.pogolinux.com has 24 and 48 bay 3.5" HD chassis'; and a 64 bay
2.5" chassis.  Tell them I sent you.

www.impediment.com are folks I trust regarding all things storage
(and RAM).  Again, tell them I sent you.

www.aberdeeninc.com is also a vendor I've had luck with, but try Pogo
and Impediment first.


Good luck and please post what happens,
Ron Peacetree



Re: Areca 1260 Performance

From
Shane Ambler
Date:
>> One thing that is clear from what you've posted thus far is that you
>> are going to needmore HDs if you want to have any chance of fully
>> utilizing your Areca HW.
> Do you know off hand where I might find a chassis that can fit 24[+]
> drives? The last chassis we ordered was through Supermicro, and the
> largest they carry fits 16 drives.

Chenbro has a 24 drive case - the largest I have seen. It fits the big
4/8 cpu boards as well.

http://www.chenbro.com/corporatesite/products_01features.php?serno=43


--

Shane Ambler
pgSQL@007Marketing.com

Get Sheeky @ http://Sheeky.Biz

Re: Areca 1260 Performance

From
Gene
Date:
I'm building a SuperServer 6035B server (16 scsi drives). My schema has basically two large tables (million+ per day) each which are partitioned daily, and queried independently of each other. Would you recommend a raid1 system partition and 14 drives in a raid 10 or should i create separate partitions/tablespaces for the two large tables and indexes?

Thanks
Gene

On 12/7/06, Shane Ambler <pgsql@007marketing.com> wrote:

>> One thing that is clear from what you've posted thus far is that you
>> are going to needmore HDs if you want to have any chance of fully
>> utilizing your Areca HW.
> Do you know off hand where I might find a chassis that can fit 24[+]
> drives? The last chassis we ordered was through Supermicro, and the
> largest they carry fits 16 drives.

Chenbro has a 24 drive case - the largest I have seen. It fits the big
4/8 cpu boards as well.

http://www.chenbro.com/corporatesite/products_01features.php?serno=43


--

Shane Ambler
pgSQL@007Marketing.com

Get Sheeky @ http://Sheeky.Biz

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match



--
Gene Hart
cell: 443-604-2679

Re: File Systems Compared

From
"Merlin Moncure"
Date:
On 12/6/06, Brian Wipf <brian@clickspace.com> wrote:
> > Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K
> > rpm HDs, you should be seeing higher absolute performance numbers.
> >
> > Find out what HW the Areca guys and Tweakers guys used to test the
> > 1280s.
> > At LW2006, Areca was demonstrating all-in-cache reads and writes of
> > ~1600MBps and ~1300MBps respectively along with RAID 0 Sustained
> > Rates of ~900MBps read, and ~850MBps write.
> >
> > Luke, I know you've managed to get higher IO rates than this with
> > this class of HW.  Is there a OS or SW config issue Brian should
> > closely investigate?
>
> I wrote 1280 by a mistake. It's actually a 1260. Sorry about that.
> The IOP341 class of cards weren't available when we ordered the parts
> for the box, so we had to go with the 1260. The box(es) we build next
> month will either have the 1261ML or 1280 depending on whether we go
> 16 or 24 disk.
>
> I noticed Bucky got almost 800 random seeks per second on her 6 disk
> 10000 RPM SAS drive Dell PowerEdge 2950. The random seek performance
> of this box disappointed me the most. Even running 2 concurrent
> bonnies, the random seek performance only increased from 644 seeks/
> sec to 813 seeks/sec. Maybe there is some setting I'm missing? This
> card looked pretty impressive on tweakers.net.

I've been looking a lot at the SAS enclosures lately and am starting
to feel like that's the way to go.  Performance is amazing and the
flexibility of choosing low cost SATA or high speed SAS drives is
great.  not only that, but more and more SAS is coming out in 2.5"
drives which seems to be a better fit for databases...more spindles.
with a 2.5" drive enclosure they can stuff 10 hot swap drives into a
1u enclosure...that's pretty amazing.

one downside of SAS is most of the HBAs are pci-express only, that can
limit your options unless your server is very new.  also you don't
want to skimp on the hba, get the best available, which looks to be
lsi logic at the moment (dell perc5/e is lsi logic controller as is
the intel sas hba)...others?

merlin

Re: Areca 1260 Performance

From
Ron
Date:
At 11:02 AM 12/7/2006, Gene wrote:
>I'm building a SuperServer 6035B server (16 scsi drives). My schema
>has basically two large tables (million+ per day) each which are
>partitioned daily, and queried independently of each other. Would
>you recommend a raid1 system partition and 14 drives in a raid 10 or
>should i create separate partitions/tablespaces for the two large
>tables and indexes?
Not an easy question to answer w/o knowing more about your actual
queries and workload.

To keep the math simple, let's assume each SCSI HD has and ASTR of
75MBps.  A 14 HD RAID 10 therefore has an ASTR of 7* 75= 525MBps.  If
the rest of your system can handle this much or more bandwidth, then
this is most probably the best config.

Dedicating spindles to specific tables is usually best done when
there is HD bandwidth that can't be utilized if the HDs are in a
larger set +and+  there is a significant hot spot that can use
dedicated resources.

My first attempt would be to use other internal HDs for a RAID 1
systems volume and use all 16 of your HBA HDs for a 16 HD RAID 10 array.
Then I'd bench the config to see if it had acceptable performance.

If yes, stop.  Else start considering the more complicated  alternatives.

Remember that adding HDs and RAM is far cheaper than even a few hours
of skilled technical labor.

Ron Peacetree


Re: File Systems Compared

From
Bruno Wolff III
Date:
On Wed, Dec 06, 2006 at 08:55:14 -0800,
  Mark Lewis <mark.lewis@mir3.com> wrote:
> > Anyone run their RAIDs with disk caches enabled, or is this akin to
> > having fsync off?
>
> Disk write caches are basically always akin to having fsync off.  The
> only time a write-cache is (more or less) safe to enable is when it is
> backed by a battery or in some other way made non-volatile.
>
> So a RAID controller with a battery-backed write cache can enable its
> own write cache, but can't safely enable the write-caches on the disk
> drives it manages.

This appears to be changing under Linux. Recent kernels have write barriers
implemented using cache flush commands (which some drives ignore, so you
need to be careful). In very recent kernels, software raid using raid 1
will also handle write barriers. To get this feature, you are supposed to
mount ext3 file systems with the barrier=1 option. For other file systems,
the parameter may need to be different.

Re: File Systems Compared

From
Jim Nasby
Date:
On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote:
> On Wed, Dec 06, 2006 at 08:55:14 -0800,
>   Mark Lewis <mark.lewis@mir3.com> wrote:
>>> Anyone run their RAIDs with disk caches enabled, or is this akin to
>>> having fsync off?
>>
>> Disk write caches are basically always akin to having fsync off.  The
>> only time a write-cache is (more or less) safe to enable is when
>> it is
>> backed by a battery or in some other way made non-volatile.
>>
>> So a RAID controller with a battery-backed write cache can enable its
>> own write cache, but can't safely enable the write-caches on the disk
>> drives it manages.
>
> This appears to be changing under Linux. Recent kernels have write
> barriers
> implemented using cache flush commands (which some drives ignore,
> so you
> need to be careful). In very recent kernels, software raid using
> raid 1
> will also handle write barriers. To get this feature, you are
> supposed to
> mount ext3 file systems with the barrier=1 option. For other file
> systems,
> the parameter may need to be different.

But would that actually provide a meaningful benefit? When you
COMMIT, the WAL data must hit non-volatile storage of some kind,
which without a BBU or something similar, means hitting the platter.
So I don't see how enabling the disk cache will help, unless of
course it's ignoring fsync.

Now, I have heard something about drives using their stored
rotational energy to flush out the cache... but I tend to suspect
urban legend there...
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: File Systems Compared

From
Bruno Wolff III
Date:
On Thu, Dec 14, 2006 at 01:39:00 -0500,
  Jim Nasby <decibel@decibel.org> wrote:
> On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote:
> >
> >This appears to be changing under Linux. Recent kernels have write
> >barriers
> >implemented using cache flush commands (which some drives ignore,
> >so you
> >need to be careful). In very recent kernels, software raid using
> >raid 1
> >will also handle write barriers. To get this feature, you are
> >supposed to
> >mount ext3 file systems with the barrier=1 option. For other file
> >systems,
> >the parameter may need to be different.
>
> But would that actually provide a meaningful benefit? When you
> COMMIT, the WAL data must hit non-volatile storage of some kind,
> which without a BBU or something similar, means hitting the platter.
> So I don't see how enabling the disk cache will help, unless of
> course it's ignoring fsync.

When you do an fsync, the OS sends a cache flush command to the drive,
which on most drives (but supposedly there are ones that ignore this
command) doesn't return until all of the cached pages have been written
to the platter, and doesn't return from the fsync until the flush is complete.
While this writes more sectors than you really need, it is safe. And it allows
for caching to speed up some things (though not as much as having queued
commands would).

I have done some tests on my systems and the speeds I am getting make it
clear that write barriers slow things down to about the same range as having
caches disabled. So I believe that it is likely working as advertised.

Note the use case for this is more for hobbiests or development boxes. You can
only use it on software raid (md) 1, which rules out most "real" systems.

Re: File Systems Compared

From
Ron Mayer
Date:
Bruno Wolff III wrote:
> On Thu, Dec 14, 2006 at 01:39:00 -0500,
>   Jim Nasby <decibel@decibel.org> wrote:
>> On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote:
>>> This appears to be changing under Linux. Recent kernels have write
>>> barriers implemented using cache flush commands (which
>>> some drives ignore,  so you need to be careful).

Is it true that some drives ignore this; or is it mostly
an urban legend that was started by testers that didn't
have kernels with write barrier support.   I'd be especially
interested in knowing if there are any currently available
drives which ignore those commands.

>>> In very recent kernels, software raid using raid 1 will also
>>> handle write barriers. To get this feature, you are supposed to
>>> mount ext3 file systems with the barrier=1 option. For other file
>>> systems, the parameter may need to be different.

With XFS the default is apparently to enable write barrier
support unless you explicitly disable it with the nobarrier mount option.
It also will warn you in the system log if the underlying device
doesn't have write barrier support.

SGI recommends that you use the "nobarrier" mount option if you do
have a persistent (battery backed) write cache on your raid device.

  http://oss.sgi.com/projects/xfs/faq.html#wcache


>> But would that actually provide a meaningful benefit? When you
>> COMMIT, the WAL data must hit non-volatile storage of some kind,
>> which without a BBU or something similar, means hitting the platter.
>> So I don't see how enabling the disk cache will help, unless of
>> course it's ignoring fsync.

With write barriers, fsync() waits for the physical disk; but I believe
the background writes from write() done by pdflush don't have to; so
it's kinda like only disabling the cache for WAL files and the filesystem's
journal, but having it enabled for the rest of your write activity (the
tables except at checkpoints?  the log file?).

> Note the use case for this is more for hobbiests or development boxes. You can
> only use it on software raid (md) 1, which rules out most "real" systems.
>

Ugh.  Looking for where that's documented; and hoping it is or will soon
work on software 1+0 as well.

Re: File Systems Compared

From
Bruno Wolff III
Date:
The reply wasn't (directly copied to the performance list, but I will
copy this one back.

On Thu, Dec 14, 2006 at 13:21:11 -0800,
  Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
> Bruno Wolff III wrote:
> > On Thu, Dec 14, 2006 at 01:39:00 -0500,
> >   Jim Nasby <decibel@decibel.org> wrote:
> >> On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote:
> >>> This appears to be changing under Linux. Recent kernels have write
> >>> barriers implemented using cache flush commands (which
> >>> some drives ignore,  so you need to be careful).
>
> Is it true that some drives ignore this; or is it mostly
> an urban legend that was started by testers that didn't
> have kernels with write barrier support.   I'd be especially
> interested in knowing if there are any currently available
> drives which ignore those commands.
>
> >>> In very recent kernels, software raid using raid 1 will also
> >>> handle write barriers. To get this feature, you are supposed to
> >>> mount ext3 file systems with the barrier=1 option. For other file
> >>> systems, the parameter may need to be different.
>
> With XFS the default is apparently to enable write barrier
> support unless you explicitly disable it with the nobarrier mount option.
> It also will warn you in the system log if the underlying device
> doesn't have write barrier support.
>
> SGI recommends that you use the "nobarrier" mount option if you do
> have a persistent (battery backed) write cache on your raid device.
>
>   http://oss.sgi.com/projects/xfs/faq.html#wcache
>
>
> >> But would that actually provide a meaningful benefit? When you
> >> COMMIT, the WAL data must hit non-volatile storage of some kind,
> >> which without a BBU or something similar, means hitting the platter.
> >> So I don't see how enabling the disk cache will help, unless of
> >> course it's ignoring fsync.
>
> With write barriers, fsync() waits for the physical disk; but I believe
> the background writes from write() done by pdflush don't have to; so
> it's kinda like only disabling the cache for WAL files and the filesystem's
> journal, but having it enabled for the rest of your write activity (the
> tables except at checkpoints?  the log file?).
>
> > Note the use case for this is more for hobbiests or development boxes. You can
> > only use it on software raid (md) 1, which rules out most "real" systems.
> >
>
> Ugh.  Looking for where that's documented; and hoping it is or will soon
> work on software 1+0 as well.

Re: File Systems Compared

From
Bruno Wolff III
Date:
On Thu, Dec 14, 2006 at 13:21:11 -0800,
  Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
> Bruno Wolff III wrote:
> > On Thu, Dec 14, 2006 at 01:39:00 -0500,
> >   Jim Nasby <decibel@decibel.org> wrote:
> >> On Dec 11, 2006, at 12:54 PM, Bruno Wolff III wrote:
> >>> This appears to be changing under Linux. Recent kernels have write
> >>> barriers implemented using cache flush commands (which
> >>> some drives ignore,  so you need to be careful).
>
> Is it true that some drives ignore this; or is it mostly
> an urban legend that was started by testers that didn't
> have kernels with write barrier support.   I'd be especially
> interested in knowing if there are any currently available
> drives which ignore those commands.

I saw posts claiming this, but no specific drives mentioned. I did see one
post that claimed that the cache flush command was mandated (not optional)
by the spec.

> >>> In very recent kernels, software raid using raid 1 will also
> >>> handle write barriers. To get this feature, you are supposed to
> >>> mount ext3 file systems with the barrier=1 option. For other file
> >>> systems, the parameter may need to be different.
>
> With XFS the default is apparently to enable write barrier
> support unless you explicitly disable it with the nobarrier mount option.
> It also will warn you in the system log if the underlying device
> doesn't have write barrier support.

I think there might be a similar patch for ext3 going into 2.6.19. I haven't
checked a 2.6.19 kernel to make sure though.

>
> SGI recommends that you use the "nobarrier" mount option if you do
> have a persistent (battery backed) write cache on your raid device.
>
>   http://oss.sgi.com/projects/xfs/faq.html#wcache
>
>
> >> But would that actually provide a meaningful benefit? When you
> >> COMMIT, the WAL data must hit non-volatile storage of some kind,
> >> which without a BBU or something similar, means hitting the platter.
> >> So I don't see how enabling the disk cache will help, unless of
> >> course it's ignoring fsync.
>
> With write barriers, fsync() waits for the physical disk; but I believe
> the background writes from write() done by pdflush don't have to; so
> it's kinda like only disabling the cache for WAL files and the filesystem's
> journal, but having it enabled for the rest of your write activity (the
> tables except at checkpoints?  the log file?).

Not exactly. Whenever you commit the file system log or fsync the wal file,
all previously written blocks will be flushed to the disk platter, before
any new write requests are honored. So journalling semantics will work
properly.

> > Note the use case for this is more for hobbiests or development boxes. You can
> > only use it on software raid (md) 1, which rules out most "real" systems.
> >
>
> Ugh.  Looking for where that's documented; and hoping it is or will soon
> work on software 1+0 as well.

I saw a comment somewhere that raid 0 provided some problems and the suggestion
was to handle the barrier at a different level (though I don't know how you
could). So I don't belive 1+0 or 5 are currently supported or will be in the
near term.

The other feature I would like is to be able to use write barriers with
encrypted file systems. I haven't found anythign on whether or not there
are near term plans by any one to support that.

Re: File Systems Compared

From
Bruno Wolff III
Date:
On Fri, Dec 15, 2006 at 10:34:15 -0600,
  Bruno Wolff III <bruno@wolff.to> wrote:
> The reply wasn't (directly copied to the performance list, but I will
> copy this one back.

Sorry about this one, I meant to intersperse my replies and hit the 'y'
key at the wrong time. (And there ended up being a copy on performance
anyway from the news gateway.)

Re: File Systems Compared

From
Bruno Wolff III
Date:
On Fri, Dec 15, 2006 at 10:44:39 -0600,
  Bruno Wolff III <bruno@wolff.to> wrote:
>
> The other feature I would like is to be able to use write barriers with
> encrypted file systems. I haven't found anythign on whether or not there
> are near term plans by any one to support that.

I asked about this on the dm-crypt list and was told that write barriers
work pre 2.6.19. There was a change for 2.6.19 that might break things for
SMP systems. But that will probably get fixed eventually.