Thread: Filesystem benchmarking for pg 8.3.3 server

Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
Hello list,

I have a server with a direct attached storage containing 4 15k SAS
drives and 6 standard SATA drives.
The server is a quad core xeon with 16GB ram.
Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU

There is 2 raid set configured.
One RAID 10 containing 4 SAS disks
One RAID 5 containing 6 SATA disks

There is one partition per RAID set with ext2 filesystem.

I ran the following iozone test which I stole from Joshua Drake's test
at

http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/

I ran this test against the RAID 5 SATA partition

#iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u

With these random write results

    Children see throughput for 1 random writers     =  168647.33 KB/sec
    Parent sees throughput for 1 random writers     =  168413.61 KB/sec
    Min throughput per process             =  168647.33 KB/sec
    Max throughput per process             =  168647.33 KB/sec
    Avg throughput per process             =  168647.33 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU utilization: Wall time    6.072    CPU time    0.540    CPU
utilization   8.89 %

Almost 170 MB/sek. Not bad for 6 standard SATA drives.

Then I ran the same thing against the RAID 10 SAS partition

    Children see throughput for 1 random writers     =   68816.25 KB/sec
    Parent sees throughput for 1 random writers     =   68767.90 KB/sec
    Min throughput per process             =   68816.25 KB/sec
    Max throughput per process             =   68816.25 KB/sec
    Avg throughput per process             =   68816.25 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU utilization: Wall time   14.880    CPU time    0.520    CPU
utilization   3.49 %

What only 70 MB/sek?

Is it possible that the 2 more spindles for the SATA drives makes that
partition soooo much faster? Even though the disks and the RAID
configuration should be slower?
It feels like there is something fishy going on. Maybe the RAID 10
implementation on the PERC/6e is crap?

Any pointers, suggestion, ideas?

I'm going to change the RAID 10 to a RAID 5 and test again and see
what happens.

Cheers,
Henke


Re: Filesystem benchmarking for pg 8.3.3 server

From
"Luke Lonergan"
Date:

Your expected write speed on a 4 drive RAID10 is two drives worth, probably 160 MB/s, depending on the generation of drives.

The expect write speed for a 6 drive RAID5 is 5 drives worth, or about 400 MB/s, sans the RAID5 parity overhead.

- Luke

----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
Sent: Fri Aug 08 10:23:55 2008
Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

Hello list,

I have a server with a direct attached storage containing 4 15k SAS 
drives and 6 standard SATA drives.
The server is a quad core xeon with 16GB ram.
Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU

There is 2 raid set configured.
One RAID 10 containing 4 SAS disks
One RAID 5 containing 6 SATA disks

There is one partition per RAID set with ext2 filesystem.

I ran the following iozone test which I stole from Joshua Drake's test 
at
http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/

I ran this test against the RAID 5 SATA partition

#iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u

With these random write results

        Children see throughput for 1 random writers    =  168647.33 KB/sec
        Parent sees throughput for 1 random writers     =  168413.61 KB/sec
        Min throughput per process                      =  168647.33 KB/sec
        Max throughput per process                      =  168647.33 KB/sec
        Avg throughput per process                      =  168647.33 KB/sec
        Min xfer                                        = 1024000.00 KB
        CPU utilization: Wall time    6.072    CPU time    0.540    CPU 
utilization   8.89 %

Almost 170 MB/sek. Not bad for 6 standard SATA drives.

Then I ran the same thing against the RAID 10 SAS partition

        Children see throughput for 1 random writers    =   68816.25 KB/sec
        Parent sees throughput for 1 random writers     =   68767.90 KB/sec
        Min throughput per process                      =   68816.25 KB/sec
        Max throughput per process                      =   68816.25 KB/sec
        Avg throughput per process                      =   68816.25 KB/sec
        Min xfer                                        = 1024000.00 KB
        CPU utilization: Wall time   14.880    CPU time    0.520    CPU 
utilization   3.49 %

What only 70 MB/sek?

Is it possible that the 2 more spindles for the SATA drives makes that 
partition soooo much faster? Even though the disks and the RAID 
configuration should be slower?
It feels like there is something fishy going on. Maybe the RAID 10 
implementation on the PERC/6e is crap?

Any pointers, suggestion, ideas?

I'm going to change the RAID 10 to a RAID 5 and test again and see 
what happens.

Cheers,
Henke


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
But random writes should be faster on a RAID10 as it doesn't need to calculate parity. That is why people suggest RAID 10 for datases, correct?

I can understand that RAID5 can be faster with sequential writes.

//Henke

8 aug 2008 kl. 16.53 skrev Luke Lonergan:

Your expected write speed on a 4 drive RAID10 is two drives worth, probably 160 MB/s, depending on the generation of drives.

The expect write speed for a 6 drive RAID5 is 5 drives worth, or about 400 MB/s, sans the RAID5 parity overhead.

- Luke

----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
Sent: Fri Aug 08 10:23:55 2008
Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

Hello list,

I have a server with a direct attached storage containing 4 15k SAS 
drives and 6 standard SATA drives.
The server is a quad core xeon with 16GB ram.
Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU

There is 2 raid set configured.
One RAID 10 containing 4 SAS disks
One RAID 5 containing 6 SATA disks

There is one partition per RAID set with ext2 filesystem.

I ran the following iozone test which I stole from Joshua Drake's test 
at
http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/

I ran this test against the RAID 5 SATA partition

#iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u

With these random write results

        Children see throughput for 1 random writers    =  168647.33 KB/sec
        Parent sees throughput for 1 random writers     =  168413.61 KB/sec
        Min throughput per process                      =  168647.33 KB/sec
        Max throughput per process                      =  168647.33 KB/sec
        Avg throughput per process                      =  168647.33 KB/sec
        Min xfer                                        = 1024000.00 KB
        CPU utilization: Wall time    6.072    CPU time    0.540    CPU 
utilization   8.89 %

Almost 170 MB/sek. Not bad for 6 standard SATA drives.

Then I ran the same thing against the RAID 10 SAS partition

        Children see throughput for 1 random writers    =   68816.25 KB/sec
        Parent sees throughput for 1 random writers     =   68767.90 KB/sec
        Min throughput per process                      =   68816.25 KB/sec
        Max throughput per process                      =   68816.25 KB/sec
        Avg throughput per process                      =   68816.25 KB/sec
        Min xfer                                        = 1024000.00 KB
        CPU utilization: Wall time   14.880    CPU time    0.520    CPU 
utilization   3.49 %

What only 70 MB/sek?

Is it possible that the 2 more spindles for the SATA drives makes that 
partition soooo much faster? Even though the disks and the RAID 
configuration should be slower?
It feels like there is something fishy going on. Maybe the RAID 10 
implementation on the PERC/6e is crap?

Any pointers, suggestion, ideas?

I'm going to change the RAID 10 to a RAID 5 and test again and see 
what happens.

Cheers,
Henke


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: Filesystem benchmarking for pg 8.3.3 server

From
"Mark Wong"
Date:
On Fri, Aug 8, 2008 at 8:08 AM, Henrik <henke@mac.se> wrote:
> But random writes should be faster on a RAID10 as it doesn't need to
> calculate parity. That is why people suggest RAID 10 for datases, correct?
> I can understand that RAID5 can be faster with sequential writes.

There is some data here that does not support that RAID5 can be faster
than RAID10 for sequential writes:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide

Regards,
Mark

Re: Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
8 aug 2008 kl. 18.44 skrev Mark Wong:

> On Fri, Aug 8, 2008 at 8:08 AM, Henrik <henke@mac.se> wrote:
>> But random writes should be faster on a RAID10 as it doesn't need to
>> calculate parity. That is why people suggest RAID 10 for datases,
>> correct?
>> I can understand that RAID5 can be faster with sequential writes.
>
> There is some data here that does not support that RAID5 can be faster
> than RAID10 for sequential writes:
>
> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
I'm amazed by the big difference on hardware vs software raid.

I set up e new Dell(!) system against a MD1000 DAS with singel quad
core 2,33 Ghz, 16GB RAM and Perc/6E raid controllers with 512MB BBU.

I set up a RAID 10 on 4 15K SAS disks.

I ran IOZone against this partition with ext2 filesystem and got the
following results.

safeuser@safecube04:/$ iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -
+u -F /database/iotest
    Iozone: Performance Test of File I/O
            Version $Revision: 3.279 $
        Compiled for 64 bit mode.
        Build: linux

    Children see throughput for  1 initial writers     =  254561.23 KB/sec
    Parent sees throughput for  1 initial writers     =  253935.07 KB/sec
    Min throughput per process             =  254561.23 KB/sec
    Max throughput per process             =  254561.23 KB/sec
    Avg throughput per process             =  254561.23 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU Utilization: Wall time    4.023    CPU time    0.740    CPU
utilization  18.40 %


    Children see throughput for  1 rewriters     =  259640.61 KB/sec
    Parent sees throughput for  1 rewriters     =  259351.20 KB/sec
    Min throughput per process             =  259640.61 KB/sec
    Max throughput per process             =  259640.61 KB/sec
    Avg throughput per process             =  259640.61 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU utilization: Wall time    3.944    CPU time    0.460    CPU
utilization  11.66 %


    Children see throughput for  1 readers         = 2931030.50 KB/sec
    Parent sees throughput for  1 readers         = 2877172.20 KB/sec
    Min throughput per process             = 2931030.50 KB/sec
    Max throughput per process             = 2931030.50 KB/sec
    Avg throughput per process             = 2931030.50 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU utilization: Wall time    0.349    CPU time    0.340    CPU
utilization  97.32 %


    Children see throughput for 1 random readers     = 2534182.50 KB/sec
    Parent sees throughput for 1 random readers     = 2465408.13 KB/sec
    Min throughput per process             = 2534182.50 KB/sec
    Max throughput per process             = 2534182.50 KB/sec
    Avg throughput per process             = 2534182.50 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU utilization: Wall time    0.404    CPU time    0.400    CPU
utilization  98.99 %

    Children see throughput for 1 random writers     =   68816.25 KB/sec
    Parent sees throughput for 1 random writers     =   68767.90 KB/sec
    Min throughput per process             =   68816.25 KB/sec
    Max throughput per process             =   68816.25 KB/sec
    Avg throughput per process             =   68816.25 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU utilization: Wall time   14.880    CPU time    0.520    CPU
utilization   3.49 %


So compared to the HP 8000 benchmarks this setup is even better than
the software raid.

But I'm skeptical of iozones results as when I run the same test
agains 6 standard SATA drives in RAID5 I got random writes of 170MB /
sek (!). Sure 2 more spindles but still.

Cheers,
Henke


Re: Filesystem benchmarking for pg 8.3.3 server

From
"Andrej Ricnik-Bay"
Date:
On 09/08/2008, Henrik <henke@mac.se> wrote:
> But random writes should be faster on a RAID10 as it doesn't need to
> calculate parity. That is why people suggest RAID 10 for datases, correct?

If it had 10 spindles as opposed to 4 ... with 4 drives the "split" is (because
you're striping and mirroring) like writing to two.


Cheers,
Andrej

Re: Filesystem benchmarking for pg 8.3.3 server

From
Greg Smith
Date:
On Fri, 8 Aug 2008, Henrik wrote:

> It feels like there is something fishy going on. Maybe the RAID 10
> implementation on the PERC/6e is crap?

Normally, when a SATA implementation is running significantly faster than
a SAS one, it's because there's some write cache in the SATA disks turned
on (which they usually are unless you go out of your way to disable them).
Since all non-battery backed caches need to get turned off for reliable
database use, you might want to double-check that on the controller that's
driving the SATA disks.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Filesystem benchmarking for pg 8.3.3 server

From
david@lang.hm
Date:
On Fri, 8 Aug 2008, Henrik wrote:

> But random writes should be faster on a RAID10 as it doesn't need to
> calculate parity. That is why people suggest RAID 10 for datases, correct?
>
> I can understand that RAID5 can be faster with sequential writes.

the key word here is "can" be faster, it depends on the exact
implementation, stripe size, OS caching, etc.

the ideal situation would be that the OS would flush exactly one stripe of
data at a time (aligned with the array) and no reads would need to be
done, mearly calculate the parity info for the new data and write it all.

the worst case is when the write size is small in relation to the stripe
size and crosses the stripe boundry. In that case the system needs to read
data from multiple stripes to calculate the new parity and write the
data and parity data.

I don't know any systems (software or hardware) that meet the ideal
situation today.

when comparing software and hardware raid, one other thing to remember is
that CPU and I/O bandwidth that's used for software raid is not available
to do other things.

so a system that benchmarks much faster with software raid could end up
being significantly slower in practice if it needs that CPU and I/O
bandwidth for other purposes.

examples could be needing the CPU/memory capacity to search through
amounts of RAM once the data is retrieved from disk, or finding that you
have enough network I/O that it combines with your disk I/O to saturate
your system busses.

David Lang


> //Henke
>
> 8 aug 2008 kl. 16.53 skrev Luke Lonergan:
>
>> Your expected write speed on a 4 drive RAID10 is two drives worth, probably
>> 160 MB/s, depending on the generation of drives.
>>
>> The expect write speed for a 6 drive RAID5 is 5 drives worth, or about 400
>> MB/s, sans the RAID5 parity overhead.
>>
>> - Luke
>>
>> ----- Original Message -----
>> From: pgsql-performance-owner@postgresql.org
>> <pgsql-performance-owner@postgresql.org>
>> To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
>> Sent: Fri Aug 08 10:23:55 2008
>> Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server
>>
>> Hello list,
>>
>> I have a server with a direct attached storage containing 4 15k SAS
>> drives and 6 standard SATA drives.
>> The server is a quad core xeon with 16GB ram.
>> Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU
>>
>> There is 2 raid set configured.
>> One RAID 10 containing 4 SAS disks
>> One RAID 5 containing 6 SATA disks
>>
>> There is one partition per RAID set with ext2 filesystem.
>>
>> I ran the following iozone test which I stole from Joshua Drake's test
>> at
>>
http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/
>>
>> I ran this test against the RAID 5 SATA partition
>>
>> #iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u
>>
>> With these random write results
>>
>>        Children see throughput for 1 random writers    =  168647.33 KB/sec
>>        Parent sees throughput for 1 random writers     =  168413.61 KB/sec
>>        Min throughput per process                      =  168647.33 KB/sec
>>        Max throughput per process                      =  168647.33 KB/sec
>>        Avg throughput per process                      =  168647.33 KB/sec
>>        Min xfer                                        = 1024000.00 KB
>>        CPU utilization: Wall time    6.072    CPU time    0.540    CPU
>> utilization   8.89 %
>>
>> Almost 170 MB/sek. Not bad for 6 standard SATA drives.
>>
>> Then I ran the same thing against the RAID 10 SAS partition
>>
>>        Children see throughput for 1 random writers    =   68816.25 KB/sec
>>        Parent sees throughput for 1 random writers     =   68767.90 KB/sec
>>        Min throughput per process                      =   68816.25 KB/sec
>>        Max throughput per process                      =   68816.25 KB/sec
>>        Avg throughput per process                      =   68816.25 KB/sec
>>        Min xfer                                        = 1024000.00 KB
>>        CPU utilization: Wall time   14.880    CPU time    0.520    CPU
>> utilization   3.49 %
>>
>> What only 70 MB/sek?
>>
>> Is it possible that the 2 more spindles for the SATA drives makes that
>> partition soooo much faster? Even though the disks and the RAID
>> configuration should be slower?
>> It feels like there is something fishy going on. Maybe the RAID 10
>> implementation on the PERC/6e is crap?
>>
>> Any pointers, suggestion, ideas?
>>
>> I'm going to change the RAID 10 to a RAID 5 and test again and see
>> what happens.
>>
>> Cheers,
>> Henke
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>

Re: Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
9 aug 2008 kl. 00.47 skrev Greg Smith:

> On Fri, 8 Aug 2008, Henrik wrote:
>
>> It feels like there is something fishy going on. Maybe the RAID 10
>> implementation on the PERC/6e is crap?
>
> Normally, when a SATA implementation is running significantly faster
> than a SAS one, it's because there's some write cache in the SATA
> disks turned on (which they usually are unless you go out of your
> way to disable them). Since all non-battery backed caches need to
> get turned off for reliable database use, you might want to double-
> check that on the controller that's driving the SATA disks.

Lucky for my I have BBU on all my controllers cards and I'm also not
using the SATA drives for database. That is why I bought the SAS
drives :) Just got confused when the SATA RAID 5 was sooo much faster
than the SAS RAID10, even random writes. But I should have realized
that SAS is only faster if the number of drives are equal :)

Thanks for the input!

Cheers,
Henke

>
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com
> Baltimore, MD
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
OK, changed the SAS RAID 10 to RAID 5 and now my random writes are
handing 112 MB/ sek. So it is almsot twice as fast as the RAID10 with
the same disks. Any ideas why?

Is the iozone tests faulty?

What is your suggestions? Trust the IOZone tests and use RAID5 instead
of RAID10, or go for RAID10 as it should be faster and will be more
suited when we add more disks in the future?

I'm a little confused by the benchmarks.

This is from the RAID5 tests on 4 SAS 15K drives...

iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u -F /database/iotest

    Children see throughput for 1 random writers     =  112074.58 KB/sec
    Parent sees throughput for 1 random writers     =  111962.80 KB/sec
    Min throughput per process             =  112074.58 KB/sec
    Max throughput per process             =  112074.58 KB/sec
    Avg throughput per process             =  112074.58 KB/sec
    Min xfer                     = 1024000.00 KB
    CPU utilization: Wall time    9.137    CPU time    0.510    CPU
utilization   5.58 %




9 aug 2008 kl. 04.24 skrev david@lang.hm:

> On Fri, 8 Aug 2008, Henrik wrote:
>
>> But random writes should be faster on a RAID10 as it doesn't need
>> to calculate parity. That is why people suggest RAID 10 for
>> datases, correct?
>>
>> I can understand that RAID5 can be faster with sequential writes.
>
> the key word here is "can" be faster, it depends on the exact
> implementation, stripe size, OS caching, etc.
>
> the ideal situation would be that the OS would flush exactly one
> stripe of data at a time (aligned with the array) and no reads would
> need to be done, mearly calculate the parity info for the new data
> and write it all.
>
> the worst case is when the write size is small in relation to the
> stripe size and crosses the stripe boundry. In that case the system
> needs to read data from multiple stripes to calculate the new parity
> and write the data and parity data.
>
> I don't know any systems (software or hardware) that meet the ideal
> situation today.
>
> when comparing software and hardware raid, one other thing to
> remember is that CPU and I/O bandwidth that's used for software raid
> is not available to do other things.
>
> so a system that benchmarks much faster with software raid could end
> up being significantly slower in practice if it needs that CPU and I/
> O bandwidth for other purposes.
>
> examples could be needing the CPU/memory capacity to search through
> amounts of RAM once the data is retrieved from disk, or finding that
> you have enough network I/O that it combines with your disk I/O to
> saturate your system busses.
>
> David Lang
>
>
>> //Henke
>>
>> 8 aug 2008 kl. 16.53 skrev Luke Lonergan:
>>
>>> Your expected write speed on a 4 drive RAID10 is two drives worth,
>>> probably 160 MB/s, depending on the generation of drives.
>>> The expect write speed for a 6 drive RAID5 is 5 drives worth, or
>>> about 400 MB/s, sans the RAID5 parity overhead.
>>> - Luke
>>> ----- Original Message -----
>>> From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org
>>> >
>>> To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org
>>> >
>>> Sent: Fri Aug 08 10:23:55 2008
>>> Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server
>>> Hello list,
>>> I have a server with a direct attached storage containing 4 15k SAS
>>> drives and 6 standard SATA drives.
>>> The server is a quad core xeon with 16GB ram.
>>> Both server and DAS has dual PERC/6E raid controllers with 512 MB
>>> BBU
>>> There is 2 raid set configured.
>>> One RAID 10 containing 4 SAS disks
>>> One RAID 5 containing 6 SATA disks
>>> There is one partition per RAID set with ext2 filesystem.
>>> I ran the following iozone test which I stole from Joshua Drake's
>>> test
>>> at
>>>
http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/
>>> I ran this test against the RAID 5 SATA partition
>>> #iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u
>>> With these random write results
>>>
>>>       Children see throughput for 1 random writers    =  168647.33
>>> KB/sec
>>>       Parent sees throughput for 1 random writers     =  168413.61
>>> KB/sec
>>>       Min throughput per process                      =  168647.33
>>> KB/sec
>>>       Max throughput per process                      =  168647.33
>>> KB/sec
>>>       Avg throughput per process                      =  168647.33
>>> KB/sec
>>>       Min xfer                                        = 1024000.00
>>> KB
>>>       CPU utilization: Wall time    6.072    CPU time    0.540
>>> CPU
>>> utilization   8.89 %
>>> Almost 170 MB/sek. Not bad for 6 standard SATA drives.
>>> Then I ran the same thing against the RAID 10 SAS partition
>>>
>>>       Children see throughput for 1 random writers    =   68816.25
>>> KB/sec
>>>       Parent sees throughput for 1 random writers     =   68767.90
>>> KB/sec
>>>       Min throughput per process                      =   68816.25
>>> KB/sec
>>>       Max throughput per process                      =   68816.25
>>> KB/sec
>>>       Avg throughput per process                      =   68816.25
>>> KB/sec
>>>       Min xfer                                        = 1024000.00
>>> KB
>>>       CPU utilization: Wall time   14.880    CPU time    0.520
>>> CPU
>>> utilization   3.49 %
>>> What only 70 MB/sek?
>>> Is it possible that the 2 more spindles for the SATA drives makes
>>> that
>>> partition soooo much faster? Even though the disks and the RAID
>>> configuration should be slower?
>>> It feels like there is something fishy going on. Maybe the RAID 10
>>> implementation on the PERC/6e is crap?
>>> Any pointers, suggestion, ideas?
>>> I'm going to change the RAID 10 to a RAID 5 and test again and see
>>> what happens.
>>> Cheers,
>>> Henke
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
>>> )
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Filesystem benchmarking for pg 8.3.3 server

From
Glyn Astill
Date:
> >
> >> It feels like there is something fishy going on.
> Maybe the RAID 10
> >> implementation on the PERC/6e is crap?
> >

It's possible.  We had a bunch of perc/5i SAS raid cards in our servers that performed quite well in Raid 5 but were
shitein Raid 10.  I switched them out for Adaptec 5808s and saw a massive improvement in Raid 10. 


      __________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at Yahoo!
http://uk.docs.yahoo.com/ymail/new.html

Re: Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
11 aug 2008 kl. 12.35 skrev Glyn Astill:

>>>
>>>> It feels like there is something fishy going on.
>> Maybe the RAID 10
>>>> implementation on the PERC/6e is crap?
>>>
>
> It's possible.  We had a bunch of perc/5i SAS raid cards in our
> servers that performed quite well in Raid 5 but were shite in Raid
> 10.  I switched them out for Adaptec 5808s and saw a massive
> improvement in Raid 10.
I suspected that. Maybe I should just put the PERC/6 cards in JBOD
mode and then make a RAID10 with linux  software raid MD?


>
>
>
>      __________________________________________________________
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses
> available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Marlowe"
Date:
On Mon, Aug 11, 2008 at 6:08 AM, Henrik <henke@mac.se> wrote:
> 11 aug 2008 kl. 12.35 skrev Glyn Astill:
>
>>>>
>>>>> It feels like there is something fishy going on.
>>>
>>> Maybe the RAID 10
>>>>>
>>>>> implementation on the PERC/6e is crap?
>>>>
>>
>> It's possible.  We had a bunch of perc/5i SAS raid cards in our servers
>> that performed quite well in Raid 5 but were shite in Raid 10.  I switched
>> them out for Adaptec 5808s and saw a massive improvement in Raid 10.
>
> I suspected that. Maybe I should just put the PERC/6 cards in JBOD mode and
> then make a RAID10 with linux  software raid MD?

You can also try making mirror sets with the hardware RAID controller
and then doing SW RAID 0 on top of that.  Since RAID0 requires little
or no CPU overhead, this is a good compromise because the OS has the
least work to do, and the RAID controller is doing what it's probably
pretty good at, mirror sets.

Re: Filesystem benchmarking for pg 8.3.3 server

From
Jeff
Date:
On Aug 11, 2008, at 5:17 AM, Henrik wrote:

> OK, changed the SAS RAID 10 to RAID 5 and now my random writes are
> handing 112 MB/ sek. So it is almsot twice as fast as the RAID10
> with the same disks. Any ideas why?
>
> Is the iozone tests faulty?


does IOzone disable the os caches?
If not you need to use a size of 2xRAM for true results.

regardless - the test only took 10 seconds of wall time - which isn't
very long at all. You'd probably want to run it longer anyway.


>
> iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u -F /database/iotest
>
>     Children see throughput for 1 random writers     =  112074.58 KB/sec
>     Parent sees throughput for 1 random writers     =  111962.80 KB/sec
>     Min throughput per process             =  112074.58 KB/sec
>     Max throughput per process             =  112074.58 KB/sec
>     Avg throughput per process             =  112074.58 KB/sec
>     Min xfer                     = 1024000.00 KB
>     CPU utilization: Wall time    9.137    CPU time    0.510    CPU
> utilization   5.58 %
>


--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




Re: Filesystem benchmarking for pg 8.3.3 server

From
Greg Smith
Date:
On Sun, 10 Aug 2008, Henrik wrote:

>> Normally, when a SATA implementation is running significantly faster than a
>> SAS one, it's because there's some write cache in the SATA disks turned on
>> (which they usually are unless you go out of your way to disable them).
> Lucky for my I have BBU on all my controllers cards and I'm also not using
> the SATA drives for database.

From how you responded I don't think I made myself clear.  In addition to
the cache on the controller itself, each of the disks has its own cache,
probably 8-32MB in size.  Your controllers may have an option to enable or
disable the caches on the individual disks, which would be a separate
configuration setting from turning the main controller cache on or off.
Your results look like what I'd expect if the individual disks caches on
the SATA drives were on, while those on the SAS controller were off (which
matches the defaults you'll find on some products in both categories).
Just something to double-check.

By the way:  getting useful results out of iozone is fairly difficult if
you're unfamiliar with it, there are lots of ways you can set that up to
run tests that aren't completely fair or that you don't run them for long
enough to give useful results.  I'd suggest doing a round of comparisons
with bonnie++, which isn't as flexible but will usually give fair results
without needing to specify any parameters.  The "seeks" number that comes
out of bonnie++ is a combined read/write one and would be good for
double-checking whether the unexpected results you're seeing are
independant of the benchmark used.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
Hi again all,

Just wanted to give you an update.

Talked to Dell tech support and they recommended using write-
through(!) caching in RAID10 configuration. Well, it didn't work and
got even worse performance.

Anyone have an estimated what a RAID10 on 4 15k SAS disks should
generate in random writes?

I'm really keen on trying Scotts suggestion on using the PERC/6 with
mirror sets only and then make the stripe with Linux SW raid.

Thanks for all the input! Much appreciated.


Cheers,
Henke

11 aug 2008 kl. 17.56 skrev Greg Smith:

> On Sun, 10 Aug 2008, Henrik wrote:
>
>>> Normally, when a SATA implementation is running significantly
>>> faster than a SAS one, it's because there's some write cache in
>>> the SATA disks turned on (which they usually are unless you go out
>>> of your way to disable them).
>> Lucky for my I have BBU on all my controllers cards and I'm also
>> not using the SATA drives for database.
>
>> From how you responded I don't think I made myself clear.  In
>> addition to
> the cache on the controller itself, each of the disks has its own
> cache, probably 8-32MB in size.  Your controllers may have an option
> to enable or disable the caches on the individual disks, which would
> be a separate configuration setting from turning the main controller
> cache on or off. Your results look like what I'd expect if the
> individual disks caches on the SATA drives were on, while those on
> the SAS controller were off (which matches the defaults you'll find
> on some products in both categories). Just something to double-check.
>
> By the way:  getting useful results out of iozone is fairly
> difficult if you're unfamiliar with it, there are lots of ways you
> can set that up to run tests that aren't completely fair or that you
> don't run them for long enough to give useful results.  I'd suggest
> doing a round of comparisons with bonnie++, which isn't as flexible
> but will usually give fair results without needing to specify any
> parameters.  The "seeks" number that comes out of bonnie++ is a
> combined read/write one and would be good for double-checking
> whether the unexpected results you're seeing are independant of the
> benchmark used.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com
> Baltimore, MD
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Marlowe"
Date:
On Tue, Aug 12, 2008 at 1:40 PM, Henrik <henke@mac.se> wrote:
> Hi again all,
>
> Just wanted to give you an update.
>
> Talked to Dell tech support and they recommended using write-through(!)
> caching in RAID10 configuration. Well, it didn't work and got even worse
> performance.

Someone at Dell doesn't understand the difference between write back
and write through.

> Anyone have an estimated what a RAID10 on 4 15k SAS disks should generate in
> random writes?

Using sw RAID or a non-caching RAID controller, you should be able to
get close to 2xmax write based on rpms.  On 7200 RPM drives that's
2*150 or ~300 small transactions per second.  On 15k drives that's
about 2*250 or around 500 tps.

The bigger the data you're writing, the fewer you're gonna be able to
write each second of course.

> I'm really keen on trying Scotts suggestion on using the PERC/6 with mirror
> sets only and then make the stripe with Linux SW raid.

Definitely worth the try.  Even full on sw RAID may be faster.  It's
worth testing.

On our new servers at work, we have Areca controllers with 512M bbc
and they were about 10% faster mixing sw and hw raid, but honestly, it
wasn't worth the extra trouble of the hw/sw combo to go with.

Re: Filesystem benchmarking for pg 8.3.3 server

From
Ron Mayer
Date:
Greg Smith wrote:
> some write cache in the SATA disks...Since all non-battery backed caches
> need to get turned  off for reliable database use, you might want to
> double-check that on the controller that's driving the SATA disks.

Is this really true?

Doesn't the ATA "FLUSH CACHE" command (say, ATA command 0xE7)
guarantee that writes are on the media?

http://www.t13.org/Documents/UploadedDocuments/technical/e01126r0.pdf
"A non-error completion of the command indicates that all cached data
  since the last FLUSH CACHE command completion was successfully written
  to media, including any cached data that may have been
  written prior to receipt of FLUSH CACHE command."
(I still can't find any $0 SATA specs; but I imagine the final
wording for the command is similar to the wording in the proposal
for the command which can be found on the ATA Technical Committee's
web site at the link above.)

Really old software (notably 2.4 linux kernels) didn't send
cache synchronizing commands for SCSI nor either ATA; but
it seems well thought through in the 2.6 kernels as described
in the Linux kernel documentation.
http://www.mjmwired.net/kernel/Documentation/block/barrier.txt

If you do have a disk where you need to disable write caches,
I'd love to know the name of the disk and see the output of
of "hdparm -I /dev/sd***" to see if it claims to support such
cache flushes.


I'm almost tempted to say that if you find yourself having to disable
caches on modern (this century) hardware and software, you're probably
covering up a more serious issue with your system.


Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Carey"
Date:
Some SATA drives were known to not flush their cache when told to.
Some file systems don't know about this (UFS, older linux kernels, etc).

So yes, if your OS / File System / Controller card combo properly sends the write cache flush command, and the drive is not a flawed one, all is well.   Most should, not all do.  Any one of those bits along the chain can potentially be disk write cache unsafe.


On Tue, Aug 12, 2008 at 2:47 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
Greg Smith wrote:
some write cache in the SATA disks...Since all non-battery backed caches need to get turned  off for reliable database use, you might want to  double-check that on the controller that's driving the SATA disks.

Is this really true?

Doesn't the ATA "FLUSH CACHE" command (say, ATA command 0xE7)
guarantee that writes are on the media?

http://www.t13.org/Documents/UploadedDocuments/technical/e01126r0.pdf
"A non-error completion of the command indicates that all cached data
 since the last FLUSH CACHE command completion was successfully written
 to media, including any cached data that may have been
 written prior to receipt of FLUSH CACHE command."
(I still can't find any $0 SATA specs; but I imagine the final
wording for the command is similar to the wording in the proposal
for the command which can be found on the ATA Technical Committee's
web site at the link above.)

Really old software (notably 2.4 linux kernels) didn't send
cache synchronizing commands for SCSI nor either ATA; but
it seems well thought through in the 2.6 kernels as described
in the Linux kernel documentation.
http://www.mjmwired.net/kernel/Documentation/block/barrier.txt

If you do have a disk where you need to disable write caches,
I'd love to know the name of the disk and see the output of
of "hdparm -I /dev/sd***" to see if it claims to support such
cache flushes.


I'm almost tempted to say that if you find yourself having to disable
caches on modern (this century) hardware and software, you're probably
covering up a more serious issue with your system.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Filesystem benchmarking for pg 8.3.3 server

From
Ron Mayer
Date:
Scott Carey wrote:
> Some SATA drives were known to not flush their cache when told to.

Can you name one?  The ATA commands seem pretty clear on the matter,
and ISTM most of the reports of these issues came from before
Linux had write-barrier support.

I've yet to hear of a drive with the problem; though no doubt there
are some cheap RAID controllers somewhere that expect you to disable
the drive caches.

Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Marlowe"
Date:
On Tue, Aug 12, 2008 at 6:23 PM, Scott Carey <scott@richrelevance.com> wrote:
> Some SATA drives were known to not flush their cache when told to.
> Some file systems don't know about this (UFS, older linux kernels, etc).
>
> So yes, if your OS / File System / Controller card combo properly sends the
> write cache flush command, and the drive is not a flawed one, all is well.
> Most should, not all do.  Any one of those bits along the chain can
> potentially be disk write cache unsafe.

I can attest to the 2.4 kernel not being able to guarantee fsync on
IDE drives.  And to the LSI megaraid SCSI controllers of the era
surviving numerous power off tests.

Re: Filesystem benchmarking for pg 8.3.3 server

From
david@lang.hm
Date:
On Tue, 12 Aug 2008, Ron Mayer wrote:

> Scott Carey wrote:
>> Some SATA drives were known to not flush their cache when told to.
>
> Can you name one?  The ATA commands seem pretty clear on the matter,
> and ISTM most of the reports of these issues came from before
> Linux had write-barrier support.

I can't name one, but I've seen it mentioned in the discussions on
linux-kernel several times by the folks who are writing the write-barrier
support.

David Lang


Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Carey"
Date:
I'm not an expert on which and where -- its been a while since I was exposed to the issue.  From what I've read in a few places over time (storagereview.com, linux and windows patches or knowledge base articles), it happens from time to time.  Drives usually get firmware updates quickly.  Drivers / Controller cards often take longer to fix.  Anyway, my anecdotal recollection was with a few instances of this occuring about 4 years ago and manefesting itself with complaints on message boards then going away.  And in general some searching around google indicates this is a problem more often drivers and controllers than drives themselves.

I recall some cheap raid cards and controller cards being an issue, like the below:
http://www.fixya.com/support/t163682-hard_drive_corrupt_every_reboot

And here is an example of an HP Fiber Channel Disk firmware bug:
HS02969 28SEP07

Title
: OPN FIBRE CHANNEL DISK FIRMWARE

Platform
: S-Series & NS-Series only with FCDMs

Summary
:
HP recently discovered a firmware flaw in some versions of 72,
146, and 300 Gigabyte fibre channel disk devices that shipped in late 2006
and early 2007. The flaw enabled the affected disk devices to inadvertently
cache write data. In very rare instances, this caching operation presents an
opportunity for disk write operations to be lost.


Even ext3 doesn't default to using write barriers at this time due to performance concerns:
http://lwn.net/Articles/283161/

On Tue, Aug 12, 2008 at 6:33 PM, <david@lang.hm> wrote:
On Tue, 12 Aug 2008, Ron Mayer wrote:

Scott Carey wrote:
Some SATA drives were known to not flush their cache when told to.

Can you name one?  The ATA commands seem pretty clear on the matter,
and ISTM most of the reports of these issues came from before
Linux had write-barrier support.

I can't name one, but I've seen it mentioned in the discussions on linux-kernel several times by the folks who are writing the write-barrier support.

David Lang



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Filesystem benchmarking for pg 8.3.3 server

From
Ron Mayer
Date:
Scott Marlowe wrote:
> I can attest to the 2.4 kernel not being able to guarantee fsync on
> IDE drives.

Sure.  But note that it won't for SCSI either; since AFAICT the write
barrier support was implemented at the same time for both.

Re: Filesystem benchmarking for pg 8.3.3 server

From
Greg Smith
Date:
On Tue, 12 Aug 2008, Ron Mayer wrote:

> Really old software (notably 2.4 linux kernels) didn't send
> cache synchronizing commands for SCSI nor either ATA; but
> it seems well thought through in the 2.6 kernels as described
> in the Linux kernel documentation.
> http://www.mjmwired.net/kernel/Documentation/block/barrier.txt

If you've drank the kool-aid you might believe that.  When I see people
asking about this in early 2008 at
http://thread.gmane.org/gmane.linux.kernel/646040 and serious disk driver
hacker Jeff Garzik says "It's completely ridiculous that we default to an
unsafe fsync." [ http://thread.gmane.org/gmane.linux.kernel/646040 ], I
don't know about you but that barrier documentation doesn't make me feel
warm and safe anymore.

> If you do have a disk where you need to disable write caches,
> I'd love to know the name of the disk and see the output of
> of "hdparm -I /dev/sd***" to see if it claims to support such
> cache flushes.

The below disk writes impossibly fast when I issue a sequence of fsync
writes to it under the CentOS 5 Linux I was running on it.  Should only be
possible to do at most 120/second since it's 7200 RPM, and if I poke it
with "hdparm -W0" first it behaves.  The drive is a known piece of junk
from circa 2004, and it's worth noting that it's an ext3 filesystem in a
md0 RAID-1 array (aren't there issues with md and the barriers?)

# hdparm -I /dev/hde

/dev/hde:

ATA device, with non-removable media
         Model Number:       Maxtor 6Y250P0
         Serial Number:      Y62K95PE
         Firmware Revision:  YAR41BW0
Standards:
         Used: ATA/ATAPI-7 T13 1532D revision 0
         Supported: 7 6 5 4
Configuration:
         Logical         max     current
         cylinders       16383   65535
         heads           16      1
         sectors/track   63      63
         --
         CHS current addressable sectors:    4128705
         LBA    user addressable sectors:  268435455
         LBA48  user addressable sectors:  490234752
         device size with M = 1024*1024:      239372 MBytes
         device size with M = 1000*1000:      251000 MBytes (251 GB)
Capabilities:
         LBA, IORDY(can be disabled)
         Standby timer values: spec'd by Standard, no device specific
minimum
         R/W multiple sector transfer: Max = 16  Current = 16
         Advanced power management level: unknown setting (0x0000)
         Recommended acoustic management value: 192, current value: 254
         DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
              Cycle time: min=120ns recommended=120ns
         PIO: pio0 pio1 pio2 pio3 pio4
              Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
         Enabled Supported:
            *    SMART feature set
                 Security Mode feature set
            *    Power Management feature set
            *    Write cache
            *    Look-ahead
            *    Host Protected Area feature set
            *    WRITE_VERIFY command
            *    WRITE_BUFFER command
            *    READ_BUFFER command
            *    NOP cmd
            *    DOWNLOAD_MICROCODE
                 Advanced Power Management feature set
                 SET_MAX security extension
            *    Automatic Acoustic Management feature set
            *    48-bit Address feature set
            *    Device Configuration Overlay feature set
            *    Mandatory FLUSH_CACHE
            *    FLUSH_CACHE_EXT
            *    SMART error logging
            *    SMART self-test

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Marlowe"
Date:
On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer
<rm_pg@cheapcomplexdevices.com> wrote:
> Scott Marlowe wrote:
>>
>> I can attest to the 2.4 kernel not being able to guarantee fsync on
>> IDE drives.
>
> Sure.  But note that it won't for SCSI either; since AFAICT the write
> barrier support was implemented at the same time for both.

Tested both by pulling the power plug.  The SCSI was pulled 10 times
while running 600 or so concurrent pgbench threads, and so was the
IDE.  The SCSI came up clean every single time, the IDE came up
corrupted every single time.

I find it hard to believe there was no difference in write barrier
behaviour with those two setups.

Re: Filesystem benchmarking for pg 8.3.3 server

From
Matthew Wakeling
Date:
On Tue, 12 Aug 2008, Ron Mayer wrote:
> Really old software (notably 2.4 linux kernels) didn't send
> cache synchronizing commands for SCSI nor either ATA;

Surely not true. Write cache flushing has been a known problem in the
computer science world for several tens of years. The difference is that
in the past we only had a "flush everything" command whereas now we have a
"flush everything before the barrier before everything after the barrier"
command.

Matthew

--
"To err is human; to really louse things up requires root
 privileges."                 -- Alexander Pope, slightly paraphrased

Re: Filesystem benchmarking for pg 8.3.3 server

From
Ron Mayer
Date:
Scott Marlowe wrote:
> On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer ...wrote:
>> Scott Marlowe wrote:
>>> I can attest to the 2.4 kernel ...
>> ...SCSI...AFAICT the write barrier support...
>
> Tested both by pulling the power plug.  The SCSI was pulled 10 times
> while running 600 or so concurrent pgbench threads, and so was the
> IDE.  The SCSI came up clean every single time, the IDE came up
> corrupted every single time.

Interesting.  With a pre-write-barrier 2.4 kernel I'd
expect corruption in both.
Perhaps all caches were disabled in the SCSI drives?

> I find it hard to believe there was no difference in write barrier
> behaviour with those two setups.

Skimming lkml it seems write barriers for SCSI were
behind (in terms of implementation) those for ATA
http://lkml.org/lkml/2005/1/27/94
"Jan 2005 ... scsi/sata write barrier support ...
  For the longest time, only the old PATA drivers
  supported barrier writes with journalled file systems.
  This patch adds support for the same type of cache
  flushing barriers that PATA uses for SCSI"

Re: Filesystem benchmarking for pg 8.3.3 server

From
Ron Mayer
Date:
Greg Smith wrote:
> The below disk writes impossibly fast when I issue a sequence of fsync

'k.  I've got some homework. I'll be trying to reproduce similar
with md raid, old IDE drives, etc to see if I can reproduce them.
I assume test_fsync in the postgres source distribution is
a decent way to see?

> driver hacker Jeff Garzik says "It's completely ridiculous that we
> default to an unsafe fsync."

Yipes indeed.  Still makes me want to understand why people
claim IDE suffers more than SCSI, tho.  Ext3 bugs seem likely
to affect both to me.

> writes to it under the CentOS 5 Linux I was running on it. ...
> junk from circa 2004, and it's worth noting that it's an ext3 filesystem
> in a md0 RAID-1 array (aren't there issues with md and the barriers?)

Apparently various distros vary a lot in how they're set
up (SuSE apparently defaults to mounting ext3 with the barrier=1
option; other distros seemed not to, etc).

I'll do a number of experiments with md, a few different drives,
etc. today and see if I can find issues with any of the
drives (and/or filesystems) around here.

But I still am looking for any evidence that there were any
widely shipped SATA (or even IDE drives) that were at fault,
as opposed to filesystem bugs and poor settings of defaults.

Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Marlowe"
Date:
On Wed, Aug 13, 2008 at 8:41 AM, Ron Mayer
<rm_pg@cheapcomplexdevices.com> wrote:
> Greg Smith wrote:

> But I still am looking for any evidence that there were any
> widely shipped SATA (or even IDE drives) that were at fault,
> as opposed to filesystem bugs and poor settings of defaults.

Well, if they're getting more than 150/166.6/250 transactions per
second without a battery backed cache, then they're likely lying about
fsync.  And most SATA and IDE drives will give you way over that for a
small data set.

Re: Filesystem benchmarking for pg 8.3.3 server

From
Decibel!
Date:
On Aug 11, 2008, at 9:01 AM, Jeff wrote:
> On Aug 11, 2008, at 5:17 AM, Henrik wrote:
>
>> OK, changed the SAS RAID 10 to RAID 5 and now my random writes are
>> handing 112 MB/ sek. So it is almsot twice as fast as the RAID10
>> with the same disks. Any ideas why?
>>
>> Is the iozone tests faulty?
>
>
> does IOzone disable the os caches?
> If not you need to use a size of 2xRAM for true results.
>
> regardless - the test only took 10 seconds of wall time - which
> isn't very long at all. You'd probably want to run it longer anyway.


Additionally, you need to be careful of what size writes you're
using. If you're doing random writes that perfectly align with the
raid stripe size, you'll see virtually no RAID5 overhead, and you'll
get the performance of N-1 drives, as opposed to RAID10 giving you N/2.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

Re: Filesystem benchmarking for pg 8.3.3 server

From
Greg Smith
Date:
On Wed, 13 Aug 2008, Ron Mayer wrote:

> I assume test_fsync in the postgres source distribution is
> a decent way to see?

Not really.  It takes too long (runs too many tests you don't care about)
and doesn't spit out the results the way you want them--TPS, not average
time.

You can do it with pgbench (scale here really doesn't matter):

$ cat insert.sql
\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 100000 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
BEGIN;
INSERT INTO history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid,
:aid, :delta, CURRENT_TIMESTAMP);
END;
$ createdb pgbench
$ pgbench -i -s 20 pgbench
$ pgbench -f insert.sql -s 20 -c 1 -t 10000 pgbench

Don't really need to ever rebuild that just to run more tests if all you
care about is the fsync speed (no indexes in the history table to bloat or
anything).

Or you can measure with sysbench;
http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
goes over that but they don't have the syntax exacty right.  Here's an
example that works:

:~/sysbench-0.4.8/bin/bin$ ./sysbench run --test=fileio
--file-fsync-freq=1 --file-num=1 --file-total-size=16384
--file-test-mode=rndwr

> But I still am looking for any evidence that there were any widely
> shipped SATA (or even IDE drives) that were at fault, as opposed to
> filesystem bugs and poor settings of defaults.

Alan Cox claims that until circa 2001, the ATA standard didn't require
implementing the cache flush call at all.  See
http://www.kerneltraffic.org/kernel-traffic/kt20011015_137.html Since
firmware is expensive to write and manufacturers are generally lazy here,
I'd bet a lot of disks from that era were missing support for the call.
Next time I'd digging through my disk graveyard I'll try and find such a
disk.  If he's correct that the standard changed around you wouldn't
expect any recent drive to not support the call.

I feel it's largely irrelevant that most drives handle things just fine
nowadays if you send them the correct flush commands, because there are so
manh other things that can make that system as a whole not work right.
Even if the flush call works most of the time, disk firmware is turning
increasibly into buggy software, and attempts to reduce how much of that
firmware you're actually using can be viewed as helpful.

This is why I usually suggest just turning the individual drive caches
off; the caveats for when they might work fine in this context are just
too numerous.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Filesystem benchmarking for pg 8.3.3 server

From
Henrik
Date:
13 aug 2008 kl. 17.13 skrev Decibel!:

> On Aug 11, 2008, at 9:01 AM, Jeff wrote:
>> On Aug 11, 2008, at 5:17 AM, Henrik wrote:
>>
>>> OK, changed the SAS RAID 10 to RAID 5 and now my random writes are
>>> handing 112 MB/ sek. So it is almsot twice as fast as the RAID10
>>> with the same disks. Any ideas why?
>>>
>>> Is the iozone tests faulty?
>>
>>
>> does IOzone disable the os caches?
>> If not you need to use a size of 2xRAM for true results.
>>
>> regardless - the test only took 10 seconds of wall time - which
>> isn't very long at all. You'd probably want to run it longer anyway.
>
>
> Additionally, you need to be careful of what size writes you're
> using. If you're doing random writes that perfectly align with the
> raid stripe size, you'll see virtually no RAID5 overhead, and you'll
> get the performance of N-1 drives, as opposed to RAID10 giving you N/
> 2.
But it still needs to do 2 reads and 2 writes for every write, correct?

I did some bonnie++ tests just to give some new more reasonable numbers.
This is with RAID10 on 4 SAS 15k drives with write-back cache.

Version 1.03b       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP  /sec %CP
safecube04   32136M 73245  95 213092  16 89456  11 64923  81 219341
16 839.9   1
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
                  16  6178  99 +++++ +++ +++++ +++  6452 100 +++++ +++
20633  99
safecube04,32136M,
73245,95,213092,16,89456,11,64923,81,219341,16,839.9,1,16,6178,99,++++
+,+++,+++++,+++,6452,100,+++++,+++,20633,99






>
> --
> Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
> Give your computer some brain candy! www.distributed.net Team #1828
>
>


Re: Filesystem benchmarking for pg 8.3.3 server

From
Ron Mayer
Date:
Scott Marlowe wrote:
>IDE came up corrupted every single time.
Greg Smith wrote:
> you've drank the kool-aid ... completely
> ridiculous ...unsafe fsync ... md0 RAID-1
> array (aren't there issues with md and the barriers?)

Alright - I'll eat my words.  Or mostly.

I still haven't found IDE drives that lie; but
if the testing I've done today, I'm starting to
think that:

   1a) ext3 fsync() seems to lie badly.
   1b) but ext3 can be tricked not to lie (but not
       in the way you might think).
   2a) md raid1 fsync() sometimes doesn't actually
       sync
   2b) I can't trick it not to.
   3a) some IDE drives don't even pretend to support
       letting you know when their cache is flushed
   3b) but the kernel will happily tell you about
       any such devices; as well as including md
       raid ones.

In more detail.  I tested on a number of systems
and disks including new (this year) and old (1997)
IDE drives; and EXT3 with and without the "barrier=1"
mount option.


First off - some IDE drives don't even support the
relatively recent ATA command that apparently lets
the software know when a cache flush is complete.
Apparently on those you will get messages in your
system logs:
   %dmesg | grep 'disabling barriers'
   JBD: barrier-based sync failed on md1 - disabling barriers
   JBD: barrier-based sync failed on hda3 - disabling barriers
and
   %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT
will not show you anything on those devices.
IMHO that's cool; and doesn't count as a lying IDE drive
since it didn't claim to support this.

Second of all - ext3 fsync() appears to me to
be *extremely* stupid.   It only seems to correctly
do the correct flushing (and waiting) for a drive's
cache to be flushed when a file's inode has changed.
For example, in the test program below, it will happily
do a real fsync (i.e. the program take a couple seconds
to run) so long as I have the "fchmod()" statements are in
there.   It will *NOT* wait on my system if I comment those
fchmod()'s out. Sadly, I get the same behavior with and
without the ext3 barrier=1 mount option. :(
==========================================================
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[]) {
   if (argc<2) {
     printf("usage: fs <filename>\n");
     exit(1);
   }
   int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
   int i;
   for (i=0;i<100;i++) {
     char byte;
     pwrite (fd, &byte, 1, 0);
     fchmod (fd, 0644); fchmod (fd, 0664);
     fsync (fd);
   }
}
==========================================================
Since it does indeed wait when the inode's touched, I think
it suggests that it's not the hard drive that's lying, but
rather ext3.

So I take back what I said about linux and write barriers
being sane.   They're not.

But AFACT, all the (6 different) IDE drives I've seen work
as advertised, and the kernel happily seems to spews boot
messages when it finds one that doesn't support knowing
when a cache flush finished.


Re: Filesystem benchmarking for pg 8.3.3 server

From
Greg Smith
Date:
On Wed, 13 Aug 2008, Ron Mayer wrote:

> First off - some IDE drives don't even support the relatively recent ATA
> command that apparently lets the software know when a cache flush is
> complete.

Right, so this is one reason you can't assume barriers will be available.
And barriers don't work regardless if you go through the device mapper,
like some LVM and software RAID configurations; see
http://lwn.net/Articles/283161/

> Second of all - ext3 fsync() appears to me to be *extremely* stupid.
> It only seems to correctly do the correct flushing (and waiting) for a
> drive's cache to be flushed when a file's inode has changed.

This is bad, but the way PostgreSQL uses fsync seems to work fine--if it
didn't, we'd all see unnaturally high write rates all the time.

> So I take back what I said about linux and write barriers
> being sane.   They're not.

Right.  Where Linux seems to be at right now is that there's this
occasional problem people run into where ext3 volumes can get corrupted if
there are out of order writes to its journal:
http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal
http://archives.free.net.ph/message/20070518.134838.52e26369.en.html

(By the way:  I just fixed the ext3 Wikipedia article to reflect the
current state of things and dumped a bunch of reference links in to there,
including some that are not listed here.  I prefer to keep my notes about
interesting topics in Wikipedia instead of having my own copies whenever
possible).

There are two ways to get around this issue ext3. You can disable write
caching, changing your default mount options to "data=journal".  In the
PostgreSQL case, the way the WAL is used seems to keep corruption at bay
even with the default "data=ordered" case, but after reading up on this
again I'm thinking I may want to switch to "journal" anyway in the future
(and retrofit some older installs with that change).  I also avoid using
Linux LVM whenever possible for databases just on general principle; one
less flakey thing in the way.

The other way, barriers, is just plain scary unless you know your disk
hardware does the right thing and the planets align just right, and even
then it seems buggy.  I personally just ignore the fact that they exist on
ext3, and maybe one day ext4 will get this right.

By the way:  there is a great ext3 "torture test" program that just came
out a few months ago that's useful for checking general filesystem
corruption in this context I keep meaning to try, if you've got some
cycles to spare working in this area check it out:
http://uwsg.indiana.edu/hypermail/linux/kernel/0805.2/1470.html

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Filesystem benchmarking for pg 8.3.3 server

From
"Scott Marlowe"
Date:
I've seen it written a couple of times in this thread, and in the
wikipedia article, that SOME sw raid configs don't support write
barriers.  This implies that some do.  Which ones do and which ones
don't?  Does anybody have a list of them?

I was mainly wondering if sw RAID0 on top of hw RAID1 would be safe.

Re: Filesystem benchmarking for pg 8.3.3 server

From
Ron Mayer
Date:
Greg Smith wrote:
> On Wed, 13 Aug 2008, Ron Mayer wrote:
>
>> Second of all - ext3 fsync() appears to me to be *extremely* stupid.
>> It only seems to correctly do the correct flushing (and waiting) for a
>> drive's cache to be flushed when a file's inode has changed.
>
> This is bad, but the way PostgreSQL uses fsync seems to work fine--if it
> didn't, we'd all see unnaturally high write rates all the time.

But only if you turn off IDE drive caches.

What was new to me in these experiments is that if you touch the
inode as described here:
  http://article.gmane.org/gmane.linux.file-systems/21373
then fsync() works and you can leave the IDE cache enabled; so
long as your drive supports the flush command -- which you can
see by looking for the drive in the output of:
   %dmesg | grep 'disabling barriers'
   JBD: barrier-based sync failed on md1 - disabling barriers
   JBD: barrier-based sync failed on hda3 - disabling barriers

>> So I take back what I said about linux and write barriers
>> being sane.   They're not.
>
> Right.  Where Linux seems to be at right now is that there's this

I almost fear I misphrased that.   Apparently IDE drives
don't lie (the ones that don't support barriers let the OS
know that they don't).   And apparently write barriers
do work.

It's just that ext3 only uses the write barriers correctly
on fsync() when an inode is touched, rather than any time
a file's data is touched.


> then it seems buggy.  I personally just ignore the fact that they exist
> on ext3, and maybe one day ext4 will get this right.

+1

Re: Filesystem benchmarking for pg 8.3.3 server

From
Decibel!
Date:
On Aug 13, 2008, at 2:54 PM, Henrik wrote:
>> Additionally, you need to be careful of what size writes you're
>> using. If you're doing random writes that perfectly align with the
>> raid stripe size, you'll see virtually no RAID5 overhead, and
>> you'll get the performance of N-1 drives, as opposed to RAID10
>> giving you N/2.
> But it still needs to do 2 reads and 2 writes for every write,
> correct?


If you are completely over-writing an entire stripe, there's no
reason to read the existing data; you would just calculate the parity
information from the new data. Any good controller should take that
approach.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

Re: Filesystem benchmarking for pg 8.3.3 server

From
david@lang.hm
Date:
On Sat, 16 Aug 2008, Decibel! wrote:

> On Aug 13, 2008, at 2:54 PM, Henrik wrote:
>>> Additionally, you need to be careful of what size writes you're using. If
>>> you're doing random writes that perfectly align with the raid stripe size,
>>> you'll see virtually no RAID5 overhead, and you'll get the performance of
>>> N-1 drives, as opposed to RAID10 giving you N/2.
>> But it still needs to do 2 reads and 2 writes for every write, correct?
>
>
> If you are completely over-writing an entire stripe, there's no reason to
> read the existing data; you would just calculate the parity information from
> the new data. Any good controller should take that approach.

in theory yes, in practice the OS writes usually aren't that large and
aligned, and as a result most raid controllers (and software) don't have
the special-case code to deal with it.

there's discussion of these issues, but not much more then that.

David Lang

Re: Filesystem benchmarking for pg 8.3.3 server

From
Gregory Stark
Date:
<david@lang.hm> writes:

>> If you are completely over-writing an entire stripe, there's no reason to
>> read the existing data; you would just calculate the parity information from
>> the new data. Any good controller should take that approach.
>
> in theory yes, in practice the OS writes usually aren't that large and aligned,
> and as a result most raid controllers (and software) don't have the
> special-case code to deal with it.

I'm pretty sure all half-decent controllers and software do actually. This is
one major reason that large (hopefully battery backed) caches help RAID-5
disproportionately. The larger the cache the more likely it'll be able to wait
until the entire raid stripe is replaced avoid having to read in the old
parity.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Re: Filesystem benchmarking for pg 8.3.3 server

From
Gregory Stark
Date:
"Gregory Stark" <stark@enterprisedb.com> writes:

> <david@lang.hm> writes:
>
>>> If you are completely over-writing an entire stripe, there's no reason to
>>> read the existing data; you would just calculate the parity information from
>>> the new data. Any good controller should take that approach.
>>
>> in theory yes, in practice the OS writes usually aren't that large and aligned,
>> and as a result most raid controllers (and software) don't have the
>> special-case code to deal with it.
>
> I'm pretty sure all half-decent controllers and software do actually. This is
> one major reason that large (hopefully battery backed) caches help RAID-5
> disproportionately. The larger the cache the more likely it'll be able to wait
> until the entire raid stripe is replaced avoid having to read in the old
> parity.

Or now that I think about it, replace two or more blocks from the same set of
parity bits. It only has to recalculate the parity bits once for all those
blocks instead of for every single block write.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!