Thread: Filesystem benchmarking for pg 8.3.3 server
Hello list, I have a server with a direct attached storage containing 4 15k SAS drives and 6 standard SATA drives. The server is a quad core xeon with 16GB ram. Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU There is 2 raid set configured. One RAID 10 containing 4 SAS disks One RAID 5 containing 6 SATA disks There is one partition per RAID set with ext2 filesystem. I ran the following iozone test which I stole from Joshua Drake's test at http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ I ran this test against the RAID 5 SATA partition #iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u With these random write results Children see throughput for 1 random writers = 168647.33 KB/sec Parent sees throughput for 1 random writers = 168413.61 KB/sec Min throughput per process = 168647.33 KB/sec Max throughput per process = 168647.33 KB/sec Avg throughput per process = 168647.33 KB/sec Min xfer = 1024000.00 KB CPU utilization: Wall time 6.072 CPU time 0.540 CPU utilization 8.89 % Almost 170 MB/sek. Not bad for 6 standard SATA drives. Then I ran the same thing against the RAID 10 SAS partition Children see throughput for 1 random writers = 68816.25 KB/sec Parent sees throughput for 1 random writers = 68767.90 KB/sec Min throughput per process = 68816.25 KB/sec Max throughput per process = 68816.25 KB/sec Avg throughput per process = 68816.25 KB/sec Min xfer = 1024000.00 KB CPU utilization: Wall time 14.880 CPU time 0.520 CPU utilization 3.49 % What only 70 MB/sek? Is it possible that the 2 more spindles for the SATA drives makes that partition soooo much faster? Even though the disks and the RAID configuration should be slower? It feels like there is something fishy going on. Maybe the RAID 10 implementation on the PERC/6e is crap? Any pointers, suggestion, ideas? I'm going to change the RAID 10 to a RAID 5 and test again and see what happens. Cheers, Henke
Your expected write speed on a 4 drive RAID10 is two drives worth, probably 160 MB/s, depending on the generation of drives.
The expect write speed for a 6 drive RAID5 is 5 drives worth, or about 400 MB/s, sans the RAID5 parity overhead.
- Luke
----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
Sent: Fri Aug 08 10:23:55 2008
Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server
Hello list,
I have a server with a direct attached storage containing 4 15k SAS
drives and 6 standard SATA drives.
The server is a quad core xeon with 16GB ram.
Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU
There is 2 raid set configured.
One RAID 10 containing 4 SAS disks
One RAID 5 containing 6 SATA disks
There is one partition per RAID set with ext2 filesystem.
I ran the following iozone test which I stole from Joshua Drake's test
at
http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/
I ran this test against the RAID 5 SATA partition
#iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u
With these random write results
Children see throughput for 1 random writers = 168647.33 KB/sec
Parent sees throughput for 1 random writers = 168413.61 KB/sec
Min throughput per process = 168647.33 KB/sec
Max throughput per process = 168647.33 KB/sec
Avg throughput per process = 168647.33 KB/sec
Min xfer = 1024000.00 KB
CPU utilization: Wall time 6.072 CPU time 0.540 CPU
utilization 8.89 %
Almost 170 MB/sek. Not bad for 6 standard SATA drives.
Then I ran the same thing against the RAID 10 SAS partition
Children see throughput for 1 random writers = 68816.25 KB/sec
Parent sees throughput for 1 random writers = 68767.90 KB/sec
Min throughput per process = 68816.25 KB/sec
Max throughput per process = 68816.25 KB/sec
Avg throughput per process = 68816.25 KB/sec
Min xfer = 1024000.00 KB
CPU utilization: Wall time 14.880 CPU time 0.520 CPU
utilization 3.49 %
What only 70 MB/sek?
Is it possible that the 2 more spindles for the SATA drives makes that
partition soooo much faster? Even though the disks and the RAID
configuration should be slower?
It feels like there is something fishy going on. Maybe the RAID 10
implementation on the PERC/6e is crap?
Any pointers, suggestion, ideas?
I'm going to change the RAID 10 to a RAID 5 and test again and see
what happens.
Cheers,
Henke
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Your expected write speed on a 4 drive RAID10 is two drives worth, probably 160 MB/s, depending on the generation of drives.
The expect write speed for a 6 drive RAID5 is 5 drives worth, or about 400 MB/s, sans the RAID5 parity overhead.
- Luke
----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
Sent: Fri Aug 08 10:23:55 2008
Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server
Hello list,
I have a server with a direct attached storage containing 4 15k SAS
drives and 6 standard SATA drives.
The server is a quad core xeon with 16GB ram.
Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU
There is 2 raid set configured.
One RAID 10 containing 4 SAS disks
One RAID 5 containing 6 SATA disks
There is one partition per RAID set with ext2 filesystem.
I ran the following iozone test which I stole from Joshua Drake's test
at
http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/
I ran this test against the RAID 5 SATA partition
#iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u
With these random write results
Children see throughput for 1 random writers = 168647.33 KB/sec
Parent sees throughput for 1 random writers = 168413.61 KB/sec
Min throughput per process = 168647.33 KB/sec
Max throughput per process = 168647.33 KB/sec
Avg throughput per process = 168647.33 KB/sec
Min xfer = 1024000.00 KB
CPU utilization: Wall time 6.072 CPU time 0.540 CPU
utilization 8.89 %
Almost 170 MB/sek. Not bad for 6 standard SATA drives.
Then I ran the same thing against the RAID 10 SAS partition
Children see throughput for 1 random writers = 68816.25 KB/sec
Parent sees throughput for 1 random writers = 68767.90 KB/sec
Min throughput per process = 68816.25 KB/sec
Max throughput per process = 68816.25 KB/sec
Avg throughput per process = 68816.25 KB/sec
Min xfer = 1024000.00 KB
CPU utilization: Wall time 14.880 CPU time 0.520 CPU
utilization 3.49 %
What only 70 MB/sek?
Is it possible that the 2 more spindles for the SATA drives makes that
partition soooo much faster? Even though the disks and the RAID
configuration should be slower?
It feels like there is something fishy going on. Maybe the RAID 10
implementation on the PERC/6e is crap?
Any pointers, suggestion, ideas?
I'm going to change the RAID 10 to a RAID 5 and test again and see
what happens.
Cheers,
Henke
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Fri, Aug 8, 2008 at 8:08 AM, Henrik <henke@mac.se> wrote: > But random writes should be faster on a RAID10 as it doesn't need to > calculate parity. That is why people suggest RAID 10 for datases, correct? > I can understand that RAID5 can be faster with sequential writes. There is some data here that does not support that RAID5 can be faster than RAID10 for sequential writes: http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide Regards, Mark
8 aug 2008 kl. 18.44 skrev Mark Wong: > On Fri, Aug 8, 2008 at 8:08 AM, Henrik <henke@mac.se> wrote: >> But random writes should be faster on a RAID10 as it doesn't need to >> calculate parity. That is why people suggest RAID 10 for datases, >> correct? >> I can understand that RAID5 can be faster with sequential writes. > > There is some data here that does not support that RAID5 can be faster > than RAID10 for sequential writes: > > http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide I'm amazed by the big difference on hardware vs software raid. I set up e new Dell(!) system against a MD1000 DAS with singel quad core 2,33 Ghz, 16GB RAM and Perc/6E raid controllers with 512MB BBU. I set up a RAID 10 on 4 15K SAS disks. I ran IOZone against this partition with ext2 filesystem and got the following results. safeuser@safecube04:/$ iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k - +u -F /database/iotest Iozone: Performance Test of File I/O Version $Revision: 3.279 $ Compiled for 64 bit mode. Build: linux Children see throughput for 1 initial writers = 254561.23 KB/sec Parent sees throughput for 1 initial writers = 253935.07 KB/sec Min throughput per process = 254561.23 KB/sec Max throughput per process = 254561.23 KB/sec Avg throughput per process = 254561.23 KB/sec Min xfer = 1024000.00 KB CPU Utilization: Wall time 4.023 CPU time 0.740 CPU utilization 18.40 % Children see throughput for 1 rewriters = 259640.61 KB/sec Parent sees throughput for 1 rewriters = 259351.20 KB/sec Min throughput per process = 259640.61 KB/sec Max throughput per process = 259640.61 KB/sec Avg throughput per process = 259640.61 KB/sec Min xfer = 1024000.00 KB CPU utilization: Wall time 3.944 CPU time 0.460 CPU utilization 11.66 % Children see throughput for 1 readers = 2931030.50 KB/sec Parent sees throughput for 1 readers = 2877172.20 KB/sec Min throughput per process = 2931030.50 KB/sec Max throughput per process = 2931030.50 KB/sec Avg throughput per process = 2931030.50 KB/sec Min xfer = 1024000.00 KB CPU utilization: Wall time 0.349 CPU time 0.340 CPU utilization 97.32 % Children see throughput for 1 random readers = 2534182.50 KB/sec Parent sees throughput for 1 random readers = 2465408.13 KB/sec Min throughput per process = 2534182.50 KB/sec Max throughput per process = 2534182.50 KB/sec Avg throughput per process = 2534182.50 KB/sec Min xfer = 1024000.00 KB CPU utilization: Wall time 0.404 CPU time 0.400 CPU utilization 98.99 % Children see throughput for 1 random writers = 68816.25 KB/sec Parent sees throughput for 1 random writers = 68767.90 KB/sec Min throughput per process = 68816.25 KB/sec Max throughput per process = 68816.25 KB/sec Avg throughput per process = 68816.25 KB/sec Min xfer = 1024000.00 KB CPU utilization: Wall time 14.880 CPU time 0.520 CPU utilization 3.49 % So compared to the HP 8000 benchmarks this setup is even better than the software raid. But I'm skeptical of iozones results as when I run the same test agains 6 standard SATA drives in RAID5 I got random writes of 170MB / sek (!). Sure 2 more spindles but still. Cheers, Henke
On 09/08/2008, Henrik <henke@mac.se> wrote: > But random writes should be faster on a RAID10 as it doesn't need to > calculate parity. That is why people suggest RAID 10 for datases, correct? If it had 10 spindles as opposed to 4 ... with 4 drives the "split" is (because you're striping and mirroring) like writing to two. Cheers, Andrej
On Fri, 8 Aug 2008, Henrik wrote: > It feels like there is something fishy going on. Maybe the RAID 10 > implementation on the PERC/6e is crap? Normally, when a SATA implementation is running significantly faster than a SAS one, it's because there's some write cache in the SATA disks turned on (which they usually are unless you go out of your way to disable them). Since all non-battery backed caches need to get turned off for reliable database use, you might want to double-check that on the controller that's driving the SATA disks. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 8 Aug 2008, Henrik wrote: > But random writes should be faster on a RAID10 as it doesn't need to > calculate parity. That is why people suggest RAID 10 for datases, correct? > > I can understand that RAID5 can be faster with sequential writes. the key word here is "can" be faster, it depends on the exact implementation, stripe size, OS caching, etc. the ideal situation would be that the OS would flush exactly one stripe of data at a time (aligned with the array) and no reads would need to be done, mearly calculate the parity info for the new data and write it all. the worst case is when the write size is small in relation to the stripe size and crosses the stripe boundry. In that case the system needs to read data from multiple stripes to calculate the new parity and write the data and parity data. I don't know any systems (software or hardware) that meet the ideal situation today. when comparing software and hardware raid, one other thing to remember is that CPU and I/O bandwidth that's used for software raid is not available to do other things. so a system that benchmarks much faster with software raid could end up being significantly slower in practice if it needs that CPU and I/O bandwidth for other purposes. examples could be needing the CPU/memory capacity to search through amounts of RAM once the data is retrieved from disk, or finding that you have enough network I/O that it combines with your disk I/O to saturate your system busses. David Lang > //Henke > > 8 aug 2008 kl. 16.53 skrev Luke Lonergan: > >> Your expected write speed on a 4 drive RAID10 is two drives worth, probably >> 160 MB/s, depending on the generation of drives. >> >> The expect write speed for a 6 drive RAID5 is 5 drives worth, or about 400 >> MB/s, sans the RAID5 parity overhead. >> >> - Luke >> >> ----- Original Message ----- >> From: pgsql-performance-owner@postgresql.org >> <pgsql-performance-owner@postgresql.org> >> To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> >> Sent: Fri Aug 08 10:23:55 2008 >> Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server >> >> Hello list, >> >> I have a server with a direct attached storage containing 4 15k SAS >> drives and 6 standard SATA drives. >> The server is a quad core xeon with 16GB ram. >> Both server and DAS has dual PERC/6E raid controllers with 512 MB BBU >> >> There is 2 raid set configured. >> One RAID 10 containing 4 SAS disks >> One RAID 5 containing 6 SATA disks >> >> There is one partition per RAID set with ext2 filesystem. >> >> I ran the following iozone test which I stole from Joshua Drake's test >> at >> http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ >> >> I ran this test against the RAID 5 SATA partition >> >> #iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u >> >> With these random write results >> >> Children see throughput for 1 random writers = 168647.33 KB/sec >> Parent sees throughput for 1 random writers = 168413.61 KB/sec >> Min throughput per process = 168647.33 KB/sec >> Max throughput per process = 168647.33 KB/sec >> Avg throughput per process = 168647.33 KB/sec >> Min xfer = 1024000.00 KB >> CPU utilization: Wall time 6.072 CPU time 0.540 CPU >> utilization 8.89 % >> >> Almost 170 MB/sek. Not bad for 6 standard SATA drives. >> >> Then I ran the same thing against the RAID 10 SAS partition >> >> Children see throughput for 1 random writers = 68816.25 KB/sec >> Parent sees throughput for 1 random writers = 68767.90 KB/sec >> Min throughput per process = 68816.25 KB/sec >> Max throughput per process = 68816.25 KB/sec >> Avg throughput per process = 68816.25 KB/sec >> Min xfer = 1024000.00 KB >> CPU utilization: Wall time 14.880 CPU time 0.520 CPU >> utilization 3.49 % >> >> What only 70 MB/sek? >> >> Is it possible that the 2 more spindles for the SATA drives makes that >> partition soooo much faster? Even though the disks and the RAID >> configuration should be slower? >> It feels like there is something fishy going on. Maybe the RAID 10 >> implementation on the PERC/6e is crap? >> >> Any pointers, suggestion, ideas? >> >> I'm going to change the RAID 10 to a RAID 5 and test again and see >> what happens. >> >> Cheers, >> Henke >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> >
9 aug 2008 kl. 00.47 skrev Greg Smith: > On Fri, 8 Aug 2008, Henrik wrote: > >> It feels like there is something fishy going on. Maybe the RAID 10 >> implementation on the PERC/6e is crap? > > Normally, when a SATA implementation is running significantly faster > than a SAS one, it's because there's some write cache in the SATA > disks turned on (which they usually are unless you go out of your > way to disable them). Since all non-battery backed caches need to > get turned off for reliable database use, you might want to double- > check that on the controller that's driving the SATA disks. Lucky for my I have BBU on all my controllers cards and I'm also not using the SATA drives for database. That is why I bought the SAS drives :) Just got confused when the SATA RAID 5 was sooo much faster than the SAS RAID10, even random writes. But I should have realized that SAS is only faster if the number of drives are equal :) Thanks for the input! Cheers, Henke > > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com > Baltimore, MD > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
OK, changed the SAS RAID 10 to RAID 5 and now my random writes are handing 112 MB/ sek. So it is almsot twice as fast as the RAID10 with the same disks. Any ideas why? Is the iozone tests faulty? What is your suggestions? Trust the IOZone tests and use RAID5 instead of RAID10, or go for RAID10 as it should be faster and will be more suited when we add more disks in the future? I'm a little confused by the benchmarks. This is from the RAID5 tests on 4 SAS 15K drives... iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u -F /database/iotest Children see throughput for 1 random writers = 112074.58 KB/sec Parent sees throughput for 1 random writers = 111962.80 KB/sec Min throughput per process = 112074.58 KB/sec Max throughput per process = 112074.58 KB/sec Avg throughput per process = 112074.58 KB/sec Min xfer = 1024000.00 KB CPU utilization: Wall time 9.137 CPU time 0.510 CPU utilization 5.58 % 9 aug 2008 kl. 04.24 skrev david@lang.hm: > On Fri, 8 Aug 2008, Henrik wrote: > >> But random writes should be faster on a RAID10 as it doesn't need >> to calculate parity. That is why people suggest RAID 10 for >> datases, correct? >> >> I can understand that RAID5 can be faster with sequential writes. > > the key word here is "can" be faster, it depends on the exact > implementation, stripe size, OS caching, etc. > > the ideal situation would be that the OS would flush exactly one > stripe of data at a time (aligned with the array) and no reads would > need to be done, mearly calculate the parity info for the new data > and write it all. > > the worst case is when the write size is small in relation to the > stripe size and crosses the stripe boundry. In that case the system > needs to read data from multiple stripes to calculate the new parity > and write the data and parity data. > > I don't know any systems (software or hardware) that meet the ideal > situation today. > > when comparing software and hardware raid, one other thing to > remember is that CPU and I/O bandwidth that's used for software raid > is not available to do other things. > > so a system that benchmarks much faster with software raid could end > up being significantly slower in practice if it needs that CPU and I/ > O bandwidth for other purposes. > > examples could be needing the CPU/memory capacity to search through > amounts of RAM once the data is retrieved from disk, or finding that > you have enough network I/O that it combines with your disk I/O to > saturate your system busses. > > David Lang > > >> //Henke >> >> 8 aug 2008 kl. 16.53 skrev Luke Lonergan: >> >>> Your expected write speed on a 4 drive RAID10 is two drives worth, >>> probably 160 MB/s, depending on the generation of drives. >>> The expect write speed for a 6 drive RAID5 is 5 drives worth, or >>> about 400 MB/s, sans the RAID5 parity overhead. >>> - Luke >>> ----- Original Message ----- >>> From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org >>> > >>> To: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org >>> > >>> Sent: Fri Aug 08 10:23:55 2008 >>> Subject: [PERFORM] Filesystem benchmarking for pg 8.3.3 server >>> Hello list, >>> I have a server with a direct attached storage containing 4 15k SAS >>> drives and 6 standard SATA drives. >>> The server is a quad core xeon with 16GB ram. >>> Both server and DAS has dual PERC/6E raid controllers with 512 MB >>> BBU >>> There is 2 raid set configured. >>> One RAID 10 containing 4 SAS disks >>> One RAID 5 containing 6 SATA disks >>> There is one partition per RAID set with ext2 filesystem. >>> I ran the following iozone test which I stole from Joshua Drake's >>> test >>> at >>> http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ >>> I ran this test against the RAID 5 SATA partition >>> #iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u >>> With these random write results >>> >>> Children see throughput for 1 random writers = 168647.33 >>> KB/sec >>> Parent sees throughput for 1 random writers = 168413.61 >>> KB/sec >>> Min throughput per process = 168647.33 >>> KB/sec >>> Max throughput per process = 168647.33 >>> KB/sec >>> Avg throughput per process = 168647.33 >>> KB/sec >>> Min xfer = 1024000.00 >>> KB >>> CPU utilization: Wall time 6.072 CPU time 0.540 >>> CPU >>> utilization 8.89 % >>> Almost 170 MB/sek. Not bad for 6 standard SATA drives. >>> Then I ran the same thing against the RAID 10 SAS partition >>> >>> Children see throughput for 1 random writers = 68816.25 >>> KB/sec >>> Parent sees throughput for 1 random writers = 68767.90 >>> KB/sec >>> Min throughput per process = 68816.25 >>> KB/sec >>> Max throughput per process = 68816.25 >>> KB/sec >>> Avg throughput per process = 68816.25 >>> KB/sec >>> Min xfer = 1024000.00 >>> KB >>> CPU utilization: Wall time 14.880 CPU time 0.520 >>> CPU >>> utilization 3.49 % >>> What only 70 MB/sek? >>> Is it possible that the 2 more spindles for the SATA drives makes >>> that >>> partition soooo much faster? Even though the disks and the RAID >>> configuration should be slower? >>> It feels like there is something fishy going on. Maybe the RAID 10 >>> implementation on the PERC/6e is crap? >>> Any pointers, suggestion, ideas? >>> I'm going to change the RAID 10 to a RAID 5 and test again and see >>> what happens. >>> Cheers, >>> Henke >>> -- >>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org >>> ) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >> > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
> > > >> It feels like there is something fishy going on. > Maybe the RAID 10 > >> implementation on the PERC/6e is crap? > > It's possible. We had a bunch of perc/5i SAS raid cards in our servers that performed quite well in Raid 5 but were shitein Raid 10. I switched them out for Adaptec 5808s and saw a massive improvement in Raid 10. __________________________________________________________ Not happy with your email address?. Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
11 aug 2008 kl. 12.35 skrev Glyn Astill: >>> >>>> It feels like there is something fishy going on. >> Maybe the RAID 10 >>>> implementation on the PERC/6e is crap? >>> > > It's possible. We had a bunch of perc/5i SAS raid cards in our > servers that performed quite well in Raid 5 but were shite in Raid > 10. I switched them out for Adaptec 5808s and saw a massive > improvement in Raid 10. I suspected that. Maybe I should just put the PERC/6 cards in JBOD mode and then make a RAID10 with linux software raid MD? > > > > __________________________________________________________ > Not happy with your email address?. > Get the one you really want - millions of new email addresses > available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Mon, Aug 11, 2008 at 6:08 AM, Henrik <henke@mac.se> wrote: > 11 aug 2008 kl. 12.35 skrev Glyn Astill: > >>>> >>>>> It feels like there is something fishy going on. >>> >>> Maybe the RAID 10 >>>>> >>>>> implementation on the PERC/6e is crap? >>>> >> >> It's possible. We had a bunch of perc/5i SAS raid cards in our servers >> that performed quite well in Raid 5 but were shite in Raid 10. I switched >> them out for Adaptec 5808s and saw a massive improvement in Raid 10. > > I suspected that. Maybe I should just put the PERC/6 cards in JBOD mode and > then make a RAID10 with linux software raid MD? You can also try making mirror sets with the hardware RAID controller and then doing SW RAID 0 on top of that. Since RAID0 requires little or no CPU overhead, this is a good compromise because the OS has the least work to do, and the RAID controller is doing what it's probably pretty good at, mirror sets.
On Aug 11, 2008, at 5:17 AM, Henrik wrote: > OK, changed the SAS RAID 10 to RAID 5 and now my random writes are > handing 112 MB/ sek. So it is almsot twice as fast as the RAID10 > with the same disks. Any ideas why? > > Is the iozone tests faulty? does IOzone disable the os caches? If not you need to use a size of 2xRAM for true results. regardless - the test only took 10 seconds of wall time - which isn't very long at all. You'd probably want to run it longer anyway. > > iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u -F /database/iotest > > Children see throughput for 1 random writers = 112074.58 KB/sec > Parent sees throughput for 1 random writers = 111962.80 KB/sec > Min throughput per process = 112074.58 KB/sec > Max throughput per process = 112074.58 KB/sec > Avg throughput per process = 112074.58 KB/sec > Min xfer = 1024000.00 KB > CPU utilization: Wall time 9.137 CPU time 0.510 CPU > utilization 5.58 % > -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
On Sun, 10 Aug 2008, Henrik wrote: >> Normally, when a SATA implementation is running significantly faster than a >> SAS one, it's because there's some write cache in the SATA disks turned on >> (which they usually are unless you go out of your way to disable them). > Lucky for my I have BBU on all my controllers cards and I'm also not using > the SATA drives for database. From how you responded I don't think I made myself clear. In addition to the cache on the controller itself, each of the disks has its own cache, probably 8-32MB in size. Your controllers may have an option to enable or disable the caches on the individual disks, which would be a separate configuration setting from turning the main controller cache on or off. Your results look like what I'd expect if the individual disks caches on the SATA drives were on, while those on the SAS controller were off (which matches the defaults you'll find on some products in both categories). Just something to double-check. By the way: getting useful results out of iozone is fairly difficult if you're unfamiliar with it, there are lots of ways you can set that up to run tests that aren't completely fair or that you don't run them for long enough to give useful results. I'd suggest doing a round of comparisons with bonnie++, which isn't as flexible but will usually give fair results without needing to specify any parameters. The "seeks" number that comes out of bonnie++ is a combined read/write one and would be good for double-checking whether the unexpected results you're seeing are independant of the benchmark used. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hi again all, Just wanted to give you an update. Talked to Dell tech support and they recommended using write- through(!) caching in RAID10 configuration. Well, it didn't work and got even worse performance. Anyone have an estimated what a RAID10 on 4 15k SAS disks should generate in random writes? I'm really keen on trying Scotts suggestion on using the PERC/6 with mirror sets only and then make the stripe with Linux SW raid. Thanks for all the input! Much appreciated. Cheers, Henke 11 aug 2008 kl. 17.56 skrev Greg Smith: > On Sun, 10 Aug 2008, Henrik wrote: > >>> Normally, when a SATA implementation is running significantly >>> faster than a SAS one, it's because there's some write cache in >>> the SATA disks turned on (which they usually are unless you go out >>> of your way to disable them). >> Lucky for my I have BBU on all my controllers cards and I'm also >> not using the SATA drives for database. > >> From how you responded I don't think I made myself clear. In >> addition to > the cache on the controller itself, each of the disks has its own > cache, probably 8-32MB in size. Your controllers may have an option > to enable or disable the caches on the individual disks, which would > be a separate configuration setting from turning the main controller > cache on or off. Your results look like what I'd expect if the > individual disks caches on the SATA drives were on, while those on > the SAS controller were off (which matches the defaults you'll find > on some products in both categories). Just something to double-check. > > By the way: getting useful results out of iozone is fairly > difficult if you're unfamiliar with it, there are lots of ways you > can set that up to run tests that aren't completely fair or that you > don't run them for long enough to give useful results. I'd suggest > doing a round of comparisons with bonnie++, which isn't as flexible > but will usually give fair results without needing to specify any > parameters. The "seeks" number that comes out of bonnie++ is a > combined read/write one and would be good for double-checking > whether the unexpected results you're seeing are independant of the > benchmark used. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com > Baltimore, MD > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Tue, Aug 12, 2008 at 1:40 PM, Henrik <henke@mac.se> wrote: > Hi again all, > > Just wanted to give you an update. > > Talked to Dell tech support and they recommended using write-through(!) > caching in RAID10 configuration. Well, it didn't work and got even worse > performance. Someone at Dell doesn't understand the difference between write back and write through. > Anyone have an estimated what a RAID10 on 4 15k SAS disks should generate in > random writes? Using sw RAID or a non-caching RAID controller, you should be able to get close to 2xmax write based on rpms. On 7200 RPM drives that's 2*150 or ~300 small transactions per second. On 15k drives that's about 2*250 or around 500 tps. The bigger the data you're writing, the fewer you're gonna be able to write each second of course. > I'm really keen on trying Scotts suggestion on using the PERC/6 with mirror > sets only and then make the stripe with Linux SW raid. Definitely worth the try. Even full on sw RAID may be faster. It's worth testing. On our new servers at work, we have Areca controllers with 512M bbc and they were about 10% faster mixing sw and hw raid, but honestly, it wasn't worth the extra trouble of the hw/sw combo to go with.
Greg Smith wrote: > some write cache in the SATA disks...Since all non-battery backed caches > need to get turned off for reliable database use, you might want to > double-check that on the controller that's driving the SATA disks. Is this really true? Doesn't the ATA "FLUSH CACHE" command (say, ATA command 0xE7) guarantee that writes are on the media? http://www.t13.org/Documents/UploadedDocuments/technical/e01126r0.pdf "A non-error completion of the command indicates that all cached data since the last FLUSH CACHE command completion was successfully written to media, including any cached data that may have been written prior to receipt of FLUSH CACHE command." (I still can't find any $0 SATA specs; but I imagine the final wording for the command is similar to the wording in the proposal for the command which can be found on the ATA Technical Committee's web site at the link above.) Really old software (notably 2.4 linux kernels) didn't send cache synchronizing commands for SCSI nor either ATA; but it seems well thought through in the 2.6 kernels as described in the Linux kernel documentation. http://www.mjmwired.net/kernel/Documentation/block/barrier.txt If you do have a disk where you need to disable write caches, I'd love to know the name of the disk and see the output of of "hdparm -I /dev/sd***" to see if it claims to support such cache flushes. I'm almost tempted to say that if you find yourself having to disable caches on modern (this century) hardware and software, you're probably covering up a more serious issue with your system.
Some file systems don't know about this (UFS, older linux kernels, etc).
So yes, if your OS / File System / Controller card combo properly sends the write cache flush command, and the drive is not a flawed one, all is well. Most should, not all do. Any one of those bits along the chain can potentially be disk write cache unsafe.
Greg Smith wrote:some write cache in the SATA disks...Since all non-battery backed caches need to get turned off for reliable database use, you might want to double-check that on the controller that's driving the SATA disks.
Is this really true?
Doesn't the ATA "FLUSH CACHE" command (say, ATA command 0xE7)
guarantee that writes are on the media?
http://www.t13.org/Documents/UploadedDocuments/technical/e01126r0.pdf
"A non-error completion of the command indicates that all cached data
since the last FLUSH CACHE command completion was successfully written
to media, including any cached data that may have been
written prior to receipt of FLUSH CACHE command."
(I still can't find any $0 SATA specs; but I imagine the final
wording for the command is similar to the wording in the proposal
for the command which can be found on the ATA Technical Committee's
web site at the link above.)
Really old software (notably 2.4 linux kernels) didn't send
cache synchronizing commands for SCSI nor either ATA; but
it seems well thought through in the 2.6 kernels as described
in the Linux kernel documentation.
http://www.mjmwired.net/kernel/Documentation/block/barrier.txt
If you do have a disk where you need to disable write caches,
I'd love to know the name of the disk and see the output of
of "hdparm -I /dev/sd***" to see if it claims to support such
cache flushes.
I'm almost tempted to say that if you find yourself having to disable
caches on modern (this century) hardware and software, you're probably
covering up a more serious issue with your system.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > Some SATA drives were known to not flush their cache when told to. Can you name one? The ATA commands seem pretty clear on the matter, and ISTM most of the reports of these issues came from before Linux had write-barrier support. I've yet to hear of a drive with the problem; though no doubt there are some cheap RAID controllers somewhere that expect you to disable the drive caches.
On Tue, Aug 12, 2008 at 6:23 PM, Scott Carey <scott@richrelevance.com> wrote: > Some SATA drives were known to not flush their cache when told to. > Some file systems don't know about this (UFS, older linux kernels, etc). > > So yes, if your OS / File System / Controller card combo properly sends the > write cache flush command, and the drive is not a flawed one, all is well. > Most should, not all do. Any one of those bits along the chain can > potentially be disk write cache unsafe. I can attest to the 2.4 kernel not being able to guarantee fsync on IDE drives. And to the LSI megaraid SCSI controllers of the era surviving numerous power off tests.
On Tue, 12 Aug 2008, Ron Mayer wrote: > Scott Carey wrote: >> Some SATA drives were known to not flush their cache when told to. > > Can you name one? The ATA commands seem pretty clear on the matter, > and ISTM most of the reports of these issues came from before > Linux had write-barrier support. I can't name one, but I've seen it mentioned in the discussions on linux-kernel several times by the folks who are writing the write-barrier support. David Lang
I recall some cheap raid cards and controller cards being an issue, like the below:
http://www.fixya.com/support/t163682-hard_drive_corrupt_every_reboot
And here is an example of an HP Fiber Channel Disk firmware bug:
HS02969 28SEP07
•
Title
: OPN FIBRE CHANNEL DISK FIRMWARE
•
Platform
: S-Series & NS-Series only with FCDMs
•
Summary
:
HP recently discovered a firmware flaw in some versions of 72,
146, and 300 Gigabyte fibre channel disk devices that shipped in late 2006
and early 2007. The flaw enabled the affected disk devices to inadvertently
cache write data. In very rare instances, this caching operation presents an
opportunity for disk write operations to be lost.
Even ext3 doesn't default to using write barriers at this time due to performance concerns:
http://lwn.net/Articles/283161/
On Tue, 12 Aug 2008, Ron Mayer wrote:I can't name one, but I've seen it mentioned in the discussions on linux-kernel several times by the folks who are writing the write-barrier support.Scott Carey wrote:Some SATA drives were known to not flush their cache when told to.
Can you name one? The ATA commands seem pretty clear on the matter,
and ISTM most of the reports of these issues came from before
Linux had write-barrier support.
David Lang
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe wrote: > I can attest to the 2.4 kernel not being able to guarantee fsync on > IDE drives. Sure. But note that it won't for SCSI either; since AFAICT the write barrier support was implemented at the same time for both.
On Tue, 12 Aug 2008, Ron Mayer wrote: > Really old software (notably 2.4 linux kernels) didn't send > cache synchronizing commands for SCSI nor either ATA; but > it seems well thought through in the 2.6 kernels as described > in the Linux kernel documentation. > http://www.mjmwired.net/kernel/Documentation/block/barrier.txt If you've drank the kool-aid you might believe that. When I see people asking about this in early 2008 at http://thread.gmane.org/gmane.linux.kernel/646040 and serious disk driver hacker Jeff Garzik says "It's completely ridiculous that we default to an unsafe fsync." [ http://thread.gmane.org/gmane.linux.kernel/646040 ], I don't know about you but that barrier documentation doesn't make me feel warm and safe anymore. > If you do have a disk where you need to disable write caches, > I'd love to know the name of the disk and see the output of > of "hdparm -I /dev/sd***" to see if it claims to support such > cache flushes. The below disk writes impossibly fast when I issue a sequence of fsync writes to it under the CentOS 5 Linux I was running on it. Should only be possible to do at most 120/second since it's 7200 RPM, and if I poke it with "hdparm -W0" first it behaves. The drive is a known piece of junk from circa 2004, and it's worth noting that it's an ext3 filesystem in a md0 RAID-1 array (aren't there issues with md and the barriers?) # hdparm -I /dev/hde /dev/hde: ATA device, with non-removable media Model Number: Maxtor 6Y250P0 Serial Number: Y62K95PE Firmware Revision: YAR41BW0 Standards: Used: ATA/ATAPI-7 T13 1532D revision 0 Supported: 7 6 5 4 Configuration: Logical max current cylinders 16383 65535 heads 16 1 sectors/track 63 63 -- CHS current addressable sectors: 4128705 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 490234752 device size with M = 1024*1024: 239372 MBytes device size with M = 1000*1000: 251000 MBytes (251 GB) Capabilities: LBA, IORDY(can be disabled) Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: unknown setting (0x0000) Recommended acoustic management value: 192, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_VERIFY command * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE Advanced Power Management feature set SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Scott Marlowe wrote: >> >> I can attest to the 2.4 kernel not being able to guarantee fsync on >> IDE drives. > > Sure. But note that it won't for SCSI either; since AFAICT the write > barrier support was implemented at the same time for both. Tested both by pulling the power plug. The SCSI was pulled 10 times while running 600 or so concurrent pgbench threads, and so was the IDE. The SCSI came up clean every single time, the IDE came up corrupted every single time. I find it hard to believe there was no difference in write barrier behaviour with those two setups.
On Tue, 12 Aug 2008, Ron Mayer wrote: > Really old software (notably 2.4 linux kernels) didn't send > cache synchronizing commands for SCSI nor either ATA; Surely not true. Write cache flushing has been a known problem in the computer science world for several tens of years. The difference is that in the past we only had a "flush everything" command whereas now we have a "flush everything before the barrier before everything after the barrier" command. Matthew -- "To err is human; to really louse things up requires root privileges." -- Alexander Pope, slightly paraphrased
Scott Marlowe wrote: > On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer ...wrote: >> Scott Marlowe wrote: >>> I can attest to the 2.4 kernel ... >> ...SCSI...AFAICT the write barrier support... > > Tested both by pulling the power plug. The SCSI was pulled 10 times > while running 600 or so concurrent pgbench threads, and so was the > IDE. The SCSI came up clean every single time, the IDE came up > corrupted every single time. Interesting. With a pre-write-barrier 2.4 kernel I'd expect corruption in both. Perhaps all caches were disabled in the SCSI drives? > I find it hard to believe there was no difference in write barrier > behaviour with those two setups. Skimming lkml it seems write barriers for SCSI were behind (in terms of implementation) those for ATA http://lkml.org/lkml/2005/1/27/94 "Jan 2005 ... scsi/sata write barrier support ... For the longest time, only the old PATA drivers supported barrier writes with journalled file systems. This patch adds support for the same type of cache flushing barriers that PATA uses for SCSI"
Greg Smith wrote: > The below disk writes impossibly fast when I issue a sequence of fsync 'k. I've got some homework. I'll be trying to reproduce similar with md raid, old IDE drives, etc to see if I can reproduce them. I assume test_fsync in the postgres source distribution is a decent way to see? > driver hacker Jeff Garzik says "It's completely ridiculous that we > default to an unsafe fsync." Yipes indeed. Still makes me want to understand why people claim IDE suffers more than SCSI, tho. Ext3 bugs seem likely to affect both to me. > writes to it under the CentOS 5 Linux I was running on it. ... > junk from circa 2004, and it's worth noting that it's an ext3 filesystem > in a md0 RAID-1 array (aren't there issues with md and the barriers?) Apparently various distros vary a lot in how they're set up (SuSE apparently defaults to mounting ext3 with the barrier=1 option; other distros seemed not to, etc). I'll do a number of experiments with md, a few different drives, etc. today and see if I can find issues with any of the drives (and/or filesystems) around here. But I still am looking for any evidence that there were any widely shipped SATA (or even IDE drives) that were at fault, as opposed to filesystem bugs and poor settings of defaults.
On Wed, Aug 13, 2008 at 8:41 AM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Greg Smith wrote: > But I still am looking for any evidence that there were any > widely shipped SATA (or even IDE drives) that were at fault, > as opposed to filesystem bugs and poor settings of defaults. Well, if they're getting more than 150/166.6/250 transactions per second without a battery backed cache, then they're likely lying about fsync. And most SATA and IDE drives will give you way over that for a small data set.
On Aug 11, 2008, at 9:01 AM, Jeff wrote: > On Aug 11, 2008, at 5:17 AM, Henrik wrote: > >> OK, changed the SAS RAID 10 to RAID 5 and now my random writes are >> handing 112 MB/ sek. So it is almsot twice as fast as the RAID10 >> with the same disks. Any ideas why? >> >> Is the iozone tests faulty? > > > does IOzone disable the os caches? > If not you need to use a size of 2xRAM for true results. > > regardless - the test only took 10 seconds of wall time - which > isn't very long at all. You'd probably want to run it longer anyway. Additionally, you need to be careful of what size writes you're using. If you're doing random writes that perfectly align with the raid stripe size, you'll see virtually no RAID5 overhead, and you'll get the performance of N-1 drives, as opposed to RAID10 giving you N/2. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Attachment
On Wed, 13 Aug 2008, Ron Mayer wrote: > I assume test_fsync in the postgres source distribution is > a decent way to see? Not really. It takes too long (runs too many tests you don't care about) and doesn't spit out the results the way you want them--TPS, not average time. You can do it with pgbench (scale here really doesn't matter): $ cat insert.sql \set nbranches :scale \set ntellers 10 * :scale \set naccounts 100000 * :scale \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom tid 1 :ntellers \setrandom delta -5000 5000 BEGIN; INSERT INTO history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; $ createdb pgbench $ pgbench -i -s 20 pgbench $ pgbench -f insert.sql -s 20 -c 1 -t 10000 pgbench Don't really need to ever rebuild that just to run more tests if all you care about is the fsync speed (no indexes in the history table to bloat or anything). Or you can measure with sysbench; http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/ goes over that but they don't have the syntax exacty right. Here's an example that works: :~/sysbench-0.4.8/bin/bin$ ./sysbench run --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr > But I still am looking for any evidence that there were any widely > shipped SATA (or even IDE drives) that were at fault, as opposed to > filesystem bugs and poor settings of defaults. Alan Cox claims that until circa 2001, the ATA standard didn't require implementing the cache flush call at all. See http://www.kerneltraffic.org/kernel-traffic/kt20011015_137.html Since firmware is expensive to write and manufacturers are generally lazy here, I'd bet a lot of disks from that era were missing support for the call. Next time I'd digging through my disk graveyard I'll try and find such a disk. If he's correct that the standard changed around you wouldn't expect any recent drive to not support the call. I feel it's largely irrelevant that most drives handle things just fine nowadays if you send them the correct flush commands, because there are so manh other things that can make that system as a whole not work right. Even if the flush call works most of the time, disk firmware is turning increasibly into buggy software, and attempts to reduce how much of that firmware you're actually using can be viewed as helpful. This is why I usually suggest just turning the individual drive caches off; the caveats for when they might work fine in this context are just too numerous. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
13 aug 2008 kl. 17.13 skrev Decibel!: > On Aug 11, 2008, at 9:01 AM, Jeff wrote: >> On Aug 11, 2008, at 5:17 AM, Henrik wrote: >> >>> OK, changed the SAS RAID 10 to RAID 5 and now my random writes are >>> handing 112 MB/ sek. So it is almsot twice as fast as the RAID10 >>> with the same disks. Any ideas why? >>> >>> Is the iozone tests faulty? >> >> >> does IOzone disable the os caches? >> If not you need to use a size of 2xRAM for true results. >> >> regardless - the test only took 10 seconds of wall time - which >> isn't very long at all. You'd probably want to run it longer anyway. > > > Additionally, you need to be careful of what size writes you're > using. If you're doing random writes that perfectly align with the > raid stripe size, you'll see virtually no RAID5 overhead, and you'll > get the performance of N-1 drives, as opposed to RAID10 giving you N/ > 2. But it still needs to do 2 reads and 2 writes for every write, correct? I did some bonnie++ tests just to give some new more reasonable numbers. This is with RAID10 on 4 SAS 15k drives with write-back cache. Version 1.03b ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP safecube04 32136M 73245 95 213092 16 89456 11 64923 81 219341 16 839.9 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 6178 99 +++++ +++ +++++ +++ 6452 100 +++++ +++ 20633 99 safecube04,32136M, 73245,95,213092,16,89456,11,64923,81,219341,16,839.9,1,16,6178,99,++++ +,+++,+++++,+++,6452,100,+++++,+++,20633,99 > > -- > Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org > Give your computer some brain candy! www.distributed.net Team #1828 > >
Scott Marlowe wrote: >IDE came up corrupted every single time. Greg Smith wrote: > you've drank the kool-aid ... completely > ridiculous ...unsafe fsync ... md0 RAID-1 > array (aren't there issues with md and the barriers?) Alright - I'll eat my words. Or mostly. I still haven't found IDE drives that lie; but if the testing I've done today, I'm starting to think that: 1a) ext3 fsync() seems to lie badly. 1b) but ext3 can be tricked not to lie (but not in the way you might think). 2a) md raid1 fsync() sometimes doesn't actually sync 2b) I can't trick it not to. 3a) some IDE drives don't even pretend to support letting you know when their cache is flushed 3b) but the kernel will happily tell you about any such devices; as well as including md raid ones. In more detail. I tested on a number of systems and disks including new (this year) and old (1997) IDE drives; and EXT3 with and without the "barrier=1" mount option. First off - some IDE drives don't even support the relatively recent ATA command that apparently lets the software know when a cache flush is complete. Apparently on those you will get messages in your system logs: %dmesg | grep 'disabling barriers' JBD: barrier-based sync failed on md1 - disabling barriers JBD: barrier-based sync failed on hda3 - disabling barriers and %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT will not show you anything on those devices. IMHO that's cool; and doesn't count as a lying IDE drive since it didn't claim to support this. Second of all - ext3 fsync() appears to me to be *extremely* stupid. It only seems to correctly do the correct flushing (and waiting) for a drive's cache to be flushed when a file's inode has changed. For example, in the test program below, it will happily do a real fsync (i.e. the program take a couple seconds to run) so long as I have the "fchmod()" statements are in there. It will *NOT* wait on my system if I comment those fchmod()'s out. Sadly, I get the same behavior with and without the ext3 barrier=1 mount option. :( ========================================================== /* ** based on http://article.gmane.org/gmane.linux.file-systems/21373 ** http://thread.gmane.org/gmane.linux.kernel/646040 */ #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> int main(int argc,char *argv[]) { if (argc<2) { printf("usage: fs <filename>\n"); exit(1); } int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666); int i; for (i=0;i<100;i++) { char byte; pwrite (fd, &byte, 1, 0); fchmod (fd, 0644); fchmod (fd, 0664); fsync (fd); } } ========================================================== Since it does indeed wait when the inode's touched, I think it suggests that it's not the hard drive that's lying, but rather ext3. So I take back what I said about linux and write barriers being sane. They're not. But AFACT, all the (6 different) IDE drives I've seen work as advertised, and the kernel happily seems to spews boot messages when it finds one that doesn't support knowing when a cache flush finished.
On Wed, 13 Aug 2008, Ron Mayer wrote: > First off - some IDE drives don't even support the relatively recent ATA > command that apparently lets the software know when a cache flush is > complete. Right, so this is one reason you can't assume barriers will be available. And barriers don't work regardless if you go through the device mapper, like some LVM and software RAID configurations; see http://lwn.net/Articles/283161/ > Second of all - ext3 fsync() appears to me to be *extremely* stupid. > It only seems to correctly do the correct flushing (and waiting) for a > drive's cache to be flushed when a file's inode has changed. This is bad, but the way PostgreSQL uses fsync seems to work fine--if it didn't, we'd all see unnaturally high write rates all the time. > So I take back what I said about linux and write barriers > being sane. They're not. Right. Where Linux seems to be at right now is that there's this occasional problem people run into where ext3 volumes can get corrupted if there are out of order writes to its journal: http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal http://archives.free.net.ph/message/20070518.134838.52e26369.en.html (By the way: I just fixed the ext3 Wikipedia article to reflect the current state of things and dumped a bunch of reference links in to there, including some that are not listed here. I prefer to keep my notes about interesting topics in Wikipedia instead of having my own copies whenever possible). There are two ways to get around this issue ext3. You can disable write caching, changing your default mount options to "data=journal". In the PostgreSQL case, the way the WAL is used seems to keep corruption at bay even with the default "data=ordered" case, but after reading up on this again I'm thinking I may want to switch to "journal" anyway in the future (and retrofit some older installs with that change). I also avoid using Linux LVM whenever possible for databases just on general principle; one less flakey thing in the way. The other way, barriers, is just plain scary unless you know your disk hardware does the right thing and the planets align just right, and even then it seems buggy. I personally just ignore the fact that they exist on ext3, and maybe one day ext4 will get this right. By the way: there is a great ext3 "torture test" program that just came out a few months ago that's useful for checking general filesystem corruption in this context I keep meaning to try, if you've got some cycles to spare working in this area check it out: http://uwsg.indiana.edu/hypermail/linux/kernel/0805.2/1470.html -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
I've seen it written a couple of times in this thread, and in the wikipedia article, that SOME sw raid configs don't support write barriers. This implies that some do. Which ones do and which ones don't? Does anybody have a list of them? I was mainly wondering if sw RAID0 on top of hw RAID1 would be safe.
Greg Smith wrote: > On Wed, 13 Aug 2008, Ron Mayer wrote: > >> Second of all - ext3 fsync() appears to me to be *extremely* stupid. >> It only seems to correctly do the correct flushing (and waiting) for a >> drive's cache to be flushed when a file's inode has changed. > > This is bad, but the way PostgreSQL uses fsync seems to work fine--if it > didn't, we'd all see unnaturally high write rates all the time. But only if you turn off IDE drive caches. What was new to me in these experiments is that if you touch the inode as described here: http://article.gmane.org/gmane.linux.file-systems/21373 then fsync() works and you can leave the IDE cache enabled; so long as your drive supports the flush command -- which you can see by looking for the drive in the output of: %dmesg | grep 'disabling barriers' JBD: barrier-based sync failed on md1 - disabling barriers JBD: barrier-based sync failed on hda3 - disabling barriers >> So I take back what I said about linux and write barriers >> being sane. They're not. > > Right. Where Linux seems to be at right now is that there's this I almost fear I misphrased that. Apparently IDE drives don't lie (the ones that don't support barriers let the OS know that they don't). And apparently write barriers do work. It's just that ext3 only uses the write barriers correctly on fsync() when an inode is touched, rather than any time a file's data is touched. > then it seems buggy. I personally just ignore the fact that they exist > on ext3, and maybe one day ext4 will get this right. +1
On Aug 13, 2008, at 2:54 PM, Henrik wrote: >> Additionally, you need to be careful of what size writes you're >> using. If you're doing random writes that perfectly align with the >> raid stripe size, you'll see virtually no RAID5 overhead, and >> you'll get the performance of N-1 drives, as opposed to RAID10 >> giving you N/2. > But it still needs to do 2 reads and 2 writes for every write, > correct? If you are completely over-writing an entire stripe, there's no reason to read the existing data; you would just calculate the parity information from the new data. Any good controller should take that approach. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Attachment
On Sat, 16 Aug 2008, Decibel! wrote: > On Aug 13, 2008, at 2:54 PM, Henrik wrote: >>> Additionally, you need to be careful of what size writes you're using. If >>> you're doing random writes that perfectly align with the raid stripe size, >>> you'll see virtually no RAID5 overhead, and you'll get the performance of >>> N-1 drives, as opposed to RAID10 giving you N/2. >> But it still needs to do 2 reads and 2 writes for every write, correct? > > > If you are completely over-writing an entire stripe, there's no reason to > read the existing data; you would just calculate the parity information from > the new data. Any good controller should take that approach. in theory yes, in practice the OS writes usually aren't that large and aligned, and as a result most raid controllers (and software) don't have the special-case code to deal with it. there's discussion of these issues, but not much more then that. David Lang
<david@lang.hm> writes: >> If you are completely over-writing an entire stripe, there's no reason to >> read the existing data; you would just calculate the parity information from >> the new data. Any good controller should take that approach. > > in theory yes, in practice the OS writes usually aren't that large and aligned, > and as a result most raid controllers (and software) don't have the > special-case code to deal with it. I'm pretty sure all half-decent controllers and software do actually. This is one major reason that large (hopefully battery backed) caches help RAID-5 disproportionately. The larger the cache the more likely it'll be able to wait until the entire raid stripe is replaced avoid having to read in the old parity. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
"Gregory Stark" <stark@enterprisedb.com> writes: > <david@lang.hm> writes: > >>> If you are completely over-writing an entire stripe, there's no reason to >>> read the existing data; you would just calculate the parity information from >>> the new data. Any good controller should take that approach. >> >> in theory yes, in practice the OS writes usually aren't that large and aligned, >> and as a result most raid controllers (and software) don't have the >> special-case code to deal with it. > > I'm pretty sure all half-decent controllers and software do actually. This is > one major reason that large (hopefully battery backed) caches help RAID-5 > disproportionately. The larger the cache the more likely it'll be able to wait > until the entire raid stripe is replaced avoid having to read in the old > parity. Or now that I think about it, replace two or more blocks from the same set of parity bits. It only has to recalculate the parity bits once for all those blocks instead of for every single block write. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!