Thread: ATA disks and RAID controllers for database servers
Dear all, Here is the first installment concerning ATA disks and RAID controller use in a database server. I happened to have a Solaris system to myself this week, so took the opportunity to use it as a "control". In this post I used the ATA RAID controller merely to enable UDMA 133 for an oldish x86 machine, the effect of any actual RAID level will (hopefully) be examined subsequently. So what I was attempting to examine here was : is it feasable to build a reasonably well performing database server using ATA disks? (in particular would disabling the ATA write cache spoil performance completely?) The Systems ----------- Dell 410 2x700Mhz PIII 512Mb Promise Fastrack TX2000 Controller 2x40G 7200RPM ATA-133 Maxtor Diamond +8 configured as JBOD Freebsd 4.8 (options SMP APIC_IO i686) Postgresql 7.4beta2 (-O2 -funroll-loops -fexpensive-optimizations -march=i686) ATA Write caching controlled via the loader.conf variable hw.ata.wc (1 = on) Sun 280R 1x900Mhz USparc III 1024Mb 1x36G 10000RPM FCAL Sun (actually Seagate) Solaris 8 (recommended patches) Postgresql 7.4beta2 (-O2 -funroll-loops -fexpensive-optimizations) The Tests --------- 1. Sequential and random writes and reads of a file twice the size of memory Files were written using read(2), write(2) functions - buffered at 8K. For the random case 10% of the file was sampled using lseek(2), and read or written. (see http://techdocs.postgresql.org/markir/download/iowork/iowork-1.0.tar.gz) 2. Postgresql pgbench benchmark program This was run using the options : -t 1000 [ 1000 transactions ] -s 10 [ scale factor 10 ] -c 1,2,4,8,16 [ 1-16 clients ] Non default postgresql.conf settings were: shared_buffers = 5000 wal_buffers = 100 checkpoint_segments = 10 A checkpoint was forced after each run to prevent cross run interference. Results ------- Test 1 System IO Operation Throughput(M/s) Options ------------------------------------------------ Sun seq write 21 seq read 48 random write 2.8 random read 2.2 Dell seq write 11 hw.ata.wc=0 seq read 50 hw.ata.wc=0 random write 1.27 hw.ata.wc=0 random read 4.2 hw.ata.wc=0 Dell seq write 20 hw.ata.wc=1 seq read 53 hw.ata.wc=1 random write 1.69 hw.ata.wc=1 random read 4.1 hw.ata.wc=1 Test 2 System Clients Throughput(tps) Options ------------------------------------------------ Sun 1 18 2 18 4 22 8 23 16 28 Dell 1 27 hw.ata.wc=0 2 38 hw.ata.wc=0 4 55 hw.ata.wc=0 8 58 hw.ata.wc=0 16 66 hw.ata.wc=0 Dell 1 82 hw.ata.wc=1 2 137 hw.ata.wc=1 4 166 hw.ata.wc=1 8 128 hw.ata.wc=1 16 117 hw.ata.wc=1 Conclusions ----------- Test 1 As far as sequential reading goes, there is not much to pick and choose between ATA and SCSI. ATA with write caching off does only about half as well for as SCSI for sequential writes. It also fares poorly at random writes - even with write caching on. The random read result was surprising - I was expecting SCSI to perform better on all random operations (seek time on the SCSI drive is about 1/2 that of the ATA). The "my program is measuring wrong" syndrome featured strongly, so I have run similar tests with Bonnie - it finds the ATA drive can do 4 *times* more seeks/s - hmmm (Bonnie gets the same sequential throughput numbers too). A point to note for *both* systems is that all disks were new, so have not yet 'burned in' - I don't know how significant this might be (anyone?). Test 2 Hmmm, 3 year old Dell 410 hammers this year's Sun 280R (write caching on or off). Now it is well known that Solaris is not the fastest platform for Pg, so maybe let's contain the excitement here. I did experiment with using bsdmalloc to improve Solaris memory performance - without a significant improvement (any other ideas?). But it seems safe to conclude that it's possible to construct a reasonably well performing ATA based system - even if write caching is off. Criticisms ---------- Using "-s 10" only produces a database of 160M - this is cacheable when you have 512/1024M real memory, so maybe "-s 100" would defeat the cache. I am currently running some tests with this configuration. Comparing a dual processor Intel to single Sun is not fair - well, a 900Mhz UltraSparc III is *supposed* to correspond to a 1.4Ghz Intel, so 2x700Mhz PIIIs should be a fair match. However it does look like the two PIIIs hammer it a bit... regards Mark
Dear all, Here is the second installment concerning ATA disks and RAID controller use in a database server. In this post a 2 disk RAID0 configuration is tested, and the results compared to the JBOD configuration in the previous message. So again, what I was attempting to examine here was : is it feasable to build a reasonably well performing database server using ATA disks? (in particular would disabling the ATA write cache spoil performance completely?) The System ---------- Dell 410 2x700Mhz PIII 512Mb Promise Fastrack TX2000 Controller 2x40G 7200RPM ATA-133 Maxtor Diamond +8 configured as JBOD or 2x40G 7200RPM ATA-133 Maxtor Diamond +8 configured as RAID0 Freebsd 4.8 (options SMP APIC_IO i686) Postgresql 7.4beta2 (-O2 -funroll-loops -fexpensive-optimizations -march=i686) ATA Write caching controlled via the loader.conf variable hw.ata.wc (1 = on) The Tests --------- 1. Sequential and random writes and reads of a file twice the size of memory Files were written using read(2), write(2) functions - buffered at 8K. For the random case 10% of the file was sampled using lseek(2), and read or written. (see http://techdocs.postgresql.org/markir/download/iowork/iowork-1.0.tar.gz) The filesystem was built with newfs options : -U -b 32768 -f 4096 [ softupdates, 32K blocks, 4K fragments ] The RAID0 strip size was 128K. This gave the best performance (32K, 64K were tried - I got tired of rebuilding the system at this point, so 256K and above may be better). 2. Postgresql pgbench benchmark program This was run using the options : -t 1000 [ 1000 transactions ] -s 10 [ scale factor 10 ] -c 1,2,4,8,16 [ 1-16 clients ] Non default postgresql.conf settings were: shared_buffers = 5000 wal_buffers = 100 checkpoint_segments = 10 A checkpoint was forced after each run to prevent cross run interference. Three runs through were performed for each configuration, and the results averaged. A new database was created for each 1-16 client "set" of runs. Results ------- Test 1 System IO Operation Throughput(M/s) Options ------------------------------------------------ Dell JBOD seq write 11 hw.ata.wc=0 seq read 50 hw.ata.wc=0 random write 1.3 hw.ata.wc=0 random read 4.2 hw.ata.wc=0 seq write 20 hw.ata.wc=1 seq read 53 hw.ata.wc=1 random write 1.7 hw.ata.wc=1 random read 4.1 hw.ata.wc=1 RAID0 seq write 13 hw.ata.wc=0 seq read 100 hw.ata.wc=0 random write 1.7 hw.ata.wc=0 random read 4.2 hw.ata.wc=0 seq write 38 hw.ata.wc=1 seq read 100 hw.ata.wc=1 random write 2.5 hw.ata.wc=1 random read 4.3 hw.ata.wc=1 Test 2 System Clients Throughput(tps) Options ------------------------------------------------ Dell JBOD 1 27 hw.ata.wc=0 2 38 hw.ata.wc=0 4 55 hw.ata.wc=0 8 58 hw.ata.wc=0 16 66 hw.ata.wc=0 1 82 hw.ata.wc=1 2 137 hw.ata.wc=1 4 166 hw.ata.wc=1 8 128 hw.ata.wc=1 16 117 hw.ata.wc=1 RAID0 1 33 hw.ata.wc=0 2 39 hw.ata.wc=0 4 61 hw.ata.wc=0 8 73 hw.ata.wc=0 16 80 hw.ata.wc=0 1 95 hw.ata.wc=1 2 156 hw.ata.wc=1 4 194 hw.ata.wc=1 8 179 hw.ata.wc=1 16 144 hw.ata.wc=1 Conclusions ----------- Test 1 It is clear that with write caching on the RAID0 configuration greatly improves sequential read and write performance - almost twice as fast as the JBOD case. The random write performance is improved by a reasonable factor too. For write caching disabled, the write rates are similar to the JBOD case. This *may* indicate some design issue in the Promise controller. Test 2 For write caching on or off, the RAID0 configuration is faster - by about 18 percent. General Clearly it is possible to obtain very good performance with write caching on using RAID0, and if you have a UPS together with good backup practice then this could be the way to go. With caching off there is a considerable decrease in performance, however this performance may be "good enough" if viewed in a cost-benefit-safely manner. Criticisms ---------- It would have been good to have two SCSI disks to test in the Dell machine (as opposed to using a Sun 280R), unfortunately I can't justify the cost of them for this test :-(. However there are some examples of similar comparisons in the Postgresql General thread "Recomended FS" (without an ATA RAID controller). Mark
On Sat, 15 Nov 2003 14:07:40 +1300 Mark Kirkwood <markir@paradise.net.nz> wrote: > > Clearly it is possible to obtain very good performance with write > caching on using RAID0, and if you have a UPS together with good backup > practice then this could be the way to go. > > With caching off there is a considerable decrease in performance, > however this performance may be "good enough" if viewed in a > cost-benefit-safely manner. > > UNless the controller itself has a battery backed cache it is dangerous - there are many more failures than losing power. Ie, blowing out the power supply or cpu. We've burnt up a fair share of cpu's over the years. Luckly on a Sun itisn't that big a deal.. but on x86. wel... you get the idea. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Clinging to sanity, threshar@torgo.978.org (Jeff) mumbled into her beard: > On Sat, 15 Nov 2003 14:07:40 +1300 > Mark Kirkwood <markir@paradise.net.nz> wrote: >> >> Clearly it is possible to obtain very good performance with write >> caching on using RAID0, and if you have a UPS together with good backup >> practice then this could be the way to go. >> >> With caching off there is a considerable decrease in performance, >> however this performance may be "good enough" if viewed in a >> cost-benefit-safely manner. > > UNless the controller itself has a battery backed cache it is > dangerous - there are many more failures than losing power. Ie, > blowing out the power supply or cpu. We've burnt up a fair share of > cpu's over the years. Luckly on a Sun it isn't that big a > deal.. but on x86. wel... you get the idea. Furthermore, if the disk drives are lying to the controller, it's anybody's guess whether or not data ever actually gets to the disk. When is it safe to let blocks expire out of the controller cache? If your computer can't know if the data has been written (because of drives that lie), I can't imagine how the controller would (since the drives are lying to the controller, too). -- If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me http://www3.sympatico.ca/cbbrowne/ "The primary difference between computer salesmen and used car salesmen is that used car salesmen know when they're lying to you."
>Furthermore, if the disk drives are lying to the controller, it's >anybody's guess whether or not data ever actually gets to the disk. > >When is it safe to let blocks expire out of the controller cache? > >If your computer can't know if the data has been written (because of >drives that lie), I can't imagine how the controller would (since the >drives are lying to the controller, too). > As I understand it, there is only 1 lie : the actual write to the disk. The receipt into the drive *cache* is not lied about - hence the discussion on mlist.limux.kernel about capacitors to allow enough power for a cache flush in a power off situation. regards Mark
Jeff wrote: > >UNless the controller itself has a battery backed cache it is dangerous - there are many more failures than losing power. Ie, blowing out the power supply or cpu. We've burnt up a fair share of cpu's over the years. Luckly on a Sun itisn't that big a deal.. but on x86. wel... you get the idea. > > Agreed. Power supply failure seems to be an ever present menace - had one last month on a Sun E220, 3 years old - it's saying "replace me" :-) regards Mark