Thread: Bad iostat numbers
avg-cpu: %user %nice %system %iowait %idle
50.40 0.00 0.50 1.10 48.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.80 0.40 6.40 41.60 113.60 20.80 56.80 22.82 570697.50 10.59 147.06 100.00
sdb 0.20 7.80 0.60 6.40 40.00 113.60 20.00 56.80 21.94 570697.50 9.83 142.86 100.00
md1 0.00 0.00 1.20 13.40 81.60 107.20 40.80 53.60 12.93 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Are they not saturated?
What kind of parameters should I pay attention when comparing SCSI controllers and disks? I would like to discover how much cache is present in the controller, how can I find this value from Linux?
Thank you in advance!
dmesg output:
...
SCSI subsystem initialized
ACPI: PCI Interrupt 0000:04:02.0[A] -> GSI 18 (level, low) -> IRQ 18
scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
<Adaptec (Dell OEM) 39320 Ultra320 SCSI adapter>
aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI 33 or 66Mhz, 512 SCBs
Vendor: SEAGATE Model: ST336607LW Rev: DS10
Type: Direct-Access ANSI SCSI revision: 03
target0:0:0: asynchronous
scsi0:A:0:0: Tagged Queuing enabled. Depth 4
target0:0:0: Beginning Domain Validation
target0:0:0: wide asynchronous
target0:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM RTI WRFLOW PCOMP (6.25 ns, offset 63)
target0:0:0: Ending Domain Validation
SCSI device sda: 71132959 512-byte hdwr sectors (36420 MB)
sda: Write Protect is off
sda: Mode Sense: ab 00 10 08
SCSI device sda: drive cache: write back w/ FUA
SCSI device sda: 71132959 512-byte hdwr sectors (36420 MB)
sda: Write Protect is off
sda: Mode Sense: ab 00 10 08
SCSI device sda: drive cache: write back w/ FUA
sda: sda1 sda2 sda3
sd 0:0:0:0: Attached scsi disk sda
Vendor: SEAGATE Model: ST336607LW Rev: DS10
Type: Direct-Access ANSI SCSI revision: 03
target0:0:1: asynchronous
scsi0:A:1:0: Tagged Queuing enabled. Depth 4
target0:0:1: Beginning Domain Validation
target0:0:1: wide asynchronous
target0:0:1: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM RTI WRFLOW PCOMP (6.25 ns, offset 63)
target0:0:1: Ending Domain Validation
SCSI device sdb: 71132959 512-byte hdwr sectors (36420 MB)
sdb: Write Protect is off
sdb: Mode Sense: ab 00 10 08
SCSI device sdb: drive cache: write back w/ FUA
SCSI device sdb: 71132959 512-byte hdwr sectors (36420 MB)
sdb: Write Protect is off
sdb: Mode Sense: ab 00 10 08
SCSI device sdb: drive cache: write back w/ FUA
sdb: sdb1 sdb2 sdb3
sd 0:0:1:0: Attached scsi disk sdb
ACPI: PCI Interrupt 0000:04:02.1[B] -> GSI 19 (level, low) -> IRQ 19
scsi1 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
<Adaptec (Dell OEM) 39320 Ultra320 SCSI adapter>
aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI 33 or 66Mhz, 512 SCBs
...
Carlos H. Reimer wrote: > While collecting performance data I discovered very bad numbers in the > I/O subsystem and I would like to know if I´m thinking correctly. > > Here is a typical iostat -x: > > > avg-cpu: %user %nice %system %iowait %idle > > 50.40 0.00 0.50 1.10 48.00 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s > avgrq-sz avgqu-sz await svctm %util > > sda 0.00 7.80 0.40 6.40 41.60 113.60 20.80 > 56.80 22.82 570697.50 10.59 147.06 100.00 > > sdb 0.20 7.80 0.60 6.40 40.00 113.60 20.00 > 56.80 21.94 570697.50 9.83 142.86 100.00 > > md1 0.00 0.00 1.20 13.40 81.60 107.20 40.80 > 53.60 12.93 0.00 0.00 0.00 0.00 > > md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > > > > Are they not saturated? > They look it (if I'm reading your typical numbers correctly) - %util 100 and svctime in the region of 100 ms! On the face of it, looks like you need something better than a RAID1 setup - probably RAID10 (RAID5 is probably no good as you are writing more than you are reading it seems). However read on... If this is a sudden change in system behavior, then it is probably worth trying to figure out what is causing it (i.e which queries) - for instance it might be that you have some new queries that are doing disk based sorts (this would mean you really need more memory rather than better disk...) Cheers Mark
Carlos H. Reimer wrote: > > > avg-cpu: %user %nice %system %iowait %idle > > 50.40 0.00 0.50 1.10 48.00 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s > avgrq-sz avgqu-sz await svctm %util > > sda 0.00 7.80 0.40 6.40 41.60 113.60 20.80 > 56.80 22.82 570697.50 10.59 147.06 100.00 > > sdb 0.20 7.80 0.60 6.40 40.00 113.60 20.00 > 56.80 21.94 570697.50 9.83 142.86 100.00 > > md1 0.00 0.00 1.20 13.40 81.60 107.20 40.80 > 53.60 12.93 0.00 0.00 0.00 0.00 > > md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > > > > Are they not saturated? > > > > What kind of parameters should I pay attention when comparing SCSI > controllers and disks? I would like to discover how much cache is > present in the controller, how can I find this value from Linux? > > These number look a bit strange. I am wondering if there is a hardware problem on one of the drives or on the controller. Check in syslog for messages about disk timeouts etc. 100% util but 6 writes/s is just wrong (unless the drive is a 1980's vintage floppy).
David Boreham wrote: > > These number look a bit strange. I am wondering if there is a hardware > problem on one of the drives > or on the controller. Check in syslog for messages about disk timeouts > etc. 100% util but 6 writes/s > is just wrong (unless the drive is a 1980's vintage floppy). > Agreed - good call, I was misreading the wkB/s as wMB/s...
Hi, I´ve taken a look in the /var/log/messages and found some temperature messages about the disk drives: Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2 Celsius to 51 Celsius since last report Can this temperature influence in the performance? Reimer > -----Mensagem original----- > De: David Boreham [mailto:david_list@boreham.org] > Enviada em: sexta-feira, 1 de dezembro de 2006 00:25 > Para: carlos.reimer@opendb.com.br > Cc: pgsql-performance@postgresql.org > Assunto: Re: [PERFORM] Bad iostat numbers > > > Carlos H. Reimer wrote: > > > > > > > avg-cpu: %user %nice %system %iowait %idle > > > > 50.40 0.00 0.50 1.10 48.00 > > > > > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s > > avgrq-sz avgqu-sz await svctm %util > > > > sda 0.00 7.80 0.40 6.40 41.60 113.60 20.80 > > 56.80 22.82 570697.50 10.59 147.06 100.00 > > > > sdb 0.20 7.80 0.60 6.40 40.00 113.60 20.00 > > 56.80 21.94 570697.50 9.83 142.86 100.00 > > > > md1 0.00 0.00 1.20 13.40 81.60 107.20 40.80 > > 53.60 12.93 0.00 0.00 0.00 0.00 > > > > md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 > > > > > > > > Are they not saturated? > > > > > > > > What kind of parameters should I pay attention when comparing SCSI > > controllers and disks? I would like to discover how much cache is > > present in the controller, how can I find this value from Linux? > > > > > These number look a bit strange. I am wondering if there is a hardware > problem on one of the drives > or on the controller. Check in syslog for messages about disk timeouts > etc. 100% util but 6 writes/s > is just wrong (unless the drive is a 1980's vintage floppy). > > > >
Hi, If you look the iostat data it shows that the system is doing much more writes than reads. It is strange, because if you look in the pg_stat tables we see a complete different scenario. Much more reads than writes. I was monitoring the presence of temporary files in the data directory what could denote big sorts, but nothing there too. But I think it is explained because of the high number of indexes present in those tables. One write in the base table, many others in the indexes. Well, about the server behaviour, it has not changed suddenly but the performance is becoming worse day by day. Reimer > -----Mensagem original----- > De: Mark Kirkwood [mailto:markir@paradise.net.nz] > Enviada em: quinta-feira, 30 de novembro de 2006 23:47 > Para: carlos.reimer@opendb.com.br > Cc: pgsql-performance@postgresql.org > Assunto: Re: [PERFORM] Bad iostat numbers > > > Carlos H. Reimer wrote: > > While collecting performance data I discovered very bad numbers in the > > I/O subsystem and I would like to know if I´m thinking correctly. > > > > Here is a typical iostat -x: > > > > > > avg-cpu: %user %nice %system %iowait %idle > > > > 50.40 0.00 0.50 1.10 48.00 > > > > > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s > > avgrq-sz avgqu-sz await svctm %util > > > > sda 0.00 7.80 0.40 6.40 41.60 113.60 20.80 > > 56.80 22.82 570697.50 10.59 147.06 100.00 > > > > sdb 0.20 7.80 0.60 6.40 40.00 113.60 20.00 > > 56.80 21.94 570697.50 9.83 142.86 100.00 > > > > md1 0.00 0.00 1.20 13.40 81.60 107.20 40.80 > > 53.60 12.93 0.00 0.00 0.00 0.00 > > > > md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 > > > > > > > > Are they not saturated? > > > > They look it (if I'm reading your typical numbers correctly) - %util 100 > and svctime in the region of 100 ms! > > On the face of it, looks like you need something better than a RAID1 > setup - probably RAID10 (RAID5 is probably no good as you are writing > more than you are reading it seems). However read on... > > If this is a sudden change in system behavior, then it is probably worth > trying to figure out what is causing it (i.e which queries) - for > instance it might be that you have some new queries that are doing disk > based sorts (this would mean you really need more memory rather than > better disk...) > > Cheers > > Mark > > > > >
Carlos H. Reimer wrote: >I´ve taken a look in the /var/log/messages and found some temperature >messages about the disk drives: > >Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2 >Celsius to 51 Celsius since last report > >Can this temperature influence in the performance? > > it can influence 'working-ness' which I guess in turn affects performance ;) But I'm not sure if 50C is too high for a disk drive, it might be ok. If you are able to, I'd say just replace the drives and see if that improves things.
On Thu, 30 Nov 2006, Carlos H. Reimer wrote: > I would like to discover how much cache is present in > the controller, how can I find this value from Linux? As far as I know there is no cache on an Adaptec 39320. The write-back cache Linux was reporting on was the one in the drives, which is 8MB; see http://www.seagate.com/cda/products/discsales/enterprise/tech/1,1593,541,00.html Be warned that running your database with the combination of an uncached controller plus disks with write caching is dangerous to your database integrity. There is a common problem with the Linux driver for this card (aic7902) where it enters what's they're calling an "Infinite Interrupt Loop". That seems to match your readings: > Here is a typical iostat -x: > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s > sda 0.00 7.80 0.40 6.40 41.60 113.60 20.80 56.80 > avgrq-sz avgqu-sz await svctm %util > 22.82 570697.50 10.59 147.06 100.00 An avgqu-sz of 570697.50 is extremely large. That explains why the utilization is 100%, because there's a massive number of I/O operations queued up that aren't getting flushed out. The read and write data says these drives are barely doing anything, as 20kB/s and 57KB/s are practically idle; they're not even remotely close to saturated. See http://lkml.org/lkml/2005/10/1/47 for a suggested workaround that may reduce the magnitude of this issue; lower the card's speed to U160 in the BIOS was also listed as a useful workaround. You might get better results by upgrading to a newer Linux kernel, and just rebooting to clear out the garbage might help if you haven't tried that yet. On the pessimistic side, other people reporting issues with this controller are: http://lkml.org/lkml/2005/12/17/55 http://www.ussg.iu.edu/hypermail/linux/kernel/0512.2/0390.html http://www.linuxforums.org/forum/peripherals-hardware/59306-scsi-hangs-boot.html and even under FreeBSD at http://lists.freebsd.org/pipermail/aic7xxx/2003-August/003973.html This Adaptec card just barely works under Linux, which happens regularly with their controllers, and my guess is that you've run into one of the ways it goes crazy sometimes. I just chuckled when checking http://linux.adaptec.com/ again and noticing they can't even be bothered to keep that server up at all. According to http://www.adaptec.com/en-US/downloads/linux_source/linux_source_code?productId=ASC-39320-R&dn=Adaptec+SCSI+Card+39320-R the driver for your card is "*minimally tested* for Linux Kernel v2.6 on all platforms." Adaptec doesn't care about Linux support on their products; if you want a SCSI controller that actually works under Linux, get an LSI MegaRAID. If this were really a Postgres problem, I wouldn't expect %iowait=1.10. Were the database engine waiting to read/write data, that number would be dramatically higher. Whatever is generating all these I/O requests, it's not waiting for them to complete like the database would be. Besides the driver problems that I'm very suspicious of, I'd suspect a runaway process writing garbage to the disks might also cause this behavior. > Ive taken a look in the /var/log/messages and found some temperature > messages about the disk drives: > Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2 > Celsius to 51 Celsius since last report > Can this temperature influence in the performance? That's close to the upper tolerance for this drive (55 degrees), which means the drive is being cooked and will likely wear out quickly. But that won't slow it down, and you'd get much scarier messages out of smartd if the drives had a real problem. You should improve cooling in this case if you want to drives to have a healthy life, odds are low this is relevant to your performance issue though. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
LSI MegaRAID has proved to be a bit of a disapointment. I have seen better numbers from the HP SmartArray 6i, and from 3ware cards with 7200RPM SATA drives.
for the output: http://www.infoconinc.com/test/bonnie++.html (the first line is a six drive RAID 10 on a 3ware 9500S, the next three are all RAID 1s on LSI MegaRAID controllers, verified by lspci).
Alex.
On Thu, 30 Nov 2006, Carlos H. Reimer wrote:
> I would like to discover how much cache is present in
> the controller, how can I find this value from Linux?
As far as I know there is no cache on an Adaptec 39320. The write-back
cache Linux was reporting on was the one in the drives, which is 8MB; see
http://www.seagate.com/cda/products/discsales/enterprise/tech/1,1593,541,00.html
Be warned that running your database with the combination of an uncached
controller plus disks with write caching is dangerous to your database
integrity.
There is a common problem with the Linux driver for this card (aic7902)
where it enters what's they're calling an "Infinite Interrupt Loop".
That seems to match your readings:
> Here is a typical iostat -x:
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
> sda 0.00 7.80 0.40 6.40 41.60 113.60 20.80 56.80
> avgrq-sz avgqu-sz await svctm %util
> 22.82 570697.50 10.59 147.06 100.00
An avgqu-sz of 570697.50 is extremely large. That explains why the
utilization is 100%, because there's a massive number of I/O operations
queued up that aren't getting flushed out. The read and write data says
these drives are barely doing anything, as 20kB/s and 57KB/s are
practically idle; they're not even remotely close to saturated.
See http://lkml.org/lkml/2005/10/1/47 for a suggested workaround that may
reduce the magnitude of this issue; lower the card's speed to U160 in the
BIOS was also listed as a useful workaround. You might get better results
by upgrading to a newer Linux kernel, and just rebooting to clear out the
garbage might help if you haven't tried that yet.
On the pessimistic side, other people reporting issues with this
controller are:
http://lkml.org/lkml/2005/12/17/55
http://www.ussg.iu.edu/hypermail/linux/kernel/0512.2/0390.html
http://www.linuxforums.org/forum/peripherals-hardware/59306-scsi-hangs-boot.html
and even under FreeBSD at
http://lists.freebsd.org/pipermail/aic7xxx/2003-August/003973.html
This Adaptec card just barely works under Linux, which happens regularly
with their controllers, and my guess is that you've run into one of the
ways it goes crazy sometimes. I just chuckled when checking
http://linux.adaptec.com/ again and noticing they can't even be bothered
to keep that server up at all. According to
http://www.adaptec.com/en-US/downloads/linux_source/linux_source_code?productId=ASC-39320-R&dn=Adaptec+SCSI+Card+39320-R
the driver for your card is "*minimally tested* for Linux Kernel v2.6 on
all platforms." Adaptec doesn't care about Linux support on their
products; if you want a SCSI controller that actually works under Linux,
get an LSI MegaRAID.
If this were really a Postgres problem, I wouldn't expect %iowait=1.10.
Were the database engine waiting to read/write data, that number would be
dramatically higher. Whatever is generating all these I/O requests, it's
not waiting for them to complete like the database would be. Besides the
driver problems that I'm very suspicious of, I'd suspect a runaway process
writing garbage to the disks might also cause this behavior.
> Ive taken a look in the /var/log/messages and found some temperature
> messages about the disk drives:
> Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2
> Celsius to 51 Celsius since last report
> Can this temperature influence in the performance?
That's close to the upper tolerance for this drive (55 degrees), which
means the drive is being cooked and will likely wear out quickly. But
that won't slow it down, and you'd get much scarier messages out of smartd
if the drives had a real problem. You should improve cooling in this case
if you want to drives to have a healthy life, odds are low this is
relevant to your performance issue though.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
http://www.postgresql.org/about/donate
On Mon, 2006-12-04 at 01:17, Alex Turner wrote: > People recommend LSI MegaRAID controllers on here regularly, but I > have found that they do not work that well. I have bonnie++ numbers > that show the controller is not performing anywhere near the disk's > saturation level in a simple RAID 1 on RedHat Linux EL4 on two > seperate machines provided by two different hosting companies. In one > case I asked them to replace the card, and the numbers got a bit > better, but still not optimal. > > LSI MegaRAID has proved to be a bit of a disapointment. I have seen > better numbers from the HP SmartArray 6i, and from 3ware cards with > 7200RPM SATA drives. > > for the output: http://www.infoconinc.com/test/bonnie++.html (the > first line is a six drive RAID 10 on a 3ware 9500S, the next three are > all RAID 1s on LSI MegaRAID controllers, verified by lspci). Wait, you're comparing a MegaRAID running a RAID 1 against another controller running a 6 disk RAID10? That's hardly fair. My experience with the LSI was that with the 1.18 series drivers, they were slow but stable. With the version 2.x drivers, I found that the performance was very good with RAID-5 and fair with RAID-1 and that layered RAID was not any better than unlayered (i.e. layering RAID0 over RAID1 resulted in basic RAID-1 performance). OTOH, with the choice at my last place of employment being LSI or Adaptec, LSI was a much better choice. :) I'd ask which LSI megaraid you've tested, and what driver was used. Does RHEL4 have the megaraid 2 driver?
On Mon, 2006-12-04 at 10:25, Scott Marlowe wrote: > > OTOH, with the choice at my last place of employment being LSI or > Adaptec, LSI was a much better choice. :) > > I'd ask which LSI megaraid you've tested, and what driver was used. > Does RHEL4 have the megaraid 2 driver? Just wanted to add that what we used our database for at my last company was for lots of mostly small writes / reads. I.e. sequential throughput didn't really matter, but random write speed did. for that application, the LSI Megaraid with battery backed cache was great. Last point, bonnie++ is a good benchmarking tool, but until you test your app / postgresql on top of the hardware, you can't really say how well it will perform. A controller that looks fast under a single bonnie++ thread might perform poorly when there are 100+ pending writes, and vice versa, a controller that looks mediocre under bonnie++ might shine when there's heavy parallel write load to handle.
How do I find out if it has version 2 of the driver?
This discussion I think is important, as I think it would be useful for this list to have a list of RAID cards that _do_ work well under Linux/BSD for people as recommended hardware for Postgresql. So far, all I can recommend is what I've found to be good, which is 3ware 9500 series cards with 10k SATA drives. Throughput was great until you reached higher levels of RAID 10 (the bonnie++ mark I posted showed write speed is a bit slow). But that doesn't solve the problem for SCSI. What cards in the SCSI arena solve the problem optimally? Why should we settle for sub-optimal performance in SCSI when there are a number of almost optimally performing cards in the SATA world (Areca, 3Ware/AMCC, LSI).
Thanks,
Alex
On Mon, 2006-12-04 at 01:17, Alex Turner wrote:
> People recommend LSI MegaRAID controllers on here regularly, but I
> have found that they do not work that well. I have bonnie++ numbers
> that show the controller is not performing anywhere near the disk's
> saturation level in a simple RAID 1 on RedHat Linux EL4 on two
> seperate machines provided by two different hosting companies. In one
> case I asked them to replace the card, and the numbers got a bit
> better, but still not optimal.
>
> LSI MegaRAID has proved to be a bit of a disapointment. I have seen
> better numbers from the HP SmartArray 6i, and from 3ware cards with
> 7200RPM SATA drives.
>
> for the output: http://www.infoconinc.com/test/bonnie++.html (the
> first line is a six drive RAID 10 on a 3ware 9500S, the next three are
> all RAID 1s on LSI MegaRAID controllers, verified by lspci).
Wait, you're comparing a MegaRAID running a RAID 1 against another
controller running a 6 disk RAID10? That's hardly fair.
My experience with the LSI was that with the 1.18 series drivers, they
were slow but stable.
With the version 2.x drivers, I found that the performance was very good
with RAID-5 and fair with RAID-1 and that layered RAID was not any
better than unlayered ( i.e. layering RAID0 over RAID1 resulted in basic
RAID-1 performance).
OTOH, with the choice at my last place of employment being LSI or
Adaptec, LSI was a much better choice. :)
I'd ask which LSI megaraid you've tested, and what driver was used.
Does RHEL4 have the megaraid 2 driver?
On Mon, Dec 04, 2006 at 12:37:29PM -0500, Alex Turner wrote: >This discussion I think is important, as I think it would be useful for this >list to have a list of RAID cards that _do_ work well under Linux/BSD for >people as recommended hardware for Postgresql. So far, all I can recommend >is what I've found to be good, which is 3ware 9500 series cards with 10k >SATA drives. Throughput was great until you reached higher levels of RAID >10 (the bonnie++ mark I posted showed write speed is a bit slow). But that >doesn't solve the problem for SCSI. What cards in the SCSI arena solve the >problem optimally? Why should we settle for sub-optimal performance in SCSI >when there are a number of almost optimally performing cards in the SATA >world (Areca, 3Ware/AMCC, LSI). Well, one factor is to be more precise about what you're looking for; a HBA != RAID controller, and you may be comparing apples and oranges. (If you have an external array with an onboard controller you probably want a simple HBA rather than a RAID controller.) Mike Stone
Alex
On Mon, Dec 04, 2006 at 12:37:29PM -0500, Alex Turner wrote:
>This discussion I think is important, as I think it would be useful for this
>list to have a list of RAID cards that _do_ work well under Linux/BSD for
>people as recommended hardware for Postgresql. So far, all I can recommend
>is what I've found to be good, which is 3ware 9500 series cards with 10k
>SATA drives. Throughput was great until you reached higher levels of RAID
>10 (the bonnie++ mark I posted showed write speed is a bit slow). But that
>doesn't solve the problem for SCSI. What cards in the SCSI arena solve the
>problem optimally? Why should we settle for sub-optimal performance in SCSI
>when there are a number of almost optimally performing cards in the SATA
>world (Areca, 3Ware/AMCC, LSI).
Well, one factor is to be more precise about what you're looking for; a
HBA != RAID controller, and you may be comparing apples and oranges. (If
you have an external array with an onboard controller you probably want
a simple HBA rather than a RAID controller.)
Mike Stone
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
On Mon, Dec 04, 2006 at 12:52:46PM -0500, Alex Turner wrote: >http://en.wikipedia.org/wiki/RAID_controller What is the wikipedia quote supposed to prove? Pray tell, if you consider RAID==HBA, what would you call a SCSI (e.g.) controller that has no RAID functionality? If you'd call it an HBA, then there is a useful distinction to be made, no? Mike Stone
On Mon, 2006-12-04 at 11:43, Michael Stone wrote: > On Mon, Dec 04, 2006 at 12:37:29PM -0500, Alex Turner wrote: > >This discussion I think is important, as I think it would be useful for this > >list to have a list of RAID cards that _do_ work well under Linux/BSD for > >people as recommended hardware for Postgresql. So far, all I can recommend > >is what I've found to be good, which is 3ware 9500 series cards with 10k > >SATA drives. Throughput was great until you reached higher levels of RAID > >10 (the bonnie++ mark I posted showed write speed is a bit slow). But that > >doesn't solve the problem for SCSI. What cards in the SCSI arena solve the > >problem optimally? Why should we settle for sub-optimal performance in SCSI > >when there are a number of almost optimally performing cards in the SATA > >world (Areca, 3Ware/AMCC, LSI). > > Well, one factor is to be more precise about what you're looking for; a > HBA != RAID controller, and you may be comparing apples and oranges. (If > you have an external array with an onboard controller you probably want > a simple HBA rather than a RAID controller.) I think he's been pretty clear. He's just talking about SCSI based RAID controllers is all.
On Mon, 2006-12-04 at 11:37, Alex Turner wrote: > The RAID 10 was in there merely for filling in, not really as a > compare, indeed it would be ludicrous to compare a RAID 1 to a 6 drive > RAID 10!! > > How do I find out if it has version 2 of the driver? Go to the directory it lives in (on my Fedora Core 2 box, it's in something like: /lib/modules/2.6.10-1.9_FC2/kernel/drivers/scsi ) and run modinfo on the driver: modinfo megaraid.ko author: LSI Logic Corporation description: LSI Logic MegaRAID driver license: GPL version: 2.00.3 SNIPPED extra stuff > This discussion I think is important, as I think it would be useful > for this list to have a list of RAID cards that _do_ work well under > Linux/BSD for people as recommended hardware for Postgresql. So far, > all I can recommend is what I've found to be good, which is 3ware 9500 > series cards with 10k SATA drives. Throughput was great until you > reached higher levels of RAID 10 (the bonnie++ mark I posted showed > write speed is a bit slow). But that doesn't solve the problem for > SCSI. What cards in the SCSI arena solve the problem optimally? Why > should we settle for sub-optimal performance in SCSI when there are a > number of almost optimally performing cards in the SATA world (Areca, > 3Ware/AMCC, LSI). Well, I think the LSI works VERY well under linux. And I've always made it quite clear in my posts that while I find it an acceptable performer, my main recommendation is based on it's stability, not speed, and that the Areca and 3Ware cards are generally regarded as faster. And all three beat the adaptecs which are observed as being rather unstable. Does this LSI have battery backed cache? Are you testing it under heavy parallel load versus single threaded to get an idea how it scales with multiple processes hitting it at once? Don't get me wrong, I'm a big fan of running tools like bonnie to get a basic idea of how good the hardware is, but benchmarks that simulate real production loads are the only ones worth putting your trust in.
On Mon, 4 Dec 2006, Alex Turner wrote: > People recommend LSI MegaRAID controllers on here regularly, but I have > found that they do not work that well. I have bonnie++ numbers that > show the controller is not performing anywhere near the disk's > saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate > machines provided by two different hosting companies. > http://www.infoconinc.com/test/bonnie++.html I don't know what's going on with your www-september-06 machine, but the other two are giving 32-40MB/s writes and 53-68MB/s reads. For a RAID-1 volume, these aren't awful numbers, but I agree they're not great. My results are no better. For your comparison, here's a snippet of bonnie++ results from one of my servers: RHEL 4, P4 3GHz, MegaRAID firmware 1L37, write-thru cache setup, RAID 1; I think the drives are 10K RPM Seagate Cheetahs. This is from the end of the drive where performance is the worst (I partitioned the important stuff at the beginning where it's fastest and don't have enough free space to run bonnie there): ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 20708 50 21473 9 9603 3 34419 72 55799 7 467.1 1 21Mb/s writes, 56MB/s reads. Not too different from yours (especially if your results were from the beginning of the disk), and certainly nothing special. I might be able to tune the write performance higher if I cared; the battery backed cache sits unused and everything is tuned for paranoia rather than performance. On this machine it doesn't matter. The thing is, even though it's rarely the top performing card even when setup perfectly, the LSI SCSI Megaraid just works. The driver is stable, caching behavior is well defined, it's a pleasure to administer. I'm never concerned that it's lying to me or doing anything to put data at risk. The command-line tools for Linux work perfectly, let me look at or control whatever I want, and it was straighforward for me to make my own customized monitoring script using them. > LSI MegaRAID has proved to be a bit of a disapointment. I have seen > better numbers from the HP SmartArray 6i, and from 3ware cards with > 7200RPM SATA drives. Whereas although I use 7200RPM SATA drives, I always try to keep an eye on them because I never really trust them. The performance list archives here also have plenty of comments about people having issues with the SmartArray controllers; search the archives for "cciss" and you'll see what I'm talking about. The Megaraid controller is very boring. That's why I like it. As a Linux distribution, RedHat has similar characteristics. If I were going for a performance setup, I'd dump that, too, for something sexier with a newish kernel. It all depends on which side of the performance/stability tradeoff you're aiming at. On Mon, 4 Dec 2006, Scott Marlowe wrote: > Does RHEL4 have the megaraid 2 driver? This is from the moderately current RHEL4 installation I had results from above. Redhat has probably done a kernel rev since I last updated back in September, haven't needed or wanted to reboot since then: megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005) megaraid: 2.20.4.6-rh2 (Release Date: Wed Jun 28 12:27:22 EST 2006) -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
My other and most important point is that I can't find any solid recommendations for a SCSI card that can perform optimally in Linux or *BSD. Off by a factor of 3x is pretty sad IMHO. (and yes, we know the Adaptec cards suck worse, that doesn't bring us to a _good_ card).
Alex.
On Mon, 4 Dec 2006, Alex Turner wrote:
> People recommend LSI MegaRAID controllers on here regularly, but I have
> found that they do not work that well. I have bonnie++ numbers that
> show the controller is not performing anywhere near the disk's
> saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate
> machines provided by two different hosting companies.
> http://www.infoconinc.com/test/bonnie++.html
I don't know what's going on with your www-september-06 machine, but the
other two are giving 32-40MB/s writes and 53-68MB/s reads. For a RAID-1
volume, these aren't awful numbers, but I agree they're not great.
My results are no better. For your comparison, here's a snippet of
bonnie++ results from one of my servers: RHEL 4, P4 3GHz, MegaRAID
firmware 1L37, write-thru cache setup, RAID 1; I think the drives are 10K
RPM Seagate Cheetahs. This is from the end of the drive where performance
is the worst (I partitioned the important stuff at the beginning where
it's fastest and don't have enough free space to run bonnie there):
------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
20708 50 21473 9 9603 3 34419 72 55799 7 467.1 1
21Mb/s writes, 56MB/s reads. Not too different from yours (especially if
your results were from the beginning of the disk), and certainly nothing
special. I might be able to tune the write performance higher if I cared;
the battery backed cache sits unused and everything is tuned for paranoia
rather than performance. On this machine it doesn't matter.
The thing is, even though it's rarely the top performing card even when
setup perfectly, the LSI SCSI Megaraid just works. The driver is stable,
caching behavior is well defined, it's a pleasure to administer. I'm
never concerned that it's lying to me or doing anything to put data at
risk. The command-line tools for Linux work perfectly, let me look at or
control whatever I want, and it was straighforward for me to make my own
customized monitoring script using them.
> LSI MegaRAID has proved to be a bit of a disapointment. I have seen
> better numbers from the HP SmartArray 6i, and from 3ware cards with
> 7200RPM SATA drives.
Whereas although I use 7200RPM SATA drives, I always try to keep an eye on
them because I never really trust them. The performance list archives
here also have plenty of comments about people having issues with the
SmartArray controllers; search the archives for "cciss" and you'll see
what I'm talking about.
The Megaraid controller is very boring. That's why I like it. As a Linux
distribution, RedHat has similar characteristics. If I were going for a
performance setup, I'd dump that, too, for something sexier with a newish
kernel. It all depends on which side of the performance/stability
tradeoff you're aiming at.
On Mon, 4 Dec 2006, Scott Marlowe wrote:
> Does RHEL4 have the megaraid 2 driver?
This is from the moderately current RHEL4 installation I had results from
above. Redhat has probably done a kernel rev since I last updated back in
September, haven't needed or wanted to reboot since then:
megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005)
megaraid: 2.20.4.6-rh2 (Release Date: Wed Jun 28 12:27:22 EST 2006)
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
On Tue, Dec 05, 2006 at 01:21:38AM -0500, Alex Turner wrote: >My other and most important point is that I can't find any solid >recommendations for a SCSI card that can perform optimally in Linux or >*BSD. Off by a factor of 3x is pretty sad IMHO. (and yes, we know the >Adaptec cards suck worse, that doesn't bring us to a _good_ card). This gets back to my point about terminology. As a SCSI HBA the Adaptec is decent: I can sustain about 300MB/s off a single channel of the 39320A using an external RAID controller. As a RAID controller I can't even imagine using the Adaptec; I'm fairly certain they put that "functionality" on there just so they could charge more for the card. It may be that there's not much market for on-board SCSI RAID controllers; between SATA on the low end and SAS & FC on the high end, there isn't a whole lotta space left for SCSI. I definitely don't think much R&D is going into SCSI controllers any more, compared to other solutions like SATA or SAS RAID (the 39320 hasn't change in at least 3 years, IIRC). Anyway, since the Adaptec part is a decent SCSI controller and a lousy RAID controller, have you tried just using software RAID? Mike Stone
Alex
On Tue, Dec 05, 2006 at 01:21:38AM -0500, Alex Turner wrote:
>My other and most important point is that I can't find any solid
>recommendations for a SCSI card that can perform optimally in Linux or
>*BSD. Off by a factor of 3x is pretty sad IMHO. (and yes, we know the
>Adaptec cards suck worse, that doesn't bring us to a _good_ card).
This gets back to my point about terminology. As a SCSI HBA the Adaptec
is decent: I can sustain about 300MB/s off a single channel of the
39320A using an external RAID controller. As a RAID controller I can't
even imagine using the Adaptec; I'm fairly certain they put that
"functionality" on there just so they could charge more for the card. It
may be that there's not much market for on-board SCSI RAID controllers;
between SATA on the low end and SAS & FC on the high end, there isn't a
whole lotta space left for SCSI. I definitely don't think much
R&D is going into SCSI controllers any more, compared to other solutions
like SATA or SAS RAID (the 39320 hasn't change in at least 3 years,
IIRC). Anyway, since the Adaptec part is a decent SCSI controller and a
lousy RAID controller, have you tried just using software RAID?
Mike Stone
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
Alex Turner wrote: > The problem I see with software raid is the issue of a battery backed > unit: If the computer loses power, then the 'cache' which is held in > system memory, goes away, and fubars your RAID. I'm not sure I see the difference. If data are cached, they're not written whether it is software or hardware RAID. I guessif you're writing RAID 1, the N disks could be out of sync, but the system can synchronize them once the array is restored,so that's no different than a single disk or a hardware RAID. If you're writing RAID 5, then the blocks are inherentlyerror detecting/correcting, so you're still OK if a partial write occurs, right? I'm not familiar with the inner details of software RAID, but the only circumstance I can see where things would get corruptedis if the RAID driver writes a LOT of blocks to one disk of the array before synchronizing the others, but my guess(and it's just a guess) is that the writes to the N disks are tightly coupled. If I'm wrong about this, I'd like to know, because I'm using software RAID 1 and 1+0, and I'm pretty happy with it. Craig
On Tue, Dec 05, 2006 at 07:57:43AM -0500, Alex Turner wrote: >The problem I see with software raid is the issue of a battery backed unit: >If the computer loses power, then the 'cache' which is held in system >memory, goes away, and fubars your RAID. Since the Adaptec doesn't have a BBU, it's a lateral move. Also, this is less an issue of data integrity than performance; you can get exactly the same level of integrity, you just have to wait for the data to sync to disk. If you're read-mostly that's irrelevant. Mike Stone
On Tue, 5 Dec 2006, Craig A. James wrote: > I'm not familiar with the inner details of software RAID, but the only > circumstance I can see where things would get corrupted is if the RAID driver > writes a LOT of blocks to one disk of the array before synchronizing the > others... You're talking about whether the discs in the RAID are kept consistant. While it's helpful with that, too, that's not the main reason a the battery-backed cache is so helpful. When PostgreSQL writes to the WAL, it waits until that data has really been placed on the drive before it enters that update into the database. In a normal situation, that means that you have to pause until the disk has physically written the blocks out, and that puts a fairly low upper limit on write performance that's based on how fast your drives rotate. RAID 0, RAID 1, none of that will speed up the time it takes to complete a single synchronized WAL write. When your controller has a battery-backed cache, it can immediately tell Postgres that the WAL write completed succesfully, while actually putting it on the disk later. On my systems, this results in simple writes going 2-4X as fast as they do without a cache. Should there be a PC failure, as long as power is restored before the battery runs out that transaction will be preserved. What Alex is rightly pointing out is that a software RAID approach doesn't have this feature. In fact, in this area performance can be even worse under SW RAID than what you get from a single disk, because you may have to wait for multiple discs to spin to the correct position and write data out before you can consider the transaction complete. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Dec 5, 2006, at 8:54 PM, Greg Smith wrote: > On Tue, 5 Dec 2006, Craig A. James wrote: > >> I'm not familiar with the inner details of software RAID, but the >> only circumstance I can see where things would get corrupted is if >> the RAID driver writes a LOT of blocks to one disk of the array >> before synchronizing the others... > > You're talking about whether the discs in the RAID are kept > consistant. While it's helpful with that, too, that's not the main > reason a the battery-backed cache is so helpful. When PostgreSQL > writes to the WAL, it waits until that data has really been placed > on the drive before it enters that update into the database. In a > normal situation, that means that you have to pause until the disk > has physically written the blocks out, and that puts a fairly low > upper limit on write performance that's based on how fast your > drives rotate. RAID 0, RAID 1, none of that will speed up the time > it takes to complete a single synchronized WAL write. > > When your controller has a battery-backed cache, it can immediately > tell Postgres that the WAL write completed succesfully, while > actually putting it on the disk later. On my systems, this results > in simple writes going 2-4X as fast as they do without a cache. > Should there be a PC failure, as long as power is restored before > the battery runs out that transaction will be preserved. > > What Alex is rightly pointing out is that a software RAID approach > doesn't have this feature. In fact, in this area performance can > be even worse under SW RAID than what you get from a single disk, > because you may have to wait for multiple discs to spin to the > correct position and write data out before you can consider the > transaction complete. So... the ideal might be a RAID1 controller with BBU for the WAL and something else, such as software RAID, for the main data array? Cheers, Steve