Thread: Bad iostat numbers

Bad iostat numbers

From
"Carlos H. Reimer"
Date:
Hi,
 
I was called to find out why one of our PostgreSQL servers has not a satisfatory response time. The server has only two SCSI disks configurated as a RAID1 software.
 
While collecting performance data I discovered very bad numbers in the I/O subsystem and I would like to know if I´m thinking correctly.
 
Here is a typical iostat -x:
 

avg-cpu:  %user   %nice %system %iowait   %idle

          50.40    0.00    0.50    1.10   48.00

 

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util

sda          0.00   7.80  0.40  6.40   41.60  113.60    20.80    56.80    22.82 570697.50   10.59 147.06 100.00

sdb          0.20   7.80  0.60  6.40   40.00  113.60    20.00    56.80    21.94 570697.50    9.83 142.86 100.00

md1          0.00   0.00  1.20 13.40   81.60  107.20    40.80    53.60    12.93     0.00    0.00   0.00   0.00

md0          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

 

Are they not saturated?

 

What kind of parameters should I pay attention when comparing SCSI controllers and disks? I would like to discover how much cache is present in the controller, how can I find this value from Linux?

 

Thank you in advance!

 

dmesg output:

...

SCSI subsystem initialized

ACPI: PCI Interrupt 0000:04:02.0[A] -> GSI 18 (level, low) -> IRQ 18

scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0

        <Adaptec (Dell OEM) 39320 Ultra320 SCSI adapter>

        aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI 33 or 66Mhz, 512 SCBs

 

  Vendor: SEAGATE   Model: ST336607LW        Rev: DS10

  Type:   Direct-Access                      ANSI SCSI revision: 03

 target0:0:0: asynchronous

scsi0:A:0:0: Tagged Queuing enabled.  Depth 4

 target0:0:0: Beginning Domain Validation

 target0:0:0: wide asynchronous

 target0:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM RTI WRFLOW PCOMP (6.25 ns, offset 63)

 target0:0:0: Ending Domain Validation

SCSI device sda: 71132959 512-byte hdwr sectors (36420 MB)

sda: Write Protect is off

sda: Mode Sense: ab 00 10 08

SCSI device sda: drive cache: write back w/ FUA

SCSI device sda: 71132959 512-byte hdwr sectors (36420 MB)

sda: Write Protect is off

sda: Mode Sense: ab 00 10 08

SCSI device sda: drive cache: write back w/ FUA

 sda: sda1 sda2 sda3

sd 0:0:0:0: Attached scsi disk sda

  Vendor: SEAGATE   Model: ST336607LW        Rev: DS10

  Type:   Direct-Access                      ANSI SCSI revision: 03

 target0:0:1: asynchronous

scsi0:A:1:0: Tagged Queuing enabled.  Depth 4

 target0:0:1: Beginning Domain Validation

 target0:0:1: wide asynchronous

 target0:0:1: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM RTI WRFLOW PCOMP (6.25 ns, offset 63)

 target0:0:1: Ending Domain Validation

SCSI device sdb: 71132959 512-byte hdwr sectors (36420 MB)

sdb: Write Protect is off

sdb: Mode Sense: ab 00 10 08

SCSI device sdb: drive cache: write back w/ FUA

SCSI device sdb: 71132959 512-byte hdwr sectors (36420 MB)

sdb: Write Protect is off

sdb: Mode Sense: ab 00 10 08

SCSI device sdb: drive cache: write back w/ FUA

 sdb: sdb1 sdb2 sdb3

sd 0:0:1:0: Attached scsi disk sdb

ACPI: PCI Interrupt 0000:04:02.1[B] -> GSI 19 (level, low) -> IRQ 19

scsi1 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0

        <Adaptec (Dell OEM) 39320 Ultra320 SCSI adapter>

        aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI 33 or 66Mhz, 512 SCBs

...

 

Reimer

Re: Bad iostat numbers

From
Mark Kirkwood
Date:
Carlos H. Reimer wrote:
> While collecting performance data I discovered very bad numbers in the
> I/O subsystem and I would like to know if I´m thinking correctly.
>
> Here is a typical iostat -x:
>
>
> avg-cpu:  %user   %nice %system %iowait   %idle
>
>           50.40    0.00    0.50    1.10   48.00
>
>
>
> Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
>
> sda          0.00   7.80  0.40  6.40   41.60  113.60    20.80
> 56.80    22.82 570697.50   10.59 147.06 100.00
>
> sdb          0.20   7.80  0.60  6.40   40.00  113.60    20.00
> 56.80    21.94 570697.50    9.83 142.86 100.00
>
> md1          0.00   0.00  1.20 13.40   81.60  107.20    40.80
> 53.60    12.93     0.00    0.00   0.00   0.00
>
> md0          0.00   0.00  0.00  0.00    0.00    0.00     0.00
> 0.00     0.00     0.00    0.00   0.00   0.00
>
>
>
> Are they not saturated?
>

They look it (if I'm reading your typical numbers correctly) - %util 100
and svctime in the region of 100 ms!

On the face of it, looks like you need something better than a RAID1
setup - probably RAID10 (RAID5 is probably no good as you are writing
more than you are reading it seems). However read on...

If this is a sudden change in system behavior, then it is probably worth
trying to figure out what is causing it (i.e which queries) - for
instance it might be that you have some new queries that are doing disk
based sorts (this would mean you really need more memory rather than
better disk...)

Cheers

Mark



Re: Bad iostat numbers

From
David Boreham
Date:
Carlos H. Reimer wrote:

>
>
> avg-cpu:  %user   %nice %system %iowait   %idle
>
>           50.40    0.00    0.50    1.10   48.00
>
>
>
> Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
>
> sda          0.00   7.80  0.40  6.40   41.60  113.60    20.80
> 56.80    22.82 570697.50   10.59 147.06 100.00
>
> sdb          0.20   7.80  0.60  6.40   40.00  113.60    20.00
> 56.80    21.94 570697.50    9.83 142.86 100.00
>
> md1          0.00   0.00  1.20 13.40   81.60  107.20    40.80
> 53.60    12.93     0.00    0.00   0.00   0.00
>
> md0          0.00   0.00  0.00  0.00    0.00    0.00     0.00
> 0.00     0.00     0.00    0.00   0.00   0.00
>
>
>
> Are they not saturated?
>
>
>
> What kind of parameters should I pay attention when comparing SCSI
> controllers and disks? I would like to discover how much cache is
> present in the controller, how can I find this value from Linux?
>
>
These number look a bit strange. I am wondering if there is a hardware
problem on one of the drives
or on the controller. Check in syslog for messages about disk timeouts
etc. 100% util but 6 writes/s
is just wrong (unless the drive is a 1980's vintage floppy).



Re: Bad iostat numbers

From
Mark Kirkwood
Date:
David Boreham wrote:

>
> These number look a bit strange. I am wondering if there is a hardware
> problem on one of the drives
> or on the controller. Check in syslog for messages about disk timeouts
> etc. 100% util but 6 writes/s
> is just wrong (unless the drive is a 1980's vintage floppy).
>

Agreed - good call, I was misreading the wkB/s as wMB/s...

RES: Bad iostat numbers

From
"Carlos H. Reimer"
Date:
Hi,

I´ve taken a look in the /var/log/messages and found some temperature
messages about the disk drives:

Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2
Celsius to 51 Celsius since last report

Can this temperature influence in the performance?

Reimer

> -----Mensagem original-----
> De: David Boreham [mailto:david_list@boreham.org]
> Enviada em: sexta-feira, 1 de dezembro de 2006 00:25
> Para: carlos.reimer@opendb.com.br
> Cc: pgsql-performance@postgresql.org
> Assunto: Re: [PERFORM] Bad iostat numbers
>
>
> Carlos H. Reimer wrote:
>
> >
> >
> > avg-cpu:  %user   %nice %system %iowait   %idle
> >
> >           50.40    0.00    0.50    1.10   48.00
> >
> >
> >
> > Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> >
> > sda          0.00   7.80  0.40  6.40   41.60  113.60    20.80
> > 56.80    22.82 570697.50   10.59 147.06 100.00
> >
> > sdb          0.20   7.80  0.60  6.40   40.00  113.60    20.00
> > 56.80    21.94 570697.50    9.83 142.86 100.00
> >
> > md1          0.00   0.00  1.20 13.40   81.60  107.20    40.80
> > 53.60    12.93     0.00    0.00   0.00   0.00
> >
> > md0          0.00   0.00  0.00  0.00    0.00    0.00     0.00
> > 0.00     0.00     0.00    0.00   0.00   0.00
> >
> >
> >
> > Are they not saturated?
> >
> >
> >
> > What kind of parameters should I pay attention when comparing SCSI
> > controllers and disks? I would like to discover how much cache is
> > present in the controller, how can I find this value from Linux?
> >
> >
> These number look a bit strange. I am wondering if there is a hardware
> problem on one of the drives
> or on the controller. Check in syslog for messages about disk timeouts
> etc. 100% util but 6 writes/s
> is just wrong (unless the drive is a 1980's vintage floppy).
>
>
>
>


RES: Bad iostat numbers

From
"Carlos H. Reimer"
Date:
Hi,

If you look the iostat data it shows that the system is doing much more
writes than reads. It is strange, because if you look in the pg_stat tables
we see a complete different scenario. Much more reads than writes. I was
monitoring the presence of temporary files in the data directory what could
denote big sorts, but nothing there too.

But I think it is explained because of the high number of indexes present in
those tables. One write in the base table, many others in the indexes.

Well, about the server behaviour, it has not changed suddenly but the
performance is becoming worse day by day.


Reimer


> -----Mensagem original-----
> De: Mark Kirkwood [mailto:markir@paradise.net.nz]
> Enviada em: quinta-feira, 30 de novembro de 2006 23:47
> Para: carlos.reimer@opendb.com.br
> Cc: pgsql-performance@postgresql.org
> Assunto: Re: [PERFORM] Bad iostat numbers
>
>
> Carlos H. Reimer wrote:
> > While collecting performance data I discovered very bad numbers in the
> > I/O subsystem and I would like to know if I´m thinking correctly.
> >
> > Here is a typical iostat -x:
> >
> >
> > avg-cpu:  %user   %nice %system %iowait   %idle
> >
> >           50.40    0.00    0.50    1.10   48.00
> >
> >
> >
> > Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> >
> > sda          0.00   7.80  0.40  6.40   41.60  113.60    20.80
> > 56.80    22.82 570697.50   10.59 147.06 100.00
> >
> > sdb          0.20   7.80  0.60  6.40   40.00  113.60    20.00
> > 56.80    21.94 570697.50    9.83 142.86 100.00
> >
> > md1          0.00   0.00  1.20 13.40   81.60  107.20    40.80
> > 53.60    12.93     0.00    0.00   0.00   0.00
> >
> > md0          0.00   0.00  0.00  0.00    0.00    0.00     0.00
> > 0.00     0.00     0.00    0.00   0.00   0.00
> >
> >
> >
> > Are they not saturated?
> >
>
> They look it (if I'm reading your typical numbers correctly) - %util 100
> and svctime in the region of 100 ms!
>
> On the face of it, looks like you need something better than a RAID1
> setup - probably RAID10 (RAID5 is probably no good as you are writing
> more than you are reading it seems). However read on...
>
> If this is a sudden change in system behavior, then it is probably worth
> trying to figure out what is causing it (i.e which queries) - for
> instance it might be that you have some new queries that are doing disk
> based sorts (this would mean you really need more memory rather than
> better disk...)
>
> Cheers
>
> Mark
>
>
>
>
>


Re: RES: Bad iostat numbers

From
David Boreham
Date:
Carlos H. Reimer wrote:

>I´ve taken a look in the /var/log/messages and found some temperature
>messages about the disk drives:
>
>Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2
>Celsius to 51 Celsius since last report
>
>Can this temperature influence in the performance?
>
>
it can influence 'working-ness' which I guess in turn affects performance ;)

But I'm not sure if 50C is too high for a disk drive, it might be ok.

If you are able to, I'd say just replace the drives and see if that
improves things.



Re: Bad iostat numbers

From
Greg Smith
Date:
On Thu, 30 Nov 2006, Carlos H. Reimer wrote:

> I would like to discover how much cache is present in
> the controller, how can I find this value from Linux?

As far as I know there is no cache on an Adaptec 39320.  The write-back
cache Linux was reporting on was the one in the drives, which is 8MB; see
http://www.seagate.com/cda/products/discsales/enterprise/tech/1,1593,541,00.html
Be warned that running your database with the combination of an uncached
controller plus disks with write caching is dangerous to your database
integrity.

There is a common problem with the Linux driver for this card (aic7902)
where it enters what's they're calling an "Infinite Interrupt Loop".
That seems to match your readings:

> Here is a typical iostat -x:
> Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
> sda          0.00   7.80  0.40  6.40   41.60  113.60    20.80    56.80
> avgrq-sz avgqu-sz   await  svctm  %util
> 22.82 570697.50   10.59 147.06 100.00

An avgqu-sz of 570697.50 is extremely large.  That explains why the
utilization is 100%, because there's a massive number of I/O operations
queued up that aren't getting flushed out.  The read and write data says
these drives are barely doing anything, as 20kB/s and 57KB/s are
practically idle; they're not even remotely close to saturated.

See http://lkml.org/lkml/2005/10/1/47 for a suggested workaround that may
reduce the magnitude of this issue; lower the card's speed to U160 in the
BIOS was also listed as a useful workaround.  You might get better results
by upgrading to a newer Linux kernel, and just rebooting to clear out the
garbage might help if you haven't tried that yet.

On the pessimistic side, other people reporting issues with this
controller are:

http://lkml.org/lkml/2005/12/17/55
http://www.ussg.iu.edu/hypermail/linux/kernel/0512.2/0390.html
http://www.linuxforums.org/forum/peripherals-hardware/59306-scsi-hangs-boot.html
and even under FreeBSD at
http://lists.freebsd.org/pipermail/aic7xxx/2003-August/003973.html

This Adaptec card just barely works under Linux, which happens regularly
with their controllers, and my guess is that you've run into one of the
ways it goes crazy sometimes.  I just chuckled when checking
http://linux.adaptec.com/ again and noticing they can't even be bothered
to keep that server up at all.  According to

http://www.adaptec.com/en-US/downloads/linux_source/linux_source_code?productId=ASC-39320-R&dn=Adaptec+SCSI+Card+39320-R

the driver for your card is "*minimally tested* for Linux Kernel v2.6 on
all platforms."  Adaptec doesn't care about Linux support on their
products; if you want a SCSI controller that actually works under Linux,
get an LSI MegaRAID.

If this were really a Postgres problem, I wouldn't expect %iowait=1.10.
Were the database engine waiting to read/write data, that number would be
dramatically higher.  Whatever is generating all these I/O requests, it's
not waiting for them to complete like the database would be.  Besides the
driver problems that I'm very suspicious of, I'd suspect a runaway process
writing garbage to the disks might also cause this behavior.

> Ive taken a look in the /var/log/messages and found some temperature
> messages about the disk drives:
> Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2
> Celsius to 51 Celsius since last report
> Can this temperature influence in the performance?

That's close to the upper tolerance for this drive (55 degrees), which
means the drive is being cooked and will likely wear out quickly.  But
that won't slow it down, and you'd get much scarier messages out of smartd
if the drives had a real problem.  You should improve cooling in this case
if you want to drives to have a healthy life, odds are low this is
relevant to your performance issue though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Bad iostat numbers

From
"Alex Turner"
Date:
People recommend LSI MegaRAID controllers on here regularly, but I have found that they do not work that well.  I have bonnie++ numbers that show the controller is not performing anywhere near the disk's saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate machines provided by two different hosting companies.  In one case I asked them to replace the card, and the numbers got a bit better, but still not optimal.

LSI MegaRAID has proved to be a bit of a disapointment.  I have seen better numbers from the HP SmartArray 6i, and from 3ware cards with 7200RPM SATA drives.

for the output: http://www.infoconinc.com/test/bonnie++.html (the first line is a six drive RAID 10 on a 3ware 9500S, the next three are all RAID 1s on LSI MegaRAID controllers, verified by lspci).

Alex.

On 12/4/06, Greg Smith <gsmith@gregsmith.com> wrote:
On Thu, 30 Nov 2006, Carlos H. Reimer wrote:

> I would like to discover how much cache is present in
> the controller, how can I find this value from Linux?

As far as I know there is no cache on an Adaptec 39320.  The write-back
cache Linux was reporting on was the one in the drives, which is 8MB; see
http://www.seagate.com/cda/products/discsales/enterprise/tech/1,1593,541,00.html
Be warned that running your database with the combination of an uncached
controller plus disks with write caching is dangerous to your database
integrity.

There is a common problem with the Linux driver for this card (aic7902)
where it enters what's they're calling an "Infinite Interrupt Loop".
That seems to match your readings:

> Here is a typical iostat -x:
> Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
> sda          0.00   7.80  0.40  6.40   41.60  113.60    20.80    56.80
> avgrq-sz avgqu-sz   await  svctm  %util
> 22.82 570697.50   10.59 147.06 100.00

An avgqu-sz of 570697.50 is extremely large.  That explains why the
utilization is 100%, because there's a massive number of I/O operations
queued up that aren't getting flushed out.  The read and write data says
these drives are barely doing anything, as 20kB/s and 57KB/s are
practically idle; they're not even remotely close to saturated.

See http://lkml.org/lkml/2005/10/1/47 for a suggested workaround that may
reduce the magnitude of this issue; lower the card's speed to U160 in the
BIOS was also listed as a useful workaround.  You might get better results
by upgrading to a newer Linux kernel, and just rebooting to clear out the
garbage might help if you haven't tried that yet.

On the pessimistic side, other people reporting issues with this
controller are:

http://lkml.org/lkml/2005/12/17/55
http://www.ussg.iu.edu/hypermail/linux/kernel/0512.2/0390.html
http://www.linuxforums.org/forum/peripherals-hardware/59306-scsi-hangs-boot.html
and even under FreeBSD at
http://lists.freebsd.org/pipermail/aic7xxx/2003-August/003973.html

This Adaptec card just barely works under Linux, which happens regularly
with their controllers, and my guess is that you've run into one of the
ways it goes crazy sometimes.  I just chuckled when checking
http://linux.adaptec.com/ again and noticing they can't even be bothered
to keep that server up at all.  According to
http://www.adaptec.com/en-US/downloads/linux_source/linux_source_code?productId=ASC-39320-R&dn=Adaptec+SCSI+Card+39320-R
the driver for your card is "*minimally tested* for Linux Kernel v2.6 on
all platforms."  Adaptec doesn't care about Linux support on their
products; if you want a SCSI controller that actually works under Linux,
get an LSI MegaRAID.

If this were really a Postgres problem, I wouldn't expect %iowait=1.10.
Were the database engine waiting to read/write data, that number would be
dramatically higher.  Whatever is generating all these I/O requests, it's
not waiting for them to complete like the database would be.  Besides the
driver problems that I'm very suspicious of, I'd suspect a runaway process
writing garbage to the disks might also cause this behavior.

> Ive taken a look in the /var/log/messages and found some temperature
> messages about the disk drives:
> Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2
> Celsius to 51 Celsius since last report
> Can this temperature influence in the performance?

That's close to the upper tolerance for this drive (55 degrees), which
means the drive is being cooked and will likely wear out quickly.  But
that won't slow it down, and you'd get much scarier messages out of smartd
if the drives had a real problem.  You should improve cooling in this case
if you want to drives to have a healthy life, odds are low this is
relevant to your performance issue though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Re: Bad iostat numbers

From
Scott Marlowe
Date:
On Mon, 2006-12-04 at 01:17, Alex Turner wrote:
> People recommend LSI MegaRAID controllers on here regularly, but I
> have found that they do not work that well.  I have bonnie++ numbers
> that show the controller is not performing anywhere near the disk's
> saturation level in a simple RAID 1 on RedHat Linux EL4 on two
> seperate machines provided by two different hosting companies.  In one
> case I asked them to replace the card, and the numbers got a bit
> better, but still not optimal.
>
> LSI MegaRAID has proved to be a bit of a disapointment.  I have seen
> better numbers from the HP SmartArray 6i, and from 3ware cards with
> 7200RPM SATA drives.
>
> for the output: http://www.infoconinc.com/test/bonnie++.html (the
> first line is a six drive RAID 10 on a 3ware 9500S, the next three are
> all RAID 1s on LSI MegaRAID controllers, verified by lspci).

Wait, you're comparing a MegaRAID running a RAID 1 against another
controller running a 6 disk RAID10?  That's hardly fair.

My experience with the LSI was that with the 1.18 series drivers, they
were slow but stable.

With the version 2.x drivers, I found that the performance was very good
with RAID-5 and fair with RAID-1 and that layered RAID was not any
better than unlayered (i.e. layering RAID0 over RAID1 resulted in basic
RAID-1 performance).

OTOH, with the choice at my last place of employment being LSI or
Adaptec, LSI was a much better choice.  :)

I'd ask which LSI megaraid you've tested, and what driver was used.
Does RHEL4 have the megaraid 2 driver?

Re: Bad iostat numbers

From
Scott Marlowe
Date:
On Mon, 2006-12-04 at 10:25, Scott Marlowe wrote:
>
> OTOH, with the choice at my last place of employment being LSI or
> Adaptec, LSI was a much better choice.  :)
>
> I'd ask which LSI megaraid you've tested, and what driver was used.
> Does RHEL4 have the megaraid 2 driver?

Just wanted to add that what we used our database for at my last company
was for lots of mostly small writes / reads.  I.e. sequential throughput
didn't really matter, but random write speed did.  for that application,
the LSI Megaraid with battery backed cache was great.

Last point, bonnie++ is a good benchmarking tool, but until you test
your app / postgresql on top of the hardware, you can't really say how
well it will perform.

A controller that looks fast under a single bonnie++ thread might
perform poorly when there are 100+ pending writes, and vice versa, a
controller that looks mediocre under bonnie++ might shine when there's
heavy parallel write load to handle.

Re: Bad iostat numbers

From
"Alex Turner"
Date:
The RAID 10 was in there merely for filling in, not really as a compare, indeed it would be ludicrous to compare a RAID 1 to a 6 drive RAID 10!!

How do I find out if it has version 2 of the driver?

This discussion I think is important, as I think it would be useful for this list to have a list of RAID cards that _do_ work well under Linux/BSD for people as recommended hardware for Postgresql.   So far, all I can recommend is what I've found to be good, which is 3ware 9500 series cards with 10k SATA drives.  Throughput was great until you reached higher levels of RAID 10 (the bonnie++ mark I posted showed write speed is a bit slow).  But that doesn't solve the problem for SCSI.  What cards in the SCSI arena solve the problem optimally?  Why should we settle for sub-optimal performance in SCSI when there are a number of almost optimally performing cards in the SATA world (Areca, 3Ware/AMCC, LSI).

Thanks,

Alex

On 12/4/06, Scott Marlowe <smarlowe@g2switchworks.com> wrote:
On Mon, 2006-12-04 at 01:17, Alex Turner wrote:
> People recommend LSI MegaRAID controllers on here regularly, but I
> have found that they do not work that well.  I have bonnie++ numbers
> that show the controller is not performing anywhere near the disk's
> saturation level in a simple RAID 1 on RedHat Linux EL4 on two
> seperate machines provided by two different hosting companies.  In one
> case I asked them to replace the card, and the numbers got a bit
> better, but still not optimal.
>
> LSI MegaRAID has proved to be a bit of a disapointment.  I have seen
> better numbers from the HP SmartArray 6i, and from 3ware cards with
> 7200RPM SATA drives.
>
> for the output: http://www.infoconinc.com/test/bonnie++.html (the
> first line is a six drive RAID 10 on a 3ware 9500S, the next three are
> all RAID 1s on LSI MegaRAID controllers, verified by lspci).

Wait, you're comparing a MegaRAID running a RAID 1 against another
controller running a 6 disk RAID10?  That's hardly fair.

My experience with the LSI was that with the 1.18 series drivers, they
were slow but stable.

With the version 2.x drivers, I found that the performance was very good
with RAID-5 and fair with RAID-1 and that layered RAID was not any
better than unlayered ( i.e. layering RAID0 over RAID1 resulted in basic
RAID-1 performance).

OTOH, with the choice at my last place of employment being LSI or
Adaptec, LSI was a much better choice.  :)

I'd ask which LSI megaraid you've tested, and what driver was used.
Does RHEL4 have the megaraid 2 driver?

Re: Bad iostat numbers

From
Michael Stone
Date:
On Mon, Dec 04, 2006 at 12:37:29PM -0500, Alex Turner wrote:
>This discussion I think is important, as I think it would be useful for this
>list to have a list of RAID cards that _do_ work well under Linux/BSD for
>people as recommended hardware for Postgresql.   So far, all I can recommend
>is what I've found to be good, which is 3ware 9500 series cards with 10k
>SATA drives.  Throughput was great until you reached higher levels of RAID
>10 (the bonnie++ mark I posted showed write speed is a bit slow).  But that
>doesn't solve the problem for SCSI.  What cards in the SCSI arena solve the
>problem optimally?  Why should we settle for sub-optimal performance in SCSI
>when there are a number of almost optimally performing cards in the SATA
>world (Areca, 3Ware/AMCC, LSI).

Well, one factor is to be more precise about what you're looking for; a
HBA != RAID controller, and you may be comparing apples and oranges. (If
you have an external array with an onboard controller you probably want
a simple HBA rather than a RAID controller.)

Mike Stone

Re: Bad iostat numbers

From
"Alex Turner"
Date:
http://en.wikipedia.org/wiki/RAID_controller

Alex

On 12/4/06, Michael Stone < mstone+postgres@mathom.us> wrote:
On Mon, Dec 04, 2006 at 12:37:29PM -0500, Alex Turner wrote:
>This discussion I think is important, as I think it would be useful for this
>list to have a list of RAID cards that _do_ work well under Linux/BSD for
>people as recommended hardware for Postgresql.   So far, all I can recommend
>is what I've found to be good, which is 3ware 9500 series cards with 10k
>SATA drives.  Throughput was great until you reached higher levels of RAID
>10 (the bonnie++ mark I posted showed write speed is a bit slow).  But that
>doesn't solve the problem for SCSI.  What cards in the SCSI arena solve the
>problem optimally?  Why should we settle for sub-optimal performance in SCSI
>when there are a number of almost optimally performing cards in the SATA
>world (Areca, 3Ware/AMCC, LSI).

Well, one factor is to be more precise about what you're looking for; a
HBA != RAID controller, and you may be comparing apples and oranges. (If
you have an external array with an onboard controller you probably want
a simple HBA rather than a RAID controller.)

Mike Stone

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: Bad iostat numbers

From
Michael Stone
Date:
On Mon, Dec 04, 2006 at 12:52:46PM -0500, Alex Turner wrote:
>http://en.wikipedia.org/wiki/RAID_controller

What is the wikipedia quote supposed to prove? Pray tell, if you
consider RAID==HBA, what would you call a SCSI (e.g.) controller that
has no RAID functionality? If you'd call it an HBA, then there is a
useful distinction to be made, no?

Mike Stone

Re: Bad iostat numbers

From
Scott Marlowe
Date:
On Mon, 2006-12-04 at 11:43, Michael Stone wrote:
> On Mon, Dec 04, 2006 at 12:37:29PM -0500, Alex Turner wrote:
> >This discussion I think is important, as I think it would be useful for this
> >list to have a list of RAID cards that _do_ work well under Linux/BSD for
> >people as recommended hardware for Postgresql.   So far, all I can recommend
> >is what I've found to be good, which is 3ware 9500 series cards with 10k
> >SATA drives.  Throughput was great until you reached higher levels of RAID
> >10 (the bonnie++ mark I posted showed write speed is a bit slow).  But that
> >doesn't solve the problem for SCSI.  What cards in the SCSI arena solve the
> >problem optimally?  Why should we settle for sub-optimal performance in SCSI
> >when there are a number of almost optimally performing cards in the SATA
> >world (Areca, 3Ware/AMCC, LSI).
>
> Well, one factor is to be more precise about what you're looking for; a
> HBA != RAID controller, and you may be comparing apples and oranges. (If
> you have an external array with an onboard controller you probably want
> a simple HBA rather than a RAID controller.)

I think he's been pretty clear.  He's just talking about SCSI based RAID
controllers is all.

Re: Bad iostat numbers

From
Scott Marlowe
Date:
On Mon, 2006-12-04 at 11:37, Alex Turner wrote:
> The RAID 10 was in there merely for filling in, not really as a
> compare, indeed it would be ludicrous to compare a RAID 1 to a 6 drive
> RAID 10!!
>
> How do I find out if it has version 2 of the driver?

Go to the directory it lives in (on my Fedora Core 2 box, it's in
something like: /lib/modules/2.6.10-1.9_FC2/kernel/drivers/scsi )
and run modinfo on the driver:

modinfo megaraid.ko
author:         LSI Logic Corporation
description:    LSI Logic MegaRAID driver
license:        GPL
version:        2.00.3

SNIPPED extra stuff

> This discussion I think is important, as I think it would be useful
> for this list to have a list of RAID cards that _do_ work well under
> Linux/BSD for people as recommended hardware for Postgresql.   So far,
> all I can recommend is what I've found to be good, which is 3ware 9500
> series cards with 10k SATA drives.  Throughput was great until you
> reached higher levels of RAID 10 (the bonnie++ mark I posted showed
> write speed is a bit slow).  But that doesn't solve the problem for
> SCSI.  What cards in the SCSI arena solve the problem optimally?  Why
> should we settle for sub-optimal performance in SCSI when there are a
> number of almost optimally performing cards in the SATA world (Areca,
> 3Ware/AMCC, LSI).

Well, I think the LSI works VERY well under linux.  And I've always made
it quite clear in my posts that while I find it an acceptable performer,
my main recommendation is based on it's stability, not speed, and that
the Areca and 3Ware cards are generally regarded as faster.  And all
three beat the adaptecs which are observed as being rather unstable.

Does this LSI have battery backed cache?  Are you testing it under heavy
parallel load versus single threaded to get an idea how it scales with
multiple processes hitting it at once?

Don't get me wrong, I'm a big fan of running tools like bonnie to get a
basic idea of how good the hardware is, but benchmarks that simulate
real production loads are the only ones worth putting your trust in.

Re: Bad iostat numbers

From
Greg Smith
Date:
On Mon, 4 Dec 2006, Alex Turner wrote:

> People recommend LSI MegaRAID controllers on here regularly, but I have
> found that they do not work that well.  I have bonnie++ numbers that
> show the controller is not performing anywhere near the disk's
> saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate
> machines provided by two different hosting companies.
> http://www.infoconinc.com/test/bonnie++.html

I don't know what's going on with your www-september-06 machine, but the
other two are giving 32-40MB/s writes and 53-68MB/s reads.  For a RAID-1
volume, these aren't awful numbers, but I agree they're not great.

My results are no better.  For your comparison, here's a snippet of
bonnie++ results from one of my servers: RHEL 4, P4 3GHz, MegaRAID
firmware 1L37, write-thru cache setup, RAID 1; I think the drives are 10K
RPM Seagate Cheetahs.  This is from the end of the drive where performance
is the worst (I partitioned the important stuff at the beginning where
it's fastest and don't have enough free space to run bonnie there):

------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
20708  50 21473   9  9603   3 34419  72 55799   7 467.1   1

21Mb/s writes, 56MB/s reads.  Not too different from yours (especially if
your results were from the beginning of the disk), and certainly nothing
special.  I might be able to tune the write performance higher if I cared;
the battery backed cache sits unused and everything is tuned for paranoia
rather than performance.  On this machine it doesn't matter.

The thing is, even though it's rarely the top performing card even when
setup perfectly, the LSI SCSI Megaraid just works.  The driver is stable,
caching behavior is well defined, it's a pleasure to administer.  I'm
never concerned that it's lying to me or doing anything to put data at
risk.  The command-line tools for Linux work perfectly, let me look at or
control whatever I want, and it was straighforward for me to make my own
customized monitoring script using them.

> LSI MegaRAID has proved to be a bit of a disapointment.  I have seen
> better numbers from the HP SmartArray 6i, and from 3ware cards with
> 7200RPM SATA drives.

Whereas although I use 7200RPM SATA drives, I always try to keep an eye on
them because I never really trust them.  The performance list archives
here also have plenty of comments about people having issues with the
SmartArray controllers; search the archives for "cciss" and you'll see
what I'm talking about.

The Megaraid controller is very boring.  That's why I like it.  As a Linux
distribution, RedHat has similar characteristics.  If I were going for a
performance setup, I'd dump that, too, for something sexier with a newish
kernel.  It all depends on which side of the performance/stability
tradeoff you're aiming at.

On Mon, 4 Dec 2006, Scott Marlowe wrote:
> Does RHEL4 have the megaraid 2 driver?

This is from the moderately current RHEL4 installation I had results from
above.  Redhat has probably done a kernel rev since I last updated back in
September, haven't needed or wanted to reboot since then:

megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005)
megaraid: 2.20.4.6-rh2 (Release Date: Wed Jun 28 12:27:22 EST 2006)

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Bad iostat numbers

From
"Alex Turner"
Date:
I agree, that MegaRAID is very stable, and it's very appealing from that perspective.  And two years ago I would have never even mentioned cciss based cards on this list, because they sucked wind big time, but I believe some people have started seeing better number from the 6i.  20MB/sec write, when the number should be closer to 60.... thats off by a factor of 3.  For my data wharehouse application, thats a big difference, and if I can get a better number from 7200RPM drives and a good SATA controller, I'm gonna do that because my data isn't OLTP, and I don't care if the whole system shits itself and I have to restore from backup one day.

My other and most important point is that I can't find any solid recommendations for a SCSI card that can perform optimally in Linux or *BSD.  Off by a factor of 3x is pretty sad IMHO. (and yes, we know the Adaptec cards suck worse, that doesn't bring us to a _good_ card).

Alex.

On 12/4/06, Greg Smith <gsmith@gregsmith.com> wrote:
On Mon, 4 Dec 2006, Alex Turner wrote:

> People recommend LSI MegaRAID controllers on here regularly, but I have
> found that they do not work that well.  I have bonnie++ numbers that
> show the controller is not performing anywhere near the disk's
> saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate
> machines provided by two different hosting companies.
> http://www.infoconinc.com/test/bonnie++.html

I don't know what's going on with your www-september-06 machine, but the
other two are giving 32-40MB/s writes and 53-68MB/s reads.  For a RAID-1
volume, these aren't awful numbers, but I agree they're not great.

My results are no better.  For your comparison, here's a snippet of
bonnie++ results from one of my servers: RHEL 4, P4 3GHz, MegaRAID
firmware 1L37, write-thru cache setup, RAID 1; I think the drives are 10K
RPM Seagate Cheetahs.  This is from the end of the drive where performance
is the worst (I partitioned the important stuff at the beginning where
it's fastest and don't have enough free space to run bonnie there):

------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
20708  50 21473   9  9603   3 34419  72 55799   7 467.1   1

21Mb/s writes, 56MB/s reads.  Not too different from yours (especially if
your results were from the beginning of the disk), and certainly nothing
special.  I might be able to tune the write performance higher if I cared;
the battery backed cache sits unused and everything is tuned for paranoia
rather than performance.  On this machine it doesn't matter.

The thing is, even though it's rarely the top performing card even when
setup perfectly, the LSI SCSI Megaraid just works.  The driver is stable,
caching behavior is well defined, it's a pleasure to administer.  I'm
never concerned that it's lying to me or doing anything to put data at
risk.  The command-line tools for Linux work perfectly, let me look at or
control whatever I want, and it was straighforward for me to make my own
customized monitoring script using them.

> LSI MegaRAID has proved to be a bit of a disapointment.  I have seen
> better numbers from the HP SmartArray 6i, and from 3ware cards with
> 7200RPM SATA drives.

Whereas although I use 7200RPM SATA drives, I always try to keep an eye on
them because I never really trust them.  The performance list archives
here also have plenty of comments about people having issues with the
SmartArray controllers; search the archives for "cciss" and you'll see
what I'm talking about.

The Megaraid controller is very boring.  That's why I like it.  As a Linux
distribution, RedHat has similar characteristics.  If I were going for a
performance setup, I'd dump that, too, for something sexier with a newish
kernel.  It all depends on which side of the performance/stability
tradeoff you're aiming at.

On Mon, 4 Dec 2006, Scott Marlowe wrote:
> Does RHEL4 have the megaraid 2 driver?

This is from the moderately current RHEL4 installation I had results from
above.  Redhat has probably done a kernel rev since I last updated back in
September, haven't needed or wanted to reboot since then:

megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005)
megaraid: 2.20.4.6-rh2 (Release Date: Wed Jun 28 12:27:22 EST 2006)

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: Bad iostat numbers

From
Michael Stone
Date:
On Tue, Dec 05, 2006 at 01:21:38AM -0500, Alex Turner wrote:
>My other and most important point is that I can't find any solid
>recommendations for a SCSI card that can perform optimally in Linux or
>*BSD.  Off by a factor of 3x is pretty sad IMHO. (and yes, we know the
>Adaptec cards suck worse, that doesn't bring us to a _good_ card).

This gets back to my point about terminology. As a SCSI HBA the Adaptec
is decent: I can sustain about 300MB/s off a single channel of the
39320A using an external RAID controller. As a RAID controller I can't
even imagine using the Adaptec; I'm fairly certain they put that
"functionality" on there just so they could charge more for the card. It
may be that there's not much market for on-board SCSI RAID controllers;
between SATA on the low end and SAS & FC on the high end, there isn't a
whole lotta space left for SCSI. I definitely don't think much
R&D is going into SCSI controllers any more, compared to other solutions
like SATA or SAS RAID (the 39320 hasn't change in at least 3 years,
IIRC). Anyway, since the Adaptec part is a decent SCSI controller and a
lousy RAID controller, have you tried just using software RAID?

Mike Stone

Re: Bad iostat numbers

From
"Alex Turner"
Date:
The problem I see with software raid is the issue of a battery backed unit: If the computer loses power, then the 'cache' which is held in system memory, goes away, and fubars your RAID.

Alex

On 12/5/06, Michael Stone <mstone+postgres@mathom.us> wrote:
On Tue, Dec 05, 2006 at 01:21:38AM -0500, Alex Turner wrote:
>My other and most important point is that I can't find any solid
>recommendations for a SCSI card that can perform optimally in Linux or
>*BSD.  Off by a factor of 3x is pretty sad IMHO. (and yes, we know the
>Adaptec cards suck worse, that doesn't bring us to a _good_ card).

This gets back to my point about terminology. As a SCSI HBA the Adaptec
is decent: I can sustain about 300MB/s off a single channel of the
39320A using an external RAID controller. As a RAID controller I can't
even imagine using the Adaptec; I'm fairly certain they put that
"functionality" on there just so they could charge more for the card. It
may be that there's not much market for on-board SCSI RAID controllers;
between SATA on the low end and SAS & FC on the high end, there isn't a
whole lotta space left for SCSI. I definitely don't think much
R&D is going into SCSI controllers any more, compared to other solutions
like SATA or SAS RAID (the 39320 hasn't change in at least 3 years,
IIRC). Anyway, since the Adaptec part is a decent SCSI controller and a
lousy RAID controller, have you tried just using software RAID?

Mike Stone

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: Bad iostat numbers

From
"Craig A. James"
Date:
Alex Turner wrote:
> The problem I see with software raid is the issue of a battery backed
> unit: If the computer loses power, then the 'cache' which is held in
> system memory, goes away, and fubars your RAID.

I'm not sure I see the difference.  If data are cached, they're not written whether it is software or hardware RAID.  I
guessif you're writing RAID 1, the N disks could be out of sync, but the system can synchronize them once the array is
restored,so that's no different than a single disk or a hardware RAID.  If you're writing RAID 5, then the blocks are
inherentlyerror detecting/correcting, so you're still OK if a partial write occurs, right? 

I'm not familiar with the inner details of software RAID, but the only circumstance I can see where things would get
corruptedis if the RAID driver writes a LOT of blocks to one disk of the array before synchronizing the others, but my
guess(and it's just a guess) is that the writes to the N disks are tightly coupled. 

If I'm wrong about this, I'd like to know, because I'm using software RAID 1 and 1+0, and I'm pretty happy with it.

Craig

Re: Bad iostat numbers

From
Michael Stone
Date:
On Tue, Dec 05, 2006 at 07:57:43AM -0500, Alex Turner wrote:
>The problem I see with software raid is the issue of a battery backed unit:
>If the computer loses power, then the 'cache' which is held in system
>memory, goes away, and fubars your RAID.

Since the Adaptec doesn't have a BBU, it's a lateral move. Also, this is
less an issue of data integrity than performance; you can get exactly
the same level of integrity, you just have to wait for the data to sync
to disk. If you're read-mostly that's irrelevant.

Mike Stone

Re: Bad iostat numbers

From
Greg Smith
Date:
On Tue, 5 Dec 2006, Craig A. James wrote:

> I'm not familiar with the inner details of software RAID, but the only
> circumstance I can see where things would get corrupted is if the RAID driver
> writes a LOT of blocks to one disk of the array before synchronizing the
> others...

You're talking about whether the discs in the RAID are kept consistant.
While it's helpful with that, too, that's not the main reason a the
battery-backed cache is so helpful.  When PostgreSQL writes to the WAL, it
waits until that data has really been placed on the drive before it enters
that update into the database.  In a normal situation, that means that you
have to pause until the disk has physically written the blocks out, and
that puts a fairly low upper limit on write performance that's based on
how fast your drives rotate.  RAID 0, RAID 1, none of that will speed up
the time it takes to complete a single synchronized WAL write.

When your controller has a battery-backed cache, it can immediately tell
Postgres that the WAL write completed succesfully, while actually putting
it on the disk later.  On my systems, this results in simple writes going
2-4X as fast as they do without a cache.  Should there be a PC failure, as
long as power is restored before the battery runs out that transaction
will be preserved.

What Alex is rightly pointing out is that a software RAID approach doesn't
have this feature.  In fact, in this area performance can be even worse
under SW RAID than what you get from a single disk, because you may have
to wait for multiple discs to spin to the correct position and write data
out before you can consider the transaction complete.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Bad iostat numbers

From
Steve Atkins
Date:
On Dec 5, 2006, at 8:54 PM, Greg Smith wrote:

> On Tue, 5 Dec 2006, Craig A. James wrote:
>
>> I'm not familiar with the inner details of software RAID, but the
>> only circumstance I can see where things would get corrupted is if
>> the RAID driver writes a LOT of blocks to one disk of the array
>> before synchronizing the others...
>
> You're talking about whether the discs in the RAID are kept
> consistant. While it's helpful with that, too, that's not the main
> reason a the battery-backed cache is so helpful.  When PostgreSQL
> writes to the WAL, it waits until that data has really been placed
> on the drive before it enters that update into the database.  In a
> normal situation, that means that you have to pause until the disk
> has physically written the blocks out, and that puts a fairly low
> upper limit on write performance that's based on how fast your
> drives rotate.  RAID 0, RAID 1, none of that will speed up the time
> it takes to complete a single synchronized WAL write.
>
> When your controller has a battery-backed cache, it can immediately
> tell Postgres that the WAL write completed succesfully, while
> actually putting it on the disk later.  On my systems, this results
> in simple writes going 2-4X as fast as they do without a cache.
> Should there be a PC failure, as long as power is restored before
> the battery runs out that transaction will be preserved.
>
> What Alex is rightly pointing out is that a software RAID approach
> doesn't have this feature.  In fact, in this area performance can
> be even worse under SW RAID than what you get from a single disk,
> because you may have to wait for multiple discs to spin to the
> correct position and write data out before you can consider the
> transaction complete.

So... the ideal might be a RAID1 controller with BBU for the WAL and
something else, such as software RAID, for the main data array?

Cheers,
   Steve