Thread: XFS options and benchmark woes

XFS options and benchmark woes

From

"mark"

Date:

09 August 2011, 11:37:22

Hello PG perf junkies,


Sorry this may get a little long winded. Apologies if the formatting gets
trashed.



Background:

I have been setting up some new servers for PG and I am getting some odd
numbers with zcav, I am hoping a second set of eyes here can point me in the
right direction. (other tests like bonniee++ (1.03e) and dd also give me odd
(flat and low) numbers)

I will preface this with, yes I bought greg's book. Yes I read it, and it
has helped me in the past, but seem to have hit an oddity.

(hardware,os, and config stuff listed at the end)





Short version: my zcav and dd tests look to get I/O bound. My numbers in
ZCAV are flat like and SSD which is odd for 15K rpm disks.




Long version:


In the past when dealing with storage I typically see a large gain with
moving from ext3 to XFS, provided I set readahead to 16384 on either
filesystem.

I also see typical down ward trends in the MB/s (expected) and upward trends
in access times (expected) with either file system.


These blades + storage-blades are giving me atypical results .


I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing access
time really increase. (something I have only seen before when I forget to
have readahead set high enough) things are just flat at about 420MB/s in
zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for ext3.

FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96
sometimes, which isn't something I have had happen before even though greg
does mention it.


Also when running zcav I will see kswapdX (0 and 1 in my two socket case)
start to eat significant cpu time (~40-50% each), with dd - kswapd and
pdflush become very active as well. This only happens once free mem gets
low. As well zcav or dd looks to get CPU bound at 100% while i/o wait stays
almost at 0.0 most of the time. (iostat -x -d shows util % at 98% though). I
see this with either XFS or ext3. Also when I cat /proc/zoneinfo it looks
like I am getting heavy contention for a single page in DMA while the tests
are running. (see end of email for zoneinfo)

Bonnie is giving me 99% cpu usage reported. Watching it while running it
bounces between 100 and 99. Kswap goes nuts here as well.


I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher kernel to
see some of the kswapd issues go away. (testing that hopefully later this
week). Maybe that will take care of everything. I don't know yet.

 Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although others
on the RHEL support site indicated it did fix kswap issues for them.



Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC using
ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of i/o wait
as expected, and my zoneinfo for DMA doesn't sit at 1)

Not going to focus too much on ext3 since I am pretty sure I should be able
to get better numbers from XFS.



With mkfs.xfs I have done some reading and it appears that it can't
automatically read the stripsize (aka stripe size to anyone other than HP)
or the number of disks. So I have been using the following:

mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256

(256K is the default hp stripsize for raid1+0, I have 12 disks in raid 10 so
I used sw=6, agcount of 256 because that is a (random) number I got from
google that seemed in the ball park.)






which gives me:
meta-data=/dev/cciss/c0d0        isize=256    agcount=256, agsize=839936
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=215012774, imaxpct=25
         =                       sunit=64     swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


(if I don't specif the agcount or su,sw stuff I get
meta-data=/dev/cciss/c0d0        isize=256    agcount=4, agsize=53753194
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=215012774, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0)

)





So it seems like I should be giving it the extra parameters at mkfs.xfs
time... could someone confirm ? In the past I have never specified the su or
sw or ag groups I have taken the defaults. But since I am getting odd
numbers here I started playing with em. Getting little or no change.



for mounting:
logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m



(I know that noatime also means nodiratime according xfs.org, but in the
past I seem to get better numbers when having both)

I am using nobarrier because I have a battery backed raid cache and the FAQ
@ XFS.org seems to indicate that is the right choice.


FWIW, if I put sunit and swidth in the mount options it seems to change them
lower (when viewed with xfs_info) so I haven't been putting it in the mount
options.




verify readahead:
blockdev --getra /dev/cciss/c0d0
16384







If anyone wants the benchmark outputs I can send them, but basically zcav
being FLAT for bother MB/s and access time tells me something is wrong. And
it will take days for me to re run all the ones I have done. I didn't save
much once I saw results that don't fit with what I thought I should get.




I haven't done much with pgbench yet as I figure its pointless to move on
while the raw I/O numbers look off to me. At that time I am going to make
the call between wal on the OS raid 1 or going to 10 data disks and 2 os and
2 wal.






I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the issue
went away, it didn't. I have gone back to 2.6.18-238.5 and put in a new
CCISS driver directly from HP, and the issue also does not go away. People
at work are thinking it might kernel bug that we have somehow never notice
before which is why we are going to look at RHEL 6.1.  we tried a 5.3 kernel
that someone on rh bugzilla said didn't have the issue but this blade had a
fit with it - no network, lots of other stuff not working and then it kernel
panic'd so we quickly gave up on that...



We may try and shoehorn in the 6.1 kernel and a few dependencies as well.
Moving to RHEL 6.1. will mean a long test period before it can go into prod
and we want to get this new hardware in sooner than that can be done.  (even
with all it's problems its probably still faster than what it is replacing
just from the 48GB of ram and 3 gen newer CPUS)








Hardware and config stuff as it sits right now.



Blade Hardware:
ProLiant BL460c G7 (bios power flag set to high performance)
2 intel 5660 cpus. (HT left on)
48GB of ram (12x4GB @ 1333MHz)
Smart Array P410i (Embedded)
Points of interest from hpacucli -
    - Hardware Revision: Rev C
    - Firmware Version: 3.66
    - Cache Board Present: True
    - Elevator Sort: Enabled
    - Cache Status: OK
    - Cache Backup Power Source: Capacitors
       - Battery/Capacitor Count: 1
       - Battery/Capacitor Status: OK
    - Total Cache Size: 512 MB
       - Accelerator Ratio: 25% Read / 75% Write
    - Strip Size: 256 KB
    - 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3)
    - Array Accelerator: Enabled
    - Status: OK
    - drives firmware = HPD5

Blade Storage subsystem:
HP SB2200 (12 disk 15K )

Points of interest from hpacucli

Smart Array P410i in Slot 3
   Controller Status: OK
   Hardware Revision: Rev C
   Firmware Version: 3.66
   Elevator Sort: Enabled
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 25% Read / 75% Write
   Drive Write Cache: Disabled
   Total Cache Size: 1024 MB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Capacitors
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   SATA NCQ Supported: True


      Logical Drive: 1
         Size: 820.2 GB
         Fault Tolerance: RAID 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Status: OK
         Array Accelerator: Enabled
         Disk Name: /dev/cciss/c0d0
         Mount Points: /raid 820.2 GB
         OS Status: LOCKED

12 drives in Raid 1+0, using XFS.


OS:
OS: RHEL 5.6 (2.6.18-238.9.1.el5)
Database use: PG 9.0.2 for OLTP.


CCISS info:
filename:
/lib/modules/2.6.18-238.9.1.el5/kernel/drivers/block/cciss.ko
version:        3.6.22-RH1
description:    Driver for HP Controller SA5xxx SA6xxx version 3.6.22-RH1
author:         Hewlett-Packard Company

XFS INFO:
xfsdump-2.2.48-3.el5
xfsprogs-2.10.2-7.el5

head of ZONEINFO while zcav is running and kswap is going nuts:
the min,low,high of 1 seems odd to me. On other systems these get above 1.

Node 0, zone      DMA
  pages free     2493
        min      1
        low      1
        high     1
        active   0
        inactive 0
        scanned  0 (a: 3 i: 3)
        spanned  4096
        present  2393
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 0
    nr_slab      0
    nr_page_table_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    numa_hit     0
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   0
    numa_other   0
        protection: (0, 3822, 24211, 24211)
  pagesets
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         0



numastat (probably worthless since I have been pounding on this box for a
while before capturing it)

                           node0           node1
numa_hit              3126413031       247696913
numa_miss               95489353      2781917287
numa_foreign          2781917287        95489353
interleave_hit             81178           97872
local_node            3126297257       247706110
other_node              95605127      2781908090

Re: XFS options and benchmark woes

From

"mark"

Date:

09 August 2011, 11:37:24


> -----Original Message-----
> From: mark [mailto:mark@sm-a.net]
> Sent: Monday, August 08, 2011 12:15 AM
> To: 'pgsql-performance@postgresql.org'
> Subject: XFS options and benchmark woes
>
> Hello PG perf junkies,
>
>
> Sorry this may get a little long winded. Apologies if the formatting
> gets trashed.
>
>
>
> Background:
>
> I have been setting up some new servers for PG and I am getting some
> odd numbers with zcav, I am hoping a second set of eyes here can point
> me in the right direction. (other tests like bonniee++ (1.03e) and dd
> also give me odd (flat and low) numbers)
>
> I will preface this with, yes I bought greg's book. Yes I read it, and
> it has helped me in the past, but seem to have hit an oddity.
>
> (hardware,os, and config stuff listed at the end)
>
>
>
>
>
> Short version: my zcav and dd tests look to get I/O bound. My numbers
> in ZCAV are flat like and SSD which is odd for 15K rpm disks.


uggg, ZCAV numbers appear to be CPU bound. Not i/o .

>
>
>
>
> Long version:
>
>
> In the past when dealing with storage I typically see a large gain with
> moving from ext3 to XFS, provided I set readahead to 16384 on either
> filesystem.
>
> I also see typical down ward trends in the MB/s (expected) and upward
> trends in access times (expected) with either file system.
>
>
> These blades + storage-blades are giving me atypical results .
>
>
> I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing
> access time really increase. (something I have only seen before when I
> forget to have readahead set high enough) things are just flat at about
> 420MB/s in zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for
> ext3.
>
> FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96
> sometimes, which isn't something I have had happen before even though
> greg does mention it.
>
>
> Also when running zcav I will see kswapdX (0 and 1 in my two socket
> case) start to eat significant cpu time (~40-50% each), with dd -
> kswapd and pdflush become very active as well. This only happens once
> free mem gets low. As well zcav or dd looks to get CPU bound at 100%
> while i/o wait stays almost at 0.0 most of the time. (iostat -x -d
> shows util % at 98% though). I see this with either XFS or ext3. Also
> when I cat /proc/zoneinfo it looks like I am getting heavy contention
> for a single page in DMA while the tests are running. (see end of email
> for zoneinfo)
>
> Bonnie is giving me 99% cpu usage reported. Watching it while running
> it bounces between 100 and 99. Kswap goes nuts here as well.
>
>
> I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher
> kernel to see some of the kswapd issues go away. (testing that
> hopefully later this week). Maybe that will take care of everything. I
> don't know yet.
>
>  Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although
> others on the RHEL support site indicated it did fix kswap issues for
> them.
>
>
>
> Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC
> using ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of
> i/o wait as expected, and my zoneinfo for DMA doesn't sit at 1)
>
> Not going to focus too much on ext3 since I am pretty sure I should be
> able to get better numbers from XFS.
>
>
>
> With mkfs.xfs I have done some reading and it appears that it can't
> automatically read the stripsize (aka stripe size to anyone other than
> HP) or the number of disks. So I have been using the following:
>
> mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256
>
> (256K is the default hp stripsize for raid1+0, I have 12 disks in raid
> 10 so I used sw=6, agcount of 256 because that is a (random) number I
> got from google that seemed in the ball park.)
>
>
>
>
>
>
> which gives me:
> meta-data=/dev/cciss/c0d0        isize=256    agcount=256,
> agsize=839936 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=215012774,
> imaxpct=25
>          =                       sunit=64     swidth=384 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=32768, version=2
>          =                       sectsz=512   sunit=64 blks, lazy-
> count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
>
> (if I don't specif the agcount or su,sw stuff I get
> meta-data=/dev/cciss/c0d0        isize=256    agcount=4,
> agsize=53753194 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=215012774,
> imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=32768, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-
> count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0)
>
> )
>
>
>
>
>
> So it seems like I should be giving it the extra parameters at mkfs.xfs
> time... could someone confirm ? In the past I have never specified the
> su or sw or ag groups I have taken the defaults. But since I am getting
> odd numbers here I started playing with em. Getting little or no
> change.
>
>
>
> for mounting:
> logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m
>
>
>
> (I know that noatime also means nodiratime according xfs.org, but in
> the past I seem to get better numbers when having both)
>
> I am using nobarrier because I have a battery backed raid cache and the
> FAQ @ XFS.org seems to indicate that is the right choice.
>
>
> FWIW, if I put sunit and swidth in the mount options it seems to change
> them lower (when viewed with xfs_info) so I haven't been putting it in
> the mount options.
>
>
>
>
> verify readahead:
> blockdev --getra /dev/cciss/c0d0
> 16384
>
>
>
>
>
>
>
> If anyone wants the benchmark outputs I can send them, but basically
> zcav being FLAT for bother MB/s and access time tells me something is
> wrong. And it will take days for me to re run all the ones I have done.
> I didn't save much once I saw results that don't fit with what I
> thought I should get.
>
>
>
>
> I haven't done much with pgbench yet as I figure its pointless to move
> on while the raw I/O numbers look off to me. At that time I am going to
> make the call between wal on the OS raid 1 or going to 10 data disks
> and 2 os and 2 wal.
>
>
>
>
>
>
> I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the
> issue went away, it didn't. I have gone back to 2.6.18-238.5 and put in
> a new CCISS driver directly from HP, and the issue also does not go
> away. People at work are thinking it might kernel bug that we have
> somehow never notice before which is why we are going to look at RHEL
> 6.1.  we tried a 5.3 kernel that someone on rh bugzilla said didn't
> have the issue but this blade had a fit with it - no network, lots of
> other stuff not working and then it kernel panic'd so we quickly gave
> up on that...
>
>
>
> We may try and shoehorn in the 6.1 kernel and a few dependencies as
> well. Moving to RHEL 6.1. will mean a long test period before it can go
> into prod and we want to get this new hardware in sooner than that can
> be done.  (even with all it's problems its probably still faster than
> what it is replacing just from the 48GB of ram and 3 gen newer CPUS)
>
>
>
>
>
>
>
>
> Hardware and config stuff as it sits right now.
>
>
>
> Blade Hardware:
> ProLiant BL460c G7 (bios power flag set to high performance)
> 2 intel 5660 cpus. (HT left on)
> 48GB of ram (12x4GB @ 1333MHz)
> Smart Array P410i (Embedded)
> Points of interest from hpacucli -
>     - Hardware Revision: Rev C
>     - Firmware Version: 3.66
>     - Cache Board Present: True
>     - Elevator Sort: Enabled
>     - Cache Status: OK
>     - Cache Backup Power Source: Capacitors
>        - Battery/Capacitor Count: 1
>        - Battery/Capacitor Status: OK
>     - Total Cache Size: 512 MB
>        - Accelerator Ratio: 25% Read / 75% Write
>     - Strip Size: 256 KB
>     - 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3)
>     - Array Accelerator: Enabled
>     - Status: OK
>     - drives firmware = HPD5
>
> Blade Storage subsystem:
> HP SB2200 (12 disk 15K )
>
> Points of interest from hpacucli
>
> Smart Array P410i in Slot 3
>    Controller Status: OK
>    Hardware Revision: Rev C
>    Firmware Version: 3.66
>    Elevator Sort: Enabled
>    Wait for Cache Room: Disabled
>    Cache Board Present: True
>    Cache Status: OK
>    Accelerator Ratio: 25% Read / 75% Write
>    Drive Write Cache: Disabled
>    Total Cache Size: 1024 MB
>    No-Battery Write Cache: Disabled
>    Cache Backup Power Source: Capacitors
>    Battery/Capacitor Count: 1
>    Battery/Capacitor Status: OK
>    SATA NCQ Supported: True
>
>
>       Logical Drive: 1
>          Size: 820.2 GB
>          Fault Tolerance: RAID 1+0
>          Heads: 255
>          Sectors Per Track: 32
>          Cylinders: 65535
>          Strip Size: 256 KB
>          Status: OK
>          Array Accelerator: Enabled
>          Disk Name: /dev/cciss/c0d0
>          Mount Points: /raid 820.2 GB
>          OS Status: LOCKED
>
> 12 drives in Raid 1+0, using XFS.
>
>
> OS:
> OS: RHEL 5.6 (2.6.18-238.9.1.el5)
> Database use: PG 9.0.2 for OLTP.
>
>
> CCISS info:
> filename:       /lib/modules/2.6.18-
> 238.9.1.el5/kernel/drivers/block/cciss.ko
> version:        3.6.22-RH1
> description:    Driver for HP Controller SA5xxx SA6xxx version 3.6.22-
> RH1
> author:         Hewlett-Packard Company
>
> XFS INFO:
> xfsdump-2.2.48-3.el5
> xfsprogs-2.10.2-7.el5
>
> head of ZONEINFO while zcav is running and kswap is going nuts:
> the min,low,high of 1 seems odd to me. On other systems these get above
> 1.
>
> Node 0, zone      DMA
>   pages free     2493
>         min      1
>         low      1
>         high     1
>         active   0
>         inactive 0
>         scanned  0 (a: 3 i: 3)
>         spanned  4096
>         present  2393
>     nr_anon_pages 0
>     nr_mapped    1
>     nr_file_pages 0
>     nr_slab      0
>     nr_page_table_pages 0
>     nr_dirty     0
>     nr_writeback 0
>     nr_unstable  0
>     nr_bounce    0
>     numa_hit     0
>     numa_miss    0
>     numa_foreign 0
>     numa_interleave 0
>     numa_local   0
>     numa_other   0
>         protection: (0, 3822, 24211, 24211)
>   pagesets
>   all_unreclaimable: 1
>   prev_priority:     12
>   start_pfn:         0
>
>
>
> numastat (probably worthless since I have been pounding on this box for
> a while before capturing it)
>
>                            node0           node1
> numa_hit              3126413031       247696913
> numa_miss               95489353      2781917287
> numa_foreign          2781917287        95489353
> interleave_hit             81178           97872
> local_node            3126297257       247706110
> other_node              95605127      2781908090
>
>
>