> -----Original Message-----
> From: mark [mailto:mark@sm-a.net]
> Sent: Monday, August 08, 2011 12:15 AM
> To: 'pgsql-performance@postgresql.org'
> Subject: XFS options and benchmark woes
>
> Hello PG perf junkies,
>
>
> Sorry this may get a little long winded. Apologies if the formatting
> gets trashed.
>
>
>
> Background:
>
> I have been setting up some new servers for PG and I am getting some
> odd numbers with zcav, I am hoping a second set of eyes here can point
> me in the right direction. (other tests like bonniee++ (1.03e) and dd
> also give me odd (flat and low) numbers)
>
> I will preface this with, yes I bought greg's book. Yes I read it, and
> it has helped me in the past, but seem to have hit an oddity.
>
> (hardware,os, and config stuff listed at the end)
>
>
>
>
>
> Short version: my zcav and dd tests look to get I/O bound. My numbers
> in ZCAV are flat like and SSD which is odd for 15K rpm disks.
uggg, ZCAV numbers appear to be CPU bound. Not i/o .
>
>
>
>
> Long version:
>
>
> In the past when dealing with storage I typically see a large gain with
> moving from ext3 to XFS, provided I set readahead to 16384 on either
> filesystem.
>
> I also see typical down ward trends in the MB/s (expected) and upward
> trends in access times (expected) with either file system.
>
>
> These blades + storage-blades are giving me atypical results .
>
>
> I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing
> access time really increase. (something I have only seen before when I
> forget to have readahead set high enough) things are just flat at about
> 420MB/s in zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for
> ext3.
>
> FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96
> sometimes, which isn't something I have had happen before even though
> greg does mention it.
>
>
> Also when running zcav I will see kswapdX (0 and 1 in my two socket
> case) start to eat significant cpu time (~40-50% each), with dd -
> kswapd and pdflush become very active as well. This only happens once
> free mem gets low. As well zcav or dd looks to get CPU bound at 100%
> while i/o wait stays almost at 0.0 most of the time. (iostat -x -d
> shows util % at 98% though). I see this with either XFS or ext3. Also
> when I cat /proc/zoneinfo it looks like I am getting heavy contention
> for a single page in DMA while the tests are running. (see end of email
> for zoneinfo)
>
> Bonnie is giving me 99% cpu usage reported. Watching it while running
> it bounces between 100 and 99. Kswap goes nuts here as well.
>
>
> I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher
> kernel to see some of the kswapd issues go away. (testing that
> hopefully later this week). Maybe that will take care of everything. I
> don't know yet.
>
> Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although
> others on the RHEL support site indicated it did fix kswap issues for
> them.
>
>
>
> Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC
> using ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of
> i/o wait as expected, and my zoneinfo for DMA doesn't sit at 1)
>
> Not going to focus too much on ext3 since I am pretty sure I should be
> able to get better numbers from XFS.
>
>
>
> With mkfs.xfs I have done some reading and it appears that it can't
> automatically read the stripsize (aka stripe size to anyone other than
> HP) or the number of disks. So I have been using the following:
>
> mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256
>
> (256K is the default hp stripsize for raid1+0, I have 12 disks in raid
> 10 so I used sw=6, agcount of 256 because that is a (random) number I
> got from google that seemed in the ball park.)
>
>
>
>
>
>
> which gives me:
> meta-data=/dev/cciss/c0d0 isize=256 agcount=256,
> agsize=839936 blks
> = sectsz=512 attr=2
> data = bsize=4096 blocks=215012774,
> imaxpct=25
> = sunit=64 swidth=384 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=32768, version=2
> = sectsz=512 sunit=64 blks, lazy-
> count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
>
> (if I don't specif the agcount or su,sw stuff I get
> meta-data=/dev/cciss/c0d0 isize=256 agcount=4,
> agsize=53753194 blks
> = sectsz=512 attr=2
> data = bsize=4096 blocks=215012774,
> imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=32768, version=2
> = sectsz=512 sunit=0 blks, lazy-
> count=1
> realtime =none extsz=4096 blocks=0, rtextents=0)
>
> )
>
>
>
>
>
> So it seems like I should be giving it the extra parameters at mkfs.xfs
> time... could someone confirm ? In the past I have never specified the
> su or sw or ag groups I have taken the defaults. But since I am getting
> odd numbers here I started playing with em. Getting little or no
> change.
>
>
>
> for mounting:
> logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m
>
>
>
> (I know that noatime also means nodiratime according xfs.org, but in
> the past I seem to get better numbers when having both)
>
> I am using nobarrier because I have a battery backed raid cache and the
> FAQ @ XFS.org seems to indicate that is the right choice.
>
>
> FWIW, if I put sunit and swidth in the mount options it seems to change
> them lower (when viewed with xfs_info) so I haven't been putting it in
> the mount options.
>
>
>
>
> verify readahead:
> blockdev --getra /dev/cciss/c0d0
> 16384
>
>
>
>
>
>
>
> If anyone wants the benchmark outputs I can send them, but basically
> zcav being FLAT for bother MB/s and access time tells me something is
> wrong. And it will take days for me to re run all the ones I have done.
> I didn't save much once I saw results that don't fit with what I
> thought I should get.
>
>
>
>
> I haven't done much with pgbench yet as I figure its pointless to move
> on while the raw I/O numbers look off to me. At that time I am going to
> make the call between wal on the OS raid 1 or going to 10 data disks
> and 2 os and 2 wal.
>
>
>
>
>
>
> I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the
> issue went away, it didn't. I have gone back to 2.6.18-238.5 and put in
> a new CCISS driver directly from HP, and the issue also does not go
> away. People at work are thinking it might kernel bug that we have
> somehow never notice before which is why we are going to look at RHEL
> 6.1. we tried a 5.3 kernel that someone on rh bugzilla said didn't
> have the issue but this blade had a fit with it - no network, lots of
> other stuff not working and then it kernel panic'd so we quickly gave
> up on that...
>
>
>
> We may try and shoehorn in the 6.1 kernel and a few dependencies as
> well. Moving to RHEL 6.1. will mean a long test period before it can go
> into prod and we want to get this new hardware in sooner than that can
> be done. (even with all it's problems its probably still faster than
> what it is replacing just from the 48GB of ram and 3 gen newer CPUS)
>
>
>
>
>
>
>
>
> Hardware and config stuff as it sits right now.
>
>
>
> Blade Hardware:
> ProLiant BL460c G7 (bios power flag set to high performance)
> 2 intel 5660 cpus. (HT left on)
> 48GB of ram (12x4GB @ 1333MHz)
> Smart Array P410i (Embedded)
> Points of interest from hpacucli -
> - Hardware Revision: Rev C
> - Firmware Version: 3.66
> - Cache Board Present: True
> - Elevator Sort: Enabled
> - Cache Status: OK
> - Cache Backup Power Source: Capacitors
> - Battery/Capacitor Count: 1
> - Battery/Capacitor Status: OK
> - Total Cache Size: 512 MB
> - Accelerator Ratio: 25% Read / 75% Write
> - Strip Size: 256 KB
> - 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3)
> - Array Accelerator: Enabled
> - Status: OK
> - drives firmware = HPD5
>
> Blade Storage subsystem:
> HP SB2200 (12 disk 15K )
>
> Points of interest from hpacucli
>
> Smart Array P410i in Slot 3
> Controller Status: OK
> Hardware Revision: Rev C
> Firmware Version: 3.66
> Elevator Sort: Enabled
> Wait for Cache Room: Disabled
> Cache Board Present: True
> Cache Status: OK
> Accelerator Ratio: 25% Read / 75% Write
> Drive Write Cache: Disabled
> Total Cache Size: 1024 MB
> No-Battery Write Cache: Disabled
> Cache Backup Power Source: Capacitors
> Battery/Capacitor Count: 1
> Battery/Capacitor Status: OK
> SATA NCQ Supported: True
>
>
> Logical Drive: 1
> Size: 820.2 GB
> Fault Tolerance: RAID 1+0
> Heads: 255
> Sectors Per Track: 32
> Cylinders: 65535
> Strip Size: 256 KB
> Status: OK
> Array Accelerator: Enabled
> Disk Name: /dev/cciss/c0d0
> Mount Points: /raid 820.2 GB
> OS Status: LOCKED
>
> 12 drives in Raid 1+0, using XFS.
>
>
> OS:
> OS: RHEL 5.6 (2.6.18-238.9.1.el5)
> Database use: PG 9.0.2 for OLTP.
>
>
> CCISS info:
> filename: /lib/modules/2.6.18-
> 238.9.1.el5/kernel/drivers/block/cciss.ko
> version: 3.6.22-RH1
> description: Driver for HP Controller SA5xxx SA6xxx version 3.6.22-
> RH1
> author: Hewlett-Packard Company
>
> XFS INFO:
> xfsdump-2.2.48-3.el5
> xfsprogs-2.10.2-7.el5
>
> head of ZONEINFO while zcav is running and kswap is going nuts:
> the min,low,high of 1 seems odd to me. On other systems these get above
> 1.
>
> Node 0, zone DMA
> pages free 2493
> min 1
> low 1
> high 1
> active 0
> inactive 0
> scanned 0 (a: 3 i: 3)
> spanned 4096
> present 2393
> nr_anon_pages 0
> nr_mapped 1
> nr_file_pages 0
> nr_slab 0
> nr_page_table_pages 0
> nr_dirty 0
> nr_writeback 0
> nr_unstable 0
> nr_bounce 0
> numa_hit 0
> numa_miss 0
> numa_foreign 0
> numa_interleave 0
> numa_local 0
> numa_other 0
> protection: (0, 3822, 24211, 24211)
> pagesets
> all_unreclaimable: 1
> prev_priority: 12
> start_pfn: 0
>
>
>
> numastat (probably worthless since I have been pounding on this box for
> a while before capturing it)
>
> node0 node1
> numa_hit 3126413031 247696913
> numa_miss 95489353 2781917287
> numa_foreign 2781917287 95489353
> interleave_hit 81178 97872
> local_node 3126297257 247706110
> other_node 95605127 2781908090
>
>
>