Re: XFS options and benchmark woes - Mailing list pgsql-performance
From | mark |
---|---|
Subject | Re: XFS options and benchmark woes |
Date | |
Msg-id | 008b01cc5592$fc36fcb0$f4a4f610$@net Whole thread Raw |
In response to | XFS options and benchmark woes ("mark" <mark@sm-a.net>) |
List | pgsql-performance |
> -----Original Message----- > From: mark [mailto:mark@sm-a.net] > Sent: Monday, August 08, 2011 12:15 AM > To: 'pgsql-performance@postgresql.org' > Subject: XFS options and benchmark woes > > Hello PG perf junkies, > > > Sorry this may get a little long winded. Apologies if the formatting > gets trashed. > > > > Background: > > I have been setting up some new servers for PG and I am getting some > odd numbers with zcav, I am hoping a second set of eyes here can point > me in the right direction. (other tests like bonniee++ (1.03e) and dd > also give me odd (flat and low) numbers) > > I will preface this with, yes I bought greg's book. Yes I read it, and > it has helped me in the past, but seem to have hit an oddity. > > (hardware,os, and config stuff listed at the end) > > > > > > Short version: my zcav and dd tests look to get I/O bound. My numbers > in ZCAV are flat like and SSD which is odd for 15K rpm disks. uggg, ZCAV numbers appear to be CPU bound. Not i/o . > > > > > Long version: > > > In the past when dealing with storage I typically see a large gain with > moving from ext3 to XFS, provided I set readahead to 16384 on either > filesystem. > > I also see typical down ward trends in the MB/s (expected) and upward > trends in access times (expected) with either file system. > > > These blades + storage-blades are giving me atypical results . > > > I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing > access time really increase. (something I have only seen before when I > forget to have readahead set high enough) things are just flat at about > 420MB/s in zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for > ext3. > > FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96 > sometimes, which isn't something I have had happen before even though > greg does mention it. > > > Also when running zcav I will see kswapdX (0 and 1 in my two socket > case) start to eat significant cpu time (~40-50% each), with dd - > kswapd and pdflush become very active as well. This only happens once > free mem gets low. As well zcav or dd looks to get CPU bound at 100% > while i/o wait stays almost at 0.0 most of the time. (iostat -x -d > shows util % at 98% though). I see this with either XFS or ext3. Also > when I cat /proc/zoneinfo it looks like I am getting heavy contention > for a single page in DMA while the tests are running. (see end of email > for zoneinfo) > > Bonnie is giving me 99% cpu usage reported. Watching it while running > it bounces between 100 and 99. Kswap goes nuts here as well. > > > I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher > kernel to see some of the kswapd issues go away. (testing that > hopefully later this week). Maybe that will take care of everything. I > don't know yet. > > Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although > others on the RHEL support site indicated it did fix kswap issues for > them. > > > > Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC > using ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of > i/o wait as expected, and my zoneinfo for DMA doesn't sit at 1) > > Not going to focus too much on ext3 since I am pretty sure I should be > able to get better numbers from XFS. > > > > With mkfs.xfs I have done some reading and it appears that it can't > automatically read the stripsize (aka stripe size to anyone other than > HP) or the number of disks. So I have been using the following: > > mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256 > > (256K is the default hp stripsize for raid1+0, I have 12 disks in raid > 10 so I used sw=6, agcount of 256 because that is a (random) number I > got from google that seemed in the ball park.) > > > > > > > which gives me: > meta-data=/dev/cciss/c0d0 isize=256 agcount=256, > agsize=839936 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=215012774, > imaxpct=25 > = sunit=64 swidth=384 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=32768, version=2 > = sectsz=512 sunit=64 blks, lazy- > count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > > (if I don't specif the agcount or su,sw stuff I get > meta-data=/dev/cciss/c0d0 isize=256 agcount=4, > agsize=53753194 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=215012774, > imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=32768, version=2 > = sectsz=512 sunit=0 blks, lazy- > count=1 > realtime =none extsz=4096 blocks=0, rtextents=0) > > ) > > > > > > So it seems like I should be giving it the extra parameters at mkfs.xfs > time... could someone confirm ? In the past I have never specified the > su or sw or ag groups I have taken the defaults. But since I am getting > odd numbers here I started playing with em. Getting little or no > change. > > > > for mounting: > logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m > > > > (I know that noatime also means nodiratime according xfs.org, but in > the past I seem to get better numbers when having both) > > I am using nobarrier because I have a battery backed raid cache and the > FAQ @ XFS.org seems to indicate that is the right choice. > > > FWIW, if I put sunit and swidth in the mount options it seems to change > them lower (when viewed with xfs_info) so I haven't been putting it in > the mount options. > > > > > verify readahead: > blockdev --getra /dev/cciss/c0d0 > 16384 > > > > > > > > If anyone wants the benchmark outputs I can send them, but basically > zcav being FLAT for bother MB/s and access time tells me something is > wrong. And it will take days for me to re run all the ones I have done. > I didn't save much once I saw results that don't fit with what I > thought I should get. > > > > > I haven't done much with pgbench yet as I figure its pointless to move > on while the raw I/O numbers look off to me. At that time I am going to > make the call between wal on the OS raid 1 or going to 10 data disks > and 2 os and 2 wal. > > > > > > > I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the > issue went away, it didn't. I have gone back to 2.6.18-238.5 and put in > a new CCISS driver directly from HP, and the issue also does not go > away. People at work are thinking it might kernel bug that we have > somehow never notice before which is why we are going to look at RHEL > 6.1. we tried a 5.3 kernel that someone on rh bugzilla said didn't > have the issue but this blade had a fit with it - no network, lots of > other stuff not working and then it kernel panic'd so we quickly gave > up on that... > > > > We may try and shoehorn in the 6.1 kernel and a few dependencies as > well. Moving to RHEL 6.1. will mean a long test period before it can go > into prod and we want to get this new hardware in sooner than that can > be done. (even with all it's problems its probably still faster than > what it is replacing just from the 48GB of ram and 3 gen newer CPUS) > > > > > > > > > Hardware and config stuff as it sits right now. > > > > Blade Hardware: > ProLiant BL460c G7 (bios power flag set to high performance) > 2 intel 5660 cpus. (HT left on) > 48GB of ram (12x4GB @ 1333MHz) > Smart Array P410i (Embedded) > Points of interest from hpacucli - > - Hardware Revision: Rev C > - Firmware Version: 3.66 > - Cache Board Present: True > - Elevator Sort: Enabled > - Cache Status: OK > - Cache Backup Power Source: Capacitors > - Battery/Capacitor Count: 1 > - Battery/Capacitor Status: OK > - Total Cache Size: 512 MB > - Accelerator Ratio: 25% Read / 75% Write > - Strip Size: 256 KB > - 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3) > - Array Accelerator: Enabled > - Status: OK > - drives firmware = HPD5 > > Blade Storage subsystem: > HP SB2200 (12 disk 15K ) > > Points of interest from hpacucli > > Smart Array P410i in Slot 3 > Controller Status: OK > Hardware Revision: Rev C > Firmware Version: 3.66 > Elevator Sort: Enabled > Wait for Cache Room: Disabled > Cache Board Present: True > Cache Status: OK > Accelerator Ratio: 25% Read / 75% Write > Drive Write Cache: Disabled > Total Cache Size: 1024 MB > No-Battery Write Cache: Disabled > Cache Backup Power Source: Capacitors > Battery/Capacitor Count: 1 > Battery/Capacitor Status: OK > SATA NCQ Supported: True > > > Logical Drive: 1 > Size: 820.2 GB > Fault Tolerance: RAID 1+0 > Heads: 255 > Sectors Per Track: 32 > Cylinders: 65535 > Strip Size: 256 KB > Status: OK > Array Accelerator: Enabled > Disk Name: /dev/cciss/c0d0 > Mount Points: /raid 820.2 GB > OS Status: LOCKED > > 12 drives in Raid 1+0, using XFS. > > > OS: > OS: RHEL 5.6 (2.6.18-238.9.1.el5) > Database use: PG 9.0.2 for OLTP. > > > CCISS info: > filename: /lib/modules/2.6.18- > 238.9.1.el5/kernel/drivers/block/cciss.ko > version: 3.6.22-RH1 > description: Driver for HP Controller SA5xxx SA6xxx version 3.6.22- > RH1 > author: Hewlett-Packard Company > > XFS INFO: > xfsdump-2.2.48-3.el5 > xfsprogs-2.10.2-7.el5 > > head of ZONEINFO while zcav is running and kswap is going nuts: > the min,low,high of 1 seems odd to me. On other systems these get above > 1. > > Node 0, zone DMA > pages free 2493 > min 1 > low 1 > high 1 > active 0 > inactive 0 > scanned 0 (a: 3 i: 3) > spanned 4096 > present 2393 > nr_anon_pages 0 > nr_mapped 1 > nr_file_pages 0 > nr_slab 0 > nr_page_table_pages 0 > nr_dirty 0 > nr_writeback 0 > nr_unstable 0 > nr_bounce 0 > numa_hit 0 > numa_miss 0 > numa_foreign 0 > numa_interleave 0 > numa_local 0 > numa_other 0 > protection: (0, 3822, 24211, 24211) > pagesets > all_unreclaimable: 1 > prev_priority: 12 > start_pfn: 0 > > > > numastat (probably worthless since I have been pounding on this box for > a while before capturing it) > > node0 node1 > numa_hit 3126413031 247696913 > numa_miss 95489353 2781917287 > numa_foreign 2781917287 95489353 > interleave_hit 81178 97872 > local_node 3126297257 247706110 > other_node 95605127 2781908090 > > >
pgsql-performance by date: