Weird XFS WAL problem - Mailing list pgsql-performance
From | Craig James |
---|---|
Subject | Weird XFS WAL problem |
Date | |
Msg-id | 4C06E994.2080905@emolecules.com Whole thread Raw |
In response to | Re: Random Page Cost and Planner (Cédric Villemain <cedric.villemain.debian@gmail.com>) |
Responses |
Re: Weird XFS WAL problem
Re: Weird XFS WAL problem Re: Weird XFS WAL problem |
List | pgsql-performance |
I'm testing/tuning a new midsize server and ran into an inexplicable problem. With an RAID10 drive, when I move the WALto a separate RAID1 drive, TPS drops from over 1200 to less than 90! I've checked everything and can't find a reason. Here are the details. 8 cores (2x4 Intel Nehalem 2 GHz) 12 GB memory 12 x 7200 SATA 500 GB disks 3WARE 9650SE-12ML RAID controller with bbu 2 disks: RAID1 500GB ext4 blocksize=4096 8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see below) 2 disks: hot swap Ubuntu 10.04 LTS (Lucid) With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results (this one is for xfs): Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP argon 24064M 70491 99 288158 25 129918 16 65296 97 428210 23 558.9 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 23283 81 +++++ +++ 13775 56 20143 74 +++++ +++ 15152 54 argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+++++,+++,13775,56,20143\ ,74,+++++,+++,15152,54 pgbench -i -s 100 -U test pgbench -c 10 -t 10000 -U test scaling factor: 100 query mode: simple number of clients: 10 number of transactions per client: 10000 number of transactions actually processed: 100000/100000 tps = 1046.104635 (including connections establishing) tps = 1046.337276 (excluding connections establishing) Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE controller, two more SATA 7200 disks). Run thesame tests and ... tps = 82.325446 (including connections establishing) tps = 82.326874 (excluding connections establishing) I thought I'd made a mistake, like maybe I moved the whole database to the RAID1 array, but I checked and double checked. I even watched the lights blink - the WAL was definitely on the RAID1 and the rest of Postgres on the RAID10. So I moved the WAL back to the RAID10 array, and performance jumped right back up to the >1200 TPS range. Next I check the RAID1 itself: dd if=/dev/zero of=./bigfile bs=8192 count=2000000 which yielded 98.8 MB/sec - not bad. bonnie++ on the RAID1 pair showed good performance too: Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP argon 24064M 68601 99 110057 18 46534 6 59883 90 123053 7 471.3 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,\ +++,+++++,+++,+++++,+++ So ... anyone have any idea at all how TPS drops to below 90 when I move the WAL to a separate RAID1 disk? Does this makeany sense at all? It's repeatable. It happens for both ext4 and xfs. It's weird. You can even watch the disk lights and see it: the RAID10 disks are on almost constantly when the WAL is on the RAID10, butwhen you move the WAL over to the RAID1, its lights are dim and flicker a lot, like it's barely getting any data, andthe RAID10 disk's lights barely go on at all. Thanks, Craig
pgsql-performance by date: