Thread: Postgres on RAID5
Hi, I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an Ext3 filesystem which is used by Postgres. Currently we are loading a 50G database on this server from a Postgres dump (copy, not insert) and are experiencing very slow write performance (35 records per second). Top shows that the Postgres process (postmaster) is being constantly put into D state for extended periods of time (2-3 seconds) which I assume is because it's waiting for disk io. I have just started gathering system statistics and here is what sar -b shows: (this is while the db is being loaded - pg_restore) tps rtps wtps bread/s bwrtn/s 01:35:01 PM 275.77 76.12 199.66 709.59 2315.23 01:45:01 PM 287.25 75.56 211.69 706.52 2413.06 01:55:01 PM 281.73 76.35 205.37 711.84 2389.86 02:05:01 PM 282.83 76.14 206.69 720.85 2418.51 02:15:01 PM 284.07 76.15 207.92 707.38 2443.60 02:25:01 PM 265.46 75.91 189.55 708.87 2089.21 02:35:01 PM 285.21 76.02 209.19 709.58 2446.46 Average: 280.33 76.04 204.30 710.66 2359.47 This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM. It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom compiled kernel. Postgres is installed from the Debian package and uses all the configuration defaults. I am also copying the pgsql-performance list. Thanks in advance for any advice/pointers. Arshavir Following is some other info that might be helpful. /proc/scsi# mdadm -D /dev/md1 /dev/md1: Version : 00.90.00 Creation Time : Wed Feb 23 17:23:41 2005 Raid Level : raid5 Array Size : 123823616 (118.09 GiB 126.80 GB) Device Size : 8844544 (8.43 GiB 9.06 GB) Raid Devices : 15 Total Devices : 17 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Thu Feb 24 10:05:38 2005 State : active Active Devices : 15 Working Devices : 16 Failed Devices : 1 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K UUID : 81ae2c97:06fa4f4d:87bfc6c9:2ee516df Events : 0.8 Number Major Minor RaidDevice State 0 8 64 0 active sync /dev/sde 1 8 80 1 active sync /dev/sdf 2 8 96 2 active sync /dev/sdg 3 8 112 3 active sync /dev/sdh 4 8 128 4 active sync /dev/sdi 5 8 144 5 active sync /dev/sdj 6 8 160 6 active sync /dev/sdk 7 8 176 7 active sync /dev/sdl 8 8 192 8 active sync /dev/sdm 9 8 208 9 active sync /dev/sdn 10 8 224 10 active sync /dev/sdo 11 8 240 11 active sync /dev/sdp 12 65 0 12 active sync /dev/sdq 13 65 16 13 active sync /dev/sdr 14 65 32 14 active sync /dev/sds 15 65 48 15 spare /dev/sdt # dumpe2fs -h /dev/md1 dumpe2fs 1.35 (28-Feb-2004) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 1bb95bd6-94c7-4344-adf2-8414cadae6fc Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal dir_index needs_recovery large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 15482880 Block count: 30955904 Reserved block count: 1547795 Free blocks: 28767226 Free inodes: 15482502 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Wed Feb 23 17:27:13 2005 Last mount time: Wed Feb 23 17:45:25 2005 Last write time: Wed Feb 23 17:45:25 2005 Mount count: 2 Maximum mount count: 28 Last checked: Wed Feb 23 17:27:13 2005 Check interval: 15552000 (6 months) Next check after: Mon Aug 22 18:27:13 2005 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: c35c0226-3b52-4dad-b102-f22feb773592 Journal backup: inode blocks # lspci | grep SCSI 0000:00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:00:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:04:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 03) 0000:04:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 03) /proc/scsi# more scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi4 Channel: 00 Id: 06 Lun: 00 Vendor: TOSHIBA Model: XM6201TASUN32XCD Rev: 1103 Type: CD-ROM ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 00 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 01 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 02 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 03 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 -- Arshavir Grigorian Systems Administrator/Engineer
Arshavir Grigorian <ag@m-cam.com> writes: > Hi, > > I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an > Ext3 filesystem which is used by Postgres. People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5 across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives though. It would be interesting to see empirical data. One thing that does scare me is the Postgres transaction log and the ext3 journal both sharing these disks with the data. Ideally both of these things should get (mirrored) disks of their own separate from the data files. But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache flush on every fsync to get the journal pushed out. This is a new linux feature that's necessary with ide but shouldn't be necessary with scsi. It would be interesting to know whether postgres performs differently with fsync=off. This would even be a reasonable mode to run under for initial database loads. It shouldn't make much of a difference with hardware like this though. And you should be aware that running under this mode in production would put your data at risk. -- greg
a 14 drive stripe will max out the PCI bus long before anything else, the only reason for a stripe this size is to get a total accessible size up. A 6 drive RAID 10 on a good controller can get up to 400Mb/sec which is pushing the limit of the PCI bus (taken from offical 3ware 9500S 8MI benchmarks). 140 drives is not going to beat 6 drives because you've run out of bandwidth on the PCI bus. The debait on RAID 5 rages onward. The benchmarks I've seen suggest that RAID 5 is consistantly slower than RAID 10 with the same number of drivers, but others suggest that RAID 5 can be much faster that RAID 10 (see arstechnica.com) (Theoretical performance of RAID 5 is inline with a RAID 0 stripe of N-1 drives, RAID 10 has only N/2 drives in a stripe, perfomance should be nearly double - in theory of course). 35 Trans/sec is pretty slow, particularly if they are only one row at a time. I typicaly get 200-400/sec on our DB server on a bad day. Up to 1100 on a fresh database. I suggested running a bonnie benchmark, or some other IO perftest to determine if it's the array itself performing badly, or if there is something wrong with postgresql. If the array isn't kicking out at least 50MB/sec read/write performance, something is wrong. Until you've isolated the problem to either postgres or the array, everything else is simply speculation. In a perfect world, you would have two 6 drive RAID 10s. on two PCI busses, with system tables on a third parition, and archive logging on a fourth. Unsurprisingly this looks alot like the Oracle recommended minimum config. Also a note for interest is that this is _software_ raid... Alex Turner netEconomist On 13 Mar 2005 23:36:13 -0500, Greg Stark <gsstark@mit.edu> wrote: > > Arshavir Grigorian <ag@m-cam.com> writes: > > > Hi, > > > > I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an > > Ext3 filesystem which is used by Postgres. > > People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5 > across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives > though. It would be interesting to see empirical data. > > One thing that does scare me is the Postgres transaction log and the ext3 > journal both sharing these disks with the data. Ideally both of these things > should get (mirrored) disks of their own separate from the data files. > > But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache > flush on every fsync to get the journal pushed out. This is a new linux > feature that's necessary with ide but shouldn't be necessary with scsi. > > It would be interesting to know whether postgres performs differently with > fsync=off. This would even be a reasonable mode to run under for initial > database loads. It shouldn't make much of a difference with hardware like this > though. And you should be aware that running under this mode in production > would put your data at risk. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
Alex Turner <armtuk@gmail.com> writes: > a 14 drive stripe will max out the PCI bus long before anything else, Hopefully anyone with a 14 drive stripe is using some combination of 64 bit PCI-X cards running at 66Mhz... > the only reason for a stripe this size is to get a total accessible > size up. Well, many drives also cuts average latency. So even if you have no need for more bandwidth you still benefit from a lower average response time by adding more drives. -- greg
Alex Turner wrote: > a 14 drive stripe will max out the PCI bus long before anything else, > the only reason for a stripe this size is to get a total accessible > size up. A 6 drive RAID 10 on a good controller can get up to > 400Mb/sec which is pushing the limit of the PCI bus (taken from > offical 3ware 9500S 8MI benchmarks). 140 drives is not going to beat > 6 drives because you've run out of bandwidth on the PCI bus. > > The debait on RAID 5 rages onward. The benchmarks I've seen suggest > that RAID 5 is consistantly slower than RAID 10 with the same number > of drivers, but others suggest that RAID 5 can be much faster that > RAID 10 (see arstechnica.com) (Theoretical performance of RAID 5 is > inline with a RAID 0 stripe of N-1 drives, RAID 10 has only N/2 drives > in a stripe, perfomance should be nearly double - in theory of > course). > > 35 Trans/sec is pretty slow, particularly if they are only one row at > a time. I typicaly get 200-400/sec on our DB server on a bad day. Up > to 1100 on a fresh database. Well, by putting the pg_xlog directory on a separate disk/partition, I was able to increase this rate to about 50 or so per second (still pretty far from your numbers). Next I am going to try putting the pg_xlog on a RAID1+0 array and see if that helps. > I suggested running a bonnie benchmark, or some other IO perftest to > determine if it's the array itself performing badly, or if there is > something wrong with postgresql. > > If the array isn't kicking out at least 50MB/sec read/write > performance, something is wrong. > > Until you've isolated the problem to either postgres or the array, > everything else is simply speculation. > > In a perfect world, you would have two 6 drive RAID 10s. on two PCI > busses, with system tables on a third parition, and archive logging on > a fourth. Unsurprisingly this looks alot like the Oracle recommended > minimum config. Could you please elaborate on this setup a little more? How do you put system tables on a separate partition? I am still using version 7, and without tablespaces (which is how Oracle controls this), I can't figure out how to put different tables on different partitions. Thanks. Arshavir > Also a note for interest is that this is _software_ raid... > > Alex Turner > netEconomist > > On 13 Mar 2005 23:36:13 -0500, Greg Stark <gsstark@mit.edu> wrote: > >>Arshavir Grigorian <ag@m-cam.com> writes: >> >> >>>Hi, >>> >>>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an >>>Ext3 filesystem which is used by Postgres. >> >>People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5 >>across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives >>though. It would be interesting to see empirical data. >> >>One thing that does scare me is the Postgres transaction log and the ext3 >>journal both sharing these disks with the data. Ideally both of these things >>should get (mirrored) disks of their own separate from the data files. >> >>But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache >>flush on every fsync to get the journal pushed out. This is a new linux >>feature that's necessary with ide but shouldn't be necessary with scsi. >> >>It would be interesting to know whether postgres performs differently with >>fsync=off. This would even be a reasonable mode to run under for initial >>database loads. It shouldn't make much of a difference with hardware like this >>though. And you should be aware that running under this mode in production >>would put your data at risk. >> >>-- >>greg >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: the planner will ignore your desire to choose an index scan if your >> joining column's datatypes do not match >> -- Arshavir Grigorian Systems Administrator/Engineer M-CAM, Inc. ag@m-cam.com +1 703-682-0570 ext. 432 Contents Confidential
All, I have a 13 disk (250G each) software raid 5 set using 1 16 port adaptec SATA controller. I am very happy with the performance. The reason I went with the 13 disk raid 5 set was for the space NOT performance. I have a single postgresql database that is over 2 TB with about 500 GB free on the disk. This raid set performs about the same as my ICP SCSI raid controller (also with raid 5). That said, now that postgresql 8 has tablespaces, I would NOT create 1 single raid 5 set, but 3 smaller sets. I also DO NOT have my wal and log's on this raid set, but on a smaller 2 disk mirror. Jim ---------- Original Message ----------- From: Greg Stark <gsstark@mit.edu> To: Alex Turner <armtuk@gmail.com> Cc: Greg Stark <gsstark@mit.edu>, Arshavir Grigorian <ag@m-cam.com>, linux-raid@vger.kernel.org, pgsql-performance@postgresql.org Sent: 14 Mar 2005 15:17:11 -0500 Subject: Re: [PERFORM] Postgres on RAID5 > Alex Turner <armtuk@gmail.com> writes: > > > a 14 drive stripe will max out the PCI bus long before anything else, > > Hopefully anyone with a 14 drive stripe is using some combination of 64 bit > PCI-X cards running at 66Mhz... > > > the only reason for a stripe this size is to get a total accessible > > size up. > > Well, many drives also cuts average latency. So even if you have no need for > more bandwidth you still benefit from a lower average response time by adding > more drives. > > -- > greg > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match ------- End of Original Message -------
Alex Turner wrote: > 35 Trans/sec is pretty slow, particularly if they are only one row at > a time. I typicaly get 200-400/sec on our DB server on a bad day. Up > to 1100 on a fresh database. Well, don't rule out that his raid controller is not caching his writes. His WAL sync method may be overriding his raid cache policy and flushing his writes to disk, always. Win32 has the same problem, and before Magnus's O_DIRECT patch, there was no way to easily work around it without turning fsync off. I'd suggest playing with different WAL sync methods before trying anything else. Merli
He doesn't have a RAID controller, it's software RAID... Alex Turner netEconomis On Mon, 14 Mar 2005 16:18:00 -0500, Merlin Moncure <merlin.moncure@rcsonline.com> wrote: > Alex Turner wrote: > > 35 Trans/sec is pretty slow, particularly if they are only one row at > > a time. I typicaly get 200-400/sec on our DB server on a bad day. Up > > to 1100 on a fresh database. > > Well, don't rule out that his raid controller is not caching his writes. > His WAL sync method may be overriding his raid cache policy and flushing > his writes to disk, always. Win32 has the same problem, and before > Magnus's O_DIRECT patch, there was no way to easily work around it > without turning fsync off. I'd suggest playing with different WAL sync > methods before trying anything else. > > Merli >
In my experience, if you are concerned about filesystem performance, don't use ext3. It is one of the slowest filesystems I have ever used especially for writes. I would suggest either reiserfs or xfs. --David Dougall On Fri, 11 Mar 2005, Arshavir Grigorian wrote: > Hi, > > I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has > an Ext3 filesystem which is used by Postgres. Currently we are loading a > 50G database on this server from a Postgres dump (copy, not insert) and > are experiencing very slow write performance (35 records per second). > > Top shows that the Postgres process (postmaster) is being constantly put > into D state for extended periods of time (2-3 seconds) which I assume > is because it's waiting for disk io. I have just started gathering > system statistics and here is what sar -b shows: (this is while the db > is being loaded - pg_restore) > > tps rtps wtps bread/s bwrtn/s > 01:35:01 PM 275.77 76.12 199.66 709.59 2315.23 > 01:45:01 PM 287.25 75.56 211.69 706.52 2413.06 > 01:55:01 PM 281.73 76.35 205.37 711.84 2389.86 > 02:05:01 PM 282.83 76.14 206.69 720.85 2418.51 > 02:15:01 PM 284.07 76.15 207.92 707.38 2443.60 > 02:25:01 PM 265.46 75.91 189.55 708.87 2089.21 > 02:35:01 PM 285.21 76.02 209.19 709.58 2446.46 > Average: 280.33 76.04 204.30 710.66 2359.47 > > This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM. > It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom > compiled kernel. Postgres is installed from the Debian package and uses > all the configuration defaults. > > I am also copying the pgsql-performance list. > > Thanks in advance for any advice/pointers. > > > Arshavir > > Following is some other info that might be helpful. > > /proc/scsi# mdadm -D /dev/md1 > /dev/md1: > Version : 00.90.00 > Creation Time : Wed Feb 23 17:23:41 2005 > Raid Level : raid5 > Array Size : 123823616 (118.09 GiB 126.80 GB) > Device Size : 8844544 (8.43 GiB 9.06 GB) > Raid Devices : 15 > Total Devices : 17 > Preferred Minor : 1 > Persistence : Superblock is persistent > > Update Time : Thu Feb 24 10:05:38 2005 > State : active > Active Devices : 15 > Working Devices : 16 > Failed Devices : 1 > Spare Devices : 1 > > Layout : left-symmetric > Chunk Size : 64K > > UUID : 81ae2c97:06fa4f4d:87bfc6c9:2ee516df > Events : 0.8 > > Number Major Minor RaidDevice State > 0 8 64 0 active sync /dev/sde > 1 8 80 1 active sync /dev/sdf > 2 8 96 2 active sync /dev/sdg > 3 8 112 3 active sync /dev/sdh > 4 8 128 4 active sync /dev/sdi > 5 8 144 5 active sync /dev/sdj > 6 8 160 6 active sync /dev/sdk > 7 8 176 7 active sync /dev/sdl > 8 8 192 8 active sync /dev/sdm > 9 8 208 9 active sync /dev/sdn > 10 8 224 10 active sync /dev/sdo > 11 8 240 11 active sync /dev/sdp > 12 65 0 12 active sync /dev/sdq > 13 65 16 13 active sync /dev/sdr > 14 65 32 14 active sync /dev/sds > > 15 65 48 15 spare /dev/sdt > > # dumpe2fs -h /dev/md1 > dumpe2fs 1.35 (28-Feb-2004) > Filesystem volume name: <none> > Last mounted on: <not available> > Filesystem UUID: 1bb95bd6-94c7-4344-adf2-8414cadae6fc > Filesystem magic number: 0xEF53 > Filesystem revision #: 1 (dynamic) > Filesystem features: has_journal dir_index needs_recovery large_file > Default mount options: (none) > Filesystem state: clean > Errors behavior: Continue > Filesystem OS type: Linux > Inode count: 15482880 > Block count: 30955904 > Reserved block count: 1547795 > Free blocks: 28767226 > Free inodes: 15482502 > First block: 0 > Block size: 4096 > Fragment size: 4096 > Blocks per group: 32768 > Fragments per group: 32768 > Inodes per group: 16384 > Inode blocks per group: 512 > Filesystem created: Wed Feb 23 17:27:13 2005 > Last mount time: Wed Feb 23 17:45:25 2005 > Last write time: Wed Feb 23 17:45:25 2005 > Mount count: 2 > Maximum mount count: 28 > Last checked: Wed Feb 23 17:27:13 2005 > Check interval: 15552000 (6 months) > Next check after: Mon Aug 22 18:27:13 2005 > Reserved blocks uid: 0 (user root) > Reserved blocks gid: 0 (group root) > First inode: 11 > Inode size: 128 > Journal inode: 8 > Default directory hash: tea > Directory Hash Seed: c35c0226-3b52-4dad-b102-f22feb773592 > Journal backup: inode blocks > > # lspci | grep SCSI > 0000:00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 > (rev 14) > 0000:00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 > (rev 14) > 0000:00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 > (rev 14) > 0000:00:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 > (rev 14) > 0000:04:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 > (rev 03) > 0000:04:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 > (rev 03) > > /proc/scsi# more scsi > Attached devices: > Host: scsi0 Channel: 00 Id: 00 Lun: 00 > Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi0 Channel: 00 Id: 01 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi0 Channel: 00 Id: 02 Lun: 00 > Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi0 Channel: 00 Id: 03 Lun: 00 > Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi1 Channel: 00 Id: 00 Lun: 00 > Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi1 Channel: 00 Id: 01 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi1 Channel: 00 Id: 02 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi1 Channel: 00 Id: 03 Lun: 00 > Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi2 Channel: 00 Id: 00 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi2 Channel: 00 Id: 01 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi2 Channel: 00 Id: 02 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi2 Channel: 00 Id: 03 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi3 Channel: 00 Id: 00 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi3 Channel: 00 Id: 01 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi3 Channel: 00 Id: 02 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi3 Channel: 00 Id: 03 Lun: 00 > Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 > Type: Direct-Access ANSI SCSI revision: 03 > Host: scsi4 Channel: 00 Id: 06 Lun: 00 > Vendor: TOSHIBA Model: XM6201TASUN32XCD Rev: 1103 > Type: CD-ROM ANSI SCSI revision: 02 > Host: scsi5 Channel: 00 Id: 00 Lun: 00 > Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi5 Channel: 00 Id: 01 Lun: 00 > Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi5 Channel: 00 Id: 02 Lun: 00 > Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi5 Channel: 00 Id: 03 Lun: 00 > Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 > Type: Direct-Access ANSI SCSI revision: 02 > > > > > > > -- > Arshavir Grigorian > Systems Administrator/Engineer > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > >
David Dougall wrote: > In my experience, if you are concerned about filesystem performance, don't > use ext3. It is one of the slowest filesystems I have ever used > especially for writes. I would suggest either reiserfs or xfs. I'm a bit afraid to start yet another filesystem flamewar, but. Please don't make such a claims without providing actual numbers and config details. Pretty please. ext3 performs well for databases, there's no reason for it to be slow. Ok, enable data=journal and use it with eg Oracle - you will see it is slow. But in that case it isn't the filesystem to blame, it's operator error, simple as that. And especially reiserfs, with its tail packing enabled by default, is NOT suitable for databases... /mjt
Arshavir Grigorian wrote: > Alex Turner wrote: > [] > Well, by putting the pg_xlog directory on a separate disk/partition, I > was able to increase this rate to about 50 or so per second (still > pretty far from your numbers). Next I am going to try putting the > pg_xlog on a RAID1+0 array and see if that helps. pg_xlog is written syncronously, right? It should be, or else reliability of the database will be at a big question... I posted a question on Feb-22 here in linux-raid, titled "*terrible* direct-write performance with raid5". There's a problem with write performance of a raid4/5/6 array, which is due to the design. Consider raid5 array (raid4 will be exactly the same, and for raid6, just double the parity writes) with N data block and 1 parity block. At the time of writing a portion of data, parity block should be updated too, to be consistent and recoverable. And here, the size of the write plays very significant role. If your write size is smaller than chunk_size*N (N = number of data blocks in a stripe), in order to calculate correct parity you have to read data from the remaining drives. The only case where you don't need to read data from other drives is when you're writing by the size of chunk_size*N, AND the write is block-aligned. By default, chunk_size is 64Kb (min is 4Kb). So the only reasonable direct-write size of N drives will be 64Kb*N, or else raid code will have to read "missing" data to calculate the parity block. Ofcourse, in 99% cases you're writing in much smaller sizes, say 4Kb or so. And here, the more drives you have, the LESS write speed you will have. When using the O/S buffer and filesystem cache, the system has much more chances to re-order requests and sometimes even omit reading entirely (when you perform many sequentional writes for example, without sync in between), so buffered writes might be much fast. But not direct or syncronous writes, again especially when you're doing alot of sequential writes... So to me it looks like an inherent problem of raid5 architecture wrt database-like workload -- databases tends to use syncronous or direct writes to ensure good data consistency. For pgsql, which (i don't know for sure but reportedly) uses syncronous writs only for the transaction log, it is a good idea to put that log only to a raid1 or raid10 array, but NOT to raid5 array. Just IMHO ofcourse. /mjt
You said: "If your write size is smaller than chunk_size*N (N = number of data blocks in a stripe), in order to calculate correct parity you have to read data from the remaining drives." Neil explained it in this message: http://marc.theaimsgroup.com/?l=linux-raid&m=108682190730593&w=2 Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Michael Tokarev Sent: Monday, March 14, 2005 5:47 PM To: Arshavir Grigorian Cc: linux-raid@vger.kernel.org; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Postgres on RAID5 Arshavir Grigorian wrote: > Alex Turner wrote: > [] > Well, by putting the pg_xlog directory on a separate disk/partition, I > was able to increase this rate to about 50 or so per second (still > pretty far from your numbers). Next I am going to try putting the > pg_xlog on a RAID1+0 array and see if that helps. pg_xlog is written syncronously, right? It should be, or else reliability of the database will be at a big question... I posted a question on Feb-22 here in linux-raid, titled "*terrible* direct-write performance with raid5". There's a problem with write performance of a raid4/5/6 array, which is due to the design. Consider raid5 array (raid4 will be exactly the same, and for raid6, just double the parity writes) with N data block and 1 parity block. At the time of writing a portion of data, parity block should be updated too, to be consistent and recoverable. And here, the size of the write plays very significant role. If your write size is smaller than chunk_size*N (N = number of data blocks in a stripe), in order to calculate correct parity you have to read data from the remaining drives. The only case where you don't need to read data from other drives is when you're writing by the size of chunk_size*N, AND the write is block-aligned. By default, chunk_size is 64Kb (min is 4Kb). So the only reasonable direct-write size of N drives will be 64Kb*N, or else raid code will have to read "missing" data to calculate the parity block. Ofcourse, in 99% cases you're writing in much smaller sizes, say 4Kb or so. And here, the more drives you have, the LESS write speed you will have. When using the O/S buffer and filesystem cache, the system has much more chances to re-order requests and sometimes even omit reading entirely (when you perform many sequentional writes for example, without sync in between), so buffered writes might be much fast. But not direct or syncronous writes, again especially when you're doing alot of sequential writes... So to me it looks like an inherent problem of raid5 architecture wrt database-like workload -- databases tends to use syncronous or direct writes to ensure good data consistency. For pgsql, which (i don't know for sure but reportedly) uses syncronous writs only for the transaction log, it is a good idea to put that log only to a raid1 or raid10 array, but NOT to raid5 array. Just IMHO ofcourse. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg Stark wrote: >Arshavir Grigorian <ag@m-cam.com> writes: > > > >>Hi, >> >>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an >>Ext3 filesystem which is used by Postgres. >> >> > >People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5 >across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives >though. It would be interesting to see empirical data. > >One thing that does scare me is the Postgres transaction log and the ext3 >journal both sharing these disks with the data. Ideally both of these things >should get (mirrored) disks of their own separate from the data files. > >But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache >flush on every fsync to get the journal pushed out. This is a new linux >feature that's necessary with ide but shouldn't be necessary with scsi. > >It would be interesting to know whether postgres performs differently with >fsync=off. This would even be a reasonable mode to run under for initial >database loads. It shouldn't make much of a difference with hardware like this >though. And you should be aware that running under this mode in production >would put your data at risk. > Hi I'm coming in from the raid list so I didn't get the full story. May I ask what kernel? I only ask because I upgraded to 2.6.11.2 and happened to be watching xosview on my (probably) completely different setup (1Tb xfs/lvm2/raid5 served by nfs to a remote sustained read/write app), when I saw all read activity cease for 2/3 seconds whilst the disk wrote, then disk read resumed. This occured repeatedly during a read/edit/write of a 3Gb file. Performance not critical here so on the "hmm, that's odd" todo list :) David