Thread: Postgres on RAID5
Hi, I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an Ext3 filesystem which is used by Postgres. Currently we are loading a 50G database on this server from a Postgres dump (copy, not insert) and are experiencing very slow write performance (35 records per second). Top shows that the Postgres process (postmaster) is being constantly put into D state for extended periods of time (2-3 seconds) which I assume is because it's waiting for disk io. I have just started gathering system statistics and here is what sar -b shows: (this is while the db is being loaded - pg_restore) tps rtps wtps bread/s bwrtn/s 01:35:01 PM 275.77 76.12 199.66 709.59 2315.23 01:45:01 PM 287.25 75.56 211.69 706.52 2413.06 01:55:01 PM 281.73 76.35 205.37 711.84 2389.86 02:05:01 PM 282.83 76.14 206.69 720.85 2418.51 02:15:01 PM 284.07 76.15 207.92 707.38 2443.60 02:25:01 PM 265.46 75.91 189.55 708.87 2089.21 02:35:01 PM 285.21 76.02 209.19 709.58 2446.46 Average: 280.33 76.04 204.30 710.66 2359.47 This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM. It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom compiled kernel. Postgres is installed from the Debian package and uses all the configuration defaults. I am also copying the pgsql-performance list. Thanks in advance for any advice/pointers. Arshavir Following is some other info that might be helpful. /proc/scsi# mdadm -D /dev/md1 /dev/md1: Version : 00.90.00 Creation Time : Wed Feb 23 17:23:41 2005 Raid Level : raid5 Array Size : 123823616 (118.09 GiB 126.80 GB) Device Size : 8844544 (8.43 GiB 9.06 GB) Raid Devices : 15 Total Devices : 17 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Thu Feb 24 10:05:38 2005 State : active Active Devices : 15 Working Devices : 16 Failed Devices : 1 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K UUID : 81ae2c97:06fa4f4d:87bfc6c9:2ee516df Events : 0.8 Number Major Minor RaidDevice State 0 8 64 0 active sync /dev/sde 1 8 80 1 active sync /dev/sdf 2 8 96 2 active sync /dev/sdg 3 8 112 3 active sync /dev/sdh 4 8 128 4 active sync /dev/sdi 5 8 144 5 active sync /dev/sdj 6 8 160 6 active sync /dev/sdk 7 8 176 7 active sync /dev/sdl 8 8 192 8 active sync /dev/sdm 9 8 208 9 active sync /dev/sdn 10 8 224 10 active sync /dev/sdo 11 8 240 11 active sync /dev/sdp 12 65 0 12 active sync /dev/sdq 13 65 16 13 active sync /dev/sdr 14 65 32 14 active sync /dev/sds 15 65 48 15 spare /dev/sdt # dumpe2fs -h /dev/md1 dumpe2fs 1.35 (28-Feb-2004) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 1bb95bd6-94c7-4344-adf2-8414cadae6fc Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal dir_index needs_recovery large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 15482880 Block count: 30955904 Reserved block count: 1547795 Free blocks: 28767226 Free inodes: 15482502 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Wed Feb 23 17:27:13 2005 Last mount time: Wed Feb 23 17:45:25 2005 Last write time: Wed Feb 23 17:45:25 2005 Mount count: 2 Maximum mount count: 28 Last checked: Wed Feb 23 17:27:13 2005 Check interval: 15552000 (6 months) Next check after: Mon Aug 22 18:27:13 2005 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: c35c0226-3b52-4dad-b102-f22feb773592 Journal backup: inode blocks # lspci | grep SCSI 0000:00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:00:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) 0000:04:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 03) 0000:04:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 03) /proc/scsi# more scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39103LCSUN9.0G Rev: 034A Type: Direct-Access ANSI SCSI revision: 02 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 02 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 03 Lun: 00 Vendor: SEAGATE Model: ST39204LCSUN9.0G Rev: 4207 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi4 Channel: 00 Id: 06 Lun: 00 Vendor: TOSHIBA Model: XM6201TASUN32XCD Rev: 1103 Type: CD-ROM ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 00 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 01 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 02 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi5 Channel: 00 Id: 03 Lun: 00 Vendor: FUJITSU Model: MAG3091L SUN9.0G Rev: 1111 Type: Direct-Access ANSI SCSI revision: 02 -- Arshavir Grigorian Systems Administrator/Engineer
Arshavir Grigorian <ag@m-cam.com> writes: > I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has > an Ext3 filesystem which is used by Postgres. Currently we are loading a > 50G database on this server from a Postgres dump (copy, not insert) and > are experiencing very slow write performance (35 records per second). What PG version is this? What version of pg_dump made the dump file? How are you measuring that write rate (seeing that pg_restore doesn't provide any such info)? > Postgres is installed from the Debian package and uses > all the configuration defaults. The defaults are made for a fairly small machine, not big iron. At a minimum you want to kick shared_buffers up to 10K or more. regards, tom lane
Tom Lane wrote: > Arshavir Grigorian <ag@m-cam.com> writes: > >>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has >>an Ext3 filesystem which is used by Postgres. Currently we are loading a >>50G database on this server from a Postgres dump (copy, not insert) and >>are experiencing very slow write performance (35 records per second). > > > What PG version is this? What version of pg_dump made the dump file? > How are you measuring that write rate (seeing that pg_restore doesn't > provide any such info)? Sorry I missed the version. Both (the db from which the dump was created and the one it's being loaded on) run on Pg 7.4. Well, if the restore is going on for X number of hours and you have Y records loaded, it's not hard to ballpark. > > >>Postgres is installed from the Debian package and uses >>all the configuration defaults. > > > The defaults are made for a fairly small machine, not big iron. At a > minimum you want to kick shared_buffers up to 10K or more. > > regards, tom lane Will do. Thanks. Arshavir
On Fri, Mar 11, 2005 at 05:29:11PM -0500, Arshavir Grigorian wrote: > Tom Lane wrote: > >The defaults are made for a fairly small machine, not big iron. At a > >minimum you want to kick shared_buffers up to 10K or more. > > > Will do. Thanks. Also, it may help that you bump up sort_mem while doing [the CREATE INDEX part of] the restore. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "We are who we choose to be", sang the goldfinch when the sun is high (Sandman)
Arshavir Grigorian <ag@m-cam.com> writes: > Tom Lane wrote: >> How are you measuring that write rate (seeing that pg_restore doesn't >> provide any such info)? > Well, if the restore is going on for X number of hours and you have Y > records loaded, it's not hard to ballpark. Yeah, but how do you know that you have Y records loaded? What I'm trying to get at is what the restore is actually spending its time on. It seems unlikely that a COPY per se would run that slowly; far more likely that the expense is involved with index construction or foreign key verification. You could possibly determine what's what by watching the backend process with "ps" to see what statement type it's executing most of the time. BTW, is this a full database restore (schema + data), or are you trying to load data into pre-existing tables? The latter is generally a whole lot slower because both index updates and foreign key checks have to be done retail instead of wholesale. There are various ways of working around that but you have to be aware of what you're doing. Also, if it is indexing that's eating the time, boosting the sort_mem setting for the server would help a lot. regards, tom lane
Many thanks for all the response. I guess there are a lot of things to change and tweak and I wonder what would be a good benchmarking sample dataset (size, contents). My tables are very large (the smallest is 7+ mil records) and take several days to load (if not weeks). It would be nice to have a sample dataset that would be large enough to mimic my large datasets, but small enough to load in a short priod of time. Any suggestions? Arshavir
A, > This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM. > It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom > compiled kernel. Postgres is installed from the Debian package and uses > all the configuration defaults. Please read http://www.powerpostgresql.com/PerfList -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Look for the possibility that a foreign key check might not be using an index. This would yield a seq scan for each insertion, which might be your problem. On Fri, 11 Mar 2005 19:22:56 -0500, Arshavir Grigorian <ag@m-cam.com> wrote: > Many thanks for all the response. > > I guess there are a lot of things to change and tweak and I wonder what > would be a good benchmarking sample dataset (size, contents). > > My tables are very large (the smallest is 7+ mil records) and take > several days to load (if not weeks). It would be nice to have a sample > dataset that would be large enough to mimic my large datasets, but small > enough to load in a short priod of time. Any suggestions? > > > Arshavir > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
I would recommend running a bonnie++ benchmark on your array to see if it's the array/controller/raid being crap, or wether it's postgres. I have had some very surprising results from arrays that theoretically should be fast, but turned out to be very slow. I would also seriously have to recommend against a 14 drive RAID 5! This is statisticaly as likely to fail as a 7 drive RAID 0 (not counting the spare, but rebuiling a spare is very hard on existing drives). Alex Turner netEconomist On Fri, 11 Mar 2005 16:13:05 -0500, Arshavir Grigorian <ag@m-cam.com> wrote: > Hi, > > I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has > an Ext3 filesystem which is used by Postgres. Currently we are loading a > 50G database on this server from a Postgres dump (copy, not insert) and > are experiencing very slow write performance (35 records per second). > > Top shows that the Postgres process (postmaster) is being constantly put > into D state for extended periods of time (2-3 seconds) which I assume > is because it's waiting for disk io. I have just started gathering > system statistics and here is what sar -b shows: (this is while the db > is being loaded - pg_restore) > > tps rtps wtps bread/s bwrtn/s > 01:35:01 PM 275.77 76.12 199.66 709.59 2315.23 > 01:45:01 PM 287.25 75.56 211.69 706.52 2413.06 > 01:55:01 PM 281.73 76.35 205.37 711.84 2389.86 > [snip]
>On Fri, 11 Mar 2005 16:13:05 -0500, Arshavir Grigorian <ag@m-cam.com> wrote: > > >>Hi, >> >>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has >>an Ext3 filesystem which is used by Postgres. Currently we are loading a >>50G database on this server from a Postgres dump (copy, not insert) and >>are experiencing very slow write performance (35 records per second). >> >> That isn't that surprising. RAID 5 has never been known for its write performance. You should be running RAID 10. Sincerely, Joshua D. Drake >>Top shows that the Postgres process (postmaster) is being constantly put >>into D state for extended periods of time (2-3 seconds) which I assume >>is because it's waiting for disk io. I have just started gathering >>system statistics and here is what sar -b shows: (this is while the db >>is being loaded - pg_restore) >> >> tps rtps wtps bread/s bwrtn/s >>01:35:01 PM 275.77 76.12 199.66 709.59 2315.23 >>01:45:01 PM 287.25 75.56 211.69 706.52 2413.06 >>01:55:01 PM 281.73 76.35 205.37 711.84 2389.86 >> >> >> >[snip] > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
Hi Arshavir Grigorian, 0. If possible move to 8.0.1 - bgwriter help you 1. Create RAID1 for redo and place drives on separate SCSI channel 2. Update postgresql.conf: shared_buffers = 10000-50000 work_mem = 100000-300000 maintenance_work_mem = 100000-300000 max_fsm_pages = 1500000 max_fsm_relations = 16000 wal_buffers = 32 checkpoint_segments = 32 # 16MB each !! checkpoint_timeout = 600 checkpoint_warning = 60 effective_cache_size = 128000 random_page_cost = 3 default_statistics_target = 100 log_min_error_statement = warning log_min_duration_statement = 1000 # for logging long SQL 3. If possible migrate from RAID5 to RAID10. 4. Add (if need) 2 new drive for OS and use ALL 20x9GB drive for DB storage. 5. Remove CDROM from work configuration and start use this scsi channel. Best regards, Alexander Kirpa
Josh Berkus wrote: > A, > > >>This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM. >>It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom >>compiled kernel. Postgres is installed from the Debian package and uses >>all the configuration defaults. > > > Please read http://www.powerpostgresql.com/PerfList > I have read that document. Very informative/useful. Thanks.
Alex Turner wrote: > I would recommend running a bonnie++ benchmark on your array to see if > it's the array/controller/raid being crap, or wether it's postgres. I > have had some very surprising results from arrays that theoretically > should be fast, but turned out to be very slow. > > I would also seriously have to recommend against a 14 drive RAID 5! > This is statisticaly as likely to fail as a 7 drive RAID 0 (not > counting the spare, but rebuiling a spare is very hard on existing > drives). Thanks for the reply. Here are the results of the bonnie test on my array: ./bonnie -s 10000 -d . > oo 2>&1 File './Bonnie.23736', size: 10485760000 Writing with putc()...done Rewriting...done Writing intelligently...done Reading with getc()...done Reading intelligently...done Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 10000 4762 96.0 46140 78.8 31180 61.0 3810 99.9 71586 67.7 411.8 13.1 On a different note, I am not sure how the probability of RAID5 over 15 disks failing is the same as that of a RAID0 array over 7 disks. RAID5 can operate in a degraded mode (14 disks - 1 bad), RAID0 on the other hand cannot operate on 6 disks (6 disks - 1 bad). Am I missing something? Are you saying running RAID0 on a set of 2 RAID1 arrays of 7 each? That would work fine, except I cannot afford to "loose" that much space. Care to comment on these numbers? Thanks. Arshavir
Actualy my statistics were off a bit I realised - chance of failure for one drive is 1 in X. change of failure in RAID 0 is 7 in X, chance of one drive failure in 14 drive RAID 5 is 14 in X,13 in X for second drive, total probably is 182 in X*X, which is much lower than RAID 0. Your drive performance is less than stellar for a 14 drive stripe, and CPU usage for writes is very high. Even so - this should be enough through put to get over 100 rows/sec assuming you have virtualy no stored procs (I have noticed that stored procs in plpgsql REALLY slow pg_sql down). Alex Turner netEconomist On Mon, 14 Mar 2005 15:54:34 -0500, Arshavir Grigorian <ag@m-cam.com> wrote: > Alex Turner wrote: > > I would recommend running a bonnie++ benchmark on your array to see if > > it's the array/controller/raid being crap, or wether it's postgres. I > > have had some very surprising results from arrays that theoretically > > should be fast, but turned out to be very slow. > > > > I would also seriously have to recommend against a 14 drive RAID 5! > > This is statisticaly as likely to fail as a 7 drive RAID 0 (not > > counting the spare, but rebuiling a spare is very hard on existing > > drives). > > Thanks for the reply. > > Here are the results of the bonnie test on my array: > > ./bonnie -s 10000 -d . > oo 2>&1 > File './Bonnie.23736', size: 10485760000 > Writing with putc()...done > Rewriting...done > Writing intelligently...done > Reading with getc()...done > Reading intelligently...done > Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 10000 4762 96.0 46140 78.8 31180 61.0 3810 99.9 71586 67.7 411.8 13.1 > > On a different note, I am not sure how the probability of RAID5 over 15 > disks failing is the same as that of a RAID0 array over 7 disks. RAID5 > can operate in a degraded mode (14 disks - 1 bad), RAID0 on the other > hand cannot operate on 6 disks (6 disks - 1 bad). Am I missing something? > > Are you saying running RAID0 on a set of 2 RAID1 arrays of 7 each? That > would work fine, except I cannot afford to "loose" that much space. > > Care to comment on these numbers? Thanks. > > > Arshavir >