Thread: Postgres on RAID5

Postgres on RAID5

From
Arshavir Grigorian
Date:
Hi,

I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has
an Ext3 filesystem which is used by Postgres. Currently we are loading a
50G database on this server from a Postgres dump (copy, not insert) and
are experiencing very slow write performance (35 records per second).

Top shows that the Postgres process (postmaster) is being constantly put
into D state for extended periods of time (2-3 seconds) which I assume
is because it's waiting for disk io. I have just started gathering
system statistics and here is what sar -b shows: (this is while the db
is being loaded - pg_restore)

             tps        rtps     wtps      bread/s  bwrtn/s
01:35:01 PM    275.77     76.12    199.66    709.59   2315.23
01:45:01 PM    287.25     75.56    211.69    706.52   2413.06
01:55:01 PM    281.73     76.35    205.37    711.84   2389.86
02:05:01 PM    282.83     76.14    206.69    720.85   2418.51
02:15:01 PM    284.07     76.15    207.92    707.38   2443.60
02:25:01 PM    265.46     75.91    189.55    708.87   2089.21
02:35:01 PM    285.21     76.02    209.19    709.58   2446.46
Average:       280.33     76.04    204.30    710.66   2359.47

This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM.
It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom
compiled kernel. Postgres is installed from the Debian package and uses
all the configuration defaults.

I am also copying the pgsql-performance list.

Thanks in advance for any advice/pointers.


Arshavir

Following is some other info that might be helpful.

/proc/scsi# mdadm -D /dev/md1
/dev/md1:
         Version : 00.90.00
   Creation Time : Wed Feb 23 17:23:41 2005
      Raid Level : raid5
      Array Size : 123823616 (118.09 GiB 126.80 GB)
     Device Size : 8844544 (8.43 GiB 9.06 GB)
    Raid Devices : 15
   Total Devices : 17
Preferred Minor : 1
     Persistence : Superblock is persistent

     Update Time : Thu Feb 24 10:05:38 2005
           State : active
  Active Devices : 15
Working Devices : 16
  Failed Devices : 1
   Spare Devices : 1

          Layout : left-symmetric
      Chunk Size : 64K

            UUID : 81ae2c97:06fa4f4d:87bfc6c9:2ee516df
          Events : 0.8

     Number   Major   Minor   RaidDevice State
        0       8       64        0      active sync   /dev/sde
        1       8       80        1      active sync   /dev/sdf
        2       8       96        2      active sync   /dev/sdg
        3       8      112        3      active sync   /dev/sdh
        4       8      128        4      active sync   /dev/sdi
        5       8      144        5      active sync   /dev/sdj
        6       8      160        6      active sync   /dev/sdk
        7       8      176        7      active sync   /dev/sdl
        8       8      192        8      active sync   /dev/sdm
        9       8      208        9      active sync   /dev/sdn
       10       8      224       10      active sync   /dev/sdo
       11       8      240       11      active sync   /dev/sdp
       12      65        0       12      active sync   /dev/sdq
       13      65       16       13      active sync   /dev/sdr
       14      65       32       14      active sync   /dev/sds

       15      65       48       15      spare   /dev/sdt

# dumpe2fs -h /dev/md1
dumpe2fs 1.35 (28-Feb-2004)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          1bb95bd6-94c7-4344-adf2-8414cadae6fc
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal dir_index needs_recovery large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              15482880
Block count:              30955904
Reserved block count:     1547795
Free blocks:              28767226
Free inodes:              15482502
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Wed Feb 23 17:27:13 2005
Last mount time:          Wed Feb 23 17:45:25 2005
Last write time:          Wed Feb 23 17:45:25 2005
Mount count:              2
Maximum mount count:      28
Last checked:             Wed Feb 23 17:27:13 2005
Check interval:           15552000 (6 months)
Next check after:         Mon Aug 22 18:27:13 2005
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      c35c0226-3b52-4dad-b102-f22feb773592
Journal backup:           inode blocks

# lspci | grep SCSI
0000:00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:04:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 03)
0000:04:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 03)

/proc/scsi# more scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 06 Lun: 00
   Vendor: TOSHIBA  Model: XM6201TASUN32XCD Rev: 1103
   Type:   CD-ROM                           ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 00 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 01 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 02 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 03 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02






--
Arshavir Grigorian
Systems Administrator/Engineer


Re: Postgres on RAID5

From
Tom Lane
Date:
Arshavir Grigorian <ag@m-cam.com> writes:
> I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has
> an Ext3 filesystem which is used by Postgres. Currently we are loading a
> 50G database on this server from a Postgres dump (copy, not insert) and
> are experiencing very slow write performance (35 records per second).

What PG version is this?  What version of pg_dump made the dump file?
How are you measuring that write rate (seeing that pg_restore doesn't
provide any such info)?

> Postgres is installed from the Debian package and uses
> all the configuration defaults.

The defaults are made for a fairly small machine, not big iron.  At a
minimum you want to kick shared_buffers up to 10K or more.

            regards, tom lane

Re: Postgres on RAID5

From
Arshavir Grigorian
Date:
Tom Lane wrote:
> Arshavir Grigorian <ag@m-cam.com> writes:
>
>>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has
>>an Ext3 filesystem which is used by Postgres. Currently we are loading a
>>50G database on this server from a Postgres dump (copy, not insert) and
>>are experiencing very slow write performance (35 records per second).
>
>
> What PG version is this?  What version of pg_dump made the dump file?
> How are you measuring that write rate (seeing that pg_restore doesn't
> provide any such info)?

Sorry I missed the version. Both (the db from which the dump was created
and the one it's being loaded on) run on Pg 7.4.

Well, if the restore is going on for X number of hours and you have Y
records loaded, it's not hard to ballpark.

>
>
>>Postgres is installed from the Debian package and uses
>>all the configuration defaults.
>
>
> The defaults are made for a fairly small machine, not big iron.  At a
> minimum you want to kick shared_buffers up to 10K or more.
>
>             regards, tom lane
Will do. Thanks.


Arshavir


Re: Postgres on RAID5

From
Alvaro Herrera
Date:
On Fri, Mar 11, 2005 at 05:29:11PM -0500, Arshavir Grigorian wrote:
> Tom Lane wrote:

> >The defaults are made for a fairly small machine, not big iron.  At a
> >minimum you want to kick shared_buffers up to 10K or more.
> >
> Will do. Thanks.

Also, it may help that you bump up sort_mem while doing [the CREATE
INDEX part of] the restore.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"We are who we choose to be", sang the goldfinch
when the sun is high (Sandman)

Re: Postgres on RAID5

From
Tom Lane
Date:
Arshavir Grigorian <ag@m-cam.com> writes:
> Tom Lane wrote:
>> How are you measuring that write rate (seeing that pg_restore doesn't
>> provide any such info)?

> Well, if the restore is going on for X number of hours and you have Y
> records loaded, it's not hard to ballpark.

Yeah, but how do you know that you have Y records loaded?

What I'm trying to get at is what the restore is actually spending its
time on.  It seems unlikely that a COPY per se would run that slowly;
far more likely that the expense is involved with index construction
or foreign key verification.  You could possibly determine what's what
by watching the backend process with "ps" to see what statement type
it's executing most of the time.

BTW, is this a full database restore (schema + data), or are you trying
to load data into pre-existing tables?  The latter is generally a whole
lot slower because both index updates and foreign key checks have to be
done retail instead of wholesale.  There are various ways of working
around that but you have to be aware of what you're doing.

Also, if it is indexing that's eating the time, boosting the sort_mem
setting for the server would help a lot.

            regards, tom lane

Re: Postgres on RAID5

From
Arshavir Grigorian
Date:
Many thanks for all the response.

I guess there are a lot of things to change and tweak and I wonder what
would be a good benchmarking sample dataset (size, contents).

My tables are very large (the smallest is 7+ mil records) and take
several days to load (if not weeks). It would be nice to have a sample
dataset that would be large enough to mimic my large datasets, but small
enough to load in a short priod of time. Any suggestions?


Arshavir

Re: Postgres on RAID5

From
Josh Berkus
Date:
A,

> This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM.
> It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom
> compiled kernel. Postgres is installed from the Debian package and uses
> all the configuration defaults.

Please read http://www.powerpostgresql.com/PerfList

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Postgres on RAID5

From
PFC
Date:
    Look for the possibility that a foreign key check might not be using an
index. This would yield a seq scan for each insertion, which might be your
problem.


On Fri, 11 Mar 2005 19:22:56 -0500, Arshavir Grigorian <ag@m-cam.com>
wrote:

> Many thanks for all the response.
>
> I guess there are a lot of things to change and tweak and I wonder what
> would be a good benchmarking sample dataset (size, contents).
>
> My tables are very large (the smallest is 7+ mil records) and take
> several days to load (if not weeks). It would be nice to have a sample
> dataset that would be large enough to mimic my large datasets, but small
> enough to load in a short priod of time. Any suggestions?
>
>
> Arshavir
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>



Re: Postgres on RAID5

From
Alex Turner
Date:
I would recommend running a bonnie++ benchmark on your array to see if
it's the array/controller/raid being crap, or wether it's postgres.  I
have had some very surprising results from arrays that theoretically
should be fast, but turned out to be very slow.

I would also seriously have to recommend against a 14 drive RAID 5!
This is statisticaly as likely to fail as a 7 drive RAID 0 (not
counting the spare, but rebuiling a spare is very hard on existing
drives).

Alex Turner
netEconomist


On Fri, 11 Mar 2005 16:13:05 -0500, Arshavir Grigorian <ag@m-cam.com> wrote:
> Hi,
>
> I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has
> an Ext3 filesystem which is used by Postgres. Currently we are loading a
> 50G database on this server from a Postgres dump (copy, not insert) and
> are experiencing very slow write performance (35 records per second).
>
> Top shows that the Postgres process (postmaster) is being constantly put
> into D state for extended periods of time (2-3 seconds) which I assume
> is because it's waiting for disk io. I have just started gathering
> system statistics and here is what sar -b shows: (this is while the db
> is being loaded - pg_restore)
>
>                tps        rtps     wtps      bread/s  bwrtn/s
> 01:35:01 PM    275.77     76.12    199.66    709.59   2315.23
> 01:45:01 PM    287.25     75.56    211.69    706.52   2413.06
> 01:55:01 PM    281.73     76.35    205.37    711.84   2389.86
>
[snip]

Re: Postgres on RAID5

From
"Joshua D. Drake"
Date:
>On Fri, 11 Mar 2005 16:13:05 -0500, Arshavir Grigorian <ag@m-cam.com> wrote:
>
>
>>Hi,
>>
>>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has
>>an Ext3 filesystem which is used by Postgres. Currently we are loading a
>>50G database on this server from a Postgres dump (copy, not insert) and
>>are experiencing very slow write performance (35 records per second).
>>
>>

That isn't that surprising. RAID 5 has never been known for its write
performance. You should be running RAID 10.

Sincerely,

Joshua D. Drake


>>Top shows that the Postgres process (postmaster) is being constantly put
>>into D state for extended periods of time (2-3 seconds) which I assume
>>is because it's waiting for disk io. I have just started gathering
>>system statistics and here is what sar -b shows: (this is while the db
>>is being loaded - pg_restore)
>>
>>               tps        rtps     wtps      bread/s  bwrtn/s
>>01:35:01 PM    275.77     76.12    199.66    709.59   2315.23
>>01:45:01 PM    287.25     75.56    211.69    706.52   2413.06
>>01:55:01 PM    281.73     76.35    205.37    711.84   2389.86
>>
>>
>>
>[snip]
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL


Attachment

Re: Postgres on RAID5

From
"Alexander Kirpa"
Date:
Hi Arshavir Grigorian,

0. If possible move to 8.0.1 - bgwriter help you

1. Create RAID1 for redo and place drives on separate
SCSI channel

2. Update postgresql.conf:
shared_buffers = 10000-50000
work_mem             = 100000-300000
maintenance_work_mem = 100000-300000
max_fsm_pages = 1500000
max_fsm_relations = 16000
wal_buffers = 32
checkpoint_segments = 32 # 16MB each !!
checkpoint_timeout = 600
checkpoint_warning = 60
effective_cache_size = 128000
random_page_cost = 3
default_statistics_target = 100
log_min_error_statement = warning
log_min_duration_statement = 1000 # for logging long SQL

3. If possible migrate from RAID5 to RAID10.

4. Add (if need) 2 new drive for OS and use ALL
20x9GB drive for DB storage.

5. Remove CDROM from work configuration and start use
this scsi channel.

Best regards,
 Alexander Kirpa


Re: Postgres on RAID5

From
Arshavir Grigorian
Date:
Josh Berkus wrote:
> A,
>
>
>>This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM.
>>It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom
>>compiled kernel. Postgres is installed from the Debian package and uses
>>all the configuration defaults.
>
>
> Please read http://www.powerpostgresql.com/PerfList
>
I have read that document. Very informative/useful. Thanks.

Re: Postgres on RAID5

From
Arshavir Grigorian
Date:
Alex Turner wrote:
> I would recommend running a bonnie++ benchmark on your array to see if
> it's the array/controller/raid being crap, or wether it's postgres.  I
> have had some very surprising results from arrays that theoretically
> should be fast, but turned out to be very slow.
>
> I would also seriously have to recommend against a 14 drive RAID 5!
> This is statisticaly as likely to fail as a 7 drive RAID 0 (not
> counting the spare, but rebuiling a spare is very hard on existing
> drives).

Thanks for the reply.

Here are the results of the bonnie test on my array:

./bonnie -s 10000 -d . > oo 2>&1
File './Bonnie.23736', size: 10485760000
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
        -------Sequential Output-------- ---Sequential Input-- --Random--
        -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
     MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
  10000 4762  96.0 46140 78.8 31180 61.0 3810  99.9 71586 67.7 411.8 13.1

On a different note, I am not sure how the probability of RAID5 over 15
disks failing is the same as that of a RAID0 array over 7 disks. RAID5
can operate in a degraded mode (14 disks - 1 bad), RAID0 on the other
hand cannot operate on 6 disks (6 disks - 1 bad). Am I missing something?

Are you saying running RAID0 on a set of 2 RAID1  arrays of 7 each? That
would work fine, except I cannot afford to "loose" that much space.

Care to comment on these numbers? Thanks.



Arshavir

Re: Postgres on RAID5

From
Alex Turner
Date:
Actualy my statistics were off a bit I realised - chance of failure
for one drive is 1 in X.  change of failure in RAID 0 is 7 in X,
chance of one drive failure in 14 drive RAID 5 is 14 in X,13 in X  for
second drive, total probably is 182 in X*X, which is much lower than
RAID 0.

Your drive performance is less than stellar for a 14 drive stripe, and
CPU usage for writes is very high.  Even so - this should be enough
through put to get over 100 rows/sec assuming you have virtualy no
stored procs (I have noticed that stored procs in plpgsql REALLY slow
pg_sql down).

Alex Turner
netEconomist

On Mon, 14 Mar 2005 15:54:34 -0500, Arshavir Grigorian <ag@m-cam.com> wrote:
> Alex Turner wrote:
> > I would recommend running a bonnie++ benchmark on your array to see if
> > it's the array/controller/raid being crap, or wether it's postgres.  I
> > have had some very surprising results from arrays that theoretically
> > should be fast, but turned out to be very slow.
> >
> > I would also seriously have to recommend against a 14 drive RAID 5!
> > This is statisticaly as likely to fail as a 7 drive RAID 0 (not
> > counting the spare, but rebuiling a spare is very hard on existing
> > drives).
>
> Thanks for the reply.
>
> Here are the results of the bonnie test on my array:
>
> ./bonnie -s 10000 -d . > oo 2>&1
> File './Bonnie.23736', size: 10485760000
> Writing with putc()...done
> Rewriting...done
> Writing intelligently...done
> Reading with getc()...done
> Reading intelligently...done
> Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
>         -------Sequential Output-------- ---Sequential Input-- --Random--
>         -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
>      MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
>   10000 4762  96.0 46140 78.8 31180 61.0 3810  99.9 71586 67.7 411.8 13.1
>
> On a different note, I am not sure how the probability of RAID5 over 15
> disks failing is the same as that of a RAID0 array over 7 disks. RAID5
> can operate in a degraded mode (14 disks - 1 bad), RAID0 on the other
> hand cannot operate on 6 disks (6 disks - 1 bad). Am I missing something?
>
> Are you saying running RAID0 on a set of 2 RAID1  arrays of 7 each? That
> would work fine, except I cannot afford to "loose" that much space.
>
> Care to comment on these numbers? Thanks.
>
>
> Arshavir
>