Thread: Postgres on RAID5

Postgres on RAID5

From
Arshavir Grigorian
Date:
Hi,

I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has
an Ext3 filesystem which is used by Postgres. Currently we are loading a
50G database on this server from a Postgres dump (copy, not insert) and
are experiencing very slow write performance (35 records per second).

Top shows that the Postgres process (postmaster) is being constantly put
into D state for extended periods of time (2-3 seconds) which I assume
is because it's waiting for disk io. I have just started gathering
system statistics and here is what sar -b shows: (this is while the db
is being loaded - pg_restore)

             tps        rtps     wtps      bread/s  bwrtn/s
01:35:01 PM    275.77     76.12    199.66    709.59   2315.23
01:45:01 PM    287.25     75.56    211.69    706.52   2413.06
01:55:01 PM    281.73     76.35    205.37    711.84   2389.86
02:05:01 PM    282.83     76.14    206.69    720.85   2418.51
02:15:01 PM    284.07     76.15    207.92    707.38   2443.60
02:25:01 PM    265.46     75.91    189.55    708.87   2089.21
02:35:01 PM    285.21     76.02    209.19    709.58   2446.46
Average:       280.33     76.04    204.30    710.66   2359.47

This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM.
It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom
compiled kernel. Postgres is installed from the Debian package and uses
all the configuration defaults.

I am also copying the pgsql-performance list.

Thanks in advance for any advice/pointers.


Arshavir

Following is some other info that might be helpful.

/proc/scsi# mdadm -D /dev/md1
/dev/md1:
         Version : 00.90.00
   Creation Time : Wed Feb 23 17:23:41 2005
      Raid Level : raid5
      Array Size : 123823616 (118.09 GiB 126.80 GB)
     Device Size : 8844544 (8.43 GiB 9.06 GB)
    Raid Devices : 15
   Total Devices : 17
Preferred Minor : 1
     Persistence : Superblock is persistent

     Update Time : Thu Feb 24 10:05:38 2005
           State : active
  Active Devices : 15
Working Devices : 16
  Failed Devices : 1
   Spare Devices : 1

          Layout : left-symmetric
      Chunk Size : 64K

            UUID : 81ae2c97:06fa4f4d:87bfc6c9:2ee516df
          Events : 0.8

     Number   Major   Minor   RaidDevice State
        0       8       64        0      active sync   /dev/sde
        1       8       80        1      active sync   /dev/sdf
        2       8       96        2      active sync   /dev/sdg
        3       8      112        3      active sync   /dev/sdh
        4       8      128        4      active sync   /dev/sdi
        5       8      144        5      active sync   /dev/sdj
        6       8      160        6      active sync   /dev/sdk
        7       8      176        7      active sync   /dev/sdl
        8       8      192        8      active sync   /dev/sdm
        9       8      208        9      active sync   /dev/sdn
       10       8      224       10      active sync   /dev/sdo
       11       8      240       11      active sync   /dev/sdp
       12      65        0       12      active sync   /dev/sdq
       13      65       16       13      active sync   /dev/sdr
       14      65       32       14      active sync   /dev/sds

       15      65       48       15      spare   /dev/sdt

# dumpe2fs -h /dev/md1
dumpe2fs 1.35 (28-Feb-2004)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          1bb95bd6-94c7-4344-adf2-8414cadae6fc
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal dir_index needs_recovery large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              15482880
Block count:              30955904
Reserved block count:     1547795
Free blocks:              28767226
Free inodes:              15482502
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Wed Feb 23 17:27:13 2005
Last mount time:          Wed Feb 23 17:45:25 2005
Last write time:          Wed Feb 23 17:45:25 2005
Mount count:              2
Maximum mount count:      28
Last checked:             Wed Feb 23 17:27:13 2005
Check interval:           15552000 (6 months)
Next check after:         Mon Aug 22 18:27:13 2005
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      c35c0226-3b52-4dad-b102-f22feb773592
Journal backup:           inode blocks

# lspci | grep SCSI
0000:00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:04:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 03)
0000:04:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 03)

/proc/scsi# more scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 01 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 02 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 03 Lun: 00
   Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
   Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 06 Lun: 00
   Vendor: TOSHIBA  Model: XM6201TASUN32XCD Rev: 1103
   Type:   CD-ROM                           ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 00 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 01 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 02 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi5 Channel: 00 Id: 03 Lun: 00
   Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
   Type:   Direct-Access                    ANSI SCSI revision: 02






--
Arshavir Grigorian
Systems Administrator/Engineer

Re: Postgres on RAID5

From
Greg Stark
Date:
Arshavir Grigorian <ag@m-cam.com> writes:

> Hi,
>
> I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an
> Ext3 filesystem which is used by Postgres.

People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5
across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives
though. It would be interesting to see empirical data.

One thing that does scare me is the Postgres transaction log and the ext3
journal both sharing these disks with the data. Ideally both of these things
should get (mirrored) disks of their own separate from the data files.

But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache
flush on every fsync to get the journal pushed out. This is a new linux
feature that's necessary with ide but shouldn't be necessary with scsi.

It would be interesting to know whether postgres performs differently with
fsync=off. This would even be a reasonable mode to run under for initial
database loads. It shouldn't make much of a difference with hardware like this
though. And you should be aware that running under this mode in production
would put your data at risk.

--
greg

Re: Postgres on RAID5

From
Alex Turner
Date:
a 14 drive stripe will max out the PCI bus long before anything else,
the only reason for a stripe this size is to get a total accessible
size up.  A 6 drive RAID 10 on a good controller can get up to
400Mb/sec which is pushing the limit of the PCI bus (taken from
offical 3ware 9500S 8MI benchmarks).  140 drives is not going to beat
6 drives because you've run out of bandwidth on the PCI bus.

The debait on RAID 5 rages onward.  The benchmarks I've seen suggest
that RAID 5 is consistantly slower than RAID 10 with the same number
of drivers, but others suggest that RAID 5 can be much faster that
RAID 10 (see arstechnica.com) (Theoretical performance of RAID 5 is
inline with a RAID 0 stripe of N-1 drives, RAID 10 has only N/2 drives
in a stripe, perfomance should be nearly double - in theory of
course).

35 Trans/sec is pretty slow, particularly if they are only one row at
a time.  I typicaly get 200-400/sec on our DB server on a bad day.  Up
to 1100 on a fresh database.

I suggested running a bonnie benchmark, or some other IO perftest to
determine if it's the array itself performing badly, or if there is
something wrong with postgresql.

If the array isn't kicking out at least 50MB/sec read/write
performance, something is wrong.

Until you've isolated the problem to either postgres or the array,
everything else is simply speculation.

In a perfect world, you would have two 6 drive RAID 10s. on two PCI
busses, with system tables on a third parition, and archive logging on
a fourth.  Unsurprisingly this looks alot like the Oracle recommended
minimum config.

Also a note for interest is that this is _software_ raid...

Alex Turner
netEconomist

On 13 Mar 2005 23:36:13 -0500, Greg Stark <gsstark@mit.edu> wrote:
>
> Arshavir Grigorian <ag@m-cam.com> writes:
>
> > Hi,
> >
> > I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an
> > Ext3 filesystem which is used by Postgres.
>
> People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5
> across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives
> though. It would be interesting to see empirical data.
>
> One thing that does scare me is the Postgres transaction log and the ext3
> journal both sharing these disks with the data. Ideally both of these things
> should get (mirrored) disks of their own separate from the data files.
>
> But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache
> flush on every fsync to get the journal pushed out. This is a new linux
> feature that's necessary with ide but shouldn't be necessary with scsi.
>
> It would be interesting to know whether postgres performs differently with
> fsync=off. This would even be a reasonable mode to run under for initial
> database loads. It shouldn't make much of a difference with hardware like this
> though. And you should be aware that running under this mode in production
> would put your data at risk.
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>

Re: Postgres on RAID5

From
Greg Stark
Date:
Alex Turner <armtuk@gmail.com> writes:

> a 14 drive stripe will max out the PCI bus long before anything else,

Hopefully anyone with a 14 drive stripe is using some combination of 64 bit
PCI-X cards running at 66Mhz...

> the only reason for a stripe this size is to get a total accessible
> size up.

Well, many drives also cuts average latency. So even if you have no need for
more bandwidth you still benefit from a lower average response time by adding
more drives.

--
greg

Re: Postgres on RAID5

From
Arshavir Grigorian
Date:
Alex Turner wrote:
> a 14 drive stripe will max out the PCI bus long before anything else,
> the only reason for a stripe this size is to get a total accessible
> size up.  A 6 drive RAID 10 on a good controller can get up to
> 400Mb/sec which is pushing the limit of the PCI bus (taken from
> offical 3ware 9500S 8MI benchmarks).  140 drives is not going to beat
> 6 drives because you've run out of bandwidth on the PCI bus.
>
> The debait on RAID 5 rages onward.  The benchmarks I've seen suggest
> that RAID 5 is consistantly slower than RAID 10 with the same number
> of drivers, but others suggest that RAID 5 can be much faster that
> RAID 10 (see arstechnica.com) (Theoretical performance of RAID 5 is
> inline with a RAID 0 stripe of N-1 drives, RAID 10 has only N/2 drives
> in a stripe, perfomance should be nearly double - in theory of
> course).
>
> 35 Trans/sec is pretty slow, particularly if they are only one row at
> a time.  I typicaly get 200-400/sec on our DB server on a bad day.  Up
> to 1100 on a fresh database.

Well, by putting the pg_xlog directory on a separate disk/partition, I
was able to increase this rate to about 50 or so per second (still
pretty far from your numbers). Next I am going to try putting the
pg_xlog on a RAID1+0 array and see if that helps.

> I suggested running a bonnie benchmark, or some other IO perftest to
> determine if it's the array itself performing badly, or if there is
> something wrong with postgresql.
>
> If the array isn't kicking out at least 50MB/sec read/write
> performance, something is wrong.
>
> Until you've isolated the problem to either postgres or the array,
> everything else is simply speculation.
>
> In a perfect world, you would have two 6 drive RAID 10s. on two PCI
> busses, with system tables on a third parition, and archive logging on
> a fourth.  Unsurprisingly this looks alot like the Oracle recommended
> minimum config.

Could you please elaborate on this setup a little more? How do you put
system tables on a separate partition? I am still using version 7, and
without tablespaces (which is how Oracle controls this), I can't figure
out how to put different tables on different partitions. Thanks.


Arshavir



> Also a note for interest is that this is _software_ raid...
>
> Alex Turner
> netEconomist
>
> On 13 Mar 2005 23:36:13 -0500, Greg Stark <gsstark@mit.edu> wrote:
>
>>Arshavir Grigorian <ag@m-cam.com> writes:
>>
>>
>>>Hi,
>>>
>>>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an
>>>Ext3 filesystem which is used by Postgres.
>>
>>People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5
>>across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives
>>though. It would be interesting to see empirical data.
>>
>>One thing that does scare me is the Postgres transaction log and the ext3
>>journal both sharing these disks with the data. Ideally both of these things
>>should get (mirrored) disks of their own separate from the data files.
>>
>>But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache
>>flush on every fsync to get the journal pushed out. This is a new linux
>>feature that's necessary with ide but shouldn't be necessary with scsi.
>>
>>It would be interesting to know whether postgres performs differently with
>>fsync=off. This would even be a reasonable mode to run under for initial
>>database loads. It shouldn't make much of a difference with hardware like this
>>though. And you should be aware that running under this mode in production
>>would put your data at risk.
>>
>>--
>>greg
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: the planner will ignore your desire to choose an index scan if your
>>      joining column's datatypes do not match
>>


--
Arshavir Grigorian
Systems Administrator/Engineer
M-CAM, Inc.
ag@m-cam.com
+1 703-682-0570 ext. 432
Contents Confidential

Re: Postgres on RAID5

From
"Jim Buttafuoco"
Date:
All,

I have a 13 disk (250G each) software raid 5 set using 1 16 port adaptec SATA controller.
I am very happy with the performance. The reason I went with the 13 disk raid 5 set was for the space NOT performance.
  I have a single postgresql database that is over 2 TB with about 500 GB free on the disk.   This raid set performs
about the same as my ICP SCSI raid controller (also with raid 5).

That said, now that postgresql 8 has tablespaces, I would NOT create 1 single raid 5 set, but 3 smaller sets.  I also
DO
NOT have my wal and log's on this raid set, but on a  smaller 2 disk mirror.

Jim

---------- Original Message -----------
From: Greg Stark <gsstark@mit.edu>
To: Alex Turner <armtuk@gmail.com>
Cc: Greg Stark <gsstark@mit.edu>, Arshavir Grigorian <ag@m-cam.com>, linux-raid@vger.kernel.org,
pgsql-performance@postgresql.org
Sent: 14 Mar 2005 15:17:11 -0500
Subject: Re: [PERFORM] Postgres on RAID5

> Alex Turner <armtuk@gmail.com> writes:
>
> > a 14 drive stripe will max out the PCI bus long before anything else,
>
> Hopefully anyone with a 14 drive stripe is using some combination of 64 bit
> PCI-X cards running at 66Mhz...
>
> > the only reason for a stripe this size is to get a total accessible
> > size up.
>
> Well, many drives also cuts average latency. So even if you have no need for
> more bandwidth you still benefit from a lower average response time by adding
> more drives.
>
> --
> greg
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
------- End of Original Message -------


Re: Postgres on RAID5

From
"Merlin Moncure"
Date:
Alex Turner wrote:
> 35 Trans/sec is pretty slow, particularly if they are only one row at
> a time.  I typicaly get 200-400/sec on our DB server on a bad day.  Up
> to 1100 on a fresh database.

Well, don't rule out that his raid controller is not caching his writes.
His WAL sync method may be overriding his raid cache policy and flushing
his writes to disk, always.  Win32 has the same problem, and before
Magnus's O_DIRECT patch, there was no way to easily work around it
without turning fsync off.  I'd suggest playing with different WAL sync
methods before trying anything else.

Merli

Re: Postgres on RAID5

From
Alex Turner
Date:
He doesn't have a RAID controller, it's software RAID...

Alex Turner
netEconomis


On Mon, 14 Mar 2005 16:18:00 -0500, Merlin Moncure
<merlin.moncure@rcsonline.com> wrote:
> Alex Turner wrote:
> > 35 Trans/sec is pretty slow, particularly if they are only one row at
> > a time.  I typicaly get 200-400/sec on our DB server on a bad day.  Up
> > to 1100 on a fresh database.
>
> Well, don't rule out that his raid controller is not caching his writes.
> His WAL sync method may be overriding his raid cache policy and flushing
> his writes to disk, always.  Win32 has the same problem, and before
> Magnus's O_DIRECT patch, there was no way to easily work around it
> without turning fsync off.  I'd suggest playing with different WAL sync
> methods before trying anything else.
>
> Merli
>

Re: Postgres on RAID5

From
David Dougall
Date:
In my experience, if you are concerned about filesystem performance, don't
use ext3.  It is one of the slowest filesystems I have ever used
especially for writes.  I would suggest either reiserfs or xfs.
--David Dougall


On Fri, 11 Mar 2005, Arshavir Grigorian wrote:

> Hi,
>
> I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has
> an Ext3 filesystem which is used by Postgres. Currently we are loading a
> 50G database on this server from a Postgres dump (copy, not insert) and
> are experiencing very slow write performance (35 records per second).
>
> Top shows that the Postgres process (postmaster) is being constantly put
> into D state for extended periods of time (2-3 seconds) which I assume
> is because it's waiting for disk io. I have just started gathering
> system statistics and here is what sar -b shows: (this is while the db
> is being loaded - pg_restore)
>
>              tps        rtps     wtps      bread/s  bwrtn/s
> 01:35:01 PM    275.77     76.12    199.66    709.59   2315.23
> 01:45:01 PM    287.25     75.56    211.69    706.52   2413.06
> 01:55:01 PM    281.73     76.35    205.37    711.84   2389.86
> 02:05:01 PM    282.83     76.14    206.69    720.85   2418.51
> 02:15:01 PM    284.07     76.15    207.92    707.38   2443.60
> 02:25:01 PM    265.46     75.91    189.55    708.87   2089.21
> 02:35:01 PM    285.21     76.02    209.19    709.58   2446.46
> Average:       280.33     76.04    204.30    710.66   2359.47
>
> This is a Sun e450 with dual TI UltraSparc II processors and 2G of RAM.
> It is currently running Debian Sarge with a 2.4.27-sparc64-smp custom
> compiled kernel. Postgres is installed from the Debian package and uses
> all the configuration defaults.
>
> I am also copying the pgsql-performance list.
>
> Thanks in advance for any advice/pointers.
>
>
> Arshavir
>
> Following is some other info that might be helpful.
>
> /proc/scsi# mdadm -D /dev/md1
> /dev/md1:
>          Version : 00.90.00
>    Creation Time : Wed Feb 23 17:23:41 2005
>       Raid Level : raid5
>       Array Size : 123823616 (118.09 GiB 126.80 GB)
>      Device Size : 8844544 (8.43 GiB 9.06 GB)
>     Raid Devices : 15
>    Total Devices : 17
> Preferred Minor : 1
>      Persistence : Superblock is persistent
>
>      Update Time : Thu Feb 24 10:05:38 2005
>            State : active
>   Active Devices : 15
> Working Devices : 16
>   Failed Devices : 1
>    Spare Devices : 1
>
>           Layout : left-symmetric
>       Chunk Size : 64K
>
>             UUID : 81ae2c97:06fa4f4d:87bfc6c9:2ee516df
>           Events : 0.8
>
>      Number   Major   Minor   RaidDevice State
>         0       8       64        0      active sync   /dev/sde
>         1       8       80        1      active sync   /dev/sdf
>         2       8       96        2      active sync   /dev/sdg
>         3       8      112        3      active sync   /dev/sdh
>         4       8      128        4      active sync   /dev/sdi
>         5       8      144        5      active sync   /dev/sdj
>         6       8      160        6      active sync   /dev/sdk
>         7       8      176        7      active sync   /dev/sdl
>         8       8      192        8      active sync   /dev/sdm
>         9       8      208        9      active sync   /dev/sdn
>        10       8      224       10      active sync   /dev/sdo
>        11       8      240       11      active sync   /dev/sdp
>        12      65        0       12      active sync   /dev/sdq
>        13      65       16       13      active sync   /dev/sdr
>        14      65       32       14      active sync   /dev/sds
>
>        15      65       48       15      spare   /dev/sdt
>
> # dumpe2fs -h /dev/md1
> dumpe2fs 1.35 (28-Feb-2004)
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID:          1bb95bd6-94c7-4344-adf2-8414cadae6fc
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal dir_index needs_recovery large_file
> Default mount options:    (none)
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              15482880
> Block count:              30955904
> Reserved block count:     1547795
> Free blocks:              28767226
> Free inodes:              15482502
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         16384
> Inode blocks per group:   512
> Filesystem created:       Wed Feb 23 17:27:13 2005
> Last mount time:          Wed Feb 23 17:45:25 2005
> Last write time:          Wed Feb 23 17:45:25 2005
> Mount count:              2
> Maximum mount count:      28
> Last checked:             Wed Feb 23 17:27:13 2005
> Check interval:           15552000 (6 months)
> Next check after:         Mon Aug 22 18:27:13 2005
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               128
> Journal inode:            8
> Default directory hash:   tea
> Directory Hash Seed:      c35c0226-3b52-4dad-b102-f22feb773592
> Journal backup:           inode blocks
>
> # lspci | grep SCSI
> 0000:00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
> (rev 14)
> 0000:00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875
> (rev 14)
> 0000:00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
> (rev 14)
> 0000:00:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875
> (rev 14)
> 0000:04:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
> (rev 03)
> 0000:04:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
> (rev 03)
>
> /proc/scsi# more scsi
> Attached devices:
> Host: scsi0 Channel: 00 Id: 00 Lun: 00
>    Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 01 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi0 Channel: 00 Id: 02 Lun: 00
>    Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 03 Lun: 00
>    Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi1 Channel: 00 Id: 00 Lun: 00
>    Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi1 Channel: 00 Id: 01 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi1 Channel: 00 Id: 02 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi1 Channel: 00 Id: 03 Lun: 00
>    Vendor: SEAGATE  Model: ST39103LCSUN9.0G Rev: 034A
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi2 Channel: 00 Id: 00 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi2 Channel: 00 Id: 01 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi2 Channel: 00 Id: 02 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi2 Channel: 00 Id: 03 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi3 Channel: 00 Id: 00 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi3 Channel: 00 Id: 01 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi3 Channel: 00 Id: 02 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi3 Channel: 00 Id: 03 Lun: 00
>    Vendor: SEAGATE  Model: ST39204LCSUN9.0G Rev: 4207
>    Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi4 Channel: 00 Id: 06 Lun: 00
>    Vendor: TOSHIBA  Model: XM6201TASUN32XCD Rev: 1103
>    Type:   CD-ROM                           ANSI SCSI revision: 02
> Host: scsi5 Channel: 00 Id: 00 Lun: 00
>    Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi5 Channel: 00 Id: 01 Lun: 00
>    Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi5 Channel: 00 Id: 02 Lun: 00
>    Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
>    Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi5 Channel: 00 Id: 03 Lun: 00
>    Vendor: FUJITSU  Model: MAG3091L SUN9.0G Rev: 1111
>    Type:   Direct-Access                    ANSI SCSI revision: 02
>
>
>
>
>
>
> --
> Arshavir Grigorian
> Systems Administrator/Engineer
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

Re: Postgres on RAID5

From
Michael Tokarev
Date:
David Dougall wrote:
> In my experience, if you are concerned about filesystem performance, don't
> use ext3.  It is one of the slowest filesystems I have ever used
> especially for writes.  I would suggest either reiserfs or xfs.

I'm a bit afraid to start yet another filesystem flamewar, but.
Please don't make such a claims without providing actual numbers
and config details.  Pretty please.

ext3 performs well for databases, there's no reason for it to be
slow.  Ok, enable data=journal and use it with eg Oracle - you will
see it is slow.  But in that case it isn't the filesystem to blame,
it's operator error, simple as that.

And especially reiserfs, with its tail packing enabled by default,
is NOT suitable for databases...

/mjt

Re: Postgres on RAID5

From
Michael Tokarev
Date:
Arshavir Grigorian wrote:
> Alex Turner wrote:
>
[]
> Well, by putting the pg_xlog directory on a separate disk/partition, I
> was able to increase this rate to about 50 or so per second (still
> pretty far from your numbers). Next I am going to try putting the
> pg_xlog on a RAID1+0 array and see if that helps.

pg_xlog is written syncronously, right?  It should be, or else reliability
of the database will be at a big question...

I posted a question on Feb-22 here in linux-raid, titled "*terrible*
direct-write performance with raid5".  There's a problem with write
performance of a raid4/5/6 array, which is due to the design.

Consider raid5 array (raid4 will be exactly the same, and for raid6,
just double the parity writes) with N data block and 1 parity block.
At the time of writing a portion of data, parity block should be
updated too, to be consistent and recoverable.  And here, the size of
the write plays very significant role.  If your write size is smaller
than chunk_size*N (N = number of data blocks in a stripe), in order
to calculate correct parity you have to read data from the remaining
drives.  The only case where you don't need to read data from other
drives is when you're writing by the size of chunk_size*N, AND the
write is block-aligned.  By default, chunk_size is 64Kb (min is 4Kb).
So the only reasonable direct-write size of N drives will be 64Kb*N,
or else raid code will have to read "missing" data to calculate the
parity block.  Ofcourse, in 99% cases you're writing in much smaller
sizes, say 4Kb or so.  And here, the more drives you have, the
LESS write speed you will have.

When using the O/S buffer and filesystem cache, the system has much
more chances to re-order requests and sometimes even omit reading
entirely (when you perform many sequentional writes for example,
without sync in between), so buffered writes might be much fast.
But not direct or syncronous writes, again especially when you're
doing alot of sequential writes...

So to me it looks like an inherent problem of raid5 architecture
wrt database-like workload -- databases tends to use syncronous
or direct writes to ensure good data consistency.

For pgsql, which (i don't know for sure but reportedly) uses syncronous
writs only for the transaction log, it is a good idea to put that log
only to a raid1 or raid10 array, but NOT to raid5 array.

Just IMHO ofcourse.

/mjt

Re: Postgres on RAID5

From
"Guy"
Date:
You said:
"If your write size is smaller than chunk_size*N (N = number of data blocks
in a stripe), in order to calculate correct parity you have to read data
from the remaining drives."

Neil explained it in this message:
http://marc.theaimsgroup.com/?l=linux-raid&m=108682190730593&w=2

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Michael Tokarev
Sent: Monday, March 14, 2005 5:47 PM
To: Arshavir Grigorian
Cc: linux-raid@vger.kernel.org; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Postgres on RAID5

Arshavir Grigorian wrote:
> Alex Turner wrote:
>
[]
> Well, by putting the pg_xlog directory on a separate disk/partition, I
> was able to increase this rate to about 50 or so per second (still
> pretty far from your numbers). Next I am going to try putting the
> pg_xlog on a RAID1+0 array and see if that helps.

pg_xlog is written syncronously, right?  It should be, or else reliability
of the database will be at a big question...

I posted a question on Feb-22 here in linux-raid, titled "*terrible*
direct-write performance with raid5".  There's a problem with write
performance of a raid4/5/6 array, which is due to the design.

Consider raid5 array (raid4 will be exactly the same, and for raid6,
just double the parity writes) with N data block and 1 parity block.
At the time of writing a portion of data, parity block should be
updated too, to be consistent and recoverable.  And here, the size of
the write plays very significant role.  If your write size is smaller
than chunk_size*N (N = number of data blocks in a stripe), in order
to calculate correct parity you have to read data from the remaining
drives.  The only case where you don't need to read data from other
drives is when you're writing by the size of chunk_size*N, AND the
write is block-aligned.  By default, chunk_size is 64Kb (min is 4Kb).
So the only reasonable direct-write size of N drives will be 64Kb*N,
or else raid code will have to read "missing" data to calculate the
parity block.  Ofcourse, in 99% cases you're writing in much smaller
sizes, say 4Kb or so.  And here, the more drives you have, the
LESS write speed you will have.

When using the O/S buffer and filesystem cache, the system has much
more chances to re-order requests and sometimes even omit reading
entirely (when you perform many sequentional writes for example,
without sync in between), so buffered writes might be much fast.
But not direct or syncronous writes, again especially when you're
doing alot of sequential writes...

So to me it looks like an inherent problem of raid5 architecture
wrt database-like workload -- databases tends to use syncronous
or direct writes to ensure good data consistency.

For pgsql, which (i don't know for sure but reportedly) uses syncronous
writs only for the transaction log, it is a good idea to put that log
only to a raid1 or raid10 array, but NOT to raid5 array.

Just IMHO ofcourse.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Postgres on RAID5 (possible sync blocking read type

From
David Greaves
Date:
Greg Stark wrote:

>Arshavir Grigorian <ag@m-cam.com> writes:
>
>
>
>>Hi,
>>
>>I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an
>>Ext3 filesystem which is used by Postgres.
>>
>>
>
>People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5
>across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives
>though. It would be interesting to see empirical data.
>
>One thing that does scare me is the Postgres transaction log and the ext3
>journal both sharing these disks with the data. Ideally both of these things
>should get (mirrored) disks of their own separate from the data files.
>
>But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache
>flush on every fsync to get the journal pushed out. This is a new linux
>feature that's necessary with ide but shouldn't be necessary with scsi.
>
>It would be interesting to know whether postgres performs differently with
>fsync=off. This would even be a reasonable mode to run under for initial
>database loads. It shouldn't make much of a difference with hardware like this
>though. And you should be aware that running under this mode in production
>would put your data at risk.
>
Hi
I'm coming in from the raid list so I didn't get the full story.

May I ask what kernel?

I only ask because I upgraded to 2.6.11.2 and happened to be watching
xosview on my (probably) completely different setup (1Tb xfs/lvm2/raid5
served by nfs to a remote sustained read/write app), when I saw all read
activity cease for 2/3 seconds whilst the disk wrote, then disk read
resumed. This occured repeatedly during a read/edit/write of a 3Gb file.

Performance not critical here so on the "hmm, that's odd" todo list :)

David