Thread: What exactly is postgres doing during INSERT/UPDATE ?

What exactly is postgres doing during INSERT/UPDATE ?

From

Joseph S

Date:

28 August 2009, 03:58:24

If I run " dd if=/dev/zero bs=1024k of=file count=1000 " iostat shows me:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             671.50        88.00    113496.00        176     226992


However postgres 8.3.7 doing a bulk data write (a slony slave, doing
inserts and updates) doesn't go nearly as fast:

  Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             418.41       648.76      7052.74       1304      14176

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             237.50        44.00      3668.00         88       7336

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             221.50       444.00      3832.00        888       7664

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             166.00       248.00      3360.00        496       6720

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             163.00       480.00      3184.00        960       6368

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             102.50       724.00      1736.00       1448       3472

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             295.50       712.00      6004.00       1424      12008

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             109.45       433.83      2260.70        872       4544

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             180.00       640.00      3512.00       1280       7024

top shows the cpu usage of the pg process ranges from zero to never more
than ten percent of a cpu, and that one cpu is always ninety some odd
percent in iowait.  So what is postgres doing (with fsync off) that
causes the cpu to spend so much time in iowait?

This is a 64 bit amd linux system with ext3 filesystem. free shows:

              total       used       free     shared    buffers     cached
Mem:       8116992    8085848      31144          0     103016    3098568
-/+ buffers/cache:    4884264    3232728
Swap:      6697296    2035508    4661788

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Pierre Frédéric Caillaud

Date:

28 August 2009, 04:52:07

> top shows the cpu usage of the pg process ranges from zero to never more
> than ten percent of a cpu, and that one cpu is always ninety some odd
> percent in iowait.  So what is postgres doing (with fsync off) that
> causes the cpu to spend so much time in iowait?

    Updating indexes ?

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Greg Smith

Date:

28 August 2009, 05:08:26

On Fri, 28 Aug 2009, Joseph S wrote:

> If I run " dd if=/dev/zero bs=1024k of=file count=1000 " iostat shows me:
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             671.50        88.00    113496.00        176     226992

That's the sequential transfer rate of your drive.  It's easier to present
these numbers if you use "vmstat 1" instead; that shows the I/O in more
useful units, and with the CPU stats on the same line.

> However postgres 8.3.7 doing a bulk data write (a slony slave, doing inserts
> and updates) doesn't go nearly as fast:

In PostgreSQL, an update is:

1) A read of the old data
2) Writing out the updated data
3) Marking the original data as dead
4) Updating any indexes involved
5) Later cleaning up after the now dead row

On top of that Slony may need to do its own metadata updates.

This sort of workload involves random I/O rather than sequential.  On
regular hard drives this normally happens at a tiny fraction of the speed
because of how the disk has to seek around.  Typically a single drive
capable of 50-100MB/s on sequential I/O will only do 1-2MB/s on a
completely random workload.  You look like you're getting somewhere in the
middle there, on the low side which doesn't surprise me.

The main two things you can do to improve this on the database side:

-Increase checkpoint_segments, which reduces how often updated data has to
be flushed to disk

-Increase shared_buffers in order to hold more of the working set of data
in RAM, so that more reads are satisfied by the database cache and less
data gets evicted to disk.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Scott Marlowe

Date:

28 August 2009, 05:29:28

On Fri, Aug 28, 2009 at 2:08 AM, Greg Smith<gsmith@gregsmith.com> wrote:
>
> This sort of workload involves random I/O rather than sequential.  On
> regular hard drives this normally happens at a tiny fraction of the speed
> because of how the disk has to seek around.  Typically a single drive
> capable of 50-100MB/s on sequential I/O will only do 1-2MB/s on a completely
> random workload.  You look like you're getting somewhere in the middle
> there, on the low side which doesn't surprise me.
>
> The main two things you can do to improve this on the database side:
>
> -Increase checkpoint_segments, which reduces how often updated data has to
> be flushed to disk
>
> -Increase shared_buffers in order to hold more of the working set of data in
> RAM, so that more reads are satisfied by the database cache and less data
> gets evicted to disk.

After that you have to start looking at hardware.  Soimething as
simple as a different drive for indexes and another for WAL, and
another for the base tables can make a big difference.

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Joseph S

Date:

28 August 2009, 11:28:28

Greg Smith wrote:

> The main two things you can do to improve this on the database side:
>
> -Increase checkpoint_segments, which reduces how often updated data has
> to be flushed to disk

It fsync is turned off, does this matter so much?
>

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Joseph S

Date:

28 August 2009, 11:58:25

Scott Marlowe wrote:

> After that you have to start looking at hardware.  Soimething as
> simple as a different drive for indexes and another for WAL, and
> another for the base tables can make a big difference.
>
If I have 14 drives in a RAID 10 to split between data tables and
indexes what would be the best way to allocate the drives for performance?

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Jeff Janes

Date:

28 August 2009, 21:20:00

---------- Forwarded message ----------
From: Joseph S <jks@selectacast.net>
To: Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org
Date: Fri, 28 Aug 2009 10:25:10 -0400
Subject: Re: What exactly is postgres doing during INSERT/UPDATE ?
Greg Smith wrote:

The main two things you can do to improve this on the database side:

-Increase checkpoint_segments, which reduces how often updated data has to be flushed to disk

It fsync is turned off, does this matter so much?

It still matters. The kernel is only willing to have so much dirty data sitting in the disk cache. Once it reaches that limit, user processes doing writes start blocking while the kernel flushes stuff on their behalf.

Jeff

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Luke Koops

Date:

29 August 2009, 01:20:58

Joseph S Wrote
> If I have 14 drives in a RAID 10 to split between data tables
> and indexes what would be the best way to allocate the drives
> for performance?

RAID-5 can be much faster than RAID-10 for random reads and writes.  It is much slower than RAID-10 for sequential
writes,but about the same for sequential reads.  For typical access patterns, I would put the data and indexes on
RAID-5unless you expect there to be lots of sequential scans. 

If you do this, you can drop the random_page_cost from the default 4.0 to 1.0.  That should also encourage postgres to
usethe index more often.  I think the default costs for postgres assume that the data is on a RAID-1 array.  Either
that,or they are a compromise that isn't quite right for any system.  On a plain old disk the random_page_cost should
be8.0 or 10.0. 

The division of the drives into two arrays would depend on how much space will be occupied by the tables vs the
indexes. This is very specific to your database.  For example, if indexes take half as much space as tables, then you
want2/3rds for tables and 1/3rd for indexes.  8 drives for tables, 5 drives for indexes, and 1 for a hot standby.  The
smallerarray may be a bit slower for some operations due to reduced parallelism.  This also depends on the intelligence
ofyour RAID controller. 

Always put the transaction logs (WAL Files) on RAID-10 (or RAID-1 if you don't want to dedicate so many drives to the
logs). The only significant performance difference between RAID-10 and RAID-1 is that RAID-1 is much slower (factor of
4or 5) for random reads.  I think the ratio of random reads from the transaction logs would typically be quite low.
Theyare written sequentially and during checkpoint they are read sequentially.  In the interim, the data is probably
stillin shared memory if it needs to be read. 

You don't want your transaction logs or any swapfiles on RAID-5.  The slow sequential write performance can be a
killer.

-Luke

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Greg Stark

Date:

29 August 2009, 05:46:49

On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote:
> Joseph S Wrote
>> If I have 14 drives in a RAID 10 to split between data tables
>> and indexes what would be the best way to allocate the drives
>> for performance?
>
> RAID-5 can be much faster than RAID-10 for random reads and writes.  It is much slower than RAID-10 for sequential
writes,but about the same for sequential reads.  For typical access patterns, I would put the data and indexes on
RAID-5unless you expect there to be lots of sequential scans. 

That's pretty much exactly backwards. RAID-5 will at best slightly
slower than RAID-0 or RAID-10 for sequential reads or random reads.
For sequential writes it performs *terribly*, especially for random
writes. The only write pattern where it performs ok sometimes is
sequential writes of large chunks.

> Always put the transaction logs (WAL Files) on RAID-10 (or RAID-1 if you don't want to dedicate so many drives to the
logs). The only significant performance difference between RAID-10 and RAID-1 is that RAID-1 is much slower (factor of
4or 5) for random reads. 

no, RAID-10 and RAID-1 should perform the same for reads. RAID-10 will
be slower at writes by about a factor equal to the number of mirror
sides.

> I think the ratio of random reads from the transaction logs would typically be quite low.

During normal operation the logs are *never* read, neither randomly
nor sequentially.

> You don't want your transaction logs or any swapfiles on RAID-5.  The slow sequential write performance can be a
killer.

As i mentioned sequential writes are the only write case when RAID-5
sometimes ok. However the picture is complicated by transaction
syncing which would make RAID-5 see it more as random i/o. In any case
wal normally doesn't take much disk space so there's not much reason
to use anything but RAID-1.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Merlin Moncure

Date:

29 August 2009, 10:26:39

On Fri, Aug 28, 2009 at 8:19 PM, Jeff Janes<jeff.janes@gmail.com> wrote:
>> ---------- Forwarded message ----------
>> From: Joseph S <jks@selectacast.net>
>> To: Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org
>> Date: Fri, 28 Aug 2009 10:25:10 -0400
>> Subject: Re: What exactly is postgres doing during INSERT/UPDATE ?
>> Greg Smith wrote:
>>
>>> The main two things you can do to improve this on the database side:
>>>
>>> -Increase checkpoint_segments, which reduces how often updated data has
>>> to be flushed to disk
>>
>> It fsync is turned off, does this matter so much?
>
> It still matters.  The kernel is only willing to have so much dirty data
> sitting in the disk cache.  Once it reaches that limit, user processes doing
> writes start blocking while the kernel flushes stuff on their behalf.

it doesn't matter nearly as much though.  if you are outrunning the
o/s write cache with fsync off, then it's time to start looking at new
hardware.

merlin

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Scott Marlowe

Date:

29 August 2009, 10:59:49

On Sat, Aug 29, 2009 at 2:46 AM, Greg Stark<gsstark@mit.edu> wrote:
> On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote:
>> Joseph S Wrote
>>> If I have 14 drives in a RAID 10 to split between data tables
>>> and indexes what would be the best way to allocate the drives
>>> for performance?
>>
>> RAID-5 can be much faster than RAID-10 for random reads and writes.  It is much slower than RAID-10 for sequential
writes,but about the same for sequential reads.  For typical access patterns, I would put the data and indexes on
RAID-5unless you expect there to be lots of sequential scans. 
>
> That's pretty much exactly backwards. RAID-5 will at best slightly
> slower than RAID-0 or RAID-10 for sequential reads or random reads.
> For sequential writes it performs *terribly*, especially for random
> writes. The only write pattern where it performs ok sometimes is
> sequential writes of large chunks.

Note that while RAID-10 is theoretically always better than RAID-5,
I've run into quite a few cheapie controllers that were heavily
optimised for RAID-5 and de-optimised for RAID-10.  However, if it's
got battery backed cache and can run in JBOD mode, linux software
RAID-10 or hybrid RAID-1 in hardware RAID-0 in software will almost
always beat hardware RAID-5 on the same controller.

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

David Rees

Date:

29 August 2009, 16:52:41

On Sat, Aug 29, 2009 at 1:46 AM, Greg Stark<gsstark@mit.edu> wrote:
> On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote:
>> RAID-5 can be much faster than RAID-10 for random reads and writes.  It is much slower than
>> RAID-10 for sequential writes, but about the same for sequential reads.  For typical access
>> patterns, I would put the data and indexes on RAID-5 unless you expect there to be lots of
>> sequential scans.
>
> That's pretty much exactly backwards. RAID-5 will at best slightly
> slower than RAID-0 or RAID-10 for sequential reads or random reads.
> For sequential writes it performs *terribly*, especially for random
> writes. The only write pattern where it performs ok sometimes is
> sequential writes of large chunks.

Also note that how terribly RAID5 performs on those small random
writes depends on a LOT on the implementation.  A good controller with
a large BBU cache will be able to mitigate the performance penalty of
having to read stripes before small writes to calculate parity (of
course, if the writes are really random enough, it's still not going
to help much).

>> Always put the transaction logs (WAL Files) on RAID-10 (or RAID-1 if you don't want to dedicate
>> so many drives to the logs).  The only significant performance difference between RAID-10 and
>> RAID-1 is that RAID-1 is much slower (factor of 4 or 5) for random reads.
>
> no, RAID-10 and RAID-1 should perform the same for reads. RAID-10 will
> be slower at writes by about a factor equal to the number of mirror
> sides.

Let's keep in mind that a 2-disk RAID-10 is really the same as a
2-disk RAID-1, it just doesn't have any mirrors to stripe over.  So
since you really need 4-disks for a "true" RAID-10, the performance of
a RAID-10 array compared to a RAID1 array is pretty much proportional
to the number of disks in the array (more disks = more performance).

The "far" RAID-10 layout that is available when using Linux software
raid is interesting.  It will lay the data out on the disks so that
you can get the streaming read performance of a RAID-0 array, but
streaming write performance will suffer a bit since now the disk will
have to seek to perform those writes.  You can also use this layout
with just 2 disks instead of RAID1.  Some claim that the performance
hit isn't noticeable due to write caching/IO ordering, but I have not
tested it's performance using PostgreSQL.  Might be a nice thing for
someone to try.

http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

-Dave

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Jeff Janes

Date:

29 August 2009, 19:01:16

On Sat, Aug 29, 2009 at 6:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Fri, Aug 28, 2009 at 8:19 PM, Jeff Janes<jeff.janes@gmail.com> wrote:
>> ---------- Forwarded message ----------
>> From: Joseph S <jks@selectacast.net>
>> To: Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org
>> Date: Fri, 28 Aug 2009 10:25:10 -0400
>> Subject: Re: What exactly is postgres doing during INSERT/UPDATE ?
>> Greg Smith wrote:
>>
>>> The main two things you can do to improve this on the database side:
>>>
>>> -Increase checkpoint_segments, which reduces how often updated data has
>>> to be flushed to disk
>>
>> It fsync is turned off, does this matter so much?
>
> It still matters. The kernel is only willing to have so much dirty data
> sitting in the disk cache. Once it reaches that limit, user processes doing
> writes start blocking while the kernel flushes stuff on their behalf.

it doesn't matter nearly as much though.

True, but it matters enough that it ought not be ignored. I've run into it more than once, and I haven't been at this very long.

if you are outrunning the
o/s write cache with fsync off, then it's time to start looking at new
hardware.

Or to start looking at tweaking the kernel VM settings. The kernel doesn't always handle these situations as gracefully as it could, and might produce a practical throughput that is much less than the theoretical one. But reducing the frequency of checkpoints is easier than either of these, and cheaper than buying new hardware. I don't see why the hardest and most expensive option would be the first choice.

Jeff

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Merlin Moncure

Date:

30 August 2009, 12:40:10

On Sat, Aug 29, 2009 at 9:59 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
> On Sat, Aug 29, 2009 at 2:46 AM, Greg Stark<gsstark@mit.edu> wrote:
>> On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote:
>>> Joseph S Wrote
>>>> If I have 14 drives in a RAID 10 to split between data tables
>>>> and indexes what would be the best way to allocate the drives
>>>> for performance?
>>>
>>> RAID-5 can be much faster than RAID-10 for random reads and writes.  It is much slower than RAID-10 for sequential
writes,but about the same for sequential reads.  For typical access patterns, I would put the data and indexes on
RAID-5unless you expect there to be lots of sequential scans. 
>>
>> That's pretty much exactly backwards. RAID-5 will at best slightly
>> slower than RAID-0 or RAID-10 for sequential reads or random reads.
>> For sequential writes it performs *terribly*, especially for random
>> writes. The only write pattern where it performs ok sometimes is
>> sequential writes of large chunks.
>
> Note that while RAID-10 is theoretically always better than RAID-5,
> I've run into quite a few cheapie controllers that were heavily
> optimised for RAID-5 and de-optimised for RAID-10.  However, if it's
> got battery backed cache and can run in JBOD mode, linux software
> RAID-10 or hybrid RAID-1 in hardware RAID-0 in software will almost
> always beat hardware RAID-5 on the same controller.

raid 5 can outperform raid 10 on sequential writes in theory.  if you
are writing 100mb of actual data on, say, a 8 drive array, the raid 10
system has to write 200mb data and the raid 5 system has to write 100
* (8/7) or about 114mb.  Of course, the raid 5 system has to do
parity, etc.

For random writes, raid 5 has to write a minimum of two drives, the
data being written and parity.  Raid 10 also has to write two drives
minimum.  A lot of people think parity is a big deal in terms of raid
5 performance penalty, but I don't -- relative to the what's going on
in the drive, xor calculation costs (one of the fastest operations in
computing) are basically zero, and off-lined if you have a hardware
raid controller.

I bet part of the problem with raid 5 is actually contention. since
your write to a stripe can conflict with other writes to a different
stripe.  The other problem with raid 5 that I see is that you don't
get very much extra protection -- it's pretty scary doing a rebuild
even with a hot spare (and then you should probably be doing raid 6).
On read performance RAID 10 wins all day long because more drives can
be involved.

merlin

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Greg Stark

Date:

30 August 2009, 12:52:36

On Sun, Aug 30, 2009 at 4:40 PM, Merlin Moncure<mmoncure@gmail.com> wrote:

> For random writes, raid 5 has to write a minimum of two drives, the
> data being written and parity.  Raid 10 also has to write two drives
> minimum.  A lot of people think parity is a big deal in terms of raid
> 5 performance penalty, but I don't -- relative to the what's going on
> in the drive, xor calculation costs (one of the fastest operations in
> computing) are basically zero, and off-lined if you have a hardware
> raid controller.

The cost is that in order to calculate the parity block the RAID
controller has to *read* in either the old data block being
overwritten and the old parity block or all the other data blocks
which participate in that parity block. So every random write becomes
not just two writes but two reads + two writes.

If you're always writing large sequential hunks at a time then this is
minimized because the RAID controller can just calculate the new
parity block for the whole new hunk. But if you often just seek to
random places in the file and overwrite 8k at a time then things go
very very poorly.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Mark Mielke

Date:

30 August 2009, 14:36:48

On 08/30/2009 11:40 AM, Merlin Moncure wrote:
> For random writes, raid 5 has to write a minimum of two drives, the
> data being written and parity.  Raid 10 also has to write two drives
> minimum.  A lot of people think parity is a big deal in terms of raid
> 5 performance penalty, but I don't -- relative to the what's going on
> in the drive, xor calculation costs (one of the fastest operations in
> computing) are basically zero, and off-lined if you have a hardware
> raid controller.
>
> I bet part of the problem with raid 5 is actually contention. since
> your write to a stripe can conflict with other writes to a different
> stripe.  The other problem with raid 5 that I see is that you don't
> get very much extra protection -- it's pretty scary doing a rebuild
> even with a hot spare (and then you should probably be doing raid 6).
> On read performance RAID 10 wins all day long because more drives can
> be involved.
>

In real life, with real life writes (i.e. not sequential from the start
of the disk to the end of the disk), where the stripes on the disk being
written are not already in RAM (to allow for XOR to be cheap), RAID 5 is
horrible. I still recall naively playing with software RAID 5 on a three
disk system and finding write performance to be 20% - 50% less than a
single drive on its own.

People need to realize that the cost of maintaining parity is not the
XOR itself - XOR is cheap - the cost is having knowledge of all drives
in the stripe in order to write the parity. This implies it is already
in cache (requires a very large cache, or a very localized load such
that the load all fits in cache), or it requires 1 or more reads before
2 or more writes. Latency is a killer here - latency is already the
slowest part of the disk, so to effectively multiply latency x 2 has a
huge impact.

I will never use RAID 5 again unless I have a huge memory backed cache
for it to cache writes against. By huge, I mean something approximately
the size of the data normally read and written. Having 1 Gbytes of RAM
dedicated to RAID 5 for a 1 Tbyte drive may not be enough.

RAID 1+0 on the other hand, has never disappointed me yet. Disks are
cheap, and paying x2 for single disk redundancy is an acceptable price.

Cheers,
mark

--
Mark Mielke<mark@mielke.cc>

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Joseph S

Date:

30 August 2009, 17:28:54

I've already learned my lesson and will never use raid 5 again.  The
question is what I do with my 14 drives. Should I use only 1 pair for
indexes or should I use 4 drives?  The wal logs are already slated for
an SSD.

Scott Marlowe wrote:
> On Sat, Aug 29, 2009 at 2:46 AM, Greg Stark<gsstark@mit.edu> wrote:
>> On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote:
>>> Joseph S Wrote
>>>> If I have 14 drives in a RAID 10 to split between data tables
>>>> and indexes what would be the best way to allocate the drives
>>>> for performance?
>>> RAID-5 can be much faster than RAID-10 for random reads and writes.  It is much slower than RAID-10 for sequential
writes,but about the same for sequential reads.  For typical access patterns, I would put the data and indexes on
RAID-5unless you expect there to be lots of sequential scans. 
>> That's pretty much exactly backwards. RAID-5 will at best slightly
>> slower than RAID-0 or RAID-10 for sequential reads or random reads.
>> For sequential writes it performs *terribly*, especially for random
>> writes. The only write pattern where it performs ok sometimes is
>> sequential writes of large chunks.
>
> Note that while RAID-10 is theoretically always better than RAID-5,
> I've run into quite a few cheapie controllers that were heavily
> optimised for RAID-5 and de-optimised for RAID-10.  However, if it's
> got battery backed cache and can run in JBOD mode, linux software
> RAID-10 or hybrid RAID-1 in hardware RAID-0 in software will almost
> always beat hardware RAID-5 on the same controller.
>

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Merlin Moncure

Date:

30 August 2009, 19:57:42

On Sun, Aug 30, 2009 at 1:36 PM, Mark Mielke<mark@mark.mielke.cc> wrote:
> On 08/30/2009 11:40 AM, Merlin Moncure wrote:
>>
>> For random writes, raid 5 has to write a minimum of two drives, the
>> data being written and parity.  Raid 10 also has to write two drives
>> minimum.  A lot of people think parity is a big deal in terms of raid
>> 5 performance penalty, but I don't -- relative to the what's going on
>> in the drive, xor calculation costs (one of the fastest operations in
>> computing) are basically zero, and off-lined if you have a hardware
>> raid controller.
>>
>> I bet part of the problem with raid 5 is actually contention. since
>> your write to a stripe can conflict with other writes to a different
>> stripe.  The other problem with raid 5 that I see is that you don't
>> get very much extra protection -- it's pretty scary doing a rebuild
>> even with a hot spare (and then you should probably be doing raid 6).
>> On read performance RAID 10 wins all day long because more drives can
>> be involved.
>>
>
> In real life, with real life writes (i.e. not sequential from the start of
> the disk to the end of the disk), where the stripes on the disk being
> written are not already in RAM (to allow for XOR to be cheap), RAID 5 is
> horrible. I still recall naively playing with software RAID 5 on a three
> disk system and finding write performance to be 20% - 50% less than a single
> drive on its own.
>
> People need to realize that the cost of maintaining parity is not the XOR
> itself - XOR is cheap - the cost is having knowledge of all drives in the
> stripe in order to write the parity. This implies it is already in cache
> (requires a very large cache, or a very localized load such that the load
> all fits in cache), or it requires 1 or more reads before 2 or more writes.
> Latency is a killer here - latency is already the slowest part of the disk,
> so to effectively multiply latency x 2 has a huge impact.

This is not necessarily correct.  As long as the data you are writing
is less than the raid stripe size (say 64kb), then you only need the
old data for that stripe (which is stored on one disk only), the
parity (also stored on one disk only), and the data being written to
recalculate the parity.  A raid stripe is usually on one disk.  So a
raid 5 random write will only involve two drives if it's less than
stripe size (and three drives if it's up to 2x stripe size, etc).

IOW, if your stripe size is 64k:
64k written:
  raid 10: two writes
  raid 5: two writes, one read (but the read and one of the writes is
same physical location)
128k written
  raid 10: four writes
  raid 5: three writes, one read (but the read and one of the writes
is same physical location)
192k written
  raid 10: six writes
  raid 5: four writes, one read (but the read and one of the writes is
same physical location)

now, by 'same physical' location, that may mean that the drive head
has to move if the data is not in cache.

I realize that many raid 5 implementations tend to suck.  That said,
raid 5 should offer higher theoretical performance for writing than
raid 10, both for sequential and random. (many, many online
descriptions of raid get this wrong and stupidly blame the overhead of
parity calculation).  raid 10 wins on read all day long.  Of course,
on a typical system with lots of things going on, it gets a lot more
complicated...

(just for the record, I use raid 10 on my databases always) :-)

merlin

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Greg Stark

Date:

30 August 2009, 20:39:04

On Sun, Aug 30, 2009 at 11:56 PM, Merlin Moncure<mmoncure@gmail.com> wrote:
> 192k written
>  raid 10: six writes
>  raid 5: four writes, one read (but the read and one of the writes is
> same physical location)
>
> now, by 'same physical' location, that may mean that the drive head
> has to move if the data is not in cache.
>
> I realize that many raid 5 implementations tend to suck.  That said,
> raid 5 should offer higher theoretical performance for writing than
> raid 10, both for sequential and random.

In the above there are two problems.

1) 192kB is not a random access pattern. Any time you're writing a
whole raid stripe or more then RAID5 can start performing reasonably
but that's not random, that's sequential i/o. The relevant random i/o
pattern is writing 8kB chunks at random offsets into a multi-terabyte
storage which doesn't fit in cache.

2) It's not clear but I think you're saying "but the read and one of
the writes is same physical location" on the basis that this mitigates
the costs. In fact it's the worst case. It means after doing the read
and calculating the parity block the drive must then spin a full
rotation before being able to write it back out. So instead of an
average latency of 1/2 of a rotation you have that plus a full
rotation, or 3x as much latency before the write can be performed as
without raid5.

It's not a fault of the implementations, it's a fundamental problem
with RAId5. Even a spectacular implementation of RAID5 will be awful
for random access writes. The only saving grace some hardware
implementations have is having huge amounts of battery backed cache
which mean that they can usually buffer all the writes for long enough
that the access patterns no longer look random. If you buffer enough
then you can hope you'll eventually overwrite the whole stripe and can
write out the new parity without reading the old data. Or failing that
you can perform the reads of the old data when it's convenient because
you're reading nearby data effectively turning it into sequential i/o.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Merlin Moncure

Date:

31 August 2009, 11:38:41

On Sun, Aug 30, 2009 at 7:38 PM, Greg Stark<gsstark@mit.edu> wrote:
> On Sun, Aug 30, 2009 at 11:56 PM, Merlin Moncure<mmoncure@gmail.com> wrote:
>> 192k written
>>  raid 10: six writes
>>  raid 5: four writes, one read (but the read and one of the writes is
>> same physical location)
>>
>> now, by 'same physical' location, that may mean that the drive head
>> has to move if the data is not in cache.
>>
>> I realize that many raid 5 implementations tend to suck.  That said,
>> raid 5 should offer higher theoretical performance for writing than
>> raid 10, both for sequential and random.
>
> In the above there are two problems.
>
> 1) 192kB is not a random access pattern. Any time you're writing a
> whole raid stripe or more then RAID5 can start performing reasonably
> but that's not random, that's sequential i/o. The relevant random i/o
> pattern is writing 8kB chunks at random offsets into a multi-terabyte
> storage which doesn't fit in cache.
>
> 2) It's not clear but I think you're saying "but the read and one of
> the writes is same physical location" on the basis that this mitigates
> the costs. In fact it's the worst case. It means after doing the read
> and calculating the parity block the drive must then spin a full
> rotation before being able to write it back out. So instead of an
> average latency of 1/2 of a rotation you have that plus a full
> rotation, or 3x as much latency before the write can be performed as
> without raid5.
>
> It's not a fault of the implementations, it's a fundamental problem
> with RAId5. Even a spectacular implementation of RAID5 will be awful
> for random access writes. The only saving grace some hardware
> implementations have is having huge amounts of battery backed cache
> which mean that they can usually buffer all the writes for long enough
> that the access patterns no longer look random. If you buffer enough
> then you can hope you'll eventually overwrite the whole stripe and can
> write out the new parity without reading the old data. Or failing that
> you can perform the reads of the old data when it's convenient because
> you're reading nearby data effectively turning it into sequential i/o.

I agree, that's good analysis.  The main point I was making was that
if you have say a 10 disk raid 5, you don't involve 10 disks, only
two...a very common misconception.  I made another mistake that you
didn't catch: you need to read *both* the data drive and the parity
drive before writing, not just the parity drive.

I wonder if flash SSD are a better fit for raid 5 since the reads are
much cheaper than writes and there is no rotational latency.  (also,
$/gb is different, and so are the failure cases).

merlin

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Aidan Van Dyk

Date:

31 August 2009, 11:48:22

* Merlin Moncure <mmoncure@gmail.com> [090831 10:38]:

> I agree, that's good analysis.  The main point I was making was that
> if you have say a 10 disk raid 5, you don't involve 10 disks, only
> two...a very common misconception.  I made another mistake that you
> didn't catch: you need to read *both* the data drive and the parity
> drive before writing, not just the parity drive.
>
> I wonder if flash SSD are a better fit for raid 5 since the reads are
> much cheaper than writes and there is no rotational latency.  (also,
> $/gb is different, and so are the failure cases).

The other thing that scares me about raid-5 is the write-hole, and the
possible delayed inconsistency that brings...

Again, hopefully mitigated by a dependable controller w/ BBU...

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Attachment

signature.asc

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Jeff Janes

Date:

31 August 2009, 12:24:28

On Sun, Aug 30, 2009 at 1:01 PM, Joseph S <jks@selectacast.net> wrote:

I've already learned my lesson and will never use raid 5 again. The question is what I do with my 14 drives. Should I use only 1 pair for indexes or should I use 4 drives? The wal logs are already slated for an SSD.

Why not just spread all your index data over 14 spindles, and do the same with your table data? I haven't encountered this debate in in the pgsql world, but from the Oracle world it seems to me the "Stripe And Mirror Everything" people had the better argument than the "separate tables and indexes" people.

Jeff

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

"Kevin Grittner"

Date:

31 August 2009, 13:31:32

Jeff Janes <jeff.janes@gmail.com> wrote:
> Joseph S <jks@selectacast.net> wrote:

>> The question is what I do with my 14 drives. Should I use only 1
>> pair for indexes or should I use 4 drives?  The wal logs are
>> already slated for an SSD.

> Why not just spread all your index data over 14 spindles, and do the
> same with your table data?

If you have the luxury of being able to test more than one
configuration with something resembling your actual workload, I would
strongly recommend including this as one of your configurations.
Spreading everything over the larger number of spindles might well
out-perform your most carefully hand-crafted tuning of object
placement on smaller spindle sets.

-Kevin

Re: What exactly is postgres doing during INSERT/UPDATE ?

From

Scott Marlowe

Date:

31 August 2009, 16:16:07

On Mon, Aug 31, 2009 at 10:31 AM, Kevin
Grittner<Kevin.Grittner@wicourts.gov> wrote:
> Jeff Janes <jeff.janes@gmail.com> wrote:
>> Joseph S <jks@selectacast.net> wrote:
>
>>> The question is what I do with my 14 drives. Should I use only 1
>>> pair for indexes or should I use 4 drives?  The wal logs are
>>> already slated for an SSD.
>
>> Why not just spread all your index data over 14 spindles, and do the
>> same with your table data?
>
> If you have the luxury of being able to test more than one
> configuration with something resembling your actual workload, I would
> strongly recommend including this as one of your configurations.
> Spreading everything over the larger number of spindles might well
> out-perform your most carefully hand-crafted tuning of object
> placement on smaller spindle sets.

The first thing I'd test would be if having a separate mirror set for
pg_xlog helps.  If you have a high write environment moving pg_xlog
off of the main data set can help a lot.