Thread: limiting performance impact of wal archiving.

limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

10 November 2009, 07:56:51

Hi !
We recently had a problem with wal archiving badly impacting the
performance of our postgresql master.
And i discovered "cstream", that can limite the bandwidth of pipe stream.

Here is our new archive command, FYI, that limit the IO bandwidth to 500KB/s  :
archive_command = '/bin/cat %p | cstream -i "" -o "" -t -500k | nice
gzip -9 -c | /usr/bin/ncftpput etc...'


PS : While writing that mail, i just found that i could replace :
cat %p | cstream -i "" ...
with
cstream -i %p ...
*grins*


--
ker2x
Sysadmin & DBA @ http://Www.over-blog.com/

Re: limiting performance impact of wal archiving.

From

Kenneth Marshall

Date:

10 November 2009, 09:41:42

On Tue, Nov 10, 2009 at 12:55:42PM +0100, Laurent Laborde wrote:
> Hi !
> We recently had a problem with wal archiving badly impacting the
> performance of our postgresql master.
> And i discovered "cstream", that can limite the bandwidth of pipe stream.
>
> Here is our new archive command, FYI, that limit the IO bandwidth to 500KB/s  :
> archive_command = '/bin/cat %p | cstream -i "" -o "" -t -500k | nice
> gzip -9 -c | /usr/bin/ncftpput etc...'
>
>
> PS : While writing that mail, i just found that i could replace :
> cat %p | cstream -i "" ...
> with
> cstream -i %p ...
> *grins*
>

And here is a simple perl program that I have used for a similar
reason. Obviously, it can be adapted to your specific needs.

Regards,
Ken

----throttle.pl-------
#!/usr/bin/perl -w

require 5.0;            # written for perl5, hasta labyebye perl4

use strict;
use Getopt::Std;

#
# This is an simple program to throttle network traffic to a
# specified  KB/second to allow a restore in the middle of the
# day over the network.
#

my($file, $chunksize, $len, $offset, $written, $rate, $buf );
my($options, $blocksize, $speed, %convert, $inv_rate, $verbose);

%convert = (              # conversion factors for $speed,$blocksize
    '',    '1',
    'w',    '2',
    'W',    '2',
    'b',    '512',
    'B',    '512',
    'k',    '1024',
    'K',    '1024',
);

$options = 'vhs:r:b:f:';

#
# set defaults
#
$speed = '100k';
$rate = '5';
$blocksize = '120k';              # Works for the DLT drives under SunOS
$file = '-';
$buf = '';
$verbose = 0;                     # default to quiet

sub usage {
  my($usage);

  $usage = "Usage: throttle [-s speed][-r rate/sec][-b blksize][-f file][-v][-h]
  (writes data to STDOUT)
  -s speed       max data rate in B/s - defaults to 100k
  -r rate        writes/sec - defaults to 5
  -b size        read blocksize - defaults to 120k
  -f file        file to read for input - defaults to STDIN
  -h             print this message
  -v             print parameters used
";

  print STDERR $usage;
  exit(1);
}

getopts($options) || usage;

if ($::opt_h || $::opt_h) {
  usage;
}

usage unless $#ARGV < 0;

$speed = $::opt_s      if $::opt_s;
$rate = $::opt_r       if $::opt_r;
$blocksize = $::opt_b  if $::opt_b;
$file = $::opt_f       if $::opt_f;

#
# Convert $speed and $blocksize to bytes for use in the rest of the script
if ( $speed =~ /^(\d+)([wWbBkK]*)$/ ) {
  $speed = $1 * $convert{$2};
}
if ( $blocksize =~ /^(\d+)([wWbBkK]*)$/ ) {
  $blocksize = $1 * $convert{$2};
}
$inv_rate = 1/$rate;
$chunksize = int($speed/$rate);
$chunksize = 1 if $chunksize == 0;

if ($::opt_v || $::opt_v) {
  print STDERR "speed = $speed B/s\nrate = $rate/sec\nblocksize = $blocksize B\nchunksize = $chunksize B\n";
}

# Return error if unable to open file
open(FILE, "<$file") or die "Cannot open $file: $!\n";

# Read data from stdin and write it to stdout at a rate based
# on $rate and $speed.
#
while($len = sysread(FILE, $buf, $blocksize)) {
  #
  # print out in chunks of $speed/$rate size to allow a smoother load
  $offset = 0;
  while ($len) {
    $written = syswrite(STDOUT, $buf, $chunksize, $offset);
      die "System write error: $!\n" unless defined $written;
    $len -= $written;
    $offset += $written;
    #
    # Now wait 1/$rate seconds before doing the next block
    #
    select(undef, undef, undef, $inv_rate);
  }
}

close(FILE);

Re: limiting performance impact of wal archiving.

From

Ivan Voras

Date:

10 November 2009, 10:06:22

Laurent Laborde wrote:
> Hi !
> We recently had a problem with wal archiving badly impacting the
> performance of our postgresql master.

Hmmm, do you want to say that copying 16 MB files over the network (and
presumably you are not doing it absolutely continually - there are
pauses between log shipping - or you wouldn't be able to use bandwidth
limiting) in an age when desktop drives easily read 60 MB/s (and besides
most of the file should be cached by the OS anyway) is a problem for
you? Slow hardware?

(or I've misunderstood the problem...)

Re: limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

10 November 2009, 11:00:43

On Tue, Nov 10, 2009 at 3:05 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> Laurent Laborde wrote:
>>
>> Hi !
>> We recently had a problem with wal archiving badly impacting the
>> performance of our postgresql master.
>
> Hmmm, do you want to say that copying 16 MB files over the network (and
> presumably you are not doing it absolutely continually - there are pauses
> between log shipping - or you wouldn't be able to use bandwidth limiting) in
> an age when desktop drives easily read 60 MB/s (and besides most of the file
> should be cached by the OS anyway) is a problem for you? Slow hardware?
>
> (or I've misunderstood the problem...)

Desktop drive can easily do 60MB/s in *sequential* read/write.
We use high performance array of 15.000rpm SAS disk on an octocore
32GB and IO is always a problem.

I explain the problem :

This server (doing wal archiving) is the master node of the
over-blog's server farm.
hundreds of GB of data, tens of millions of articles and comments,
millions of user, ...
~250 read/write sql requests per seconds for the master
~500 read sql request per slave.

Awefully random access overload our array at 10MB/s at best.
Of course, when doing sequential read it goes to +250MB/s :)

Waiting for "cheap" memory to be cheap enough to have 512Go of ram per server ;)

We tought about SSD.
But interleaved read/write kill any SSD performance and is not better
than SSD. Just more expensive with an unknown behaviour over age.

--
ker2x
sysadmin & DBA @ http://www.over-blog.com/

Re: limiting performance impact of wal archiving.

From

Ivan Voras

Date:

10 November 2009, 11:13:34

Laurent Laborde wrote:
> On Tue, Nov 10, 2009 at 3:05 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> Laurent Laborde wrote:
>>> Hi !
>>> We recently had a problem with wal archiving badly impacting the
>>> performance of our postgresql master.
>> Hmmm, do you want to say that copying 16 MB files over the network (and
>> presumably you are not doing it absolutely continually - there are pauses
>> between log shipping - or you wouldn't be able to use bandwidth limiting) in
>> an age when desktop drives easily read 60 MB/s (and besides most of the file
>> should be cached by the OS anyway) is a problem for you? Slow hardware?
>>
>> (or I've misunderstood the problem...)
>
> Desktop drive can easily do 60MB/s in *sequential* read/write.

... and WAL files are big sequential chunks of data :)

> We use high performance array of 15.000rpm SAS disk on an octocore
> 32GB and IO is always a problem.
>
> I explain the problem :
>
> This server (doing wal archiving) is the master node of the
> over-blog's server farm.
> hundreds of GB of data, tens of millions of articles and comments,
> millions of user, ...
> ~250 read/write sql requests per seconds for the master
> ~500 read sql request per slave.
>
> Awefully random access overload our array at 10MB/s at best.

Ok, this explains it. It also means you are probably not getting much
runtime performance benefits from the logging and should think about
moving the logs to different drive(s), among other things because...

> Of course, when doing sequential read it goes to +250MB/s :)

... it means you cannot dedicate 0.064 of second from the array to read
through a single log file without your other transactions suffering.

> Waiting for "cheap" memory to be cheap enough to have 512Go of ram per server ;)
>
> We tought about SSD.
> But interleaved read/write kill any SSD performance and is not better
> than SSD. Just more expensive with an unknown behaviour over age.

Yes, this is the current attitude toward them.

Re: limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

10 November 2009, 11:29:53

On Tue, Nov 10, 2009 at 4:11 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> Laurent Laborde wrote:
>
> Ok, this explains it. It also means you are probably not getting much
> runtime performance benefits from the logging and should think about moving
> the logs to different drive(s), among other things because...

It is on a separate array which does everything but tablespace (on a
separate array) and indexspace (another separate array).

>> Of course, when doing sequential read it goes to +250MB/s :)
>
> ... it means you cannot dedicate 0.064 of second from the array to read
> through a single log file without your other transactions suffering.

Well, actually, i also change the configuration to synchronous_commit=off
It probably was *THE* problem with checkpoint and archiving :)

But adding cstream couldn't hurt performance, and i wanted to share
this with the list. :)

BTW, if you have any idea to improve IO performance, i'll happily read it.
We're 100% IO bound.

eg: historically, we use JFS with LVM on linux. from the good old time
when IO wasn't a problem.
i heard that ext3 is not better for postgresql. what else ? xfs ?

*hugs*

--
ker2x
Sysadmin & DBA @ http://www.over-blog.com/

Re: limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

10 November 2009, 11:38:00

checkpoint log :
--------------------

 checkpoint starting: time
 checkpoint complete: wrote 1972 buffers (0.8%); 0 transaction log
file(s) added, 0 removed, 13 recycled;
 write=179.123 s, sync=26.284 s, total=205.451 s

with a 10mn timeout.

--
ker2x

Re: limiting performance impact of wal archiving.

From

Scott Marlowe

Date:

10 November 2009, 11:43:48

On Tue, Nov 10, 2009 at 8:00 AM, Laurent Laborde <kerdezixe@gmail.com> wrote:
>
> Desktop drive can easily do 60MB/s in *sequential* read/write.
> We use high performance array of 15.000rpm SAS disk on an octocore
> 32GB and IO is always a problem.

How man drives in the array?  Controller? RAID level?

> I explain the problem :
>
> This server (doing wal archiving) is the master node of the
> over-blog's server farm.
> hundreds of GB of data, tens of millions of articles and comments,
> millions of user, ...
> ~250 read/write sql requests per seconds for the master
> ~500 read sql request per slave.

That's really not very fast.

Re: limiting performance impact of wal archiving.

From

"Kevin Grittner"

Date:

10 November 2009, 11:48:58

Laurent Laborde <kerdezixe@gmail.com> wrote:

> BTW, if you have any idea to improve IO performance, i'll happily
> read it.  We're 100% IO bound.

At the risk of stating the obvious, you want to make sure you have
high quality RAID adapters with large battery backed cache configured
to write-back.

If you haven't already done so, you might want to try
elevator=deadline.

> xfs ?

If you use xfs and have the aforementioned BBU cache, be sure to turn
write barriers off.

-Kevin

Re: limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

10 November 2009, 11:54:01

On Tue, Nov 10, 2009 at 4:48 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Laurent Laborde <kerdezixe@gmail.com> wrote:
>
>> BTW, if you have any idea to improve IO performance, i'll happily
>> read it.  We're 100% IO bound.
>
> At the risk of stating the obvious, you want to make sure you have
> high quality RAID adapters with large battery backed cache configured
> to write-back.

Not sure how "high quality" the 3ware is.
/c0 Driver Version = 2.26.08.004-2.6.18
/c0 Model = 9690SA-8I
/c0 Available Memory = 448MB
/c0 Firmware Version = FH9X 4.04.00.002
/c0 Bios Version = BE9X 4.01.00.010
/c0 Boot Loader Version = BL9X 3.08.00.001
/c0 Serial Number = L340501A7360026
/c0 PCB Version = Rev 041
/c0 PCHIP Version = 2.00
/c0 ACHIP Version = 1501290C
/c0 Controller Phys = 8
/c0 Connections = 8 of 128
/c0 Drives = 8 of 128
/c0 Units = 3 of 128
/c0 Active Drives = 8 of 128
/c0 Active Units = 3 of 32
/c0 Max Drives Per Unit = 32
/c0 Total Optimal Units = 2
/c0 Not Optimal Units = 1
/c0 Disk Spinup Policy = 1
/c0 Spinup Stagger Time Policy (sec) = 1
/c0 Auto-Carving Policy = off
/c0 Auto-Carving Size = 2048 GB
/c0 Auto-Rebuild Policy = on
/c0 Controller Bus Type = PCIe
/c0 Controller Bus Width = 8 lanes
/c0 Controller Bus Speed = 2.5 Gbps/lane


> If you haven't already done so, you might want to try
> elevator=deadline.

That's what we use.
Also tried "noop" scheduler without signifiant performance change.

--
ker2x

Re: limiting performance impact of wal archiving.

From

Greg Smith

Date:

10 November 2009, 12:36:12

Laurent Laborde wrote:
> It is on a separate array which does everything but tablespace (on a
> separate array) and indexspace (another separate array).
>
On Linux, the types of writes done to the WAL volume (where writes are
constantly being flushed) require the WAL volume not be shared with
anything else for that to perform well.  Typically you'll end up with
other things being written out too because it can't just selectively
flush just the WAL data.  The whole "write barriers" implementation
should fix that, but in practice rarely does.

If you put many drives into one big array, somewhere around 6 or more
drives, at that point you might put the WAL on that big volume too and
be OK (presuming a battery-backed cache which you have).  But if you're
carving up array sections so finely for other purposes, it doesn't sound
like your WAL data is on a big array.  Mixed onto a big shared array or
single dedicated disks (RAID1) are the two WAL setups that work well,
and if I have a bunch of drives I personally always prefer a dedicated
drive mainly because it makes it easy to monitor exactly how much WAL
activity is going on by watching that drive.

> Well, actually, i also change the configuration to synchronous_commit=off
> It probably was *THE* problem with checkpoint and archiving :)
>
This is basically turning off the standard WAL implementation for one
where you'll lose some data if there's a crash.  If you're OK with that,
great; if not, expect to lose some number of transactions if the server
ever goes down unexpectedly when configured like this.

Generally if checkpoints and archiving are painful, the first thing to
do is to increase checkpoint_segments to a very high amount (>100),
increase checkpoint_timeout too, and push shared_buffers up to be a
large chunk of memory.  Disabling synchronous_commit should be a last
resort if your performance issues are so bad you have no choice but to
sacrifice some data integrity just to keep things going, while you
rearchitect to improve things.

> eg: historically, we use JFS with LVM on linux. from the good old time
> when IO wasn't a problem.
> i heard that ext3 is not better for postgresql. what else ? xfs ?
>
You never want to use LVM under Linux if you care about performance.  It
adds a bunch of overhead that drops throughput no matter what, and it's
filled with limitations.  For example, I mentioned write barriers being
one way to interleave WAL writes without other types without having to
write the whole filesystem cache out.  Guess what:  they don't work at
all regardless if you're using LVM.  Much like using virtual machines,
LVM is an approach only suitable for low to medium performance systems
where your priority is easier management rather than speed.

Given the current quality of Linux code, I hesitate to use anything but
ext3 because I consider that just barely reliable enough even as the
most popular filesystem by far.  JFS and XFS have some benefits to them,
but none so compelling to make up for how much less testing they get.
That said, there seem to be a fair number of people happily running
high-performance PostgreSQL instances on XFS.

--
Greg Smith    greg@2ndQuadrant.com    Baltimore, MD

Re: limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

10 November 2009, 12:52:14

On Tue, Nov 10, 2009 at 5:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Laurent Laborde wrote:
>>
>> It is on a separate array which does everything but tablespace (on a
>> separate array) and indexspace (another separate array).
>>
>
> On Linux, the types of writes done to the WAL volume (where writes are
> constantly being flushed) require the WAL volume not be shared with anything
> else for that to perform well.  Typically you'll end up with other things
> being written out too because it can't just selectively flush just the WAL
> data.  The whole "write barriers" implementation should fix that, but in
> practice rarely does.
>
> If you put many drives into one big array, somewhere around 6 or more
> drives, at that point you might put the WAL on that big volume too and be OK
> (presuming a battery-backed cache which you have).  But if you're carving up
> array sections so finely for other purposes, it doesn't sound like your WAL
> data is on a big array.  Mixed onto a big shared array or single dedicated
> disks (RAID1) are the two WAL setups that work well, and if I have a bunch
> of drives I personally always prefer a dedicated drive mainly because it
> makes it easy to monitor exactly how much WAL activity is going on by
> watching that drive.

On the "new" slave i have 6 disk in raid-10 and 2 disk in raid-1.
I tought about doing the same thing with the master.


>> Well, actually, i also change the configuration to synchronous_commit=off
>> It probably was *THE* problem with checkpoint and archiving :)
>>
>
> This is basically turning off the standard WAL implementation for one where
> you'll lose some data if there's a crash.  If you're OK with that, great; if
> not, expect to lose some number of transactions if the server ever goes down
> unexpectedly when configured like this.

I have 1 spare dedicated to hot standby, doing nothing but waiting for
the master to fail.
+ 2 spare candidate for cluster mastering.

In theory, i could even disable fsync and all "safety" feature on the master.
In practice, i'd like to avoid using the slony's failover capabilities
if i can avoid it :)

> Generally if checkpoints and archiving are painful, the first thing to do is
> to increase checkpoint_segments to a very high amount (>100), increase
> checkpoint_timeout too, and push shared_buffers up to be a large chunk of
> memory.

Shared_buffer is 2GB.
I'll reread domcumentation about checkpoint_segments.
thx.

> Disabling synchronous_commit should be a last resort if your
> performance issues are so bad you have no choice but to sacrifice some data
> integrity just to keep things going, while you rearchitect to improve
> things.
>
>> eg: historically, we use JFS with LVM on linux. from the good old time
>> when IO wasn't a problem.
>> i heard that ext3 is not better for postgresql. what else ? xfs ?
>>
>
> You never want to use LVM under Linux if you care about performance.  It
> adds a bunch of overhead that drops throughput no matter what, and it's
> filled with limitations.  For example, I mentioned write barriers being one
> way to interleave WAL writes without other types without having to write the
> whole filesystem cache out.  Guess what:  they don't work at all regardless
> if you're using LVM.  Much like using virtual machines, LVM is an approach
> only suitable for low to medium performance systems where your priority is
> easier management rather than speed.

*doh* !!
Everybody told me "nooo ! LVM is ok, no perceptible overhead, etc ...)
Are you 100% about LVM ? I'll happily trash it :)

> Given the current quality of Linux code, I hesitate to use anything but ext3
> because I consider that just barely reliable enough even as the most popular
> filesystem by far.  JFS and XFS have some benefits to them, but none so
> compelling to make up for how much less testing they get.  That said, there
> seem to be a fair number of people happily running high-performance
> PostgreSQL instances on XFS.

Thx for the info :)

--
ker2x

Re: limiting performance impact of wal archiving.

From

Scott Marlowe

Date:

10 November 2009, 13:01:40

On Tue, Nov 10, 2009 at 9:52 AM, Laurent Laborde <kerdezixe@gmail.com> wrote:
> On Tue, Nov 10, 2009 at 5:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> disks (RAID1) are the two WAL setups that work well, and if I have a bunch
>> of drives I personally always prefer a dedicated drive mainly because it
>> makes it easy to monitor exactly how much WAL activity is going on by
>> watching that drive.

I do the same thing for the same reasons.

> On the "new" slave i have 6 disk in raid-10 and 2 disk in raid-1.
> I tought about doing the same thing with the master.

It would be a worthy change to make.  As long as there's no heavy log
write load on the RAID-1 put the pg_xlog there.

>> Generally if checkpoints and archiving are painful, the first thing to do is
>> to increase checkpoint_segments to a very high amount (>100), increase
>> checkpoint_timeout too, and push shared_buffers up to be a large chunk of
>> memory.
>
> Shared_buffer is 2GB.

On some busy systems with lots of small transactions large
shared_buffer can cause it to run slower rather than faster due to
background writer overhead.

> I'll reread domcumentation about checkpoint_segments.
> thx.

Note that if you've got a slow IO subsystem, a large number of
checkpoint segments can result in REALLY long restart times after a
crash, as well as really long waits for shutdown and / or bgwriter
once you've filled them all up.

>> You never want to use LVM under Linux if you care about performance.  It
>> adds a bunch of overhead that drops throughput no matter what, and it's
>> filled with limitations.  For example, I mentioned write barriers being one
>> way to interleave WAL writes without other types without having to write the
>> whole filesystem cache out.  Guess what:  they don't work at all regardless
>> if you're using LVM.  Much like using virtual machines, LVM is an approach
>> only suitable for low to medium performance systems where your priority is
>> easier management rather than speed.
>
> *doh* !!
> Everybody told me "nooo ! LVM is ok, no perceptible overhead, etc ...)
> Are you 100% about LVM ? I'll happily trash it :)

Everyone who doesn't run databases thinks LVM is plenty fast.  Under a
database it is not so quick.  Do your own testing to be sure, but I've
seen slowdowns of about 1/2 under it for fast RAID arrays.

>> Given the current quality of Linux code, I hesitate to use anything but ext3
>> because I consider that just barely reliable enough even as the most popular
>> filesystem by far.  JFS and XFS have some benefits to them, but none so
>> compelling to make up for how much less testing they get.  That said, there
>> seem to be a fair number of people happily running high-performance
>> PostgreSQL instances on XFS.
>
> Thx for the info :)

Note that XFS gets a LOT of testing, especially under linux.  That
said it's still probably only 1/10th as many dbs (or fewer) as those
running on ext3 on linux.  I've used it before and it's a little
faster than ext3 at some stuff, especially deleting large files (or in
pg's case lots of 1G files) which can make ext3 crawl.

Re: limiting performance impact of wal archiving.

From

Craig James

Date:

10 November 2009, 13:07:32

On Tue, Nov 10, 2009 at 5:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Given the current quality of Linux code, I hesitate to use anything but ext3
> because I consider that just barely reliable enough even as the most popular
> filesystem by far.  JFS and XFS have some benefits to them, but none so
> compelling to make up for how much less testing they get.  That said, there
> seem to be a fair number of people happily running high-performance
> PostgreSQL instances on XFS.

I thought the common wisdom was to use ext2 for the WAL, since the WAL is a journal system, and ext3 would essentially
bejournaling the journal.  Is that not true? 

Craig

Re: limiting performance impact of wal archiving.

From

Scott Marlowe

Date:

10 November 2009, 13:10:54

On Tue, Nov 10, 2009 at 10:07 AM, Craig James
<craig_james@emolecules.com> wrote:
> On Tue, Nov 10, 2009 at 5:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>
>> Given the current quality of Linux code, I hesitate to use anything but
>> ext3
>> because I consider that just barely reliable enough even as the most
>> popular
>> filesystem by far.  JFS and XFS have some benefits to them, but none so
>> compelling to make up for how much less testing they get.  That said,
>> there
>> seem to be a fair number of people happily running high-performance
>> PostgreSQL instances on XFS.
>
> I thought the common wisdom was to use ext2 for the WAL, since the WAL is a
> journal system, and ext3 would essentially be journaling the journal.  Is
> that not true?

Yep, ext2 for pg_xlog is fine.

Re: limiting performance impact of wal archiving.

From

Greg Smith

Date:

10 November 2009, 13:26:23

Craig James wrote:
> On Tue, Nov 10, 2009 at 5:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Given the current quality of Linux code, I hesitate to use anything
>> but ext3
>> because I consider that just barely reliable enough even as the most
>> popular
>> filesystem by far.  JFS and XFS have some benefits to them, but none so
>> compelling to make up for how much less testing they get.  That said,
>> there
>> seem to be a fair number of people happily running high-performance
>> PostgreSQL instances on XFS.
>
> I thought the common wisdom was to use ext2 for the WAL, since the WAL
> is a journal system, and ext3 would essentially be journaling the
> journal.  Is that not true?
Using ext2 means that you're still exposed to fsck errors on boot after
a crash, which doesn't lose anything but you have to go out of your way
to verify you're not going to get stuck with your server down in that
case.  The state of things on the performance side is nicely benchmarked
at

http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/

Sure, it jumps from 85MB/s to 115MB/s if you use ext2, but if noatime
had been used I think even some of that fairly small gap would have
closed.  My experience is that it's really hard to saturate even a
single disk worth of bandwidth with WAL writes if there's a dedicated
WAL volume.  As such, I'll use ext3 until it's very clear that's the
actual bottleneck, and only then step back and ask if converting to ext2
is worth the performance boost and potential crash recovery mess.  I've
never actually reached that point in a real-world situation, only in
simulated burst write tests.

--
Greg Smith    greg@2ndQuadrant.com    Baltimore, MD

Re: limiting performance impact of wal archiving.

From

Greg Smith

Date:

10 November 2009, 13:34:50

Laurent Laborde wrote:
> On Tue, Nov 10, 2009 at 5:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>
> I have 1 spare dedicated to hot standby, doing nothing but waiting for
> the master to fail.
> + 2 spare candidate for cluster mastering.
>
> In theory, i could even disable fsync and all "safety" feature on the master.
>
There are two types of safety issues here:

1) Will the database be corrupted if there's a crash?  This can happen
if you turn off fsync, and you'll need to switch to a standby to easily
get back up again

2) Will you lose transactions that have been reported as committed to a
client if there's a crash?  This you're exposed to if synchronous_commit
is off, and whether you have a standby or not doesn't change that fact.

> Everybody told me "nooo ! LVM is ok, no perceptible overhead, etc ...)
> Are you 100% about LVM ? I'll happily trash it :)
Believing what people told you is how you got into trouble in the first
place.  You shouldn't believe me either--benchmark yourself and then
you'll know.  As a rule, any time someone suggests there's a
technological approach that makes it easier to manage disks, that
approach will also slow performance.  LVM vs. straight volumes, SAN vs.
direct-attached storage, VM vs. real hardware, it's always the same story.

--
Greg Smith    greg@2ndQuadrant.com    Baltimore, MD

Re: limiting performance impact of wal archiving.

From

Greg Smith

Date:

10 November 2009, 13:49:18

Scott Marlowe wrote:
> On some busy systems with lots of small transactions large
> shared_buffer can cause it to run slower rather than faster due to
> background writer overhead.
>
This is only really true in 8.2 and earlier, where background writer
computations are done as a percentage of shared_buffers.  The rewrite I
did in 8.3 changes that to where it's proportional to overall system
activity (specifically, buffer allocations) and you shouldn't see this
there.  However, large values for shared_buffers do increase the
potential for longer checkpoints though, which is similar background
overhead starting in 8.3.  That's why I mention it hand in hand with
decreasing the checkpoint frequency, you really need to do that before
large shared_buffers values are viable.

This is actually a topic I meant to mention to Laurent:  if you're not
running at least PG8.3, you really should be considering what it would
take to upgrade to 8.4.  It's hard to justify the 8.3->8.4 upgrade just
based on that version's new performance features (unless you delete
things a lot), but the changes from 8.1 to 8.2 to 8.3 make the database
faster at a lot of common tasks.

> Note that if you've got a slow IO subsystem, a large number of
> checkpoint segments can result in REALLY long restart times after a
> crash, as well as really long waits for shutdown and / or bgwriter
> once you've filled them all up.
>
The setup here, with a decent number of disks and a 3ware controller,
shouldn't be that bad here.  Ultimately you have to ask yourself whether
it's OK to suffer from the rare recovery issue this introduces if it
improves things a lot all of the rest of the time, which increasing
checkpoint_segments does.

> Note that XFS gets a LOT of testing, especially under linux.  That
> said it's still probably only 1/10th as many dbs (or fewer) as those
> running on ext3 on linux.  I've used it before and it's a little
> faster than ext3 at some stuff, especially deleting large files (or in
> pg's case lots of 1G files) which can make ext3 crawl.
>
While true, you have to consider whether the things it's better at
really happen during a regular day.  The whole "faster at deleting large
files" thing doesn't matter to me on a production DB server at all, so
that slam-dunk win for XFS doesn't even factor into my filesystem
ranking computations in that context.

--
Greg Smith    greg@2ndQuadrant.com    Baltimore, MD

Re: limiting performance impact of wal archiving.

From

Scott Marlowe

Date:

10 November 2009, 14:10:58

On Tue, Nov 10, 2009 at 10:48 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Scott Marlowe wrote:
>>
>> On some busy systems with lots of small transactions large
>> shared_buffer can cause it to run slower rather than faster due to
>> background writer overhead.
>>
>
> This is only really true in 8.2 and earlier, where background writer
> computations are done as a percentage of shared_buffers.  The rewrite I did
> in 8.3 changes that to where it's proportional to overall system activity
> (specifically, buffer allocations) and you shouldn't see this there.

Nice to know since we converted to 8.3 a few months ago.  I did notice
the huge overall performance improvement from 8.2 to 8.3 and I assume
part of that was the code you wrote for WAL.  Thanks!

>  However, large values for shared_buffers do increase the potential for
> longer checkpoints though, which is similar background overhead starting in
> 8.3.  That's why I mention it hand in hand with decreasing the checkpoint
> frequency, you really need to do that before large shared_buffers values are
> viable.

Yeah.  We run 64 checkpoint segments and a 30 minute timeout and a
lower completion target (0.25 to 0.5) on most of our servers with good
behaviour in 8.3

> This is actually a topic I meant to mention to Laurent:  if you're not
> running at least PG8.3, you really should be considering what it would take
> to upgrade to 8.4.  It's hard to justify the 8.3->8.4 upgrade just based on
> that version's new performance features (unless you delete things a lot),
> but the changes from 8.1 to 8.2 to 8.3 make the database faster at a lot of
> common tasks.

True++  8.3 is the minimum version of pg we run anywhere at work now.
8.4 isn't compelling yet for us, since we finally got fsm setup right.
 But for someone upgrading from 8.2 or before, I'd think the automatic
fsm stuff would be a big selling point.

>> Note that if you've got a slow IO subsystem, a large number of
>> checkpoint segments can result in REALLY long restart times after a
>> crash, as well as really long waits for shutdown and / or bgwriter
>> once you've filled them all up.
>>
>
> The setup here, with a decent number of disks and a 3ware controller,
> shouldn't be that bad here.

If he were running RAID-5 I'd agree. :) That's gonna slow down the
write speeds quite a bit during recovery.

> Ultimately you have to ask yourself whether
> it's OK to suffer from the rare recovery issue this introduces if it
> improves things a lot all of the rest of the time, which increasing
> checkpoint_segments does.

Note that 100% of the time I have to wait for recovery on start it's
because something went wrong with a -m fast shutdown that required
either hand killing all postgres backends and the postmaster, or a -m
immediate.  On the machines with 12 disk RAID-10 arrays this takes
seconds to do.  On the slaves with a pair of 7200RPM SATA drives, or
the one at the office on RAID-6, and 60 to 100+ WAL segments, it takes
a couple of minutes.

>> Note that XFS gets a LOT of testing, especially under linux.  That
>> said it's still probably only 1/10th as many dbs (or fewer) as those
>> running on ext3 on linux.  I've used it before and it's a little
>> faster than ext3 at some stuff, especially deleting large files (or in
>> pg's case lots of 1G files) which can make ext3 crawl.
>
> While true, you have to consider whether the things it's better at really
> happen during a regular day.  The whole "faster at deleting large files"
> thing doesn't matter to me on a production DB server at all, so that
> slam-dunk win for XFS doesn't even factor into my filesystem ranking
> computations in that context.

ahhhh.  I store backups on my pgdata directory, so it does start to
matter there.  Luckily, that's on a slave database so it's not as
horrible as it could be.  Still running ext3 on it because it just
works.

Re: limiting performance impact of wal archiving.

From

Jeff

Date:

10 November 2009, 17:30:20

On Nov 10, 2009, at 10:53 AM, Laurent Laborde wrote:

> On Tue, Nov 10, 2009 at 4:48 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>> Laurent Laborde <kerdezixe@gmail.com> wrote:
>>
>>> BTW, if you have any idea to improve IO performance, i'll happily
>>> read it.  We're 100% IO bound.
>>
>> At the risk of stating the obvious, you want to make sure you have
>> high quality RAID adapters with large battery backed cache configured
>> to write-back.
>
> Not sure how "high quality" the 3ware is.
> /c0 Driver Version = 2.26.08.004-2.6.18
> /c0 Model = 9690SA-8I
> /c0 Available Memory = 448MB

I'll note that I've had terrible experience with 3ware controllers and
getting a high number of iops using hardware raid mode.  If you switch
it to jbod and do softraid you'll get a large increase in iops - which
is the key metric for a db.  I've posted previously about my problems
with 3ware.

as for the ssd comment - I disagree.  I've been running ssd's for a
while now (probably closing in on a year by now) with great success.
A pair of intel x25-e's can get thousands of iops.  That being said
the key is I'm running the intel ssds - there are plenty of absolutely
miserable ssds floating around (I'm looking at you jmicron based disks!)

Have you gone through the normal process of checking your query plans
to ensure they are sane? There is always a possibility a new index can
vastly reduce IO.

--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/

Re: limiting performance impact of wal archiving.

From

Greg Smith

Date:

12 November 2009, 03:04:56

Scott Carey wrote:
>> Using ext2 means that you're still exposed to fsck errors on boot after
>> a crash, which doesn't lose anything but you have to go out of your way
>> to verify you're not going to get stuck with your server down in that
>> case.
> fsck on a filesystem with 1 folder and <checkpoint_segments> files is very
> very fast.  Even if using WAL archiving, there won't be many
> files/directories to check.  Fsck is not an issue if the partition is
> exclusively for WAL.  You can even mount it direct, and avoid having the OS
> cache those pages if you are using a caching raid controller
Right; that sort of thing--switching to a more direct mount, making sure
fsck is setup to run automatically rather than dropping to a menu--is
what I was alluding to when I said you had to go out of your way to make
that work.  It's not complicated, really, but by the time you've set
everything up and done the proper testing to confirm it all worked as
expected you've just spent a modest chunk of time.  All I was trying to
suggest is that there is a cost and some complexity, and that I feel
there's no reason to justify that unless you're not bottlenecked
specifically at WAL write volume.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: limiting performance impact of wal archiving.

From

Scott Carey

Date:

12 November 2009, 03:15:39

> Using ext2 means that you're still exposed to fsck errors on boot after
> a crash, which doesn't lose anything but you have to go out of your way
> to verify you're not going to get stuck with your server down in that
> case.  The state of things on the performance side is nicely benchmarked
> at
> http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_
> smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/
>

fsck on a filesystem with 1 folder and <checkpoint_segments> files is very
very fast.  Even if using WAL archiving, there won't be many
files/directories to check.  Fsck is not an issue if the partition is
exclusively for WAL.  You can even mount it direct, and avoid having the OS
cache those pages if you are using a caching raid controller.

Re: limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

12 November 2009, 10:21:40

Hi !

Here is my plan :
- rebuilding a spare with ext3, raid10, without lvm
- switch the slony master to this new node.

We'll see ...
Thx for all the info !!!

--
Ker2x

Re: limiting performance impact of wal archiving.

From

david@lang.hm

Date:

15 November 2009, 19:06:06

On Tue, 10 Nov 2009, Greg Smith wrote:

> Laurent Laborde wrote:
>> It is on a separate array which does everything but tablespace (on a
>> separate array) and indexspace (another separate array).
>>
> On Linux, the types of writes done to the WAL volume (where writes are
> constantly being flushed) require the WAL volume not be shared with anything
> else for that to perform well.  Typically you'll end up with other things
> being written out too because it can't just selectively flush just the WAL
> data.  The whole "write barriers" implementation should fix that, but in
> practice rarely does.

I believe that this is more a EXT3 problem than a linux problem.

David Lang

Re: limiting performance impact of wal archiving.

From

Laurent Laborde

Date:

16 November 2009, 05:26:26

On Thu, Nov 12, 2009 at 3:21 PM, Laurent Laborde <kerdezixe@gmail.com> wrote:
> Hi !
>
> Here is my plan :
> - rebuilding a spare with ext3, raid10, without lvm
> - switch the slony master to this new node.

Done 3 days ago : Problem solved ! It totally worked. \o/

--
ker2x
sysadmin & DBA @ http://www.over-blog.com/