Thread: moving pg_xlog -- yeah, it's worth it!

From:
"Kevin Grittner"
Date:

Due to an error during an update to the system kernel on a database
server backing a web application, we ran for a month (mid-December
to mid-January) with the WAL files in the pg_xlog subdirectory on
the same 40-spindle array as the data -- only the OS was on a
separate mirrored pair of drives.  When we found the problem, we
moved the pg_xlog directory back to its own mirrored pair of drives
and symlinked to it.  It's a pretty dramatic difference.

Graph attached, this is response time for a URL request (like what a
user from the Internet would issue) which runs 15 queries and
formats the results.  Response time is 85 ms without putting WAL on
its own RAID; 50 ms with it on its own RAID.  This is a real-world,
production web site with 1.3 TB data, millions of hits per day, and
it's an active replication target 24/7.

Frankly, I was quite surprised by this, since some of the benchmarks
people have published on the effects of using a separate RAID for
the WAL files have only shown a one or two percent difference when
using a hardware RAID controller with BBU cache configured for
write-back.

No assistance needed; just posting a performance data point.

-Kevin

From:
Jesper Krogh
Date:

>
> Frankly, I was quite surprised by this, since some of the benchmarks
> people have published on the effects of using a separate RAID for
> the WAL files have only shown a one or two percent difference when
> using a hardware RAID controller with BBU cache configured for
> write-back.

Hi Kevin.

Nice report, but just a few questions.

Sorry if it is obvious.. but what filesystem/OS are you using and do you
 have BBU-writeback on the main data catalog also?

Jesper


From:
"Kevin Grittner"
Date:

Jesper Krogh <> wrote:

> Sorry if it is obvious.. but what filesystem/OS are you using and
> do you have BBU-writeback on the main data catalog also?

Sorry for not providing more context.

ATHENA:/var/pgsql/data # uname -a
Linux ATHENA 2.6.16.60-0.39.3-smp #1 SMP Mon May 11 11:46:34 UTC
2009 x86_64 x86_64 x86_64 GNU/Linux
ATHENA:/var/pgsql/data # cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2

File system is xfs noatime,nobarrier for all data; OS is on ext3.  I
*think* the pg_xlog mirrored pair is hanging off the same
BBU-writeback controller as the big RAID, but I'd have to track down
the hardware tech to confirm, and he's out today.  System has 16
Xeon CPUs and 64 GB RAM.

-Kevin

From:
Amitabh Kant
Date:

On Tue, Feb 9, 2010 at 10:03 PM, Kevin Grittner <> wrote:
Jesper Krogh <> wrote:
File system is xfs noatime,nobarrier for all data; OS is on ext3.  I
*think* the pg_xlog mirrored pair is hanging off the same
BBU-writeback controller as the big RAID, but I'd have to track down
the hardware tech to confirm, and he's out today.  System has 16
Xeon CPUs and 64 GB RAM.

-Kevin


Hi Kevin

Just curious if you have a 16 physical CPU's or 16 cores on 4 CPU/8 cores over 2 CPU with HT.

With regards

Amitabh Kant
From:
"Kevin Grittner"
Date:

Amitabh Kant <> wrote:

> Just curious if you have a 16 physical CPU's or 16 cores on 4
> CPU/8 cores over 2 CPU with HT.

Four quad core CPUs:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X7350  @ 2.93GHz
stepping        : 11
cpu MHz         : 2931.978
cache size      : 4096 KB
physical id     : 3
siblings        : 4
core id         : 0
cpu cores       : 4

-Kevin

From:
Bruce Momjian
Date:

Kevin Grittner wrote:
> Jesper Krogh <> wrote:
>
> > Sorry if it is obvious.. but what filesystem/OS are you using and
> > do you have BBU-writeback on the main data catalog also?
>
> Sorry for not providing more context.
>
> ATHENA:/var/pgsql/data # uname -a
> Linux ATHENA 2.6.16.60-0.39.3-smp #1 SMP Mon May 11 11:46:34 UTC
> 2009 x86_64 x86_64 x86_64 GNU/Linux
> ATHENA:/var/pgsql/data # cat /etc/SuSE-release
> SUSE Linux Enterprise Server 10 (x86_64)
> VERSION = 10
> PATCHLEVEL = 2
>
> File system is xfs noatime,nobarrier for all data; OS is on ext3.  I
> *think* the pg_xlog mirrored pair is hanging off the same
> BBU-writeback controller as the big RAID, but I'd have to track down
> the hardware tech to confirm, and he's out today.  System has 16
> Xeon CPUs and 64 GB RAM.

I would be surprised if the RAID controller had a BBU-writeback cache.
I don't think having xlog share a BBU-writeback makes things slower, and
if it does, I would love for someone to explain why.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

From:
Scott Marlowe
Date:

On Thu, Feb 11, 2010 at 4:29 AM, Bruce Momjian <> wrote:
> Kevin Grittner wrote:
>> Jesper Krogh <> wrote:
>>
>> > Sorry if it is obvious.. but what filesystem/OS are you using and
>> > do you have BBU-writeback on the main data catalog also?
>>
>> Sorry for not providing more context.
>>
>> ATHENA:/var/pgsql/data # uname -a
>> Linux ATHENA 2.6.16.60-0.39.3-smp #1 SMP Mon May 11 11:46:34 UTC
>> 2009 x86_64 x86_64 x86_64 GNU/Linux
>> ATHENA:/var/pgsql/data # cat /etc/SuSE-release
>> SUSE Linux Enterprise Server 10 (x86_64)
>> VERSION = 10
>> PATCHLEVEL = 2
>>
>> File system is xfs noatime,nobarrier for all data; OS is on ext3.  I
>> *think* the pg_xlog mirrored pair is hanging off the same
>> BBU-writeback controller as the big RAID, but I'd have to track down
>> the hardware tech to confirm, and he's out today.  System has 16
>> Xeon CPUs and 64 GB RAM.
>
> I would be surprised if the RAID controller had a BBU-writeback cache.
> I don't think having xlog share a BBU-writeback makes things slower, and
> if it does, I would love for someone to explain why.

I believe in the past when this discussion showed up it was mainly due
to them being on the same file system (and then not with pg_xlog
separate) that made the biggest difference.  I recall there being a
noticeable performance gain from having two file systems on the same
logical RAID device even.

From:
"Kevin Grittner"
Date:

"Kevin Grittner" <> wrote:
> Jesper Krogh <> wrote:
>
>> Sorry if it is obvious.. but what filesystem/OS are you using and
>> do you have BBU-writeback on the main data catalog also?

> File system is xfs noatime,nobarrier for all data; OS is on ext3.
> I *think* the pg_xlog mirrored pair is hanging off the same
> BBU-writeback controller as the big RAID, but I'd have to track
> down the hardware tech to confirm

Another example of why I shouldn't trust my memory.  Per the
hardware tech:


OS:      /dev/sda   is RAID1  -  2  x  2.5" 15k SAS disk
pg_xlog: /dev/sdb   is RAID1  -  2  x  2.5" 15k SAS disk

These reside on a ServeRAID-MR10k controller with 256MB BB cache.


data:    /dev/sdc   is RAID5  -  30 x 3.5" 15k SAS disk

These reside on the DS3200 disk subsystem with 512MB BB cache per
controller and redundant drive loops.


At least I had the file systems and options right.  ;-)

-Kevin

From:
Alvaro Herrera
Date:

Kevin Grittner wrote:

> Another example of why I shouldn't trust my memory.  Per the
> hardware tech:
>
>
> OS:      /dev/sda   is RAID1  -  2  x  2.5" 15k SAS disk
> pg_xlog: /dev/sdb   is RAID1  -  2  x  2.5" 15k SAS disk
>
> These reside on a ServeRAID-MR10k controller with 256MB BB cache.
>
>
> data:    /dev/sdc   is RAID5  -  30 x 3.5" 15k SAS disk
>
> These reside on the DS3200 disk subsystem with 512MB BB cache per
> controller and redundant drive loops.

Hmm, so maybe the performance benefit is not from it being on a separate
array, but from it being RAID1 instead of RAID5?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:
Aidan Van Dyk
Date:

* Alvaro Herrera <> [100211 12:58]:
> Hmm, so maybe the performance benefit is not from it being on a separate
> array, but from it being RAID1 instead of RAID5?

Or the cumulative effects of:
1) Dedicated spindles/Raid1
2) More BBU cache available (I can't imagine the OS pair writing much)
3) not being queued behind data writes before getting to controller
3) Not waiting for BBU cache to be available (which is shared with all data
   writes) which requires RAID5 writes to complete...

Really, there's *lots* of variables here.  The basics being that WAL on
the same FS as data, on a RAID5, even with BBU is worse than WAL on a
dedicated set of RAID1 spindles with it's own BBU.

Wow!

;-)


--
Aidan Van Dyk                                             Create like a god,
                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

From:
"Kevin Grittner"
Date:

Aidan Van Dyk <> wrote:
> Alvaro Herrera <> wrote:
>> Hmm, so maybe the performance benefit is not from it being on a
>> separate array, but from it being RAID1 instead of RAID5?
>
> Or the cumulative effects of:
> 1) Dedicated spindles/Raid1
> 2) More BBU cache available (I can't imagine the OS pair writing
>    much)
> 3) not being queued behind data writes before getting to
>    controller
> 3) Not waiting for BBU cache to be available (which is shared with
>    all data writes) which requires RAID5 writes to complete...
>
> Really, there's *lots* of variables here.  The basics being that
> WAL on the same FS as data, on a RAID5, even with BBU is worse
> than WAL on a dedicated set of RAID1 spindles with it's own BBU.
>
> Wow!

Sure, OK, but what surprised me was that a set of 15 read-only
queries (with pretty random reads) took almost twice as long when
the WAL files were on the same file system.  That's with OS writes
being only about 10% of reads, and *that's* with 128 GB of RAM which
keeps a lot of the reads from having to go to the disk.  I would not
have expected that a read-mostly environment like this would be that
sensitive to the WAL file placement.  (OK, I *did* request the
separate file system for them anyway, but I thought it was going to
be a marginal benefit, not something this big.)

-Kevin

From:
"Kevin Grittner"
Date:

Hannu Krosing  wrote:

> Can it be, that each request does at least 1 write op (update
> session or log something) ?

Well, the web application connects through a login which only has
SELECT rights; but, in discussing a previous issue we've pretty well
established that it's not unusual for a read to force a dirty buffer
to write to the OS.  Perhaps this is the issue here again.  Nothing
is logged on the database server for every request.

> If you can, set
>
> synchronous_commit = off;
>
> and see if it further increases performance.

I wonder if it might also pay to make the background writer even more
aggressive than we have, so that SELECT-only queries don't spend so
much time writing pages.  Anyway, given that these are replication
targets, and aren't the "database of origin" for any data of their
own, I guess there's no reason not to try asynchronous commit.
Thanks for the suggestion.

-Kevin



From:
Alvaro Herrera
Date:

Kevin Grittner wrote:
> Hannu Krosing  wrote:
>
> > Can it be, that each request does at least 1 write op (update
> > session or log something) ?
>
> Well, the web application connects through a login which only has
> SELECT rights; but, in discussing a previous issue we've pretty well
> established that it's not unusual for a read to force a dirty buffer
> to write to the OS.  Perhaps this is the issue here again.  Nothing
> is logged on the database server for every request.

I don't think it explains it, because dirty buffers are obviously
written to the data area, not pg_xlog.

> I wonder if it might also pay to make the background writer even more
> aggressive than we have, so that SELECT-only queries don't spend so
> much time writing pages.

That's worth trying.

> Anyway, given that these are replication
> targets, and aren't the "database of origin" for any data of their
> own, I guess there's no reason not to try asynchronous commit.

Yeah; since the transactions only ever write commit records to WAL, it
wouldn't matter a bit that they are lost on crash.  And you should see
an improvement, because they wouldn't have to flush at all.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:
Alvaro Herrera
Date:

Alvaro Herrera wrote:
> Kevin Grittner wrote:

> > Anyway, given that these are replication
> > targets, and aren't the "database of origin" for any data of their
> > own, I guess there's no reason not to try asynchronous commit.
>
> Yeah; since the transactions only ever write commit records to WAL, it
> wouldn't matter a bit that they are lost on crash.  And you should see
> an improvement, because they wouldn't have to flush at all.

Actually, a transaction that performed no writes doesn't get a commit
WAL record written, so it shouldn't make any difference at all.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:
"Kevin Grittner"
Date:

Alvaro Herrera <> wrote:
> Alvaro Herrera wrote:
>> Kevin Grittner wrote:
>
>> > Anyway, given that these are replication targets, and aren't
>> > the "database of origin" for any data of their own, I guess
>> > there's no reason not to try asynchronous commit.
>>
>> Yeah; since the transactions only ever write commit records to
>> WAL, it wouldn't matter a bit that they are lost on crash.  And
>> you should see an improvement, because they wouldn't have to
>> flush at all.
>
> Actually, a transaction that performed no writes doesn't get a
> commit WAL record written, so it shouldn't make any difference at
> all.

Well, concurrent to the web application is the replication.  Would
asynchronous commit of that potentially alter the pattern of writes
such that it had less impact on the reads?  I'm thinking, again, of
why the placement of the pg_xlog on a separate file system made such
a dramatic difference to the read-only response time -- might it
make less difference if the replication was using asynchronous
commit?

By the way, the way our replication system works is that each target
keeps track of how far it has replicated in the transaction stream
of each source, so as long as a *later* transaction from a source is
never persisted before an *earlier* one, there's no risk of data
loss; it's strictly a performance issue.  It will be able to catch
up from wherever it is in the transaction stream when it comes back
up after any down time (planned or otherwise).

By the way, I have no complaints about the performance with the
pg_xlog directory on its own file system (although if it could be
*even faster* with a configuration change I will certainly take
advantage of that).  I do like to understand the dynamics of these
things when I can, though.

-Kevin

From:
Alvaro Herrera
Date:

Kevin Grittner wrote:
> Alvaro Herrera <> wrote:

> > Actually, a transaction that performed no writes doesn't get a
> > commit WAL record written, so it shouldn't make any difference at
> > all.
>
> Well, concurrent to the web application is the replication.  Would
> asynchronous commit of that potentially alter the pattern of writes
> such that it had less impact on the reads?

Well, certainly async commit would completely change the pattern of
writes: it would give the controller an opportunity to reorder them
according to some scheduler.  Otherwise they are strictly serialized.

> I'm thinking, again, of
> why the placement of the pg_xlog on a separate file system made such
> a dramatic difference to the read-only response time -- might it
> make less difference if the replication was using asynchronous
> commit?

Yeah, I think it would have been less notorious, but this is all
theoretical.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:
Greg Smith
Date:

Kevin Grittner wrote:
> I wonder if it might also pay to make the background writer even more
> aggressive than we have, so that SELECT-only queries don't spend so
> much time writing pages.
You can easily quantify if the BGW is aggressive enough.  Buffers leave
the cache three ways, and they each show up as separate counts in
pg_stat_bgwriter:  buffers_checkpoint, buffers_clean (the BGW), and
buffers_backend (the queries).  Cranking it up further tends to shift
writes out of buffers_backend, which are the ones you want to avoid,
toward buffers_clean instead.  If buffers_backend is already low on a
percentage basis compared to the other two, there's little benefit in
trying to make the BGW do more.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
"Kevin Grittner"
Date:

Greg Smith <> wrote:

> You can easily quantify if the BGW is aggressive enough.  Buffers
> leave the cache three ways, and they each show up as separate
> counts in pg_stat_bgwriter:  buffers_checkpoint, buffers_clean
> (the BGW), and buffers_backend (the queries).  Cranking it up
> further tends to shift writes out of buffers_backend, which are
> the ones you want to avoid, toward buffers_clean instead.  If
> buffers_backend is already low on a percentage basis compared to
> the other two, there's little benefit in trying to make the BGW do
> more.

Here are the values from our two largest and busiest systems (where
we found the pg_xlog placement to matter so much).  It looks to me
like a more aggressive bgwriter would help, yes?

cir=> select * from pg_stat_bgwriter ;
-[ RECORD 1 ]------+------------
checkpoints_timed  | 125996
checkpoints_req    | 16932
buffers_checkpoint | 342972024
buffers_clean      | 343634920
maxwritten_clean   | 9928
buffers_backend    | 575589056
buffers_alloc      | 52397855471

cir=> select * from pg_stat_bgwriter ;
-[ RECORD 1 ]------+------------
checkpoints_timed  | 125992
checkpoints_req    | 16840
buffers_checkpoint | 260358442
buffers_clean      | 474768152
maxwritten_clean   | 9778
buffers_backend    | 565837397
buffers_alloc      | 71463873477

Current settings:

bgwriter_delay = '200ms'
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4

Any suggestions on how far to push it?

-Kevin

From:
Ben Chobot
Date:

On Feb 23, 2010, at 2:23 PM, Kevin Grittner wrote:


Here are the values from our two largest and busiest systems (where
we found the pg_xlog placement to matter so much).  It looks to me
like a more aggressive bgwriter would help, yes?

cir=> select * from pg_stat_bgwriter ;
-[ RECORD 1 ]------+------------
checkpoints_timed  | 125996
checkpoints_req    | 16932
buffers_checkpoint | 342972024
buffers_clean      | 343634920
maxwritten_clean   | 9928
buffers_backend    | 575589056
buffers_alloc      | 52397855471

cir=> select * from pg_stat_bgwriter ;
-[ RECORD 1 ]------+------------
checkpoints_timed  | 125992
checkpoints_req    | 16840
buffers_checkpoint | 260358442
buffers_clean      | 474768152
maxwritten_clean   | 9778
buffers_backend    | 565837397
buffers_alloc      | 71463873477

Current settings:

bgwriter_delay = '200ms'
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4

Any suggestions on how far to push it?

I don't know how far to push it, but you could start by reducing the delay time and observe how that affects performance.