Thread: partition question for new server setup

From:
Whit Armstrong
Date:

I have the opportunity to set up a new postgres server for our
production database.  I've read several times in various postgres
lists about the importance of separating logs from the actual database
data to avoid disk contention.

Can someone suggest a typical partitioning scheme for a postgres server?

My initial thought was to create /var/lib/postgresql as a partition on
a separate set of disks.

However, I can see that the xlog files will be stored here as well:
http://www.postgresql.org/docs/8.3/interactive/storage-file-layout.html

Should the xlog files be stored on a separate partition to improve performance?

Any suggestions would be very helpful.  Or if there is a document that
lays out some best practices for server setup, that would be great.

The database usage will be read heavy (financial data) with batch
writes occurring overnight and occassionally during the day.

server information:
Dell PowerEdge 2970, 8 core Opteron 2384
6 1TB hard drives with a PERC 6i
64GB of ram

We will be running Ubuntu 9.04.

Thanks in advance,
Whit

From:
Scott Marlowe
Date:

On Tue, Apr 28, 2009 at 10:56 AM, Whit Armstrong
<> wrote:
> I have the opportunity to set up a new postgres server for our
> production database.  I've read several times in various postgres
> lists about the importance of separating logs from the actual database
> data to avoid disk contention.
>
> Can someone suggest a typical partitioning scheme for a postgres server?

At work I have 16 SAS disks.  They are setup with 12 in a RAID-10, 2
in a RAID-1 and 2 hot spares.

The OS, /var/log, and postgres base go in the RAID-1.  I then create a
new data directory on the RAID-10, shut down pgsql, copy the base
directory over to the RAID-10 and replace the base dir in the pg data
directory with a link to the RAID-10's base directory and restart
postgres.  So, my pg_xlog and all OS and logging stuff goes on the
RAID-10 and the main store for the db goes on the RAID-10.

From:
Whit Armstrong
Date:

Thanks, Scott.

Just to clarify you said:

> postgres.  So, my pg_xlog and all OS and logging stuff goes on the
> RAID-10 and the main store for the db goes on the RAID-10.

Is that meant to be that the pg_xlog and all OS and logging stuff go
on the RAID-1 and the real database (the
/var/lib/postgresql/8.3/main/base directory) goes on the RAID-10
partition?

This is very helpful.  Thanks for your feedback.

Additionally are there any clear choices w/ regard to filesystem
types?  Our choices would be xfs, ext3, or ext4.

Is anyone out there running ext4 on a production system?

-Whit

From:
Scott Marlowe
Date:

On Tue, Apr 28, 2009 at 11:48 AM, Whit Armstrong
<> wrote:
> Thanks, Scott.
>
> Just to clarify you said:
>
>> postgres.  So, my pg_xlog and all OS and logging stuff goes on the
>> RAID-10 and the main store for the db goes on the RAID-10.
>
> Is that meant to be that the pg_xlog and all OS and logging stuff go
> on the RAID-1 and the real database (the
> /var/lib/postgresql/8.3/main/base directory) goes on the RAID-10
> partition?

Yeah, and extra 0 jumped in there.  Faulty keyboard I guess. :)  OS
and everything but base is on the RAID-1.

> This is very helpful.  Thanks for your feedback.
>
> Additionally are there any clear choices w/ regard to filesystem
> types?  Our choices would be xfs, ext3, or ext4.

Well, there's a lot of people who use xfs and ext3.  XFS is generally
rated higher than ext3 both for performance and reliability.  However,
we run Centos 5 in production, and XFS isn't one of the blessed file
systems it comes with, so we're running ext3.  It's worked quite well
for us.

From:
Kenneth Marshall
Date:

On Tue, Apr 28, 2009 at 11:56:25AM -0600, Scott Marlowe wrote:
> On Tue, Apr 28, 2009 at 11:48 AM, Whit Armstrong
> <> wrote:
> > Thanks, Scott.
> >
> > Just to clarify you said:
> >
> >> postgres. ?So, my pg_xlog and all OS and logging stuff goes on the
> >> RAID-10 and the main store for the db goes on the RAID-10.
> >
> > Is that meant to be that the pg_xlog and all OS and logging stuff go
> > on the RAID-1 and the real database (the
> > /var/lib/postgresql/8.3/main/base directory) goes on the RAID-10
> > partition?
>
> Yeah, and extra 0 jumped in there.  Faulty keyboard I guess. :)  OS
> and everything but base is on the RAID-1.
>
> > This is very helpful. ?Thanks for your feedback.
> >
> > Additionally are there any clear choices w/ regard to filesystem
> > types? ?Our choices would be xfs, ext3, or ext4.
>
> Well, there's a lot of people who use xfs and ext3.  XFS is generally
> rated higher than ext3 both for performance and reliability.  However,
> we run Centos 5 in production, and XFS isn't one of the blessed file
> systems it comes with, so we're running ext3.  It's worked quite well
> for us.
>

The other optimizations are using data=writeback when mounting the
ext3 filesystem for PostgreSQL and using the elevator=deadline for
the disk driver. I do not know how you specify that for Ubuntu.

Cheers,
Ken

From:
Craig James
Date:

Whit Armstrong wrote:
> I have the opportunity to set up a new postgres server for our
> production database.  I've read several times in various postgres
> lists about the importance of separating logs from the actual database
> data to avoid disk contention.
>
> Can someone suggest a typical partitioning scheme for a postgres server?
>
> My initial thought was to create /var/lib/postgresql as a partition on
> a separate set of disks.
>
> However, I can see that the xlog files will be stored here as well:
> http://www.postgresql.org/docs/8.3/interactive/storage-file-layout.html
>
> Should the xlog files be stored on a separate partition to improve performance?
>
> Any suggestions would be very helpful.  Or if there is a document that
> lays out some best practices for server setup, that would be great.
>
> The database usage will be read heavy (financial data) with batch
> writes occurring overnight and occassionally during the day.
>
> server information:
> Dell PowerEdge 2970, 8 core Opteron 2384
> 6 1TB hard drives with a PERC 6i
> 64GB of ram

We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we have 8 of the 2.5" 10K 384GB disks.

When I asked the same question on this forum, I was advised to just put all 8 disks into a single RAID 10, and forget
aboutseparating things.  The performance of a battery-backed PERC 6i (you did get a battery-backed cache, right?) with
8disks is quite good.  

In order to separate the logs, OS and data, I'd have to split off at least two of the 8 disks, leaving only six for the
RAID10 array.  But then my xlogs would be on a single disk, which might not be safe.  A more robust approach would be
tosplit off four of the disks, put the OS on a RAID 1, the xlog on a RAID 1, and the database data on a 4-disk RAID 10.
Now I've separated the data, but my primary partition has lost half its disks. 

So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very happy with it.  It has everything:
Postgres,OS and logs.  But since the RAID array is 8 disks instead of 4, the net performance seems to quite good. 

But ... your mileage may vary.  My box has just one thing running on it: Postgres.  There is almost no other disk
activityto interfere with the file-system caching.  If your server is going to have a bunch of other activity that
generatea lot of non-Postgres disk activity, then this advice might not apply. 

Craig


From:
Alan Hodgson
Date:

On Tuesday 28 April 2009, Whit Armstrong <> wrote:
> Additionally are there any clear choices w/ regard to filesystem
> types?  Our choices would be xfs, ext3, or ext4.

xfs consistently delivers much higher sequential throughput than ext3 (up to
100%), at least on my hardware.

--
Even a sixth-grader can figure out that you can’t borrow money to pay off
your debt

From:
Craig James
Date:

Kenneth Marshall wrote:
>>> Additionally are there any clear choices w/ regard to filesystem
>>> types? ?Our choices would be xfs, ext3, or ext4.
>> Well, there's a lot of people who use xfs and ext3.  XFS is generally
>> rated higher than ext3 both for performance and reliability.  However,
>> we run Centos 5 in production, and XFS isn't one of the blessed file
>> systems it comes with, so we're running ext3.  It's worked quite well
>> for us.
>>
>
> The other optimizations are using data=writeback when mounting the
> ext3 filesystem for PostgreSQL and using the elevator=deadline for
> the disk driver. I do not know how you specify that for Ubuntu.

After a reading various articles, I thought that "noop" was the right choice when you're using a battery-backed RAID
controller. The RAID controller is going to cache all data and reschedule the writes anyway, so the kernal schedule is
irrelevantat best, and can slow things down. 

On Ubuntu, it's

  echo noop >/sys/block/hdx/queue/scheduler

where "hdx" is replaced by the appropriate device.

Craig


From:
"Kevin Grittner"
Date:

Craig James <> wrote:

> After a reading various articles, I thought that "noop" was the
> right choice when you're using a battery-backed RAID controller.
> The RAID controller is going to cache all data and reschedule the
> writes anyway, so the kernal schedule is irrelevant at best, and can
> slow things down.

Wouldn't that depend on the relative sizes of those caches?  In a
not-so-hypothetical example, we have machines with 120 GB OS cache,
and 256 MB BBU RAID controller cache.  We seem to benefit from
elevator=deadline at the OS level.

-Kevin

From:
Whit Armstrong
Date:

>  echo noop >/sys/block/hdx/queue/scheduler

can this go into /etc/init.d somewhere?

or does that change stick between reboots?

-Whit


On Tue, Apr 28, 2009 at 2:16 PM, Craig James <> wrote:
> Kenneth Marshall wrote:
>>>>
>>>> Additionally are there any clear choices w/ regard to filesystem
>>>> types? ?Our choices would be xfs, ext3, or ext4.
>>>
>>> Well, there's a lot of people who use xfs and ext3.  XFS is generally
>>> rated higher than ext3 both for performance and reliability.  However,
>>> we run Centos 5 in production, and XFS isn't one of the blessed file
>>> systems it comes with, so we're running ext3.  It's worked quite well
>>> for us.
>>>
>>
>> The other optimizations are using data=writeback when mounting the
>> ext3 filesystem for PostgreSQL and using the elevator=deadline for
>> the disk driver. I do not know how you specify that for Ubuntu.
>
> After a reading various articles, I thought that "noop" was the right choice
> when you're using a battery-backed RAID controller.  The RAID controller is
> going to cache all data and reschedule the writes anyway, so the kernal
> schedule is irrelevant at best, and can slow things down.
>
> On Ubuntu, it's
>
>  echo noop >/sys/block/hdx/queue/scheduler
>
> where "hdx" is replaced by the appropriate device.
>
> Craig
>
>

From:
Kenneth Marshall
Date:

On Tue, Apr 28, 2009 at 01:30:59PM -0500, Kevin Grittner wrote:
> Craig James <> wrote:
>
> > After a reading various articles, I thought that "noop" was the
> > right choice when you're using a battery-backed RAID controller.
> > The RAID controller is going to cache all data and reschedule the
> > writes anyway, so the kernal schedule is irrelevant at best, and can
> > slow things down.
>
> Wouldn't that depend on the relative sizes of those caches?  In a
> not-so-hypothetical example, we have machines with 120 GB OS cache,
> and 256 MB BBU RAID controller cache.  We seem to benefit from
> elevator=deadline at the OS level.
>
> -Kevin
>
This was my understanding as well. If your RAID controller had a
lot of well managed cache, then the noop scheduler was a win. Less
performant RAID controllers benefit from teh deadline scheduler.

Cheers,
Ken

From:
"Kevin Grittner"
Date:

Whit Armstrong <> wrote:
>>   echo noop >/sys/block/hdx/queue/scheduler
>
> can this go into /etc/init.d somewhere?

We set the default for the kernel in the /boot/grub/menu.lst file.  On
a kernel line, add  elevator=xxx (where xxx is your choice of
scheduler).

-Kevin


From:
Whit Armstrong
Date:

I see.

Thanks for everyone for replying.  The whole discussion has been very helpful.

Cheers,
Whit


On Tue, Apr 28, 2009 at 3:13 PM, Kevin Grittner
<> wrote:
> Whit Armstrong <> wrote:
>>>   echo noop >/sys/block/hdx/queue/scheduler
>>
>> can this go into /etc/init.d somewhere?
>
> We set the default for the kernel in the /boot/grub/menu.lst file.  On
> a kernel line, add  elevator=xxx (where xxx is your choice of
> scheduler).
>
> -Kevin
>
>

From:
Scott Marlowe
Date:

On Tue, Apr 28, 2009 at 12:06 PM, Kenneth Marshall <> wrote:
> On Tue, Apr 28, 2009 at 11:56:25AM -0600, Scott Marlowe wrote:
>> On Tue, Apr 28, 2009 at 11:48 AM, Whit Armstrong
>> <> wrote:
>> > Thanks, Scott.
>> >
>> > Just to clarify you said:
>> >
>> >> postgres. ?So, my pg_xlog and all OS and logging stuff goes on the
>> >> RAID-10 and the main store for the db goes on the RAID-10.
>> >
>> > Is that meant to be that the pg_xlog and all OS and logging stuff go
>> > on the RAID-1 and the real database (the
>> > /var/lib/postgresql/8.3/main/base directory) goes on the RAID-10
>> > partition?
>>
>> Yeah, and extra 0 jumped in there.  Faulty keyboard I guess. :)  OS
>> and everything but base is on the RAID-1.
>>
>> > This is very helpful. ?Thanks for your feedback.
>> >
>> > Additionally are there any clear choices w/ regard to filesystem
>> > types? ?Our choices would be xfs, ext3, or ext4.
>>
>> Well, there's a lot of people who use xfs and ext3.  XFS is generally
>> rated higher than ext3 both for performance and reliability.  However,
>> we run Centos 5 in production, and XFS isn't one of the blessed file
>> systems it comes with, so we're running ext3.  It's worked quite well
>> for us.
>>
>
> The other optimizations are using data=writeback when mounting the
> ext3 filesystem for PostgreSQL and using the elevator=deadline for
> the disk driver. I do not know how you specify that for Ubuntu.

Yeah, we set the scheduler to deadline on our db servers and it
dropped the load and io wait noticeably, even with our rather fast
arrays and controller.  We also use data=writeback.

From:
Scott Marlowe
Date:

On Tue, Apr 28, 2009 at 12:37 PM, Whit Armstrong
<> wrote:
>>  echo noop >/sys/block/hdx/queue/scheduler
>
> can this go into /etc/init.d somewhere?
>
> or does that change stick between reboots?

I just stick in /etc/rc.local

From:
Scott Marlowe
Date:

On Tue, Apr 28, 2009 at 12:40 PM, Kenneth Marshall <> wrote:
> On Tue, Apr 28, 2009 at 01:30:59PM -0500, Kevin Grittner wrote:
>> Craig James <> wrote:
>>
>> > After a reading various articles, I thought that "noop" was the
>> > right choice when you're using a battery-backed RAID controller.
>> > The RAID controller is going to cache all data and reschedule the
>> > writes anyway, so the kernal schedule is irrelevant at best, and can
>> > slow things down.
>>
>> Wouldn't that depend on the relative sizes of those caches?  In a
>> not-so-hypothetical example, we have machines with 120 GB OS cache,
>> and 256 MB BBU RAID controller cache.  We seem to benefit from
>> elevator=deadline at the OS level.
>>
>> -Kevin
>>
> This was my understanding as well. If your RAID controller had a
> lot of well managed cache, then the noop scheduler was a win. Less
> performant RAID controllers benefit from teh deadline scheduler.

I have an Areca 1680ix with 512M cache on a machine with 32Gig ram and
I get slightly better performance and lower load factors from deadline
than from noop, but it's not by much.

From:
Scott Carey
Date:

On 4/28/09 11:16 AM, "Craig James" <> wrote:

> Kenneth Marshall wrote:
>>>> Additionally are there any clear choices w/ regard to filesystem
>>>> types? ?Our choices would be xfs, ext3, or ext4.
>>> Well, there's a lot of people who use xfs and ext3.  XFS is generally
>>> rated higher than ext3 both for performance and reliability.  However,
>>> we run Centos 5 in production, and XFS isn't one of the blessed file
>>> systems it comes with, so we're running ext3.  It's worked quite well
>>> for us.
>>>
>>
>> The other optimizations are using data=writeback when mounting the
>> ext3 filesystem for PostgreSQL and using the elevator=deadline for
>> the disk driver. I do not know how you specify that for Ubuntu.
>
> After a reading various articles, I thought that "noop" was the right choice
> when you're using a battery-backed RAID controller.  The RAID controller is
> going to cache all data and reschedule the writes anyway, so the kernal
> schedule is irrelevant at best, and can slow things down.
>
> On Ubuntu, it's
>
>   echo noop >/sys/block/hdx/queue/scheduler
>
> where "hdx" is replaced by the appropriate device.
>
> Craig
>

I've always had better performance from deadline than noop, no matter what
raid controller I have.  Perhaps with a really good one or a SAN that
changes (NOT a PERC 6 mediocre thingamabob).

PERC 6 really, REALLY needs to have the linux "readahead" value set up to at
least 1MB per effective spindle to get good sequential read performance.
Xfs helps with it too, but you can mitigate half of the ext3 vs xfs
sequential access performance with high readahead settings:

/sbin/blockdev --setra <value> <device>

Value is in blocks (512 bytes)

/sbin/blockdev --getra <device> to see its setting.   Google for more info.

>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From:
Scott Carey
Date:

>>
>> server information:
>> Dell PowerEdge 2970, 8 core Opteron 2384
>> 6 1TB hard drives with a PERC 6i
>> 64GB of ram
>
> We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we have
> 8 of the 2.5" 10K 384GB disks.
>
> When I asked the same question on this forum, I was advised to just put all 8
> disks into a single RAID 10, and forget about separating things.  The
> performance of a battery-backed PERC 6i (you did get a battery-backed cache,
> right?) with 8 disks is quite good.
>
> In order to separate the logs, OS and data, I'd have to split off at least two
> of the 8 disks, leaving only six for the RAID 10 array.  But then my xlogs
> would be on a single disk, which might not be safe.  A more robust approach
> would be to split off four of the disks, put the OS on a RAID 1, the xlog on a
> RAID 1, and the database data on a 4-disk RAID 10.  Now I've separated the
> data, but my primary partition has lost half its disks.
>
> So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very
> happy with it.  It has everything: Postgres, OS and logs.  But since the RAID
> array is 8 disks instead of 4, the net performance seems to quite good.
>

If you go this route, there are a few risks:
1.  If everything is on the same partition/file system, fsyncs from the
xlogs may cross-pollute to the data.  Ext3 is notorious for this, though
data=writeback limits the effect you especially might not want
data=writeback on your OS partition.  I would recommend that the OS, Data,
and xlogs + etc live on three different partitions regardless of the number
of logical RAID volumes.
2. Cheap raid controllers (PERC, others) will see fsync for an array and
flush everything that is dirty (not just the partition or file data), which
is a concern if you aren't using it in write-back with battery backed cache,
even for a very read heavy db that doesn't need high fsync speed for
transactions.

> But ... your mileage may vary.  My box has just one thing running on it:
> Postgres.  There is almost no other disk activity to interfere with the
> file-system caching.  If your server is going to have a bunch of other
> activity that generate a lot of non-Postgres disk activity, then this advice
> might not apply.
>
> Craig
>

6 and 8 disk counts are tough.  My biggest single piece of advise is to have
the xlogs in a partition separate from the data (not necessarily a different
raid logical volume), with file system and mount options tuned for each case
separately.  I've seen this alone improve performance by a factor of 2.5 on
some file system / storage combinations.

>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


From:
Whit Armstrong
Date:

are there any other xfs settings that should be tuned for postgres?

I see this post mentions "allocation groups."  does anyone have
suggestions for those settings?
http://archives.postgresql.org/pgsql-admin/2009-01/msg00144.php

what about raid stripe size?  does it really make a difference?  I
think the default for the perc is 64kb (but I'm not in front of the
server right now).

-Whit


On Tue, Apr 28, 2009 at 7:40 PM, Scott Carey <> wrote:
> On 4/28/09 11:16 AM, "Craig James" <> wrote:
>
>> Kenneth Marshall wrote:
>>>>> Additionally are there any clear choices w/ regard to filesystem
>>>>> types? ?Our choices would be xfs, ext3, or ext4.
>>>> Well, there's a lot of people who use xfs and ext3.  XFS is generally
>>>> rated higher than ext3 both for performance and reliability.  However,
>>>> we run Centos 5 in production, and XFS isn't one of the blessed file
>>>> systems it comes with, so we're running ext3.  It's worked quite well
>>>> for us.
>>>>
>>>
>>> The other optimizations are using data=writeback when mounting the
>>> ext3 filesystem for PostgreSQL and using the elevator=deadline for
>>> the disk driver. I do not know how you specify that for Ubuntu.
>>
>> After a reading various articles, I thought that "noop" was the right choice
>> when you're using a battery-backed RAID controller.  The RAID controller is
>> going to cache all data and reschedule the writes anyway, so the kernal
>> schedule is irrelevant at best, and can slow things down.
>>
>> On Ubuntu, it's
>>
>>   echo noop >/sys/block/hdx/queue/scheduler
>>
>> where "hdx" is replaced by the appropriate device.
>>
>> Craig
>>
>
> I've always had better performance from deadline than noop, no matter what
> raid controller I have.  Perhaps with a really good one or a SAN that
> changes (NOT a PERC 6 mediocre thingamabob).
>
> PERC 6 really, REALLY needs to have the linux "readahead" value set up to at
> least 1MB per effective spindle to get good sequential read performance.
> Xfs helps with it too, but you can mitigate half of the ext3 vs xfs
> sequential access performance with high readahead settings:
>
> /sbin/blockdev --setra <value> <device>
>
> Value is in blocks (512 bytes)
>
> /sbin/blockdev --getra <device> to see its setting.   Google for more info.
>
>>
>> --
>> Sent via pgsql-performance mailing list ()
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>

From:
Whit Armstrong
Date:

Thanks, Scott.

So far, I've followed a pattern similar to Scott Marlowe's setup.  I
have configured 2 disks as a RAID 1 volume, and 4 disks as a RAID 10
volume.  So, the OS and xlogs will live on the RAID 1 vol and the data
will live on the RAID 10 vol.

I'm running the memtest on it now, so we still haven't locked
ourselves into any choices.

regarding your comment:
> 6 and 8 disk counts are tough.  My biggest single piece of advise is to have
> the xlogs in a partition separate from the data (not necessarily a different
> raid logical volume), with file system and mount options tuned for each case
> separately.  I've seen this alone improve performance by a factor of 2.5 on
> some file system / storage combinations.

can you suggest mount options for the various partitions?  I'm leaning
towards xfs for the filesystem format unless someone complains loudly
about data corruption on xfs for a recent 2.6 kernel.

-Whit


On Tue, Apr 28, 2009 at 7:58 PM, Scott Carey <> wrote:
>>>
>>> server information:
>>> Dell PowerEdge 2970, 8 core Opteron 2384
>>> 6 1TB hard drives with a PERC 6i
>>> 64GB of ram
>>
>> We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we have
>> 8 of the 2.5" 10K 384GB disks.
>>
>> When I asked the same question on this forum, I was advised to just put all 8
>> disks into a single RAID 10, and forget about separating things.  The
>> performance of a battery-backed PERC 6i (you did get a battery-backed cache,
>> right?) with 8 disks is quite good.
>>
>> In order to separate the logs, OS and data, I'd have to split off at least two
>> of the 8 disks, leaving only six for the RAID 10 array.  But then my xlogs
>> would be on a single disk, which might not be safe.  A more robust approach
>> would be to split off four of the disks, put the OS on a RAID 1, the xlog on a
>> RAID 1, and the database data on a 4-disk RAID 10.  Now I've separated the
>> data, but my primary partition has lost half its disks.
>>
>> So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very
>> happy with it.  It has everything: Postgres, OS and logs.  But since the RAID
>> array is 8 disks instead of 4, the net performance seems to quite good.
>>
>
> If you go this route, there are a few risks:
> 1.  If everything is on the same partition/file system, fsyncs from the
> xlogs may cross-pollute to the data.  Ext3 is notorious for this, though
> data=writeback limits the effect you especially might not want
> data=writeback on your OS partition.  I would recommend that the OS, Data,
> and xlogs + etc live on three different partitions regardless of the number
> of logical RAID volumes.
> 2. Cheap raid controllers (PERC, others) will see fsync for an array and
> flush everything that is dirty (not just the partition or file data), which
> is a concern if you aren't using it in write-back with battery backed cache,
> even for a very read heavy db that doesn't need high fsync speed for
> transactions.
>
>> But ... your mileage may vary.  My box has just one thing running on it:
>> Postgres.  There is almost no other disk activity to interfere with the
>> file-system caching.  If your server is going to have a bunch of other
>> activity that generate a lot of non-Postgres disk activity, then this advice
>> might not apply.
>>
>> Craig
>>
>
> 6 and 8 disk counts are tough.  My biggest single piece of advise is to have
> the xlogs in a partition separate from the data (not necessarily a different
> raid logical volume), with file system and mount options tuned for each case
> separately.  I've seen this alone improve performance by a factor of 2.5 on
> some file system / storage combinations.
>
>>
>> --
>> Sent via pgsql-performance mailing list ()
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>

From:
Scott Marlowe
Date:

On Tue, Apr 28, 2009 at 5:58 PM, Scott Carey <> wrote:

> 1.  If everything is on the same partition/file system, fsyncs from the
> xlogs may cross-pollute to the data.  Ext3 is notorious for this, though
> data=writeback limits the effect you especially might not want
> data=writeback on your OS partition.  I would recommend that the OS, Data,
> and xlogs + etc live on three different partitions regardless of the number
> of logical RAID volumes.

Note that I remember reading some comments a while back that just
having a different file system, on the same logical set, makes things
faster.  I.e. a partition for OS, one for xlog and one for pgdata on
the same large logical volume was noticeably faster than having it all
on the same big partition on a single logical volume.

From:
Scott Carey
Date:

On 4/28/09 5:02 PM, "Whit Armstrong" <> wrote:

> are there any other xfs settings that should be tuned for postgres?
>
> I see this post mentions "allocation groups."  does anyone have
> suggestions for those settings?
> http://archives.postgresql.org/pgsql-admin/2009-01/msg00144.php
>
> what about raid stripe size?  does it really make a difference?  I
> think the default for the perc is 64kb (but I'm not in front of the
> server right now).
>

When I tested a PERC I couldn't tell the difference between the 64k and 256k
settings.  The other settings that looked like they might improve things all
had worse performance (other than write back cache of course).

Also, if you have partitions at all on the data device, you'll want to try
and stripe align it.  The easiest way is to simply put the file system on
the raw device rather than a partition (e.g. /dev/sda rather than
/dev/sda1).  Partition alignment can be very annoying to do well.  It will
affect performance a little, less so with larger stripe sizes.

> -Whit
>
>
> On Tue, Apr 28, 2009 at 7:40 PM, Scott Carey <> wrote:



From:
Scott Carey
Date:

On 4/28/09 5:10 PM, "Whit Armstrong" <> wrote:

> Thanks, Scott.
>
> So far, I've followed a pattern similar to Scott Marlowe's setup.  I
> have configured 2 disks as a RAID 1 volume, and 4 disks as a RAID 10
> volume.  So, the OS and xlogs will live on the RAID 1 vol and the data
> will live on the RAID 10 vol.
>
> I'm running the memtest on it now, so we still haven't locked
> ourselves into any choices.
>

Its a fine option -- the only way to know if one big volume with separate
partitions is better is to test your actual application since it is highly
dependant on the use case.


> regarding your comment:
>> 6 and 8 disk counts are tough.  My biggest single piece of advise is to have
>> the xlogs in a partition separate from the data (not necessarily a different
>> raid logical volume), with file system and mount options tuned for each case
>> separately.  I've seen this alone improve performance by a factor of 2.5 on
>> some file system / storage combinations.
>
> can you suggest mount options for the various partitions?  I'm leaning
> towards xfs for the filesystem format unless someone complains loudly
> about data corruption on xfs for a recent 2.6 kernel.
>
> -Whit
>

I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a
separate xlogs partition, and xfs for the data.
ext2's drawbacks are not relevant for a small partition with just xlog data,
but are a problem for the OS.

For a setup like yours xlog speed is not going to limit you.
I suggest a partition for the OS with default ext3 mount options, and a
second partition for postgres/xlogs minus the data on ext3 with
data=writeback.

ext3 with default data=ordered on the xlogs causes performance issues as
others have mentioned here.  But data=ordered is probably the right thing
for the OS.  Your xlogs will not be a bottleneck and will probably be fine
either way -- and this is a mount-time option so you can switch.

I went with xfs for the data partition, and did not see benefit from
anything other than the 'noatime' mount option.  The default xfs settings
are fine, and the raid specific formatting options are primarily designed to
help raid 5 or 6 out.
If you go with ext3 for the data partition, make sure its data=writeback
with 'noatime'.  Both of these are mount time options.

I said it before, but I'll repeat -- don't neglect the OS readahead setting
for the device, especially the data device.
Something like:
/sbin/blockdev --setra 8192 /dev/sd<X>
Where <X> is the right letter for your data raid volume
Will have a big impact on larger sequential scans.  This has to go in
rc.local or whatever script runs after boot on your distro.


>
> On Tue, Apr 28, 2009 at 7:58 PM, Scott Carey <> wrote:
>>>>
>>>> server information:
>>>> Dell PowerEdge 2970, 8 core Opteron 2384
>>>> 6 1TB hard drives with a PERC 6i
>>>> 64GB of ram
>>>
>>> We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we
>>> have
>>> 8 of the 2.5" 10K 384GB disks.
>>>
>>> When I asked the same question on this forum, I was advised to just put all
>>> 8
>>> disks into a single RAID 10, and forget about separating things.  The
>>> performance of a battery-backed PERC 6i (you did get a battery-backed cache,
>>> right?) with 8 disks is quite good.
>>>
>>> In order to separate the logs, OS and data, I'd have to split off at least
>>> two
>>> of the 8 disks, leaving only six for the RAID 10 array.  But then my xlogs
>>> would be on a single disk, which might not be safe.  A more robust approach
>>> would be to split off four of the disks, put the OS on a RAID 1, the xlog on
>>> a
>>> RAID 1, and the database data on a 4-disk RAID 10.  Now I've separated the
>>> data, but my primary partition has lost half its disks.
>>>
>>> So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very
>>> happy with it.  It has everything: Postgres, OS and logs.  But since the
>>> RAID
>>> array is 8 disks instead of 4, the net performance seems to quite good.
>>>
>>
>> If you go this route, there are a few risks:
>> 1.  If everything is on the same partition/file system, fsyncs from the
>> xlogs may cross-pollute to the data.  Ext3 is notorious for this, though
>> data=writeback limits the effect you especially might not want
>> data=writeback on your OS partition.  I would recommend that the OS, Data,
>> and xlogs + etc live on three different partitions regardless of the number
>> of logical RAID volumes.
>> 2. Cheap raid controllers (PERC, others) will see fsync for an array and
>> flush everything that is dirty (not just the partition or file data), which
>> is a concern if you aren't using it in write-back with battery backed cache,
>> even for a very read heavy db that doesn't need high fsync speed for
>> transactions.
>>
>>> But ... your mileage may vary.  My box has just one thing running on it:
>>> Postgres.  There is almost no other disk activity to interfere with the
>>> file-system caching.  If your server is going to have a bunch of other
>>> activity that generate a lot of non-Postgres disk activity, then this advice
>>> might not apply.
>>>
>>> Craig
>>>
>>
>> 6 and 8 disk counts are tough.  My biggest single piece of advise is to have
>> the xlogs in a partition separate from the data (not necessarily a different
>> raid logical volume), with file system and mount options tuned for each case
>> separately.  I've seen this alone improve performance by a factor of 2.5 on
>> some file system / storage combinations.
>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list ()
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>
>>
>>
>


From:
Scott Carey
Date:

On 4/29/09 7:28 AM, "Whit Armstrong" <> wrote:

> Thanks, Scott.
>
>> I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a
>> separate xlogs partition, and xfs for the data.
>> ext2's drawbacks are not relevant for a small partition with just xlog data,
>> but are a problem for the OS.
>
> Can you suggest an appropriate size for the xlogs partition?  These
> files are controlled by checkpoint_segments, is that correct?
>
> We have checkpoint_segments set to 500 in the current setup, which is
> about 8GB.  So 10 to 15 GB xlogs partition?  Is that reasonable?
>

Yes and no.
If you are using or plan to ever use log shipping you¹ll need more space.
In most setups, It will keep around logs until successful shipping has
happened and been told to remove them, which will allow them to grow.
There may be other reasons why the total files there might be greater and
I'm not an expert in all the possibilities there so others will probably
have to answer that.

With a basic install however, it won't use much more than your calculation
above.
You probably want a little breathing room in general, and in most new
systems today its not hard to carve out 50GB.  I'd be shocked if your mirror
that you are carving this out of isn't at least 250GB since its SATA.

I will reiterate that on a system your size the xlog throughput won't be a
bottleneck (fsync latency might, but raid cards with battery backup is for
that).  So the file system choice isn't a big deal once its on its own
partition -- the main difference at that point is almost entirely max write
throughput.


From:
Whit Armstrong
Date:

Thanks, Scott.

> I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a
> separate xlogs partition, and xfs for the data.
> ext2's drawbacks are not relevant for a small partition with just xlog data,
> but are a problem for the OS.

Can you suggest an appropriate size for the xlogs partition?  These
files are controlled by checkpoint_segments, is that correct?

We have checkpoint_segments set to 500 in the current setup, which is
about 8GB.  So 10 to 15 GB xlogs partition?  Is that reasonable?

-Whit

From:
Whit Armstrong
Date:

Thanks to everyone who helped me arrive at the config for this server.
 Here is my first set of benchmarks using the standard pgbench setup.

The benchmark numbers seem pretty reasonable to me, but I don't have a
good feel for what typical numbers are.  Any feedback is appreciated.

-Whit


the server is set up as follows:
6 1TB drives all seagateBarracuda ES.2
dell PERC 6 raid controller card
RAID 1 volume with OS and pg_xlog mounted as a separate partition w/
noatime and data=writeback both ext3
RAID 10 volume with pg_data as xfs

nodeadmin@node3:~$ /usr/lib/postgresql/8.3/bin/pgbench -t 10000 -c 10
-U dbadmin test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 5498.740733 (including connections establishing)
tps = 5504.244984 (excluding connections establishing)
nodeadmin@node3:~$ /usr/lib/postgresql/8.3/bin/pgbench -t 10000 -c 10
-U dbadmin test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 5627.047823 (including connections establishing)
tps = 5632.835873 (excluding connections establishing)
nodeadmin@node3:~$ /usr/lib/postgresql/8.3/bin/pgbench -t 10000 -c 10
-U dbadmin test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 5629.213818 (including connections establishing)
tps = 5635.225116 (excluding connections establishing)
nodeadmin@node3:~$



On Wed, Apr 29, 2009 at 2:58 PM, Scott Carey <> wrote:
>
> On 4/29/09 7:28 AM, "Whit Armstrong" <> wrote:
>
>> Thanks, Scott.
>>
>>> I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a
>>> separate xlogs partition, and xfs for the data.
>>> ext2's drawbacks are not relevant for a small partition with just xlog data,
>>> but are a problem for the OS.
>>
>> Can you suggest an appropriate size for the xlogs partition?  These
>> files are controlled by checkpoint_segments, is that correct?
>>
>> We have checkpoint_segments set to 500 in the current setup, which is
>> about 8GB.  So 10 to 15 GB xlogs partition?  Is that reasonable?
>>
>
> Yes and no.
> If you are using or plan to ever use log shipping you¹ll need more space.
> In most setups, It will keep around logs until successful shipping has
> happened and been told to remove them, which will allow them to grow.
> There may be other reasons why the total files there might be greater and
> I'm not an expert in all the possibilities there so others will probably
> have to answer that.
>
> With a basic install however, it won't use much more than your calculation
> above.
> You probably want a little breathing room in general, and in most new
> systems today its not hard to carve out 50GB.  I'd be shocked if your mirror
> that you are carving this out of isn't at least 250GB since its SATA.
>
> I will reiterate that on a system your size the xlog throughput won't be a
> bottleneck (fsync latency might, but raid cards with battery backup is for
> that).  So the file system choice isn't a big deal once its on its own
> partition -- the main difference at that point is almost entirely max write
> throughput.
>
>