Thread: partition question for new server setup
I have the opportunity to set up a new postgres server for our production database. I've read several times in various postgres lists about the importance of separating logs from the actual database data to avoid disk contention. Can someone suggest a typical partitioning scheme for a postgres server? My initial thought was to create /var/lib/postgresql as a partition on a separate set of disks. However, I can see that the xlog files will be stored here as well: http://www.postgresql.org/docs/8.3/interactive/storage-file-layout.html Should the xlog files be stored on a separate partition to improve performance? Any suggestions would be very helpful. Or if there is a document that lays out some best practices for server setup, that would be great. The database usage will be read heavy (financial data) with batch writes occurring overnight and occassionally during the day. server information: Dell PowerEdge 2970, 8 core Opteron 2384 6 1TB hard drives with a PERC 6i 64GB of ram We will be running Ubuntu 9.04. Thanks in advance, Whit
On Tue, Apr 28, 2009 at 10:56 AM, Whit Armstrong <armstrong.whit@gmail.com> wrote: > I have the opportunity to set up a new postgres server for our > production database. I've read several times in various postgres > lists about the importance of separating logs from the actual database > data to avoid disk contention. > > Can someone suggest a typical partitioning scheme for a postgres server? At work I have 16 SAS disks. They are setup with 12 in a RAID-10, 2 in a RAID-1 and 2 hot spares. The OS, /var/log, and postgres base go in the RAID-1. I then create a new data directory on the RAID-10, shut down pgsql, copy the base directory over to the RAID-10 and replace the base dir in the pg data directory with a link to the RAID-10's base directory and restart postgres. So, my pg_xlog and all OS and logging stuff goes on the RAID-10 and the main store for the db goes on the RAID-10.
Thanks, Scott. Just to clarify you said: > postgres. So, my pg_xlog and all OS and logging stuff goes on the > RAID-10 and the main store for the db goes on the RAID-10. Is that meant to be that the pg_xlog and all OS and logging stuff go on the RAID-1 and the real database (the /var/lib/postgresql/8.3/main/base directory) goes on the RAID-10 partition? This is very helpful. Thanks for your feedback. Additionally are there any clear choices w/ regard to filesystem types? Our choices would be xfs, ext3, or ext4. Is anyone out there running ext4 on a production system? -Whit
On Tue, Apr 28, 2009 at 11:48 AM, Whit Armstrong <armstrong.whit@gmail.com> wrote: > Thanks, Scott. > > Just to clarify you said: > >> postgres. So, my pg_xlog and all OS and logging stuff goes on the >> RAID-10 and the main store for the db goes on the RAID-10. > > Is that meant to be that the pg_xlog and all OS and logging stuff go > on the RAID-1 and the real database (the > /var/lib/postgresql/8.3/main/base directory) goes on the RAID-10 > partition? Yeah, and extra 0 jumped in there. Faulty keyboard I guess. :) OS and everything but base is on the RAID-1. > This is very helpful. Thanks for your feedback. > > Additionally are there any clear choices w/ regard to filesystem > types? Our choices would be xfs, ext3, or ext4. Well, there's a lot of people who use xfs and ext3. XFS is generally rated higher than ext3 both for performance and reliability. However, we run Centos 5 in production, and XFS isn't one of the blessed file systems it comes with, so we're running ext3. It's worked quite well for us.
On Tue, Apr 28, 2009 at 11:56:25AM -0600, Scott Marlowe wrote: > On Tue, Apr 28, 2009 at 11:48 AM, Whit Armstrong > <armstrong.whit@gmail.com> wrote: > > Thanks, Scott. > > > > Just to clarify you said: > > > >> postgres. ?So, my pg_xlog and all OS and logging stuff goes on the > >> RAID-10 and the main store for the db goes on the RAID-10. > > > > Is that meant to be that the pg_xlog and all OS and logging stuff go > > on the RAID-1 and the real database (the > > /var/lib/postgresql/8.3/main/base directory) goes on the RAID-10 > > partition? > > Yeah, and extra 0 jumped in there. Faulty keyboard I guess. :) OS > and everything but base is on the RAID-1. > > > This is very helpful. ?Thanks for your feedback. > > > > Additionally are there any clear choices w/ regard to filesystem > > types? ?Our choices would be xfs, ext3, or ext4. > > Well, there's a lot of people who use xfs and ext3. XFS is generally > rated higher than ext3 both for performance and reliability. However, > we run Centos 5 in production, and XFS isn't one of the blessed file > systems it comes with, so we're running ext3. It's worked quite well > for us. > The other optimizations are using data=writeback when mounting the ext3 filesystem for PostgreSQL and using the elevator=deadline for the disk driver. I do not know how you specify that for Ubuntu. Cheers, Ken
Whit Armstrong wrote: > I have the opportunity to set up a new postgres server for our > production database. I've read several times in various postgres > lists about the importance of separating logs from the actual database > data to avoid disk contention. > > Can someone suggest a typical partitioning scheme for a postgres server? > > My initial thought was to create /var/lib/postgresql as a partition on > a separate set of disks. > > However, I can see that the xlog files will be stored here as well: > http://www.postgresql.org/docs/8.3/interactive/storage-file-layout.html > > Should the xlog files be stored on a separate partition to improve performance? > > Any suggestions would be very helpful. Or if there is a document that > lays out some best practices for server setup, that would be great. > > The database usage will be read heavy (financial data) with batch > writes occurring overnight and occassionally during the day. > > server information: > Dell PowerEdge 2970, 8 core Opteron 2384 > 6 1TB hard drives with a PERC 6i > 64GB of ram We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we have 8 of the 2.5" 10K 384GB disks. When I asked the same question on this forum, I was advised to just put all 8 disks into a single RAID 10, and forget aboutseparating things. The performance of a battery-backed PERC 6i (you did get a battery-backed cache, right?) with 8disks is quite good. In order to separate the logs, OS and data, I'd have to split off at least two of the 8 disks, leaving only six for the RAID10 array. But then my xlogs would be on a single disk, which might not be safe. A more robust approach would be tosplit off four of the disks, put the OS on a RAID 1, the xlog on a RAID 1, and the database data on a 4-disk RAID 10. Now I've separated the data, but my primary partition has lost half its disks. So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very happy with it. It has everything: Postgres,OS and logs. But since the RAID array is 8 disks instead of 4, the net performance seems to quite good. But ... your mileage may vary. My box has just one thing running on it: Postgres. There is almost no other disk activityto interfere with the file-system caching. If your server is going to have a bunch of other activity that generatea lot of non-Postgres disk activity, then this advice might not apply. Craig
On Tuesday 28 April 2009, Whit Armstrong <armstrong.whit@gmail.com> wrote: > Additionally are there any clear choices w/ regard to filesystem > types? Our choices would be xfs, ext3, or ext4. xfs consistently delivers much higher sequential throughput than ext3 (up to 100%), at least on my hardware. -- Even a sixth-grader can figure out that you can’t borrow money to pay off your debt
Kenneth Marshall wrote: >>> Additionally are there any clear choices w/ regard to filesystem >>> types? ?Our choices would be xfs, ext3, or ext4. >> Well, there's a lot of people who use xfs and ext3. XFS is generally >> rated higher than ext3 both for performance and reliability. However, >> we run Centos 5 in production, and XFS isn't one of the blessed file >> systems it comes with, so we're running ext3. It's worked quite well >> for us. >> > > The other optimizations are using data=writeback when mounting the > ext3 filesystem for PostgreSQL and using the elevator=deadline for > the disk driver. I do not know how you specify that for Ubuntu. After a reading various articles, I thought that "noop" was the right choice when you're using a battery-backed RAID controller. The RAID controller is going to cache all data and reschedule the writes anyway, so the kernal schedule is irrelevantat best, and can slow things down. On Ubuntu, it's echo noop >/sys/block/hdx/queue/scheduler where "hdx" is replaced by the appropriate device. Craig
Craig James <craig_james@emolecules.com> wrote: > After a reading various articles, I thought that "noop" was the > right choice when you're using a battery-backed RAID controller. > The RAID controller is going to cache all data and reschedule the > writes anyway, so the kernal schedule is irrelevant at best, and can > slow things down. Wouldn't that depend on the relative sizes of those caches? In a not-so-hypothetical example, we have machines with 120 GB OS cache, and 256 MB BBU RAID controller cache. We seem to benefit from elevator=deadline at the OS level. -Kevin
> echo noop >/sys/block/hdx/queue/scheduler can this go into /etc/init.d somewhere? or does that change stick between reboots? -Whit On Tue, Apr 28, 2009 at 2:16 PM, Craig James <craig_james@emolecules.com> wrote: > Kenneth Marshall wrote: >>>> >>>> Additionally are there any clear choices w/ regard to filesystem >>>> types? ?Our choices would be xfs, ext3, or ext4. >>> >>> Well, there's a lot of people who use xfs and ext3. XFS is generally >>> rated higher than ext3 both for performance and reliability. However, >>> we run Centos 5 in production, and XFS isn't one of the blessed file >>> systems it comes with, so we're running ext3. It's worked quite well >>> for us. >>> >> >> The other optimizations are using data=writeback when mounting the >> ext3 filesystem for PostgreSQL and using the elevator=deadline for >> the disk driver. I do not know how you specify that for Ubuntu. > > After a reading various articles, I thought that "noop" was the right choice > when you're using a battery-backed RAID controller. The RAID controller is > going to cache all data and reschedule the writes anyway, so the kernal > schedule is irrelevant at best, and can slow things down. > > On Ubuntu, it's > > echo noop >/sys/block/hdx/queue/scheduler > > where "hdx" is replaced by the appropriate device. > > Craig > >
On Tue, Apr 28, 2009 at 01:30:59PM -0500, Kevin Grittner wrote: > Craig James <craig_james@emolecules.com> wrote: > > > After a reading various articles, I thought that "noop" was the > > right choice when you're using a battery-backed RAID controller. > > The RAID controller is going to cache all data and reschedule the > > writes anyway, so the kernal schedule is irrelevant at best, and can > > slow things down. > > Wouldn't that depend on the relative sizes of those caches? In a > not-so-hypothetical example, we have machines with 120 GB OS cache, > and 256 MB BBU RAID controller cache. We seem to benefit from > elevator=deadline at the OS level. > > -Kevin > This was my understanding as well. If your RAID controller had a lot of well managed cache, then the noop scheduler was a win. Less performant RAID controllers benefit from teh deadline scheduler. Cheers, Ken
Whit Armstrong <armstrong.whit@gmail.com> wrote: >> echo noop >/sys/block/hdx/queue/scheduler > > can this go into /etc/init.d somewhere? We set the default for the kernel in the /boot/grub/menu.lst file. On a kernel line, add elevator=xxx (where xxx is your choice of scheduler). -Kevin
I see. Thanks for everyone for replying. The whole discussion has been very helpful. Cheers, Whit On Tue, Apr 28, 2009 at 3:13 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Whit Armstrong <armstrong.whit@gmail.com> wrote: >>> echo noop >/sys/block/hdx/queue/scheduler >> >> can this go into /etc/init.d somewhere? > > We set the default for the kernel in the /boot/grub/menu.lst file. On > a kernel line, add elevator=xxx (where xxx is your choice of > scheduler). > > -Kevin > >
On Tue, Apr 28, 2009 at 12:06 PM, Kenneth Marshall <ktm@rice.edu> wrote: > On Tue, Apr 28, 2009 at 11:56:25AM -0600, Scott Marlowe wrote: >> On Tue, Apr 28, 2009 at 11:48 AM, Whit Armstrong >> <armstrong.whit@gmail.com> wrote: >> > Thanks, Scott. >> > >> > Just to clarify you said: >> > >> >> postgres. ?So, my pg_xlog and all OS and logging stuff goes on the >> >> RAID-10 and the main store for the db goes on the RAID-10. >> > >> > Is that meant to be that the pg_xlog and all OS and logging stuff go >> > on the RAID-1 and the real database (the >> > /var/lib/postgresql/8.3/main/base directory) goes on the RAID-10 >> > partition? >> >> Yeah, and extra 0 jumped in there. Faulty keyboard I guess. :) OS >> and everything but base is on the RAID-1. >> >> > This is very helpful. ?Thanks for your feedback. >> > >> > Additionally are there any clear choices w/ regard to filesystem >> > types? ?Our choices would be xfs, ext3, or ext4. >> >> Well, there's a lot of people who use xfs and ext3. XFS is generally >> rated higher than ext3 both for performance and reliability. However, >> we run Centos 5 in production, and XFS isn't one of the blessed file >> systems it comes with, so we're running ext3. It's worked quite well >> for us. >> > > The other optimizations are using data=writeback when mounting the > ext3 filesystem for PostgreSQL and using the elevator=deadline for > the disk driver. I do not know how you specify that for Ubuntu. Yeah, we set the scheduler to deadline on our db servers and it dropped the load and io wait noticeably, even with our rather fast arrays and controller. We also use data=writeback.
On Tue, Apr 28, 2009 at 12:37 PM, Whit Armstrong <armstrong.whit@gmail.com> wrote: >> echo noop >/sys/block/hdx/queue/scheduler > > can this go into /etc/init.d somewhere? > > or does that change stick between reboots? I just stick in /etc/rc.local
On Tue, Apr 28, 2009 at 12:40 PM, Kenneth Marshall <ktm@rice.edu> wrote: > On Tue, Apr 28, 2009 at 01:30:59PM -0500, Kevin Grittner wrote: >> Craig James <craig_james@emolecules.com> wrote: >> >> > After a reading various articles, I thought that "noop" was the >> > right choice when you're using a battery-backed RAID controller. >> > The RAID controller is going to cache all data and reschedule the >> > writes anyway, so the kernal schedule is irrelevant at best, and can >> > slow things down. >> >> Wouldn't that depend on the relative sizes of those caches? In a >> not-so-hypothetical example, we have machines with 120 GB OS cache, >> and 256 MB BBU RAID controller cache. We seem to benefit from >> elevator=deadline at the OS level. >> >> -Kevin >> > This was my understanding as well. If your RAID controller had a > lot of well managed cache, then the noop scheduler was a win. Less > performant RAID controllers benefit from teh deadline scheduler. I have an Areca 1680ix with 512M cache on a machine with 32Gig ram and I get slightly better performance and lower load factors from deadline than from noop, but it's not by much.
On 4/28/09 11:16 AM, "Craig James" <craig_james@emolecules.com> wrote: > Kenneth Marshall wrote: >>>> Additionally are there any clear choices w/ regard to filesystem >>>> types? ?Our choices would be xfs, ext3, or ext4. >>> Well, there's a lot of people who use xfs and ext3. XFS is generally >>> rated higher than ext3 both for performance and reliability. However, >>> we run Centos 5 in production, and XFS isn't one of the blessed file >>> systems it comes with, so we're running ext3. It's worked quite well >>> for us. >>> >> >> The other optimizations are using data=writeback when mounting the >> ext3 filesystem for PostgreSQL and using the elevator=deadline for >> the disk driver. I do not know how you specify that for Ubuntu. > > After a reading various articles, I thought that "noop" was the right choice > when you're using a battery-backed RAID controller. The RAID controller is > going to cache all data and reschedule the writes anyway, so the kernal > schedule is irrelevant at best, and can slow things down. > > On Ubuntu, it's > > echo noop >/sys/block/hdx/queue/scheduler > > where "hdx" is replaced by the appropriate device. > > Craig > I've always had better performance from deadline than noop, no matter what raid controller I have. Perhaps with a really good one or a SAN that changes (NOT a PERC 6 mediocre thingamabob). PERC 6 really, REALLY needs to have the linux "readahead" value set up to at least 1MB per effective spindle to get good sequential read performance. Xfs helps with it too, but you can mitigate half of the ext3 vs xfs sequential access performance with high readahead settings: /sbin/blockdev --setra <value> <device> Value is in blocks (512 bytes) /sbin/blockdev --getra <device> to see its setting. Google for more info. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
>> >> server information: >> Dell PowerEdge 2970, 8 core Opteron 2384 >> 6 1TB hard drives with a PERC 6i >> 64GB of ram > > We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we have > 8 of the 2.5" 10K 384GB disks. > > When I asked the same question on this forum, I was advised to just put all 8 > disks into a single RAID 10, and forget about separating things. The > performance of a battery-backed PERC 6i (you did get a battery-backed cache, > right?) with 8 disks is quite good. > > In order to separate the logs, OS and data, I'd have to split off at least two > of the 8 disks, leaving only six for the RAID 10 array. But then my xlogs > would be on a single disk, which might not be safe. A more robust approach > would be to split off four of the disks, put the OS on a RAID 1, the xlog on a > RAID 1, and the database data on a 4-disk RAID 10. Now I've separated the > data, but my primary partition has lost half its disks. > > So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very > happy with it. It has everything: Postgres, OS and logs. But since the RAID > array is 8 disks instead of 4, the net performance seems to quite good. > If you go this route, there are a few risks: 1. If everything is on the same partition/file system, fsyncs from the xlogs may cross-pollute to the data. Ext3 is notorious for this, though data=writeback limits the effect you especially might not want data=writeback on your OS partition. I would recommend that the OS, Data, and xlogs + etc live on three different partitions regardless of the number of logical RAID volumes. 2. Cheap raid controllers (PERC, others) will see fsync for an array and flush everything that is dirty (not just the partition or file data), which is a concern if you aren't using it in write-back with battery backed cache, even for a very read heavy db that doesn't need high fsync speed for transactions. > But ... your mileage may vary. My box has just one thing running on it: > Postgres. There is almost no other disk activity to interfere with the > file-system caching. If your server is going to have a bunch of other > activity that generate a lot of non-Postgres disk activity, then this advice > might not apply. > > Craig > 6 and 8 disk counts are tough. My biggest single piece of advise is to have the xlogs in a partition separate from the data (not necessarily a different raid logical volume), with file system and mount options tuned for each case separately. I've seen this alone improve performance by a factor of 2.5 on some file system / storage combinations. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
are there any other xfs settings that should be tuned for postgres? I see this post mentions "allocation groups." does anyone have suggestions for those settings? http://archives.postgresql.org/pgsql-admin/2009-01/msg00144.php what about raid stripe size? does it really make a difference? I think the default for the perc is 64kb (but I'm not in front of the server right now). -Whit On Tue, Apr 28, 2009 at 7:40 PM, Scott Carey <scott@richrelevance.com> wrote: > On 4/28/09 11:16 AM, "Craig James" <craig_james@emolecules.com> wrote: > >> Kenneth Marshall wrote: >>>>> Additionally are there any clear choices w/ regard to filesystem >>>>> types? ?Our choices would be xfs, ext3, or ext4. >>>> Well, there's a lot of people who use xfs and ext3. XFS is generally >>>> rated higher than ext3 both for performance and reliability. However, >>>> we run Centos 5 in production, and XFS isn't one of the blessed file >>>> systems it comes with, so we're running ext3. It's worked quite well >>>> for us. >>>> >>> >>> The other optimizations are using data=writeback when mounting the >>> ext3 filesystem for PostgreSQL and using the elevator=deadline for >>> the disk driver. I do not know how you specify that for Ubuntu. >> >> After a reading various articles, I thought that "noop" was the right choice >> when you're using a battery-backed RAID controller. The RAID controller is >> going to cache all data and reschedule the writes anyway, so the kernal >> schedule is irrelevant at best, and can slow things down. >> >> On Ubuntu, it's >> >> echo noop >/sys/block/hdx/queue/scheduler >> >> where "hdx" is replaced by the appropriate device. >> >> Craig >> > > I've always had better performance from deadline than noop, no matter what > raid controller I have. Perhaps with a really good one or a SAN that > changes (NOT a PERC 6 mediocre thingamabob). > > PERC 6 really, REALLY needs to have the linux "readahead" value set up to at > least 1MB per effective spindle to get good sequential read performance. > Xfs helps with it too, but you can mitigate half of the ext3 vs xfs > sequential access performance with high readahead settings: > > /sbin/blockdev --setra <value> <device> > > Value is in blocks (512 bytes) > > /sbin/blockdev --getra <device> to see its setting. Google for more info. > >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > >
Thanks, Scott. So far, I've followed a pattern similar to Scott Marlowe's setup. I have configured 2 disks as a RAID 1 volume, and 4 disks as a RAID 10 volume. So, the OS and xlogs will live on the RAID 1 vol and the data will live on the RAID 10 vol. I'm running the memtest on it now, so we still haven't locked ourselves into any choices. regarding your comment: > 6 and 8 disk counts are tough. My biggest single piece of advise is to have > the xlogs in a partition separate from the data (not necessarily a different > raid logical volume), with file system and mount options tuned for each case > separately. I've seen this alone improve performance by a factor of 2.5 on > some file system / storage combinations. can you suggest mount options for the various partitions? I'm leaning towards xfs for the filesystem format unless someone complains loudly about data corruption on xfs for a recent 2.6 kernel. -Whit On Tue, Apr 28, 2009 at 7:58 PM, Scott Carey <scott@richrelevance.com> wrote: >>> >>> server information: >>> Dell PowerEdge 2970, 8 core Opteron 2384 >>> 6 1TB hard drives with a PERC 6i >>> 64GB of ram >> >> We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we have >> 8 of the 2.5" 10K 384GB disks. >> >> When I asked the same question on this forum, I was advised to just put all 8 >> disks into a single RAID 10, and forget about separating things. The >> performance of a battery-backed PERC 6i (you did get a battery-backed cache, >> right?) with 8 disks is quite good. >> >> In order to separate the logs, OS and data, I'd have to split off at least two >> of the 8 disks, leaving only six for the RAID 10 array. But then my xlogs >> would be on a single disk, which might not be safe. A more robust approach >> would be to split off four of the disks, put the OS on a RAID 1, the xlog on a >> RAID 1, and the database data on a 4-disk RAID 10. Now I've separated the >> data, but my primary partition has lost half its disks. >> >> So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very >> happy with it. It has everything: Postgres, OS and logs. But since the RAID >> array is 8 disks instead of 4, the net performance seems to quite good. >> > > If you go this route, there are a few risks: > 1. If everything is on the same partition/file system, fsyncs from the > xlogs may cross-pollute to the data. Ext3 is notorious for this, though > data=writeback limits the effect you especially might not want > data=writeback on your OS partition. I would recommend that the OS, Data, > and xlogs + etc live on three different partitions regardless of the number > of logical RAID volumes. > 2. Cheap raid controllers (PERC, others) will see fsync for an array and > flush everything that is dirty (not just the partition or file data), which > is a concern if you aren't using it in write-back with battery backed cache, > even for a very read heavy db that doesn't need high fsync speed for > transactions. > >> But ... your mileage may vary. My box has just one thing running on it: >> Postgres. There is almost no other disk activity to interfere with the >> file-system caching. If your server is going to have a bunch of other >> activity that generate a lot of non-Postgres disk activity, then this advice >> might not apply. >> >> Craig >> > > 6 and 8 disk counts are tough. My biggest single piece of advise is to have > the xlogs in a partition separate from the data (not necessarily a different > raid logical volume), with file system and mount options tuned for each case > separately. I've seen this alone improve performance by a factor of 2.5 on > some file system / storage combinations. > >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > >
On Tue, Apr 28, 2009 at 5:58 PM, Scott Carey <scott@richrelevance.com> wrote: > 1. If everything is on the same partition/file system, fsyncs from the > xlogs may cross-pollute to the data. Ext3 is notorious for this, though > data=writeback limits the effect you especially might not want > data=writeback on your OS partition. I would recommend that the OS, Data, > and xlogs + etc live on three different partitions regardless of the number > of logical RAID volumes. Note that I remember reading some comments a while back that just having a different file system, on the same logical set, makes things faster. I.e. a partition for OS, one for xlog and one for pgdata on the same large logical volume was noticeably faster than having it all on the same big partition on a single logical volume.
On 4/28/09 5:02 PM, "Whit Armstrong" <armstrong.whit@gmail.com> wrote: > are there any other xfs settings that should be tuned for postgres? > > I see this post mentions "allocation groups." does anyone have > suggestions for those settings? > http://archives.postgresql.org/pgsql-admin/2009-01/msg00144.php > > what about raid stripe size? does it really make a difference? I > think the default for the perc is 64kb (but I'm not in front of the > server right now). > When I tested a PERC I couldn't tell the difference between the 64k and 256k settings. The other settings that looked like they might improve things all had worse performance (other than write back cache of course). Also, if you have partitions at all on the data device, you'll want to try and stripe align it. The easiest way is to simply put the file system on the raw device rather than a partition (e.g. /dev/sda rather than /dev/sda1). Partition alignment can be very annoying to do well. It will affect performance a little, less so with larger stripe sizes. > -Whit > > > On Tue, Apr 28, 2009 at 7:40 PM, Scott Carey <scott@richrelevance.com> wrote:
On 4/28/09 5:10 PM, "Whit Armstrong" <armstrong.whit@gmail.com> wrote: > Thanks, Scott. > > So far, I've followed a pattern similar to Scott Marlowe's setup. I > have configured 2 disks as a RAID 1 volume, and 4 disks as a RAID 10 > volume. So, the OS and xlogs will live on the RAID 1 vol and the data > will live on the RAID 10 vol. > > I'm running the memtest on it now, so we still haven't locked > ourselves into any choices. > Its a fine option -- the only way to know if one big volume with separate partitions is better is to test your actual application since it is highly dependant on the use case. > regarding your comment: >> 6 and 8 disk counts are tough. My biggest single piece of advise is to have >> the xlogs in a partition separate from the data (not necessarily a different >> raid logical volume), with file system and mount options tuned for each case >> separately. I've seen this alone improve performance by a factor of 2.5 on >> some file system / storage combinations. > > can you suggest mount options for the various partitions? I'm leaning > towards xfs for the filesystem format unless someone complains loudly > about data corruption on xfs for a recent 2.6 kernel. > > -Whit > I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a separate xlogs partition, and xfs for the data. ext2's drawbacks are not relevant for a small partition with just xlog data, but are a problem for the OS. For a setup like yours xlog speed is not going to limit you. I suggest a partition for the OS with default ext3 mount options, and a second partition for postgres/xlogs minus the data on ext3 with data=writeback. ext3 with default data=ordered on the xlogs causes performance issues as others have mentioned here. But data=ordered is probably the right thing for the OS. Your xlogs will not be a bottleneck and will probably be fine either way -- and this is a mount-time option so you can switch. I went with xfs for the data partition, and did not see benefit from anything other than the 'noatime' mount option. The default xfs settings are fine, and the raid specific formatting options are primarily designed to help raid 5 or 6 out. If you go with ext3 for the data partition, make sure its data=writeback with 'noatime'. Both of these are mount time options. I said it before, but I'll repeat -- don't neglect the OS readahead setting for the device, especially the data device. Something like: /sbin/blockdev --setra 8192 /dev/sd<X> Where <X> is the right letter for your data raid volume Will have a big impact on larger sequential scans. This has to go in rc.local or whatever script runs after boot on your distro. > > On Tue, Apr 28, 2009 at 7:58 PM, Scott Carey <scott@richrelevance.com> wrote: >>>> >>>> server information: >>>> Dell PowerEdge 2970, 8 core Opteron 2384 >>>> 6 1TB hard drives with a PERC 6i >>>> 64GB of ram >>> >>> We're running a similar configuration: PowerEdge 8 core, PERC 6i, but we >>> have >>> 8 of the 2.5" 10K 384GB disks. >>> >>> When I asked the same question on this forum, I was advised to just put all >>> 8 >>> disks into a single RAID 10, and forget about separating things. The >>> performance of a battery-backed PERC 6i (you did get a battery-backed cache, >>> right?) with 8 disks is quite good. >>> >>> In order to separate the logs, OS and data, I'd have to split off at least >>> two >>> of the 8 disks, leaving only six for the RAID 10 array. But then my xlogs >>> would be on a single disk, which might not be safe. A more robust approach >>> would be to split off four of the disks, put the OS on a RAID 1, the xlog on >>> a >>> RAID 1, and the database data on a 4-disk RAID 10. Now I've separated the >>> data, but my primary partition has lost half its disks. >>> >>> So, I took the advice, and just made one giant 8-disk RAID 10, and I'm very >>> happy with it. It has everything: Postgres, OS and logs. But since the >>> RAID >>> array is 8 disks instead of 4, the net performance seems to quite good. >>> >> >> If you go this route, there are a few risks: >> 1. If everything is on the same partition/file system, fsyncs from the >> xlogs may cross-pollute to the data. Ext3 is notorious for this, though >> data=writeback limits the effect you especially might not want >> data=writeback on your OS partition. I would recommend that the OS, Data, >> and xlogs + etc live on three different partitions regardless of the number >> of logical RAID volumes. >> 2. Cheap raid controllers (PERC, others) will see fsync for an array and >> flush everything that is dirty (not just the partition or file data), which >> is a concern if you aren't using it in write-back with battery backed cache, >> even for a very read heavy db that doesn't need high fsync speed for >> transactions. >> >>> But ... your mileage may vary. My box has just one thing running on it: >>> Postgres. There is almost no other disk activity to interfere with the >>> file-system caching. If your server is going to have a bunch of other >>> activity that generate a lot of non-Postgres disk activity, then this advice >>> might not apply. >>> >>> Craig >>> >> >> 6 and 8 disk counts are tough. My biggest single piece of advise is to have >> the xlogs in a partition separate from the data (not necessarily a different >> raid logical volume), with file system and mount options tuned for each case >> separately. I've seen this alone improve performance by a factor of 2.5 on >> some file system / storage combinations. >> >>> >>> -- >>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >>> >> >> >
On 4/29/09 7:28 AM, "Whit Armstrong" <armstrong.whit@gmail.com> wrote: > Thanks, Scott. > >> I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a >> separate xlogs partition, and xfs for the data. >> ext2's drawbacks are not relevant for a small partition with just xlog data, >> but are a problem for the OS. > > Can you suggest an appropriate size for the xlogs partition? These > files are controlled by checkpoint_segments, is that correct? > > We have checkpoint_segments set to 500 in the current setup, which is > about 8GB. So 10 to 15 GB xlogs partition? Is that reasonable? > Yes and no. If you are using or plan to ever use log shipping you¹ll need more space. In most setups, It will keep around logs until successful shipping has happened and been told to remove them, which will allow them to grow. There may be other reasons why the total files there might be greater and I'm not an expert in all the possibilities there so others will probably have to answer that. With a basic install however, it won't use much more than your calculation above. You probably want a little breathing room in general, and in most new systems today its not hard to carve out 50GB. I'd be shocked if your mirror that you are carving this out of isn't at least 250GB since its SATA. I will reiterate that on a system your size the xlog throughput won't be a bottleneck (fsync latency might, but raid cards with battery backup is for that). So the file system choice isn't a big deal once its on its own partition -- the main difference at that point is almost entirely max write throughput.
Thanks, Scott. > I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a > separate xlogs partition, and xfs for the data. > ext2's drawbacks are not relevant for a small partition with just xlog data, > but are a problem for the OS. Can you suggest an appropriate size for the xlogs partition? These files are controlled by checkpoint_segments, is that correct? We have checkpoint_segments set to 500 in the current setup, which is about 8GB. So 10 to 15 GB xlogs partition? Is that reasonable? -Whit
Thanks to everyone who helped me arrive at the config for this server. Here is my first set of benchmarks using the standard pgbench setup. The benchmark numbers seem pretty reasonable to me, but I don't have a good feel for what typical numbers are. Any feedback is appreciated. -Whit the server is set up as follows: 6 1TB drives all seagateBarracuda ES.2 dell PERC 6 raid controller card RAID 1 volume with OS and pg_xlog mounted as a separate partition w/ noatime and data=writeback both ext3 RAID 10 volume with pg_data as xfs nodeadmin@node3:~$ /usr/lib/postgresql/8.3/bin/pgbench -t 10000 -c 10 -U dbadmin test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 10000 number of transactions actually processed: 100000/100000 tps = 5498.740733 (including connections establishing) tps = 5504.244984 (excluding connections establishing) nodeadmin@node3:~$ /usr/lib/postgresql/8.3/bin/pgbench -t 10000 -c 10 -U dbadmin test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 10000 number of transactions actually processed: 100000/100000 tps = 5627.047823 (including connections establishing) tps = 5632.835873 (excluding connections establishing) nodeadmin@node3:~$ /usr/lib/postgresql/8.3/bin/pgbench -t 10000 -c 10 -U dbadmin test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 10000 number of transactions actually processed: 100000/100000 tps = 5629.213818 (including connections establishing) tps = 5635.225116 (excluding connections establishing) nodeadmin@node3:~$ On Wed, Apr 29, 2009 at 2:58 PM, Scott Carey <scott@richrelevance.com> wrote: > > On 4/29/09 7:28 AM, "Whit Armstrong" <armstrong.whit@gmail.com> wrote: > >> Thanks, Scott. >> >>> I went with ext3 for the OS -- it makes Ops feel a lot better. ext2 for a >>> separate xlogs partition, and xfs for the data. >>> ext2's drawbacks are not relevant for a small partition with just xlog data, >>> but are a problem for the OS. >> >> Can you suggest an appropriate size for the xlogs partition? These >> files are controlled by checkpoint_segments, is that correct? >> >> We have checkpoint_segments set to 500 in the current setup, which is >> about 8GB. So 10 to 15 GB xlogs partition? Is that reasonable? >> > > Yes and no. > If you are using or plan to ever use log shipping you¹ll need more space. > In most setups, It will keep around logs until successful shipping has > happened and been told to remove them, which will allow them to grow. > There may be other reasons why the total files there might be greater and > I'm not an expert in all the possibilities there so others will probably > have to answer that. > > With a basic install however, it won't use much more than your calculation > above. > You probably want a little breathing room in general, and in most new > systems today its not hard to carve out 50GB. I'd be shocked if your mirror > that you are carving this out of isn't at least 250GB since its SATA. > > I will reiterate that on a system your size the xlog throughput won't be a > bottleneck (fsync latency might, but raid cards with battery backup is for > that). So the file system choice isn't a big deal once its on its own > partition -- the main difference at that point is almost entirely max write > throughput. > >