Thread: Choosing a filesystem
I'm about to buy a new server. It will be a Xeon system with two processors (4 cores per processor) and 16GB RAM. Two RAID extenders will be attached to an Intel s5000 series motherboard, providing 12 SAS/Serial ATA connectors. The server will run FreeBSD 7.0, PostgreSQL 8, apache, PHP, mail server, dovecot IMAP server and background programs for database maintenance. On our current system, I/O performance for PostgreSQL is the biggest problem, but sometimes all CPUs are at 100%. Number of users using this system: PostgreSQL: 30 connections Apache: 30 connections IMAP server: 15 connections The databases are mostly OLTP, but the background programs are creating historical data and statistic data continuously, and sometimes web site visitors/serach engine robots run searches in bigger tables (with 3million+ records). There is an expert at the company who sells the server, and he recommended that I use SAS disks for the base system at least. I would like to use many SAS disks, but they are just too expensive. So the basic system will reside on a RAID 1 array, created from two SAS disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda 320GB SATA (ES 7200) disks for the rest. The expert told me to use RAID 5 but I'm hesitating. I think that RAID 1+0 would be much faster, and I/O performance is what I really need. I would like to put the WAL file on the SAS disks to improve performance, and create one big RAID 1+0 disk for the data directory. But maybe I'm completely wrong. Can you please advise how to create logical partitions? The hardware is capable of handling different types of RAID volumes on the same set of disks. For example, a smaller RAID 0 for indexes and a bigger RAID 5 etc. If you need more information about the database, please ask. :-) Thank you very much, Laszlo
On Thu, Sep 11, 2008 at 06:29:36PM +0200, Laszlo Nagy wrote: > The expert told me to use RAID 5 but I'm hesitating. I think that RAID 1+0 > would be much faster, and I/O performance is what I really need. I think you're right. I think it's a big mistake to use RAID 5 in a database server where you're hoping for reasonable write performance. In theory RAID 5 ought to be fast for reads, but I've never seen it work that way. > I would like to put the WAL file on the SAS disks to improve performance, > and create one big RAID 1+0 disk for the data directory. But maybe I'm > completely wrong. Can you please advise how to create logical partitions? I would listen to yourself before you listen to the expert. You sound right to me :) A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On Thu, 11 Sep 2008, Laszlo Nagy wrote: > So the basic system will reside on a RAID 1 array, created from two SAS > disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda > 320GB SATA (ES 7200) disks for the rest. That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest. It'll be pretty good. Put the OS and the WAL on the pair, and the database on the large array. However, one of the biggest things that will improve your performance (especially in OLTP) is to use a proper RAID controller with a battery-backed-up cache. Matthew -- X's book explains this very well, but, poor bloke, he did the Cambridge Maths Tripos... -- Computer Science Lecturer
On Thu, Sep 11, 2008 at 06:18:37PM +0100, Matthew Wakeling wrote: > On Thu, 11 Sep 2008, Laszlo Nagy wrote: >> So the basic system will reside on a RAID 1 array, created from two SAS >> disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda >> 320GB SATA (ES 7200) disks for the rest. > > That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest. It'll > be pretty good. Put the OS and the WAL on the pair, and the database on the > large array. > > However, one of the biggest things that will improve your performance > (especially in OLTP) is to use a proper RAID controller with a > battery-backed-up cache. > > Matthew > But remember that putting the WAL on a separate drive(set) will only help if you do not have competing I/O, such as system logging or paging, going to the same drives. This turns your fast sequential I/O into random I/O with the accompaning 10x or more performance decrease. Ken
>>> Kenneth Marshall <ktm@rice.edu> wrote: > On Thu, Sep 11, 2008 at 06:18:37PM +0100, Matthew Wakeling wrote: >> On Thu, 11 Sep 2008, Laszlo Nagy wrote: >>> So the basic system will reside on a RAID 1 array, created from two SAS >>> disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda >>> 320GB SATA (ES 7200) disks for the rest. >> >> That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest. It'll >> be pretty good. Put the OS and the WAL on the pair, and the database on the >> large array. >> >> However, one of the biggest things that will improve your performance >> (especially in OLTP) is to use a proper RAID controller with a >> battery-backed-up cache. > > But remember that putting the WAL on a separate drive(set) will only > help if you do not have competing I/O, such as system logging or paging, > going to the same drives. This turns your fast sequential I/O into > random I/O with the accompaning 10x or more performance decrease. Unless you have a good RAID controller with battery-backed-up cache. -Kevin
>> going to the same drives. This turns your fast sequential I/O into >> random I/O with the accompaning 10x or more performance decrease. >> > > Unless you have a good RAID controller with battery-backed-up cache. > All right. :-) This is what I'll have: Boxed Intel Server Board S5000PSLROMB with 8-port SAS ROMB card (Supports 45nm processors (Harpertown and Wolfdale-DP) Intel® RAID Activation key AXXRAK18E enables full intelligent SAS RAID on S5000PAL, S5000PSL, SR4850HW4/M, SR6850HW4/M. RoHS Compliant. 512 MB 400MHz DDR2 ECC Registered CL3 DIMM Single Rank, x8(for s5000pslromb) 6-drive SAS/SATA backplane with expander (requires 2 SAS ports) for SC5400 and SC5299 (two pieces) 5410 Xeon 2.33 GHz/1333 FSB/12MB Dobozos , Passive cooling / 80W (2 pieces) 2048 MB 667MHz DDR2 ECC Fully Buffered CL5 DIMM Dual Rank, x8 (8 pieces) SAS disks will be: 146.8 GB, SAS 3G,15000RPM, 16 MB cache (two pieces) SATA disks will be: HDD Server SEAGATE Barracuda ES 7200.1 (320GB,16MB,SATA II-300) __(10 pieces) I cannot spend more money on this computer, but since you are all talking about battery back up, I'll try to get money from the management and buy this: Intel® RAID Smart Battery AXXRSBBU3, optional battery back up for use with AXXRAK18E and SRCSAS144E. RoHS Complaint. This server will also be an IMAP server, web server etc. so I'm 100% sure that the SAS disks will be used for logging. I have two spare 200GB SATA disks here in the office but they are cheap ones designed for desktop computers. Is it okay to dedicate these disks for the WAL file in RAID1? Will it improve performance? How much trouble would it cause if the WAL file goes wrong? Should I just put the WAL file on the RAID 1+0 array? Thanks, Laszlo
On Thu, Sep 11, 2008 at 10:29 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote: > I'm about to buy a new server. It will be a Xeon system with two processors > (4 cores per processor) and 16GB RAM. Two RAID extenders will be attached > to an Intel s5000 series motherboard, providing 12 SAS/Serial ATA > connectors. > > The server will run FreeBSD 7.0, PostgreSQL 8, apache, PHP, mail server, > dovecot IMAP server and background programs for database maintenance. On our > current system, I/O performance for PostgreSQL is the biggest problem, but > sometimes all CPUs are at 100%. Number of users using this system: 100% what? sys? user? iowait? if it's still iowait, then the newer, bigger, faster RAID should really help. > PostgreSQL: 30 connections > Apache: 30 connections > IMAP server: 15 connections > > The databases are mostly OLTP, but the background programs are creating > historical data and statistic data continuously, and sometimes web site > visitors/serach engine robots run searches in bigger tables (with 3million+ > records). This might be a good application to setup where you slony replicate to another server, then run your I/O intensive processes against the slave. > There is an expert at the company who sells the server, and he recommended > that I use SAS disks for the base system at least. I would like to use many > SAS disks, but they are just too expensive. So the basic system will reside > on a RAID 1 array, created from two SAS disks spinning at 15 000 rpm. I will > buy 10 pieces of Seagate Barracuda 320GB SATA (ES 7200) disks for the rest. SAS = a bit faster, and better at parallel work. However, short stroking 7200 RPM SATA drives on the fastest parts of the platters can get you close to SAS territory for a fraction of the cost, plus you can then store backups etc on the rest of the drives at night. So, you're gonna put the OS o RAID1, and pgsql on the rest... Makes sense. consider setting up another RAID1 for the pg_clog directory. > The expert told me to use RAID 5 but I'm hesitating. I think that RAID 1+0 > would be much faster, and I/O performance is what I really need. The expert is most certainly wrong for an OLTP database. If your RAID controller can't run RAID-10 quickly compared to RAID-5 then it's a crap card, and you need a better one. Or put it into JBOD and let the OS handle the RAID-10 work. Or split it RAID-1 sets on the controller, RAID-0 in the OS. > I would like to put the WAL file on the SAS disks to improve performance, Actually, the WAL doesn't need SAS for good performance really. Except for the 15K.6 Seagate Cheetahs, most decent SATA drives are within a few percentage of SAS drives for sequential write / read speed, which is what the WAL basically does. > and create one big RAID 1+0 disk for the data directory. But maybe I'm > completely wrong. Can you please advise how to create logical partitions? > The hardware is capable of handling different types of RAID volumes on the > same set of disks. For example, a smaller RAID 0 for indexes and a bigger > RAID 5 etc. Avoid RAID-5 on OLTP. Now, if you have a slony slave for the aggregate work stuff, and you're doing big reads and writes, RAID-5 on a large SATA set may be a good and cost effective solution.
On Thu, Sep 11, 2008 at 11:47 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote: > I cannot spend more money on this computer, but since you are all talking > about battery back up, I'll try to get money from the management and buy > this: > > Intel(R) RAID Smart Battery AXXRSBBU3, optional battery back up for use with > AXXRAK18E and SRCSAS144E. RoHS Complaint. Sacrifice a couple of SAS drives to get that. I'd rather have all SATA drives and a BBU than SAS without one.
Laszlo Nagy wrote: > I cannot spend more money on this computer, but since you are all > talking about battery back up, I'll try to get money from the management > and buy this: > > Intel® RAID Smart Battery AXXRSBBU3, optional battery back up for use > with AXXRAK18E and SRCSAS144E. RoHS Complaint. The battery-backup is really important. You'd be better off to drop down to 8 disks in a RAID 1+0 and put everything onit, if that meant you could use the savings to get the battery-backed RAID controller. The performance improvement ofa BB cache is amazing. Based on advice from this group, configured our systems with a single 8-disk RAID 1+0 with a battery-backed cache. It holdsthe OS, WAL and database, and it is VERY fast. We're very happy with it. Craig
On Thu, 11 Sep 2008, Laszlo Nagy wrote: > The expert told me to use RAID 5 but I'm hesitating. Your "expert" isn't--at least when it comes to database performance. Trust yourself here, you've got the right general idea. But I can't make any sense out of exactly how your disks are going to be connected to the server with that collection of hardware. What I can tell is that you're approaching that part backwards, probably under the influence of the vendor you're dealing with, and since they don't understand what you're doing you're stuck sorting that out. If you want your database to perform well on writes, the first thing you do is select a disk controller that performs well, has a well-known stable driver for your OS, has a reasonably large cache (>=256MB), and has a battery backup on it. I don't know anything about how well this Intel RAID performs under FreeBSD, but you should check that if you haven't already. From the little bit I read about it I'm concerned if it's fast enough for as many drives as you're using. The wrong disk controller will make a slow mess out of any hardware you throw at it. Then, you connect as many drives to the caching controller as you can for the database. OS drives can connect to another controller (like the ports on the motherboard), but you shouldn't use them for either the database data or the WAL. That's what I can't tell from your outline of the server configuration; if it presumes a couple of the SATA disks holding database data are using the motherboard ports, you need to stop there and get a larger battery backed caching controller. If you're on a limited budget and the choice is between more SATA disks or less SAS disks, get more of the SATA ones. Once you've filled the available disk slots on the controller or run out of room in the chassis, if there's money leftover then you can go back and analyze whether replacing some of those with SAS disks makes sense--as long as they're still connected to a caching controller. I don't know what flexibility the "SAS/SATA backplane" you listed has here. You've got enough disks that it may be worthwhile to set aside two of them to be dedicated WAL volumes. That call depends on the balance of OLTP writes (which are more likely to take advantage of that) versus the reports scans (which would prefer more disks in the database array), and the only way you'll know for sure is to benchmark both configurations with something resembling your application. Since you should always do stress testing on any new hardware anyway before it goes into production, that's a good time to run comparisons like that. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 11 Sep 2008, Greg Smith wrote: > If you want your database to perform well on writes, the first thing you do > is select a disk controller that performs well, has a well-known stable > driver for your OS, has a reasonably large cache (>=256MB), and has a battery > backup on it. Greg, it might be worth you listing a few good RAID controllers. It's almost an FAQ. From what I'm hearing, this Intel one doesn't sound like it would be on the list. Matthew -- Riker: Our memory pathways have become accustomed to your sensory input. Data: I understand - I'm fond of you too, Commander. And you too Counsellor
Craig James <craig_james 'at' emolecules.com> writes: > The performance improvement of a BB cache is amazing. Could some of you share the insight on why this is the case? I cannot find much information on it on wikipedia, for example. Even http://linuxfinances.info/info/diskusage.html doesn't explain *why*. Out of the blue, is it just because when postgresql fsync's after a write, on a normal system the write has to really happen on disk and waiting for it to be complete, whereas with BBU cache the fsync is almost immediate because the write cache actually replaces the "really on disk" write? -- Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36
On Fri, 12 Sep 2008, Matthew Wakeling wrote: > Greg, it might be worth you listing a few good RAID controllers. It's almost > an FAQ. I started doing that at the end of http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks , that still needs some work. What I do periodically is sweep through old messages here that have useful FAQ text and dump them into the appropriate part of http://wiki.postgresql.org/wiki/Performance_Optimization -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 12 Sep 2008, Guillaume Cottenceau wrote: > Out of the blue, is it just because when postgresql fsync's after a > write, on a normal system the write has to really happen on disk and > waiting for it to be complete, whereas with BBU cache the fsync is > almost immediate because the write cache actually replaces the "really > on disk" write? That's the main thing, and nothing else you can do will accelerate that. Without a useful write cache (which usually means RAM with a BBU), you'll at best get about 100-200 write transactions per second for any one client, and something like 500/second even with lots of clients (queued up transaction fsyncs do get combined). Those numbers increase to several thousand per second the minute there's a good caching controller in the mix. You might say "but I don't write that heavily, so what?" Even if the write volume is low enough that the disk can keep up, there's still latency. A person who is writing transactions is going to be delayed a few milliseconds after each commit, which drags some types of data loading to a crawl. Also, without a cache in places mixes of fsync'd writes and reads can behave badly, with readers getting stuck behind writers much more often than in the cached situation. The final factor is that additional layers of cache usually help improve physical grouping of blocks into ordered sections to lower seek overhead. The OS is supposed to be doing that for you, but a cache closer to the drives themselves helps smooth things out when the OS dumps a large block of data out for some reason. The classic example in PostgreSQL land, particularly before 8.3, was when a checkpoint happens. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, Sep 12, 2008 at 5:11 AM, Greg Smith <gsmith@gregsmith.com> wrote: > On Fri, 12 Sep 2008, Guillaume Cottenceau wrote: > > That's the main thing, and nothing else you can do will accelerate that. > Without a useful write cache (which usually means RAM with a BBU), you'll at > best get about 100-200 write transactions per second for any one client, and > something like 500/second even with lots of clients (queued up transaction > fsyncs do get combined). Those numbers increase to several thousand per > second the minute there's a good caching controller in the mix. While this is correct, if heavy writing is sustained, especially on large databases, you will eventually outrun the write cache on the controller and things will start to degrade towards the slow case. So it's fairer to say that caching raid controllers burst up to several thousand per second, with a sustained write rate somewhat better than write-through but much worse than the burst rate. How fast things degrade from the burst rate depends on certain factors...how big the database is relative to the o/s read cache in the controller write cache, and how random the i/o is generally. One thing raid controllers are great at is smoothing bursty i/o during checkpoints for example. Unfortunately when you outrun cache on raid controllers the behavior is not always very pleasant...in at least one case I've experienced (perc 5/i) when the cache fills up the card decides to clear it before continuing. This means that if fsync is on, you get unpredictable random freezing pauses while the cache is clearing. merlin
On Fri, 12 Sep 2008, Merlin Moncure wrote: > On Fri, Sep 12, 2008 at 5:11 AM, Greg Smith <gsmith@gregsmith.com> wrote: >> On Fri, 12 Sep 2008, Guillaume Cottenceau wrote: >> >> That's the main thing, and nothing else you can do will accelerate that. >> Without a useful write cache (which usually means RAM with a BBU), you'll at >> best get about 100-200 write transactions per second for any one client, and >> something like 500/second even with lots of clients (queued up transaction >> fsyncs do get combined). Those numbers increase to several thousand per >> second the minute there's a good caching controller in the mix. > > While this is correct, if heavy writing is sustained, especially on > large databases, you will eventually outrun the write cache on the > controller and things will start to degrade towards the slow case. So > it's fairer to say that caching raid controllers burst up to several > thousand per second, with a sustained write rate somewhat better than > write-through but much worse than the burst rate. > > How fast things degrade from the burst rate depends on certain > factors...how big the database is relative to the o/s read cache in > the controller write cache, and how random the i/o is generally. One > thing raid controllers are great at is smoothing bursty i/o during > checkpoints for example. > > Unfortunately when you outrun cache on raid controllers the behavior > is not always very pleasant...in at least one case I've experienced > (perc 5/i) when the cache fills up the card decides to clear it before > continuing. This means that if fsync is on, you get unpredictable > random freezing pauses while the cache is clearing. although for postgres the thing that you are doing the fsync on is the WAL log file. that is a single (usually) contiguous file. As such it is very efficiant to write large chunks of it. so while you will degrade from the battery-only mode, the fact that the controller can flush many requests worth of writes out to the WAL log at once while you fill the cache with them one at a time is still a significant win. David Lang
On Sat, Sep 13, 2008 at 5:26 PM, <david@lang.hm> wrote: > On Fri, 12 Sep 2008, Merlin Moncure wrote: >> >> While this is correct, if heavy writing is sustained, especially on >> large databases, you will eventually outrun the write cache on the >> controller and things will start to degrade towards the slow case. So >> it's fairer to say that caching raid controllers burst up to several >> thousand per second, with a sustained write rate somewhat better than >> write-through but much worse than the burst rate. >> >> How fast things degrade from the burst rate depends on certain >> factors...how big the database is relative to the o/s read cache in >> the controller write cache, and how random the i/o is generally. One >> thing raid controllers are great at is smoothing bursty i/o during >> checkpoints for example. >> >> Unfortunately when you outrun cache on raid controllers the behavior >> is not always very pleasant...in at least one case I've experienced >> (perc 5/i) when the cache fills up the card decides to clear it before >> continuing. This means that if fsync is on, you get unpredictable >> random freezing pauses while the cache is clearing. > > although for postgres the thing that you are doing the fsync on is the WAL > log file. that is a single (usually) contiguous file. As such it is very > efficiant to write large chunks of it. so while you will degrade from the > battery-only mode, the fact that the controller can flush many requests > worth of writes out to the WAL log at once while you fill the cache with > them one at a time is still a significant win. The heap files have to be synced as well during checkpoints, etc. merlin
Merlin Moncure wrote: > > although for postgres the thing that you are doing the fsync on is the WAL > > log file. that is a single (usually) contiguous file. As such it is very > > efficiant to write large chunks of it. so while you will degrade from the > > battery-only mode, the fact that the controller can flush many requests > > worth of writes out to the WAL log at once while you fill the cache with > > them one at a time is still a significant win. > > The heap files have to be synced as well during checkpoints, etc. True, but as of 8.3 those checkpoint fsyncs are spread over the interval between checkpoints. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2008-09-23 at 13:02 -0400, Bruce Momjian wrote: > Merlin Moncure wrote: > > > although for postgres the thing that you are doing the fsync on is the WAL > > > log file. that is a single (usually) contiguous file. As such it is very > > > efficiant to write large chunks of it. so while you will degrade from the > > > battery-only mode, the fact that the controller can flush many requests > > > worth of writes out to the WAL log at once while you fill the cache with > > > them one at a time is still a significant win. > > > > The heap files have to be synced as well during checkpoints, etc. > > True, but as of 8.3 those checkpoint fsyncs are spread over the interval > between checkpoints. No, the fsyncs still all happen in a tight window after we have issued the writes. There's no waits in between them at all. The delays we introduced are all in the write phase. Whether that is important or not depends upon OS parameter settings. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support