Thread: 12 disks raid setup
Hi, I am in the process of setting up a postgresql server with 12 SAS disks. I am considering two options: 1) set up a 12 disks raid 10 array to get maximum raw performance from the system and put everything on it (it the whole pg cluster, including WAL, and every tablespcace) 2) set up 3 raid 10 arrays of 4 disks, and dispatch my data on these disks via tablespaces : data1 = pg cluster + references data (dimensions) tablespace data2 = fact data tablespace data3 = indices tablespace Typical workload is either massive insert/update via ETL or complex queries on big (10 millions tuples) tables with several joins (including Mondrian ROLAP). Does anyone have an opinion of what could give best results ? Thanks, Franck
On Fri, Feb 29, 2008 at 5:51 AM, Franck Routier <franck.routier@axege.com> wrote: > Hi, > > I am in the process of setting up a postgresql server with 12 SAS disks. > > I am considering two options: > > 1) set up a 12 disks raid 10 array to get maximum raw performance from > the system and put everything on it (it the whole pg cluster, including > WAL, and every tablespcace) > > 2) set up 3 raid 10 arrays of 4 disks, and dispatch my data on these > disks via tablespaces : How you set it up depends on your RAID controller as much as anything. Good battery backed RAID controllers seem to work better with one big RAID-10 array. But as with anything, unless you benchmark it, you're really just guessing which is best.
Hi, my Raid controller is an Adaptec 31205 SAS/RAID controller. The battery was an option, but I didn't know it at purchase time. So I have no battery, but the whole system is on an UPS. I have done quite a few tests using bonnie++, focusing on 'random seek' results, and found out that: 1) linux md raid 10 performs better than Adaptec hardware raid in this field (random seek) by 15%, quite consistently 2) hardware raid is better on sequential output 3) md outperforms it again when coming to sequential read, especially with far layout option. So in fact I think I will use md raid, but still don't know with which layout (3x4 or 1x12). What would you suggest as a benchmarking method ? Simply issue a few big queries that I expect to be usual and see how long it last, or is there a more convinient and or "scientific" method ? Thanks, Franck
On Fri, 29 Feb 2008, Franck Routier wrote: > my Raid controller is an Adaptec 31205 SAS/RAID controller. The battery > was an option, but I didn't know it at purchase time. So I have no > battery, but the whole system is on an UPS. The UPS is of no help here. The problem is that PostgreSQL forces the disk controller to commit WAL writes to disk after every transaction. If you have a controller with a battery-backed cache, you can use that cache to buffer those writes and dramatically increase write performance. The USP doesn't give you the same write guarantees. Let's say someone trips over the server power cord (simplest example of a whole class of failures). With the BBC controller, the cached writes will get committed when you plug the server back in. If all you've got is a UPS, writes that didn't make it to disk before the outage are lost. That means you can't buffer those writes without risking database corruption. The general guideline here is that if you don't have a battery-backed cache on your controller, based on disk rotation speed you'll be limited to around 100 (7200 RPM) to 200 (15K RPM) commits/second per single client, with each commit facing around a 2-4ms delay. That rises to perhaps 500/s total with lots of clients. BBC configurations can easily clear 3000/s total and individual commits don't have that couple of ms delay. > So in fact I think I will use md raid, but still don't know with which > layout (3x4 or 1x12). The only real downside of md RAID is that if you lose the boot device it can be tricky to get the system to start again; hardware RAID hides that little detail from the BIOS. Make sure you simulate a failure of the primary boot drive and are comfortable with recovering from that situation before you go into production with md. The only way to know which layout will work better is to have a lot of knowledge of this application and how it bottlenecks under load. If you know, for example, that there's a particular set of tables/indexes that are critical to real-time users, whereas others are only used by batch operations, things like that can be used to figure out how to optimize disk layout. If you don't know your database to that level, put everything into one big array and forget about it; you won't do any better than that. > What would you suggest as a benchmarking method ? Simply issue a few big > queries that I expect to be usual and see how long it last, or is there > a more convinient and or "scientific" method ? Benchmarking is hard and you have to use a whole array of tests if you want to quantify the many aspects of performance. You're doing the right thing using bonnie++ to quantify disk speed. If you've got some typical queries, using those to fine-tune postgresql.conf parameters is a good idea; just make sure to set shared_buffers, estimated_cache_size, and run ANALYZE on your tables. Be careful to note performance differences when the cache is already filled with data from previous runs. Measuring write/commit performance is probably easiest using pgbench. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Franck Routier wrote: > Hi, > > I am in the process of setting up a postgresql server with 12 SAS disks. > > I am considering two options: > > 1) set up a 12 disks raid 10 array to get maximum raw performance from > the system and put everything on it (it the whole pg cluster, including > WAL, and every tablespcace) > > 2) set up 3 raid 10 arrays of 4 disks, and dispatch my data on these > disks via tablespaces : > > data1 = pg cluster + references data (dimensions) tablespace > data2 = fact data tablespace > data3 = indices tablespace > > > Option 2: Infact I would also say within one of the RAID1 use another softpartition and separate out pg_xlog also. My 2 cents based on my benchmarks. -Jignesh
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 29 Feb 2008 12:17:29 -0500 "Jignesh K. Shah" <J.K.Shah@sun.com> wrote: > > > Franck Routier wrote: > > Hi, > > > > I am in the process of setting up a postgresql server with 12 SAS > > disks. > > > > I am considering two options: > > > > 1) set up a 12 disks raid 10 array to get maximum raw performance > > from the system and put everything on it (it the whole pg cluster, > > including WAL, and every tablespcace) I would do this (assuming you have other spindles for the OS): /data1 - RAID 10 journalled filesystem + 1 (so 9 disks) /xlogs - RAID 1 non journalled filesystem + 1 (so 3 disks) You can create any number of tablespaces for further growth you see fit and move them as more IO becomes available. Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHyEX6ATb/zqfZUUQRAu1XAKCpszYwF4dbI0hidg71JhmcrPqbmACcDhdc E0qVOtKrUBpEEerGUjTMF9I= =LzZS -----END PGP SIGNATURE-----
On Fri, 29 Feb 2008, Joshua D. Drake wrote: > /data1 - RAID 10 journalled filesystem + 1 (so 9 disks) > /xlogs - RAID 1 non journalled filesystem + 1 (so 3 disks) Sounds good. Can't they share the hot spare, rather than having two? However, I would recommend splashing out on the battery for the cache, and then just putting then all in one RAID 10 lump. Matthew -- To be or not to be -- Shakespeare To do is to be -- Nietzsche To be is to do -- Sartre Do be do be do -- Sinatra
On Feb 29, 2008, at 9:51 AM, Franck Routier wrote: > my Raid controller is an Adaptec 31205 SAS/RAID controller. The > battery > was an option, but I didn't know it at purchase time. So I have no > battery, but the whole system is on an UPS. Go find one on ebay or google search, and plug it in. Adaptec batteries just snap in and sometimes have a bracket to clip them in place. Your performance will be awful without one since you can't safely write cache. Also, if your card has upgradable RAM (but I've never seen an adaptec card which could) max it out.
Greg Smith wrote: > > The only real downside of md RAID is that if you lose the boot device > it can be tricky to get the system to start again; hardware RAID hides > that little detail from the BIOS. Make sure you simulate a failure of > the primary boot drive and are comfortable with recovering from that > situation before you go into production with md. +1 I usually ensure there is a separate /boot that is setup RAID1 (with md using all the disks for the RAID1 - so the I can keep the partition map the same for all the disks, otherwise it is fiddly!) Cheers Mark
Greg Smith wrote: > On Fri, 29 Feb 2008, Franck Routier wrote: > >> my Raid controller is an Adaptec 31205 SAS/RAID controller. The battery >> was an option, but I didn't know it at purchase time. So I have no >> battery, but the whole system is on an UPS. > > The UPS is of no help here. The problem is that PostgreSQL forces the > disk controller to commit WAL writes to disk after every transaction. > If you have a controller with a battery-backed cache, you can use that > cache to buffer those writes and dramatically increase write > performance. The USP doesn't give you the same write guarantees. Let's > say someone trips over the server power cord (simplest example of a > whole class of failures). With the BBC controller, the cached writes > will get committed when you plug the server back in. If all you've got > is a UPS, writes that didn't make it to disk before the outage are > lost. That means you can't buffer those writes without risking database > corruption. > > The general guideline here is that if you don't have a battery-backed > cache on your controller, based on disk rotation speed you'll be limited > to around 100 (7200 RPM) to 200 (15K RPM) commits/second per single > client, with each commit facing around a 2-4ms delay. That rises to > perhaps 500/s total with lots of clients. BBC configurations can easily > clear 3000/s total and individual commits don't have that couple of ms > delay. > It may be the way you have worded this but it makes it sound like the cache and the battery backup are as one (or that the cache doesn't work unless you have the battery) The cache may be optional (or plug-in) in some cards, even of varied size. The battery is normally optional. You can normally add/remove the battery without changing the cache options. If the raid card has the cache without the battery you would get the performance figures you mentioned, you just wouldn't have the reliability of finishing writes after a power off situation. correct me if I am wrong here. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
On Sat, 1 Mar 2008, Shane Ambler wrote: > It may be the way you have worded this but it makes it sound like the > cache and the battery backup are as one (or that the cache doesn't work > unless you have the battery)...If the raid card has the cache without > the battery you would get the performance figures you mentioned, you > just wouldn't have the reliability of finishing writes after a power off > situation. Wording is intentional--if you don't have a battery for it, the cache has to be turned off (or set to write-through so it's only being used on reads) in order for the database to be reliable. If you can't finish writes after a power off, you can't cache writes and expect your database to survive for too long. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hi, Le vendredi 29 février 2008 à 23:56 -0500, Greg Smith a écrit : > Wording is intentional--if you don't have a battery for it, the cache has > to be turned off (or set to write-through so it's only being used on > reads) in order for the database to be reliable. If you can't finish > writes after a power off, you can't cache writes and expect your database > to survive for too long. Well, am I just wrong, or the file system might also heavily rely on cache, especially as I use XFS ? So anyway Postgresql has no way to know if the data is really on the disk, and in case of a brutal outage, the system may definitely lose data, wether there is another level of caching (Raid controller) or not... Right ?
On Sat, Mar 1, 2008 at 4:27 AM, Franck Routier <franck.routier@axege.com> wrote: > Hi, > > Le vendredi 29 février 2008 à 23:56 -0500, Greg Smith a écrit : > > Wording is intentional--if you don't have a battery for it, the cache has > > to be turned off (or set to write-through so it's only being used on > > reads) in order for the database to be reliable. If you can't finish > > writes after a power off, you can't cache writes and expect your database > > to survive for too long. > > Well, am I just wrong, or the file system might also heavily rely on > cache, especially as I use XFS ? > > So anyway Postgresql has no way to know if the data is really on the > disk, and in case of a brutal outage, the system may definitely lose > data, wether there is another level of caching (Raid controller) or > not... > > Right ? nope. assuming your disk subsystem doesn't lie about write completion, then postgresql can recover from complete and sudden loss of power without any data loss.
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > On Sat, Mar 1, 2008 at 4:27 AM, Franck Routier <franck.routier@axege.com> wrote: >> Well, am I just wrong, or the file system might also heavily rely on >> cache, especially as I use XFS ? >> >> So anyway Postgresql has no way to know if the data is really on the >> disk, and in case of a brutal outage, the system may definitely lose >> data, wether there is another level of caching (Raid controller) or >> not... > nope. assuming your disk subsystem doesn't lie about write > completion, then postgresql can recover from complete and sudden loss > of power without any data loss. Franck does have a point here: we are expecting the filesystem to tend to its own knitting. If a power failure corrupts the filesystem so badly that we can't find the WAL files, or their contents are badly scrambled, then we're screwed. Most modern filesystems defend themselves against that using journaling, which is exactly the same idea as WAL but applied to filesystem metadata. We do expect that when we fsync a file, by the time the OS reports that that's done both the file contents and its metadata are safely on disk. This is part of the specification for fsync, so the OS is clearly broken if it doesn't get that right. Whether the OS *can* guarantee it if the disk drive lies about write completion is something you'd have to ask the filesystem hackers about. regards, tom lane
We're upgrading to a medium-sized server, a Dell PowerEdge 2950, dual-quad CPU's and 8 GB memory. This box can hold at most8 disks (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller. I'm thinking of this: 6 disks RAID 1+0 Postgres data 1 disk WAL 1 disk Linux I've often seen RAID 1 recommended for the WAL. Is that strictly for reliability, or is there a performance advantage toRAID 1 for the WAL? It seems to me separating the OS and WAL on two disks is better than making a single RAID 1 and sharing it, from a performancepoint of view. Thanks, Craig
On Sat, Mar 1, 2008 at 12:06 PM, Craig James <craig_james@emolecules.com> wrote: > We're upgrading to a medium-sized server, a Dell PowerEdge 2950, dual-quad CPU's and 8 GB memory. This box can hold atmost 8 disks (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller. > > I'm thinking of this: > > 6 disks RAID 1+0 Postgres data > 1 disk WAL > 1 disk Linux > > I've often seen RAID 1 recommended for the WAL. Is that strictly for reliability, or is there a performance advantageto RAID 1 for the WAL? > > It seems to me separating the OS and WAL on two disks is better than making a single RAID 1 and sharing it, from a performancepoint of view. It's a trade off. Remember that if the single disk hold xlog fails you've just quite possubly lost your database. I'd be inclined to either using a RAID-1 of two disks for the OS and xlog, and having pgsql log to the 6 disk RAID-10 instead of the OS / xlog disk set. More important, do you have battery backed cache on the controller? A good controller with a battery backed cache can usually outrun a larger array with no write cache when it comes to transactions / writing to the disks.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sat, 01 Mar 2008 10:06:54 -0800 Craig James <craig_james@emolecules.com> wrote: > We're upgrading to a medium-sized server, a Dell PowerEdge 2950, > dual-quad CPU's and 8 GB memory. This box can hold at most 8 disks > (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller. > > I'm thinking of this: > > 6 disks RAID 1+0 Postgres data > 1 disk WAL > 1 disk Linux > > I've often seen RAID 1 recommended for the WAL. Is that strictly for > reliability, or is there a performance advantage to RAID 1 for the > WAL? > > It seems to me separating the OS and WAL on two disks is better than > making a single RAID 1 and sharing it, from a performance point of > view. This scares me... You lose WAL you are a goner. Combine your OS and WAL into a RAID 1. Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHycSzATb/zqfZUUQRAs14AJ9pm3huW+z1j7jUIY7FbIZMzz2IxgCgnOhD yWiDabTYAG+x12JEqrf4q8E= =gBPs -----END PGP SIGNATURE-----
Joshua D. Drake wrote: > On Sat, 01 Mar 2008 10:06:54 -0800 > Craig James <craig_james@emolecules.com> wrote: > >> We're upgrading to a medium-sized server, a Dell PowerEdge 2950, >> dual-quad CPU's and 8 GB memory. This box can hold at most 8 disks >> (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller. >> >> I'm thinking of this: >> >> 6 disks RAID 1+0 Postgres data >> 1 disk WAL >> 1 disk Linux >> >> I've often seen RAID 1 recommended for the WAL. Is that strictly for >> reliability, or is there a performance advantage to RAID 1 for the >> WAL? >> >> It seems to me separating the OS and WAL on two disks is better than >> making a single RAID 1 and sharing it, from a performance point of >> view. > > This scares me... You lose WAL you are a goner. Combine your OS and > WAL into a RAID 1. Right, I do understand that, but reliability is not a top priority in this system. The database will be replicated, andcan be reproduced from the raw data. It's not an accounting system, it finds scientific results. That's not to say I*won't* take your advice, we may in fact combine the OS and WAL on one disk. Reliability is a good thing, but I need toknow all of the tradeoffs, so that I can weigh performance, reliability, and cost and make the right choice. So my question still stands: From a strictly performance point of view, would it be better to separate the OS and the WALonto two disks? Is there any performance advantage to RAID 1? My understanding is that RAID 1 can give 2x seek performanceduring read, but no advantage during write. For the WAL, it seems to me that RAID 1 has no performance benefits,so separating the WAL and OS seems like a peformance advantage. Another option would be: 4 disks RAID 1+0 Postgres data 2 disks RAID 1 WAL 1 disk Linux 1 disk spare This would give us reliability, but I think the performance would be considerably worse, since the primary Postgres datawould come from 4 disks instead of six. I guess we could also consider: 4 disks RAID 1+0 Postgres data 4 disks RAID 1+0 WAL and Linux Or even 8 disks RAID 1+0 Everything This is a dedicated system and does nothing but Apache/Postgres, so the OS should get very little traffic. But if that'sthe case, I guess you could argue that your suggestion of combining OS and WAL on a 2-disk RAID 1 would be the wayto go, since the OS activity wouldn't affect the WAL very much. I suppose the thing to do is get the system, and run bonnie on various configurations. I've never run bonnie before -- canI get some useful results without a huge learning curve? Thanks, Craig
On Sat, Mar 1, 2008 at 3:53 PM, Craig James <craig_james@emolecules.com> wrote: > Joshua D. Drake wrote: > > On Sat, 01 Mar 2008 10:06:54 -0800 > > Craig James <craig_james@emolecules.com> wrote: > > > >> We're upgrading to a medium-sized server, a Dell PowerEdge 2950, > >> dual-quad CPU's and 8 GB memory. This box can hold at most 8 disks > >> (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller. > >> > >> I'm thinking of this: > >> > >> 6 disks RAID 1+0 Postgres data > >> 1 disk WAL > >> 1 disk Linux > >> > >> I've often seen RAID 1 recommended for the WAL. Is that strictly for > >> reliability, or is there a performance advantage to RAID 1 for the > >> WAL? > >> > >> It seems to me separating the OS and WAL on two disks is better than > >> making a single RAID 1 and sharing it, from a performance point of > >> view. > > > > This scares me... You lose WAL you are a goner. Combine your OS and > > WAL into a RAID 1. > > Right, I do understand that, but reliability is not a top priority in this system. The database will be replicated, andcan be reproduced from the raw data. It's not an accounting system, it finds scientific results. That's not to say I*won't* take your advice, we may in fact combine the OS and WAL on one disk. Reliability is a good thing, but I need toknow all of the tradeoffs, so that I can weigh performance, reliability, and cost and make the right choice. In that case you could always make the data partition a 6 disk RAID-0. > So my question still stands: From a strictly performance point of view, would it be better to separate the OS and theWAL onto two disks? Is there any performance advantage to RAID 1? My understanding is that RAID 1 can give 2x seek performanceduring read, but no advantage during write. For the WAL, it seems to me that RAID 1 has no performance benefits,so separating the WAL and OS seems like a peformance advantage. Yes, Only on Reads. Correct. > Another option would be: > > > 4 disks RAID 1+0 Postgres data > 2 disks RAID 1 WAL > 1 disk Linux > 1 disk spare > > This would give us reliability, but I think the performance would be considerably worse, since the primary Postgres datawould come from 4 disks instead of six. Performance-wise, RAID-10 with n disks is about the same as RAID-0 with n/2 disks. So, you're losing abot 1/3 of your peak performance, assuming 100% efficient controllers and you aren't bottlenecking I/O with > 4 disks. > I guess we could also consider: > > > 4 disks RAID 1+0 Postgres data > 4 disks RAID 1+0 WAL and Linux > > Or even > > 8 disks RAID 1+0 Everything It really depends on the controller. Battery backed write cache? Then the one big everything is often faster than any other method. No BB cache? Then splitting them up will help. > I suppose the thing to do is get the system, and run bonnie on various configurations. I've never run bonnie before --can I get some useful results without a huge learning curve? Yes, it's fairly easy to drive. It'll tell you more about your controller than anything else, which is very useful information. The way a different controllers behaves with different configurations can be very very different from one controller to the next.
On Sat, 1 Mar 2008, Franck Routier wrote: > Well, am I just wrong, or the file system might also heavily rely on > cache, especially as I use XFS ? So anyway Postgresql has no way to know > if the data is really on the disk, and in case of a brutal outage, the > system may definitely lose data, wether there is another level of > caching (Raid controller) or not... After PostgreSQL writes to the WAL, it calls fsync. If your filesystem doesn't then force a real write to disk at that point and clear whatever cache it might have, it's broken and unsuitable for database use. XFS is smart enough to understand that. The only thing people typically run into that will hear fsync and lie about the data actually being written to disk are a) caching controllers with the write cache turned on and b) cheap hard drives. In case (a), having a battery backup for the cache is sufficient to survive most classes of outage without damage--if the system is without power for longer than the battery lasts you're in trouble, otherwise is shouldn't be a problem. In case (b), you have to turn the disk cache off to get reliable database operation. I've put all the interesting trivia on this topic I've ever come across at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm if you're looking for some really exciting reading. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 1 Mar 2008, Craig James wrote: > So my question still stands: From a strictly performance point of view, would > it be better to separate the OS and the WAL onto two disks? You're not getting a more useful answer here because you haven't mentioned yet a) what the disk controller is or b) how much writing activity is going on here. If you can cache writes, most of the advantages to having a seperate WAL disk aren't important unless you've got an extremely high write throughput (higher you can likely sustain with only 8 disks) so you can put the WAL data just about anywhere. > This is a dedicated system and does nothing but Apache/Postgres, so the OS > should get very little traffic. But if that's the case, I guess you could > argue that your suggestion of combining OS and WAL on a 2-disk RAID 1 would > be the way to go, since the OS activity wouldn't affect the WAL very much. The main thing to watch out for if the OS and WAL are on the same disk is that some random process spewing logs files could fill the disk and now the database is stalled. I think there are two configurations that make sense for your situation: > 8 disks RAID 1+0 Everything This maximizes potential sequential and seek throughput for the database, which is probably going to be your bottleneck unless you're writing lots of simple data, while still allowing survival of any one disk. The crazy log situation I mentioned above is less likely to be a problem because having so much more disk space available to everything means it's more likely you'll notice it before the disk actually fills. 6 disks RAID 0 Postgres data+WAL 2 disks RAID 1 Linux This puts some redundancy on the base OS, so no single disk loss can actually take down the system altogether. You get maximum throughput on the database. If you lose a database disk, you replace it and rebuild the whole database at that point. > I suppose the thing to do is get the system, and run bonnie on various > configurations. I've never run bonnie before -- can I get some useful > results without a huge learning curve? I've collected some bonnie++ examples at http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm you may find useful. With only 8 disks you should be able to get useful results without a learning curve; with significantly more it can be necessary to run more than one bonnie at once to really saturate the disks and that's trickier. I don't think you're going to learn anything useful from that though (other than figuring out if your disk+controller combination is fundamentally fast or not). As you put more disks into the array, sequential throughput and seeks/second will go up. This doesn't tell you anything useful about whether the WAL is going to get enough traffic to be a bottleneck such that it needs to be on a seperate disk. To figure that out, you need to run some simulations of the real database and its application, and doing that fairly is a more serious benchmarking project. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 1 Mar 2008, Craig James wrote: > Right, I do understand that, but reliability is not a top priority in this > system. The database will be replicated, and can be reproduced from the raw > data. So what you're saying is: 1. Reliability is not important. 2. There's zero write traffic once the database is set up. If this is true, then RAID-0 is the way to go. I think Greg's options are good. Either: 2 discs RAID 1: OS 6 discs RAID 0: database + WAL which is what we're using here (except with more discs), or: 8 discs RAID 10: everything However, if reliability *really* isn't an issue, and you can accept reinstalling the system if you lose a disc, then there's a third option: 8 discs RAID 0: Everything Matthew -- Heat is work, and work's a curse. All the heat in the universe, it's going to cool down, because it can't increase, then there'll be no more work, and there'll be perfect peace. -- Michael Flanders
Matthew wrote: > On Sat, 1 Mar 2008, Craig James wrote: >> Right, I do understand that, but reliability is not a top priority in >> this system. The database will be replicated, and can be reproduced >> from the raw data. > > So what you're saying is: > > 1. Reliability is not important. > 2. There's zero write traffic once the database is set up. > > If this is true, then RAID-0 is the way to go. I think Greg's options > are good. Either: > > 2 discs RAID 1: OS > 6 discs RAID 0: database + WAL > > which is what we're using here (except with more discs), or: > > 8 discs RAID 10: everything Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0 is faster for these sorts of loads? My understanding is that RAID 1+0 *can* reduce latency for reads, but that it relies on random access, whereas RAID 0 performs best for sequential scans? Does PostgreSQL ever do enough random access to make RAID 1+0 shine? Curious. Thanks, mark -- Mark Mielke <mark@mielke.cc>
Matthew wrote: > On Sat, 1 Mar 2008, Craig James wrote: >> Right, I do understand that, but reliability is not a top priority in >> this system. The database will be replicated, and can be reproduced >> from the raw data. > > So what you're saying is: > > 1. Reliability is not important. > 2. There's zero write traffic once the database is set up. Well, I actually didn't say either of those things, but I appreciate the feedback. RAID 0 is an interesting suggestion,but given our constraints, it's not an option. Reliability is important, but not as important as, say, a bankingsystem. And as far as zero write traffic, I don't know where that came from. It's a "hitlist" based system, where complex searchresults are saved for the user in tables, and the write traffic can be quite high. > If this is true, then RAID-0 is the way to go. I think Greg's options > are good. Either: > > 2 discs RAID 1: OS > 6 discs RAID 0: database + WAL > > which is what we're using here (except with more discs), or: > > 8 discs RAID 10: everything Right now, an 8-disk RAID 10 is looking like the best choice. The Dell Perc 6i has configurations that include a battery-backedcache, so performance should be quite good. > However, if reliability *really* isn't an issue, and you can accept > reinstalling the system if you lose a disc, then there's a third option: > > 8 discs RAID 0: Everything I imagine the MTBF on a system like this would be < 1 year, which is out of the question, even with a backup system thatcan take over. A failure completely wipes the system, OS and everything, so you're guaranteed that once or twice a year,you have to rebuild your system from the ground up. I'd rather spend that time at the beach! Craig
On Mon, 3 Mar 2008, Mark Mielke wrote: > Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0 is > faster for these sorts of loads? My understanding is that RAID 1+0 *can* > reduce latency for reads, but that it relies on random access, whereas RAID 0 > performs best for sequential scans? Does PostgreSQL ever do enough random > access to make RAID 1+0 shine? Theoretically the performance of RAID 0 and RAID 10 should be identical for reads, both seeks and throughput, assuming you have a sensible readahead and a good controller. For writes, RAID 10 needs to write to multiple drives, so is slower. Whether this is true in reality is another matter, as all sorts of factors come in, not least how good your controller is at managing the arrangement. Matthew -- The only secure computer is one that's unplugged, locked in a safe, and buried 20 feet under the ground in a secret location...and i'm not even too sure about that one. --Dennis Huges, FBI
Matthew wrote: > On Mon, 3 Mar 2008, Mark Mielke wrote: >> Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0 >> is faster for these sorts of loads? My understanding is that RAID 1+0 >> *can* reduce latency for reads, but that it relies on random access, >> whereas RAID 0 performs best for sequential scans? Does PostgreSQL >> ever do enough random access to make RAID 1+0 shine? > Theoretically the performance of RAID 0 and RAID 10 should be > identical for reads, both seeks and throughput, assuming you have a > sensible readahead and a good controller. For writes, RAID 10 needs to > write to multiple drives, so is slower. Whether this is true in > reality is another matter, as all sorts of factors come in, not least > how good your controller is at managing the arrangement. I don't think your statement that they should be identical is true - RAID 1+0 can satisfy and given read from at least two drives. A good controller can satisfy half the reads from one side of the array, and half the reads from the other side of the array, where the first set does not have to wait for the second set, before continuing. To contrast, sequential reads of a RAID 1+0 system is almost always HALF of the speed of sequential reads of a RAID 0 system. The hardware read-ahead on the RAID 1+0 system is being wasted as even if you did leap from one side of the array to the other, each side ends up "skipping" the data served by the other side, making any caching ineffective. The question I have is not whether RAID 1+0 vs RAID 0 show different characteristics. I know they do based upon my own analysis. My question is whether PostgreSQL disk access patterns for certain loads ever benefit from RAID 1+0, or whether RAID 1+0 is always a bad choice for performance-only (completely ignore reliability) loads. Cheers, mark -- Mark Mielke <mark@mielke.cc>
On Mon, Mar 3, 2008 at 8:48 AM, Mark Mielke <mark@mark.mielke.cc> wrote: > Matthew wrote: > > On Sat, 1 Mar 2008, Craig James wrote: > >> Right, I do understand that, but reliability is not a top priority in > >> this system. The database will be replicated, and can be reproduced > >> from the raw data. > > > > So what you're saying is: > > > > 1. Reliability is not important. > > 2. There's zero write traffic once the database is set up. > > > > If this is true, then RAID-0 is the way to go. I think Greg's options > > are good. Either: > > > > 2 discs RAID 1: OS > > 6 discs RAID 0: database + WAL > > > > which is what we're using here (except with more discs), or: > > > > 8 discs RAID 10: everything > > Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0 is > faster for these sorts of loads? My understanding is that RAID 1+0 *can* > reduce latency for reads, but that it relies on random access, whereas > RAID 0 performs best for sequential scans? Does PostgreSQL ever do > enough random access to make RAID 1+0 shine? RAID 1+0 has certain theoretical advantages in parallel access scenarios that straight RAID-0 wouldn't have. I.e. if you used n>2 disks in a mirror and built a RAID-0 out of those types of mirrors, then you could theoretically have n users reading data on the same "drive" (the raid-1 underneath the raid-0) at the same time where RAID-0 would only have the one disk to read from. The effects of this advantage are dulled by caching, depending on how much of the data set you can cache. With a system that can cache it's whole data set in memory (not uncommon for transactional systems) or at least a large percentage, the n>2 RAID-1 sets aren't that big of an advantage. RAID-0 of n drives should behave pretty similarly to RAID-10 with 2n drives for most types of access. I.e. no better or worse for sequential or random access, if the number of drives is equivalent.
Joshua D. Drake wrote: > This scares me... You lose WAL you are a goner. Combine your OS and > WAL into a RAID 1. Can someone elaborate on this? From the WAL concept and documentation at http://www.postgresql.org/docs/8.3/interactive/wal-intro.html I'd say the only data that should be lost are the transactions currently in the log but not yet transferred to permanent storage (database files proper).
Ivan Voras wrote: > Joshua D. Drake wrote: > >> This scares me... You lose WAL you are a goner. Combine your OS and >> WAL into a RAID 1. > > Can someone elaborate on this? From the WAL concept and documentation at > http://www.postgresql.org/docs/8.3/interactive/wal-intro.html I'd say > the only data that should be lost are the transactions currently in the > log but not yet transferred to permanent storage (database files proper). > The log records what changes are made to your data files before the data files are changed. (and gets flushed to disk before the data files are changed) In the event of power loss right in the middle of the data files being updated for a transaction, when power is restored, how do we know what changes were made to which data files and which changes are incomplete? Without the log files there is no way to be sure your data files are not full of "half done transactions" Chances are that 90% of the time everything is fine but without the log files how do you check that your data files are as they should be. (or do you expect to restore from backup after any power outs?) Keeping them on a raid 1 gives you a level of redundancy to get you past hardware failures that happen at the wrong time. (as they all do) -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz