Thread: Planning a new server - help needed
Hello, I need to install a new server for postgresql 8.3. It will run two databases, web server and some background programs. We already have a server but it is becoming slow and we would like to have something that is faster. It is a cost sensitive application, and I would like to get your opinion in some questions. The database itself is an OLTP system. There are many smaller tables, and some bigger ones (biggest table with 1.2 million records, table size 966MB, indexes size 790MB). In the bigger tables there are only a few records updated frequently, most of the other records are not changed. The smaller tables are updated continuously. Question 1. We are going to use PostgreSQL 3.1 with FreeBSD. The pg docs say that it is better to use FreeBSD because it can alter the I/O priority of processes dynamically. The latest legacy release is 6.3 which is probably more stable. However, folks say that 7.0 has superior performance on the same hardware. Can I use 7.0 on a production server? Question 2. SCSI or SATA? I plan to buy a RocketRAID 3520 controller with 8 SATA 2 disks. The operating system would be on another disk pair, connected to the motherboard's controller. I wonder if I can get more performance with SCSI, for the same amount of money? (I can spend about $1500 on the controller and the disks, that would cover 10 SATA 2 disks and the controller.) Question 3. FreeBSD 7.0 can use the ZFS file system. I suspect that UFS 2 + soft updates will be better, but I'm not sure. Which is better? Question 4. How to make the partitions? This is the hardest question. Here is my plan: - the OS resides on 2 disks, RAID 1 - the databases should go on 8 disks, RAID 0 + 1 However, the transaction log file should be on a separate disk and maybe I could gain more performance by putting indexes on a separate drive, but I do not want to reduce the number of disks in the RAID 0+1 array. Should I put indexes and transaction log on the RAID 1 array? Or should I invest a bit more money, add an SATA RAID controller with 16 channels and add more disks? Would it pay the bill? Another alternative is to put the biggest tables on a separate array so that it will be faster when we join these tables with other tables. I know that it is hard to answer without knowing the structure of the databases. :-( I can make tests with different configurations later, but I would like to know your opinion first - what should I try? Thanks, Laszlo
> I need to install a new server for postgresql 8.3. It will run two > databases, web server and some background programs. We already have a > server but it is becoming slow and we would like to have something that > is faster. It is a cost sensitive application, and I would like to get > your opinion in some questions. > > The database itself is an OLTP system. There are many smaller tables, > and some bigger ones (biggest table with 1.2 million records, table size > 966MB, indexes size 790MB). In the bigger tables there are only a few > records updated frequently, most of the other records are not changed. > The smaller tables are updated continuously. > > Question 1. We are going to use PostgreSQL 3.1 with FreeBSD. The pg docs > say that it is better to use FreeBSD because it can alter the I/O > priority of processes dynamically. The latest legacy release is 6.3 > which is probably more stable. However, folks say that 7.0 has superior > performance on the same hardware. Can I use 7.0 on a production server? I guess you mean postgresql 8.3.1? :-) I use FreeBSD 7 release on a 8-way HP DL360 G5 with a ciss controller. Works out of the box and I haven't had any issue with 7.0 at all. > Question 2. SCSI or SATA? I plan to buy a RocketRAID 3520 controller > with 8 SATA 2 disks. The operating system would be on another disk pair, > connected to the motherboard's controller. I wonder if I can get more > performance with SCSI, for the same amount of money? (I can spend about > $1500 on the controller and the disks, that would cover 10 SATA 2 disks > and the controller.) SAS would probably be the way to go. I haven't tried the rocketraid-controller. I use the built-in p400i-controller on my servers using the ciss-driver. I've heard many positive remarks about areca. > Question 3. FreeBSD 7.0 can use the ZFS file system. I suspect that UFS > 2 + soft updates will be better, but I'm not sure. Which is better? I'd stick with ufs2 atm. There are some issues with zfs which probably have been ironed out by now but ufs2 has been deployed for a longer time. Performance-wise they are about the same. > Question 4. How to make the partitions? This is the hardest question. > Here is my plan: > > - the OS resides on 2 disks, RAID 1 > - the databases should go on 8 disks, RAID 0 + 1 If you have enough disks raid-6 should perform almost as good as raid 1+0. I've setup 11 disks in raid-6 plus one hotspare so I can get more space out of it. "Enough disks" are approx. eight and up. -- regards Claus When lenity and cruelty play for a kingdom, the gentlest gamester is the soonest winner. Shakespeare
> Question 1. We are going to use PostgreSQL 3.1 with FreeBSD. The pg > docs say that it is better to use FreeBSD because it can alter the > I/O priority of processes dynamically. The latest legacy release is > 6.3 which is probably more stable. However, folks say that 7.0 has > superior performance on the same hardware. Can I use 7.0 on a > production server? FreeBSD 7.x is pretty stable, and it has the advantage of having the new ULE and other things that can't be MFC'd to 6.x branch. And as a long time FreeBSD enthusiast having Cisco sponsoring Dtrace, Nokia sponsoring scheduler development etc. 7.x is definitely in my opinion now the branch to install and start following for ease of upgrading later. Of course, as always check that your intended hardware is supported. ULE which is pretty much the key for performance boost in 7.x branch isn't yet the default scheduler, but will be in 7.1 and afterwards. This means you have to roll custom kernel if you want to use ULE. > Question 3. FreeBSD 7.0 can use the ZFS file system. I suspect that > UFS 2 + soft updates will be better, but I'm not sure. Which is > better? For now I'd choose between UFS+gjournal or plain UFS, although with bigger disks journaling is a boon compared to fsck'ing plain UFS partition. ZFS isn't yet ready for production I think, but it seems to be getting there. This is opinion based on bug reports and discussions in stable and current mailing lists, not on personal testing though. My experiences with gjournal have been positive so far. On the drives and controller - I'm not sure whether SCSI/SAS will give any noticeable boost over SATA, but based on personal experience SCSI is still ahead on terms of drive reliability. Whatever technology I'd choose, for production server getting decent battery backed controller would be the start. And of course a controller that does the RAID's in hardware. -Reko
> I guess you mean postgresql 8.3.1? :-) > Yep. Sorry. >> Question 3. FreeBSD 7.0 can use the ZFS file system. I suspect that UFS >> 2 + soft updates will be better, but I'm not sure. Which is better? >> > > I'd stick with ufs2 atm. There are some issues with zfs which probably > have been ironed out by now but ufs2 has been deployed for a longer > time. Performance-wise they are about the same. > Thank you. I suspected the same but it was good to get positive confirmation. >> Question 4. How to make the partitions? This is the hardest question. >> Here is my plan: >> >> - the OS resides on 2 disks, RAID 1 >> - the databases should go on 8 disks, RAID 0 + 1 >> > > If you have enough disks raid-6 should perform almost as good as raid > 1+0. Hmm, I have heard that RAID 1 or RAID 1 + 0 should be used for databases, never RAID 5. I know nothing about RAID 6. I guess I must accept your suggestion since you have more experience than I have. :-) Obviously, it would be easier to manage a single RAID 6 array. > I've setup 11 disks in raid-6 plus one hotspare so I can get more > space out of it. "Enough disks" are approx. eight and up. > The RAID controller that I have selected can only handle 8 disks. I guess I need to find a different one with 16 channels and use more disks. So are you saying that with all disks in a bigger RAID 6 array, I will get the most out of the hardware? In that case, I'll try to get a bit more money from the management and build RAID 6 with 12 disks. I also feel that I need to use a separate RAID 1 array (I prefer gmirror) for the base system. Thanks, Laszlo
On Fri, 28 Mar 2008, Laszlo Nagy wrote: > We already have a server but it is becoming slow and we would like to > have something that is faster. What's it slow at? Have you identified the bottlenecks and current sources of sluggish behavior? That sort of thing is much more informative to look into in regards to redesigning for new hardware than trivia like disk layout. For all we know you're CPU bound. > The database itself is an OLTP system. There are many smaller tables, and > some bigger ones (biggest table with 1.2 million records, table size 966MB, > indexes size 790MB). The total database size is the interesting number you left out here. And you didn't mention how much RAM either. That ratio has a lot of impact on how hard you'll push the disks. > Question 1. We are going to use PostgreSQL 3.1 with FreeBSD. The pg docs say > that it is better to use FreeBSD because it can alter the I/O priority of > processes dynamically. You shouldn't make an OS decision based on a technical detail that small. I won't knock FreeBSD because it's a completely reasonable choice, but there's no credible evidence it's a better performer for the workload you expect than, say, Linux or even Solaris x64. (The benchmarks the FreeBSD team posted as part of their 7.0 fanfare are not representative of real PostgreSQL performance, and are read-only as well). All the reasonable OS choices here are close enough to one another (as long as you get FreeBSD 7, earlier versions are really slow) that you should be thinking in terms of reliability, support, and features rather than viewing this from a narrow performance perspective. There's nothing about what you've described that sounds like it needs bleeding-edge performance to achieve. For reliability, I first look at how good the disk controller and its matching driver in the OS used is, which brings us to: > Question 2. SCSI or SATA? I plan to buy a RocketRAID 3520 controller with 8 > SATA 2 disks. The operating system would be on another disk pair, connected > to the motherboard's controller. I wonder if I can get more performance with > SCSI, for the same amount of money? (I can spend about $1500 on the > controller and the disks, that would cover 10 SATA 2 disks and the > controller.) Highpoint has traditionally made disk controllers that were garbage. The 3520 is from a relatively new series of products from them, and it seems like a reasonable unit. However: do you want to be be deploying your system on a new card with zero track record for reliability, and from a company that has never done a good job before? I can't think of any reason at all why you should take that risk. The standard SATA RAID controller choices people suggest here are 3ware, Areca, and LSI Logic. Again, unless you're really pushing what the hardware is capable of these are all close to each other performance-wise (see http://femme.tweakblogs.net/blog/196/highpoint-rocketraid-3220.html for something that include the Highpoint card). You should be thinking in terms of known reliability and stability when you select a database controller card, and Highpoint isn't even on the list of vendors to consider yet by those standards. As for SCSI vs. SATA, I collected up the usual arguments on both sides at http://www.postgresqldocs.org/index.php/SCSI_vs._IDE/SATA_Disks > However, the transaction log file should be on a separate disk and maybe I > could gain more performance by putting indexes on a separate drive, but I do > not want to reduce the number of disks in the RAID 0+1 array. If you're looking at 8+ disks and have a caching controller with a battery backup, which appears to be your target configuration, there little reason to expect a big performance improvement from splitting the transaction log out onto a seperate disk. As you note, doing that will reduce the spread of disk for the database which may cost you more in performance than seperate transaction logs gain. It is worth considering creating a seperate filesystem on the big array to hold the xlog data through, because that gives you more flexibility in terms of mount parameters there. For example, you can always turn off atime updates on the transaction log filesystem, and in many cases the filesystem journal updates can be optimized more usefully (the xlog doesn't require them). -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Laszlo Nagy wrote: > > Question 4. How to make the partitions? This is the hardest question. > Here is my plan: > > - the OS resides on 2 disks, RAID 1 > - the databases should go on 8 disks, RAID 0 + 1 Make sure you understand the difference between RAID 1+0 and RAID 0+1.. I suspect you'll end up going with 1+0 instead. -Dan
Laszlo Nagy schrieb: [...] > The RAID controller that I have selected can only handle 8 disks. I > guess I need to find a different one with 16 channels and use more > disks. So are you saying that with all disks in a bigger RAID 6 array, I > will get the most out of the hardware? In that case, I'll try to get a > bit more money from the management and build RAID 6 with 12 disks. Here a good SATA-Controllers for 4/8/12/16-Disks: http://www.tekram.com/product2/product_detail.asp?pid=51 Stefan
Greg Smith wrote: > As for SCSI vs. SATA, I collected up the usual arguments on both sides > at http://www.postgresqldocs.org/index.php/SCSI_vs._IDE/SATA_Disks > Why do you claim that 'More platters also means slower seeks and generally slower performance.'? On the face of it, it should mean that the number of track step operations is reduced, even if the drive doesn't buffer a slice of tracks aross all platters (which would help if it did). I'm not entirely sure why the extra platters should really count as more moving parts since I think the platter assembly and head assembly are both single parts in effect, albeit they will be more massive with more platters. I'm not sure how much extra bearing friction that will mean, but it is reasonable that some extra energy is going to be needed. Recent figures I've seen suggest that the increased storage density per platter, plus the extra number of platters, means that the streaming speed of good 7200rpm SATA drives is very close to that of good 15000rpm SAS drives - and you can choose which bit of the disk to use to reduce seek time and maximise transfer rate with the oversize drives. You can get about 100MB/s out of both technologies, streaming. It may be worth considering an alternative approach. I suspect that a god RAID1 or RAID1+0 is worthwhile for WAL, but you might consider a RAID1 of a pair of SSDs for data. They will use a lot of your budget, but the seek time is negligible so the effective usable performance is higher than you get with spinning disks - so you might trade a fancy controller with battery-backed cache for straight SSD. I haven't done this, so YMMV. But the prices are getting interesting for OLTP where most disks are massively oversized. The latest Samsung and SanDisk are expensive in the UK but the Transcend 16GB TS16GSSD25S-S SATA is about $300 equiv - it can do 'only' 'up to' 28MB/s write and you wouldn't want to put WAL on one, but sustaining 15-20MB/s through random access to a real disk isn't trivial. If average access is 10ms, and you write 100MB/s streaming, then you have to ask yourself if you going to do 80 or more seeks a second. James James
> Why do you claim that 'More platters also means slower seeks > and generally slower performance.'? More platters -> more heads -> heavier head assembly -> slower seek time But.. More platters -> higher density -> less seek distance (in mm of head movement) -> faster seek time As usual, no clear-cut case, a real-life test would tell more interesting things. > I'm not entirely sure why the extra platters should really count > as more moving parts since I think the platter assembly and > head assembly are both single parts in effect, albeit they will > be more massive with more platters. I'm not sure how much > extra bearing friction that will mean, but it is reasonable that > some extra energy is going to be needed. Since the bearings are only on one side of the axle (not both), a heavier platter assembly would put more stress on the bearing if the disk is subject to vibrations (like, all those RAID disks seeking together) which would perhaps shorten its life. Everything with conditionals of course ;) I remember reading a paper on vibration from many RAID disks somewhere a year or so ago, vibration from other disks seeking at the exact same time and in the same direction would cause resonances in the housing chassis and disturb the heads of disks, slightly worsening seek times and reliability. But, on the other hand, the 7 disks raided in my home storage server never complained, even though the $30 computer case vibrates all over the place when they seek. Perhaps if they were subject to 24/7 heavy torture, a heavier/better damped chassis would be a good investment. > It may be worth considering an alternative approach. I suspect > that a god RAID1 or RAID1+0 is worthwhile for WAL, but Actually, now that 8.3 can sync to disk every second instead of at every commit, I wonder, did someone do some enlightening benchmarks ? I remember benchmarking 8.2 on a forum style load and using a separate disk for WAL (SATA, write cache off) made a huge difference (as expected) versus one disk for everything (SATA, and write cache off). Postgres beat the crap out of MyISAM, lol. Seems like Postgres is one of the rare apps which gets faster and meaner with every release, instead of getting slower and more bloated like everyone else. Also, there is a thing called write barriers, which supposedly could be used to implement fsync-like behaviour without the penalty, if the disk, the OS, the controller, and the filesystem support it (that's a lot of ifs)... > I haven't done this, so YMMV. But the prices are getting > interesting for OLTP where most disks are massively > oversized. The latest Samsung and SanDisk are expensive > in the UK but the Transcend 16GB TS16GSSD25S-S SATA > is about $300 equiv - it can do 'only' 'up to' 28MB/s write and Gigabyte should revamp their i-RAM to use ECC RAM of a larger capacity... and longer lasting battery backup... I wonder, how many write cycles those Flash drives can take before reliability becomes a problem...
PFC wrote: > >> Why do you claim that 'More platters also means slower seeks >> and generally slower performance.'? > > More platters -> more heads -> heavier head assembly -> slower > seek time Note sure I've sen a lot of evidence of that in drive specifications! > Gigabyte should revamp their i-RAM to use ECC RAM of a larger > capacity... and longer lasting battery backup... You would think a decent capacitor or rechargable button battery would be enough to dump it to a flash memory. No problem with flash wear then. > I wonder, how many write cycles those Flash drives can take before > reliability becomes a problem... Hard to get data isn't it? I believe its hundreds of thousands to millions now. Now each record in most OLTP tables is rewritten a few times unless its stuff that can go into temp tables etc, which should be elsewhere. Index pages clearly get rewritten often. I suspect a mix of storage technologies will be handy for some time yet - WAL on disk, and temp tables on disk with no synchronous fsync requirement. I think life is about to get interesting in DBMS storage. All good for us users. James
Laszlo Nagy wrote: > Question 1. We are going to use PostgreSQL 3.1 with FreeBSD. The pg docs > say that it is better to use FreeBSD because it can alter the I/O > priority of processes dynamically. Where does it say that?
On Sat, 29 Mar 2008, PFC wrote: >> Why do you claim that 'More platters also means slower seeks >> and generally slower performance.'? > More platters -> more heads -> heavier head assembly -> slower seek > time I recall seeing many designs with more platters that have slower seek times in benchmarks, and I always presumed this was the reason. That's the basis for that comment. I'll disclaim that section a bit. > Actually, now that 8.3 can sync to disk every second instead of at every > commit, I wonder, did someone do some enlightening benchmarks ? I've seen some really heavy workloads where using async commit helped group commits in a larger batches usefully, but I personally haven't found it to be all that useful if you're already got a caching controller to accelerate writes on the kinds of hardware most people have. It's a great solution for situations without a usable write cache though. > Also, there is a thing called write barriers, which supposedly could be > used to implement fsync-like behaviour without the penalty, if the disk, > the OS, the controller, and the filesystem support it (that's a lot of > ifs)... The database can't use fsync-like behavior for the things it calls fsync for; it needs the full semantics. You're either doing the full operation, or you're cheating and it doesn't do what it's supposed to. Write barriers aren't any improvement over a good direct I/O sync write setup for the WAL. There may be some limited value to that approach for the database writes at checkpoint time, but again there's a real fsync coming at the end of that and it's not satisfied until everything is on disk (or in a good disk controller cache). > Gigabyte should revamp their i-RAM to use ECC RAM of a larger > capacity... and longer lasting battery backup... I saw a rumor somewhere that they were close to having a new version of that using DDR2 ready, which would make it pretty easy to have 8GB on there. > I wonder, how many write cycles those Flash drives can take before > reliability becomes a problem... The newer SSD drives with good write leveling should last at least as long as you'd expect a mechanical drive to, even in a WAL application. Lesser grades of flash used as disk could be a problem though. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: >> Gigabyte should revamp their i-RAM to use ECC RAM of a larger >> capacity... and longer lasting battery backup... > > I saw a rumor somewhere that they were close to having a new version of > that using DDR2 ready, which would make it pretty easy to have 8GB on > there. I'm hoping - probably in vain - that they'll also include a CF/SD slot or some onboard flash so it can dump its contents to flash using the battery backup. For anybody wondering what the devices in question are, see: http://www.anandtech.com/storage/showdoc.aspx?i=2480 http://www.gigabyte.com.tw/Products/Storage/Products_Overview.aspx?ProductID=2180 -- Craig Ringer
PFC wrote: > >> Why do you claim that 'More platters also means slower seeks >> and generally slower performance.'? > > More platters -> more heads -> heavier head assembly -> slower seek > time > But.. > More platters -> higher density -> less seek distance (in mm of head > movement) -> faster seek time More platters means more tracks under the read heads at a time, so generally *better* performance. All other things (like rotational speed) being equal, of course. -- Guy Rouillier