Thread: 10K vs 15k rpm for analytics
Anyone has any experience doing analytics with postgres. In particular if 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price difference is $3,000. Rarely ever have more than 2 or 3 connections to the machine. So far from what I have seen throughput is more important than TPS for the queries we do. Usually we end up doing sequential scans to do summaries/aggregates.
Francisco Reyes wrote: > Anyone has any experience doing analytics with postgres. In particular > if 10K rpm drives are good enough vs using 15K rpm, over 24 drives. > Price difference is $3,000. > > Rarely ever have more than 2 or 3 connections to the machine. > > So far from what I have seen throughput is more important than TPS for > the queries we do. Usually we end up doing sequential scans to do > summaries/aggregates. > With 24 drives it'll probably be the controller that is the limiting factor of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K drives and put the saved money towards the controller (or maybe more than one controller). regards, Yeb Havinga
Yeb Havinga wrote: > With 24 drives it'll probably be the controller that is the limiting > factor of bandwidth. Our HP SAN controller with 28 15K drives delivers > 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. You should be able to clear 1GB/s on sequential reads with 28 15K drives in a RAID10, given proper read-ahead adjustment. I get over 200MB/s out of the 3-disk RAID0 on my home server without even trying hard. Can you share what HP SAN controller you're using? -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Seconded .... these days even a single 5400rpm SATA drive can muster almost 100MB/sec on a sequential read.
The benefit of 15K rpm drives is seen when you have a lot of small, random accesses from a working set that is too big to cache .... the extra rotational speed translates to an average reduction of about 1ms on a random seek and read from the media.
Cheers
Dave
The benefit of 15K rpm drives is seen when you have a lot of small, random accesses from a working set that is too big to cache .... the extra rotational speed translates to an average reduction of about 1ms on a random seek and read from the media.
Cheers
Dave
On Tue, Mar 2, 2010 at 2:51 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:
Francisco Reyes wrote:With 24 drives it'll probably be the controller that is the limiting factor of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K drives and put the saved money towards the controller (or maybe more than one controller).Anyone has any experience doing analytics with postgres. In particular if 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price difference is $3,000.
Rarely ever have more than 2 or 3 connections to the machine.
So far from what I have seen throughput is more important than TPS for the queries we do. Usually we end up doing sequential scans to do summaries/aggregates.
regards,
Yeb Havinga
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Tue, Mar 2, 2010 at 1:42 PM, Francisco Reyes <lists@stringsutils.com> wrote: > Anyone has any experience doing analytics with postgres. In particular if > 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price > difference is $3,000. > > Rarely ever have more than 2 or 3 connections to the machine. > > So far from what I have seen throughput is more important than TPS for the > queries we do. Usually we end up doing sequential scans to do > summaries/aggregates. Then the real thing to compare is the speed of the drives for throughput not rpm. Using older 15k drives would actually be slower than some more modern 10k or even 7.2k drives. Another issue would be whether or not to short stroke the drives. You may find that short stroked 10k drives provide the same throughput for much less money. The 10krpm 2.5" ultrastar C10K300 drives have a throughput numbers of 143 to 88 Meg/sec, which is quite respectable, and you can put 24 into a 2U supermicro case and save rack space too. The 15k 2.5" ultrastar c15k147 drives are 159 to 116, only a bit faster. And if short stroked the 10k drives should be competitive.
On Tue, 2 Mar 2010, Francisco Reyes wrote: > Anyone has any experience doing analytics with postgres. In particular if 10K > rpm drives are good enough vs using 15K rpm, over 24 drives. Price difference > is $3,000. > > Rarely ever have more than 2 or 3 connections to the machine. > > So far from what I have seen throughput is more important than TPS for the > queries we do. Usually we end up doing sequential scans to do > summaries/aggregates. With sequential scans you may be better off with the large SATA drives as they fit more data per track and so give great sequential read rates. if you end up doing a lot of seeking to retreive the data, you may find that you get a benifit from the faster drives. David Lang
On Tue, Mar 2, 2010 at 2:14 PM, <david@lang.hm> wrote: > On Tue, 2 Mar 2010, Francisco Reyes wrote: > >> Anyone has any experience doing analytics with postgres. In particular if >> 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price >> difference is $3,000. >> >> Rarely ever have more than 2 or 3 connections to the machine. >> >> So far from what I have seen throughput is more important than TPS for the >> queries we do. Usually we end up doing sequential scans to do >> summaries/aggregates. > > With sequential scans you may be better off with the large SATA drives as > they fit more data per track and so give great sequential read rates. True, I just looked at the Hitachi 7200 RPM 2TB Ultrastar and it lists and average throughput of 134 Megabytes/second which is quite good. While seek time is about double that of a 15krpm drive, short stroking can lower that quite a bit. Latency is still 2x as much, but there's not much to do about that.
Yeb Havinga writes: > With 24 drives it'll probably be the controller that is the limiting > factor of bandwidth. Going with a 3Ware SAS controller. > Our HP SAN controller with 28 15K drives delivers > 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. Already have simmilar machine in house. With RAID 1+0 Bonne++ reports around 400MB/sec sequential read. > go for the 10K drives and put the saved money towards the controller (or > maybe more than one controller). Have some external enclosures with 16 15Krpm drives. They are older 15K rpms, but they should be good enough. Since the 15K rpms usually have better Transanctions per second I will put WAL and indexes in the external enclosure.
Scott Marlowe writes: > Then the real thing to compare is the speed of the drives for > throughput not rpm. In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives me about 500MB/sec
On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > With 24 drives it'll probably be the controller that is the limiting factor > of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at > maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K > drives and put the saved money towards the controller (or maybe more than > one controller). That's horrifically bad numbers for that many drives. I can get those numbers for write performance on a RAID-6 on our office server. I wonder what's making your SAN setup so slow?
david@lang.hm writes: > With sequential scans you may be better off with the large SATA drives as > they fit more data per track and so give great sequential read rates. I lean more towards SAS because of writes. One common thing we do is create temp tables.. so a typical pass may be: * sequential scan * create temp table with subset * do queries against subset+join to smaller tables. I figure the concurrent read/write would be faster on SAS than on SATA. I am trying to move to having an external enclosure (we have several not in use or about to become free) so I could separate the read and the write of the temp tables. Lastly, it is likely we are going to do horizontal partitioning (ie master all data in one machine, replicate and then change our code to read parts of data from different machine) and I think at that time the better drives will do better as we have more concurrent queries.
Greg Smith writes: > in a RAID10, given proper read-ahead adjustment. I get over 200MB/s out > of the 3-disk RAID0 Any links/suggested reads on read-ahead adjustment? It will probably be OS dependant, but any info would be usefull.
On Tue, Mar 2, 2010 at 2:30 PM, Francisco Reyes <lists@stringsutils.com> wrote: > Scott Marlowe writes: > >> Then the real thing to compare is the speed of the drives for >> throughput not rpm. > > In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm > gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives > me about 500MB/sec Have you tried short stroking the drives to see how they compare then? Or is the reduced primary storage not a valid path here? While 16x15k older drives doing 500Meg seems only a little slow, the 24x10k drives getting only 400MB/s seems way slow. I'd expect a RAID-10 of those to read at somewhere in or just past the gig per second range with a fast pcie (x8 or x16 or so) controller. You may find that a faster controller with only 8 or so fast and large SATA drives equals the 24 10k drives you're looking at now. I can write at about 300 to 350 Megs a second on a slower Areca 12xx series controller and 8 2TB Western Digital Green drives, which aren't even made for speed.
Greg Smith writes: > in a RAID10, given proper read-ahead adjustment. I get over 200MB/s out > of the 3-disk RAID0 on my home server without even trying hard. Can you Any links/suggested reading on "read-ahead adjustment". I understand this may be OS specific, but any info would be helpfull. Currently have 24 x 10K rpm drives and only getting about 400MB/sec.
Scott Marlowe writes: > Have you tried short stroking the drives to see how they compare then? > Or is the reduced primary storage not a valid path here? No, have not tried it. By the time I got the machine we needed it in production so could not test anything. When the 2 new machines come I should hopefully have time to try a few strategies, including RAID0, to see what is the best setup for our needs. > RAID-10 of those to read at somewhere in or just past the gig per > second range with a fast pcie (x8 or x16 or so) controller. Thanks for the info. Contacted the vendor to see what pcie speed is the controller connected to, specially since we are considering getting 2 more machines from them. > drives equals the 24 10k drives you're looking at now. I can write at > about 300 to 350 Megs a second on a slower Areca 12xx series > controller and 8 2TB Western Digital Green drives, which aren't even How about read spead?
On Tue, 2 Mar 2010, Scott Marlowe wrote: > On Tue, Mar 2, 2010 at 2:30 PM, Francisco Reyes <lists@stringsutils.com> wrote: >> Scott Marlowe writes: >> >>> Then the real thing to compare is the speed of the drives for >>> throughput not rpm. >> >> In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm >> gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives >> me about 500MB/sec > > Have you tried short stroking the drives to see how they compare then? > Or is the reduced primary storage not a valid path here? > > While 16x15k older drives doing 500Meg seems only a little slow, the > 24x10k drives getting only 400MB/s seems way slow. I'd expect a > RAID-10 of those to read at somewhere in or just past the gig per > second range with a fast pcie (x8 or x16 or so) controller. You may > find that a faster controller with only 8 or so fast and large SATA > drives equals the 24 10k drives you're looking at now. I can write at > about 300 to 350 Megs a second on a slower Areca 12xx series > controller and 8 2TB Western Digital Green drives, which aren't even > made for speed. what filesystem is being used. There is a thread on the linux-kernel mailing list right now showing that ext4 seems to top out at ~360MB/sec while XFS is able to go to 500MB/sec+ on single disks the disk performance limits you, but on arrays where the disk performance is higher there may be other limits you are running into. David Lang
david@lang.hm writes: > what filesystem is being used. There is a thread on the linux-kernel > mailing list right now showing that ext4 seems to top out at ~360MB/sec > while XFS is able to go to 500MB/sec+ EXT3 on Centos 5.4 Plan to try and see if I have time with the new machines to try FreeBSD+ZFS. ZFS supposedly makes good use of memory and the new machines will have 72GB of RAM.
Scott Marlowe writes: > While 16x15k older drives doing 500Meg seems only a little slow, the > 24x10k drives getting only 400MB/s seems way slow. I'd expect a > RAID-10 of those to read at somewhere in or just past the gig per Talked to the vendor. The likely issue is the card. They used a single card with an expander and the card also has an external enclosure through an exterrnal port. They have some ideas which they are going to test and report back.. since we are in the process of getting 2 more machines from them. They believe that by splitting the internal drives into one controller and the external into a second controller that performance should go up. They will report back some numbers. Will post them to the list when I get the info.
Francisco Reyes wrote: > Going with a 3Ware SAS controller. > Already have simmilar machine in house. > With RAID 1+0 Bonne++ reports around 400MB/sec sequential read. Increase read-ahead and I'd bet you can add 50% to that easy--one area the 3Ware controllers need serious help, as they admit: http://www.3ware.com/kb/article.aspx?id=11050 Just make sure you ignore their dirty_ratio comments--those are completely the opposite of what you want on a database app. Still seems on the low side though. Short stroke and you could probably chop worst-case speeds in half too, on top of that. Note that 3Ware's controllers have seriously limited reporting on drive data when using SAS drives because they won't talk SMART to them: http://www.3ware.com/KB/Article.aspx?id=15383 I consider them still a useful vendor for SATA controllers, but would never buy a SAS solution from them again until this is resolved. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Francisco Reyes wrote: > Anyone has any experience doing analytics with postgres. In particular > if 10K rpm drives are good enough vs using 15K rpm, over 24 drives. > Price difference is $3,000. > Rarely ever have more than 2 or 3 connections to the machine. > So far from what I have seen throughput is more important than TPS for > the queries we do. Usually we end up doing sequential scans to do > summaries/aggregates. For arrays this size, the first priority is to sort out what controller you're going to get, whether it can keep up with the array size, and how you're going to support/monitor it. Once you've got all that nailed down, if you still have the option of 10K vs. 15K the trade-offs are pretty simple: -10K drives are cheaper -15K drives will commit and seek faster. If you have a battery-backed controller, commit speed doesn't matter very much. If you only have 2 or 3 connections, I can't imagine that the improved seek times of the 15K drives will be a major driving factor. As already suggested, 10K drives tend to be larger and can be extremely fast on sequential workloads, particularly if you short-stroke them and stick to putting the important stuff on the fast part of the disk. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Tue, Mar 2, 2010 at 4:50 PM, Greg Smith <greg@2ndquadrant.com> wrote: > If you only have 2 or 3 connections, I can't imagine that the improved seek > times of the 15K drives will be a major driving factor. As already > suggested, 10K drives tend to be larger and can be extremely fast on > sequential workloads, particularly if you short-stroke them and stick to > putting the important stuff on the fast part of the disk. The thing I like most about short stroking 7200RPM 1 to 2 TB drives is that you get great performance on one hand, and a ton of left over storage for backups and stuff. And honestly, you can't have enough extra storage laying about when working on databases.
Scott Marlowe wrote: > True, I just looked at the Hitachi 7200 RPM 2TB Ultrastar and it lists > and average throughput of 134 Megabytes/second which is quite good. > Yeah, but have you tracked the reliability of any of the 2TB drives out there right now? They're terrible. I wouldn't deploy anything more than a 1TB drive right now in a server, everything with a higher capacity is still on the "too new to be stable yet" side of the fence to me. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Tue, Mar 2, 2010 at 4:57 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Scott Marlowe wrote: >> >> True, I just looked at the Hitachi 7200 RPM 2TB Ultrastar and it lists >> and average throughput of 134 Megabytes/second which is quite good. >> > > Yeah, but have you tracked the reliability of any of the 2TB drives out > there right now? They're terrible. I wouldn't deploy anything more than a > 1TB drive right now in a server, everything with a higher capacity is still > on the "too new to be stable yet" side of the fence to me. We've had REAL good luck with the WD green and black drives. Out of about 35 or so drives we've had two failures in the last year, one of each black and green. The Seagate SATA drives have been horrific for us, with a 30% failure rate in the last 8 or so months. We only have something like 10 of the Seagates, so the sample's not as big as the WDs. Note that we only use the supposed "enterprise" class drives from each manufacturer. We just got a shipment of 8 1.5TB Seagates so I'll keep you informed of the failure rate of those drives. Wouldn't be surprised to see 1 or 2 die in the first few months tho.
Scott Marlowe wrote: > We've had REAL good luck with the WD green and black drives. Out of > about 35 or so drives we've had two failures in the last year, one of > each black and green. I've been happy with almost all the WD Blue drives around here (have about a dozen in service for around two years), with the sole exception that the one drive I did have go bad has turned into a terrible liar. Refuses to either acknowledge it's broken and produce an RMA code, or to work. At least the Seagate and Hitachi drives are honest about being borked when once they've started producing heavy SMART errors. I have enough redundancy to deal with failure, but can't tolerate dishonesty one bit. The Blue drives are of course regular crappy consumer models though, so this is not necessarily indicative of how the Green/Black drives work. > The Seagate SATA drives have been horrific for > us, with a 30% failure rate in the last 8 or so months. We only have > something like 10 of the Seagates, so the sample's not as big as the > WDs. Note that we only use the supposed "enterprise" class drives > from each manufacturer. > > We just got a shipment of 8 1.5TB Seagates so I'll keep you informed > of the failure rate of those drives. Wouldn't be surprised to see 1 > or 2 die in the first few months tho. > Good luck with those--the consumer version of Seagate's 1.5TB drives have been perhaps the worst single drive model on the market over the last year. Something got seriously misplaced when they switched their manufacturing facility from Singapore to Thailand a few years ago, and now that the old plant is gone: http://www.theregister.co.uk/2009/08/04/seagate_closing_singapore_plant/ I don't expect them to ever recover from that. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Tue, Mar 2, 2010 at 6:03 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Scott Marlowe wrote: >> >> We've had REAL good luck with the WD green and black drives. Out of >> about 35 or so drives we've had two failures in the last year, one of >> each black and green. > > I've been happy with almost all the WD Blue drives around here (have about a > dozen in service for around two years), with the sole exception that the one > drive I did have go bad has turned into a terrible liar. Refuses to either > acknowledge it's broken and produce an RMA code, or to work. At least the > Seagate and Hitachi drives are honest about being borked when once they've > started producing heavy SMART errors. I have enough redundancy to deal with > failure, but can't tolerate dishonesty one bit. Time to do the ESD shuffle I think. >> The Seagate SATA drives have been horrific for >> us, with a 30% failure rate in the last 8 or so months. We only have >> something like 10 of the Seagates, so the sample's not as big as the >> WDs. Note that we only use the supposed "enterprise" class drives >> from each manufacturer. >> >> We just got a shipment of 8 1.5TB Seagates so I'll keep you informed >> of the failure rate of those drives. Wouldn't be surprised to see 1 >> or 2 die in the first few months tho. >> > > Good luck with those--the consumer version of Seagate's 1.5TB drives have > been perhaps the worst single drive model on the market over the last year. > Something got seriously misplaced when they switched their manufacturing > facility from Singapore to Thailand a few years ago, and now that the old > plant is gone: > http://www.theregister.co.uk/2009/08/04/seagate_closing_singapore_plant/ I > don't expect them to ever recover from that. Yeah, I've got someone upstream in my chain of command who's a huge fan of seacrates, so that's how we got those 1.5TB drives. Our 15k5 seagates have been great, with 2 failures in 32 drives in 1.5 years of very heavy use. All our seagate SATAs, whether 500G or 2TB have been the problem children. I've pretty much given up on Seagate SATA drives. The new seagates we got are the consumer 7200.11 drives, but at least they have the latest firmware and all.
Scott Marlowe wrote: > Time to do the ESD shuffle I think. > Nah, I keep the crazy drive around as an interesting test case. Fun to see what happens when I connect to a RAID card; very informative about how thorough the card's investigation of the drive is. > Our 15k5 > seagates have been great, with 2 failures in 32 drives in 1.5 years of > very heavy use. All our seagate SATAs, whether 500G or 2TB have been > the problem children. I've pretty much given up on Seagate SATA > drives. The new seagates we got are the consumer 7200.11 drives, but > at least they have the latest firmware and all. > Well, what I was pointing out was that all the 15K drives used to come out of this plant in Singapore, which is also where their good consumer drives used to come from too during the 2003-2007ish period where all their products were excellent. Then they moved the consumer production to this new location in Thailand, and all of the drives from there have been total junk. And as of August they closed the original plant, which had still been making the enterprise drives, altogether. So now you can expect the 15K drives to come from the same known source of garbage drives as everything else they've made recently, rather than the old, reliable plant. I recall the Singapore plant sucked for a while when it got started in the mid 90's too, so maybe this Thailand one will eventually get their issues sorted out. It seems like you can't just move a hard drive plant somewhere and have the new one work without a couple of years of practice first, I keep seeing this pattern repeat. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith writes: > http://www.3ware.com/KB/Article.aspx?id=15383 I consider them still a > useful vendor for SATA controllers, but would never buy a SAS solution > from them again until this is resolved. Who are you using for SAS? One thing I like about 3ware is their management utility works under both FreeBSD and Linux well.
On Tue, Mar 2, 2010 at 7:44 PM, Francisco Reyes <lists@stringsutils.com> wrote: > Greg Smith writes: > >> http://www.3ware.com/KB/Article.aspx?id=15383 I consider them still a >> useful vendor for SATA controllers, but would never buy a SAS solution from >> them again until this is resolved. > > > Who are you using for SAS? > One thing I like about 3ware is their management utility works under both > FreeBSD and Linux well. The non-open source nature of the command line tool for Areca makes me avoid their older cards. The 1680 has it's own ethernet wtih a web interface with snmp that is independent of the OS. This means that with something like a hung / panicked kernel, you can still check out the RAID array and check rebuild status and other stuff. We get a hang about every 180 to 460 days with them where the raid driver in linux hangs with the array going off-line. It's still there to the web interface on its own NIC. Newer kernels seem to trigger the failure far more often, once every 1 to 2 weeks, two months on the outside. The driver guy from Areca is supposed to be working on the driver for linux, so we'll see if it gets fixed. It's pretty stable on a RHEL 5.2 kernel, on anything after that I've tested, it'll hang every week or two. So I run RHEL 5.latest with a 5.2 kernel and it works pretty well. Note that this is a pretty heavily used machine with enough access going through 12 drives to use about 30% IOwait, 50% user, 10% sys at peak midday. Load factor 7 to 15. And they run really ultra-smooth between these hangs. They come back up uncorrupted, every time, every plug pull test etc. Other than the occasional rare hang, they're perfect.
Greg Smith wrote: > Yeb Havinga wrote: >> With 24 drives it'll probably be the controller that is the limiting >> factor of bandwidth. Our HP SAN controller with 28 15K drives >> delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. > > You should be able to clear 1GB/s on sequential reads with 28 15K > drives in a RAID10, given proper read-ahead adjustment. I get over > 200MB/s out of the 3-disk RAID0 on my home server without even trying > hard. Can you share what HP SAN controller you're using? Yeah I should have mentioned a bit more, to allow for a better picture of the apples and pears. Controller a is the built in controller of the HP MSA1000 SAN - with 14 disks but with extra 14 disks from a MSA30. It is connected through a 2Gbit/s fibrechannel adapter - should give up to roughly 250MB/s bandwidth, maybe a bit less due to frame overhead and gib/gb difference. Controller has 256MB cache It is three years old, however HP still sells it. I performed a few dozen tests with oracle's free and standalone orion tool (http://www.oracle.com/technology/software/tech/orion/index.html) with different raid and controller settings, where I varied - controller read/write cache ratio - logical unit layout (like one big raidx, 3 luns with raid10 (giving stripe width of 4 disks and 4 hot spares), 7 luns with raid10 - stripe size set to maximum - load type (random or sequential large io) - linux io scheduler (deadline / cfq etc) - fibre channel adapter queue depth - ratio between reads and writes by the orion - our production application has about 25% writes. - I also did the short stroking that is talked about further in this thread by only using one partition of about 30% size on each disk. - etc My primary goal was large IOPS for our typical load: mostly OLTP. The orion tool tests in a matrix with on one axis the # concurrent small io's and the other axis the # concurrent large io's. It output numbers are also in a matrix, with MBps, iops and latency. I put several of these numbers in matlab to produce 3d pictures and that showed some interesting stuff - its probably bad netiquette here to post a one of those pictures. One of the striking things was that I could see something that looked like a mountain where the top was neatly cut of - my guess: controller maximum reached. Below is the output data of a recent test, where a 4Gbit/s fc adapter was connected. From this numbers I conclude that in our setup, the controller is maxed out at 155MB/s for raid 1+0 *with this setup*. In a test run I constructed to try and see what the maximum mbps of the controller would be: 100% reads, sequential large io - that went to 170MBps. I'm particularly proud of the iops of this test. Please note: large load is random, not sequential! So to come back at my original claim: controller is important when you have 24 disks. I believe I have backed up this claim by this mail. Also please take notice that for our setup, a database that has a lot of concurrent users on a medium size database (~=160GB), random IO is what we needed, and for this purpose the HP MSA has proved rock solid. But the setup that Francisco mentioned is different: a few users doing mostly sequential IO. For that load, our setup is far from optimal, mainly because of the (single) controller. regards, Yeb Havinga ORION VERSION 10.2.0.1.0 Commandline: -run advanced -testname r10-7 -num_disks 24 -size_small 4 -size_large 1024 -type rand -simulate concat -verbose -write 25 -duration 15 -matrix detailed -cache_size 256 This maps to this test: Test: r10-7 Small IO size: 4 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Random IOs Simulated Array Type: CONCAT Write: 25% Cache Size: 256 MB Duration for each Data Point: 15 seconds Small Columns:, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120 Large Columns:, 0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48 Total Data Points: 416 Name: /dev/sda1 Size: 72834822144 Name: /dev/sdb1 Size: 72834822144 Name: /dev/sdc1 Size: 72834822144 Name: /dev/sdd1 Size: 72834822144 Name: /dev/sde1 Size: 72834822144 Name: /dev/sdf1 Size: 72834822144 Name: /dev/sdg1 Size: 72834822144 7 FILEs found. Maximum Large MBPS=155.05 @ Small=2 and Large=48 Maximum Small IOPS=6261 @ Small=120 and Large=0 Minimum Small Latency=3.93 @ Small=1 and Large=0 Below the MBps matrix - hope this reads well in email clients?? Large/Small, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120 1, 76.60, 74.87, 73.24, 70.66, 70.45, 68.36, 67.58, 59.63, 54.94, 50.74, 44.65, 41.24, 37.31, 35.85, 35.05, 32.53, 29.01, 30.64, 30.39, 27.41, 26.19, 25.43, 24.17, 24.10, 22.96, 22.39 2, 114.19, 115.65, 113.65, 112.11, 111.31, 109.77, 108.57, 101.81, 95.25, 86.74, 83.48, 76.12, 70.82, 68.98, 62.85, 63.75, 57.36, 56.28, 52.78, 50.37, 47.96, 48.53, 46.82, 44.47, 45.09, 42.53 3, 135.41, 135.21, 134.20, 134.27, 133.78, 132.62, 131.03, 127.08, 121.25, 114.15, 109.51, 104.28, 98.66, 94.91, 91.95, 86.27, 82.99, 79.28, 76.09, 74.26, 71.60, 67.83, 67.94, 64.55, 65.39, 63.23 4, 144.30, 143.93, 145.00, 144.47, 143.49, 142.56, 142.23, 139.14, 135.64, 131.82, 128.82, 124.51, 121.88, 116.16, 112.13, 107.91, 105.63, 101.54, 99.06, 93.50, 90.35, 87.25, 86.98, 83.57, 83.45, 79.73 8, 152.93, 152.87, 152.60, 152.29, 152.36, 152.16, 151.85, 151.11, 150.00, 149.09, 148.18, 147.40, 146.09, 145.21, 144.94, 143.82, 142.90, 141.43, 140.93, 140.08, 137.83, 136.95, 136.17, 133.69, 134.05, 131.85 12, 154.10, 153.83, 154.07, 153.79, 154.03, 153.35, 153.09, 152.41, 152.14, 151.32, 151.49, 150.68, 150.10, 149.69, 149.19, 148.07, 148.00, 147.90, 146.78, 146.57, 145.79, 144.96, 145.21, 144.23, 143.58, 142.59 16, 154.30, 154.40, 153.71, 153.96, 154.13, 154.13, 153.58, 153.24, 152.97, 152.86, 152.29, 151.95, 151.57, 150.68, 150.85, 150.44, 150.03, 149.59, 149.15, 149.01, 148.29, 147.89, 147.44, 147.41, 146.79, 146.55 20, 154.70, 154.53, 154.33, 154.12, 154.05, 154.29, 154.05, 153.84, 152.87, 153.26, 153.02, 152.64, 152.37, 151.99, 151.65, 151.44, 150.89, 150.89, 150.69, 150.34, 149.90, 149.59, 149.38, 149.31, 148.76, 148.35 24, 154.31, 154.34, 154.28, 154.31, 154.21, 154.39, 154.07, 153.80, 153.80, 153.17, 153.28, 152.83, 152.59, 152.66, 151.97, 152.00, 151.66, 151.17, 150.79, 151.10, 150.62, 150.52, 150.17, 149.93, 149.79, 149.27 28, 154.62, 154.48, 154.34, 154.70, 154.48, 154.31, 154.44, 153.92, 153.82, 153.72, 153.54, 153.23, 152.88, 152.29, 152.23, 152.43, 151.84, 151.70, 151.32, 151.56, 150.87, 150.87, 150.90, 150.31, 150.63, 150.03 32, 154.58, 154.33, 154.90, 154.40, 154.51, 154.44, 154.41, 154.08, 154.30, 154.02, 153.53, 153.50, 153.35, 153.01, 152.83, 152.83, 152.41, 152.16, 152.06, 151.99, 151.75, 151.29, 151.12, 151.47, 151.22, 150.77 36, 154.67, 154.46, 154.43, 154.25, 154.60, 154.96, 154.25, 154.25, 154.15, 154.00, 153.83, 153.45, 153.16, 153.23, 152.74, 152.66, 152.49, 152.57, 152.28, 152.53, 151.79, 151.40, 151.23, 151.30, 151.19, 151.20 40, 154.27, 154.67, 154.63, 154.74, 154.17, 154.31, 154.82, 154.24, 154.67, 154.35, 153.81, 153.82, 153.89, 153.29, 153.18, 152.97, 153.18, 152.72, 152.69, 151.94, 151.80, 151.69, 152.12, 151.59, 151.31, 151.52 44, 154.37, 154.59, 154.51, 154.66, 154.88, 154.58, 154.26, 154.29, 153.83, 154.38, 153.84, 153.66, 153.55, 153.23, 153.02, 153.20, 152.70, 152.67, 152.88, 152.53, 152.67, 152.13, 152.10, 152.06, 151.53, 151.45 48, 154.61, 154.83, 155.05, 154.65, 154.47, 154.97, 154.29, 154.40, 154.33, 154.29, 154.00, 154.01, 153.71, 153.47, 153.58, 153.50, 153.15, 152.50, 153.08, 152.83, 152.40, 152.04, 151.46, 152.29, 152.11, 151.43 below the iops matrix Large/Small, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120 0, 254, 502, 751, 960, 1177, 1388, 2343, 3047, 3557, 3945, 4247, 4529, 4752, 4953, 5111, 5280, 5412, 5550, 5670, 5785, 5904, 5987, 6093, 6167, 6261 1, 178, 353, 526, 684, 832, 999, 1801, 2445, 2937, 3382, 3742, 4054, 4262, 4489, 4685, 4910, 5030, 5139, 5312, 5439, 5549, 5685, 5760, 5873, 5953 2, 122, 240, 364, 484, 605, 715, 1342, 1907, 2416, 2808, 3208, 3526, 3789, 4072, 4217, 4477, 4629, 4840, 4964, 5187, 5242, 5381, 5490, 5543, 5704 3, 84, 167, 253, 337, 420, 510, 990, 1486, 1924, 2332, 2692, 3035, 3272, 3578, 3838, 4048, 4260, 4426, 4607, 4760, 4948, 4989, 5164, 5216, 5335 4, 61, 120, 180, 236, 303, 368, 732, 1086, 1445, 1780, 2144, 2434, 2771, 3092, 3342, 3576, 3793, 4000, 4165, 4376, 4554, 4703, 4805, 4847, 5062 8, 24, 49, 73, 100, 122, 152, 303, 448, 614, 759, 889, 1043, 1201, 1325, 1489, 1647, 1800, 1948, 2116, 2291, 2434, 2594, 2824, 2946, 3124 12, 15, 30, 45, 62, 76, 90, 188, 275, 366, 462, 543, 638, 726, 814, 906, 978, 1055, 1151, 1245, 1341, 1425, 1488, 1566, 1688, 1759 16, 10, 23, 32, 44, 55, 66, 130, 198, 259, 328, 387, 450, 519, 580, 643, 706, 767, 834, 891, 964, 1029, 1083, 1141, 1206, 1263 20, 8, 17, 25, 34, 41, 50, 102, 152, 201, 255, 302, 350, 402, 447, 496, 554, 591, 645, 688, 746, 791, 844, 882, 934, 984 24, 6, 13, 21, 28, 35, 41, 85, 123, 166, 206, 250, 288, 326, 377, 410, 451, 497, 531, 568, 610, 660, 694, 732, 772, 814 28, 6, 12, 17, 23, 29, 35, 70, 106, 142, 174, 210, 247, 279, 325, 348, 378, 419, 453, 487, 523, 553, 586, 627, 651, 691 32, 5, 10, 15, 20, 26, 31, 61, 92, 120, 154, 182, 212, 245, 274, 309, 336, 368, 395, 429, 452, 488, 514, 542, 581, 605 36, 4, 9, 13, 18, 22, 27, 56, 83, 110, 138, 166, 193, 222, 248, 279, 302, 333, 358, 385, 414, 438, 468, 496, 523, 551 40, 4, 8, 12, 17, 21, 25, 50, 77, 103, 127, 155, 184, 205, 236, 256, 285, 315, 341, 362, 387, 418, 442, 468, 492, 518 44, 4, 8, 11, 15, 20, 24, 49, 73, 98, 123, 151, 173, 197, 225, 248, 269, 294, 329, 349, 373, 390, 428, 438, 469, 498 48, 3, 7, 11, 15, 20, 23, 47, 70, 95, 120, 141, 166, 192, 212, 237, 260, 282, 308, 329, 353, 378, 400, 424, 450, 468
Scott Marlowe wrote: > On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > >> With 24 drives it'll probably be the controller that is the limiting factor >> of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at >> maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K >> drives and put the saved money towards the controller (or maybe more than >> one controller). >> > > That's horrifically bad numbers for that many drives. I can get those > numbers for write performance on a RAID-6 on our office server. I > wonder what's making your SAN setup so slow? > Pre scriptum: A few minutes ago I mailed detailed information in the same thread but as reply to an earlier response - it tells more about setup and gives results of a raid1+0 test. I just have to react to "horrifically bad" and "slow" :-) : The HP san can do raid5 on 28 disks also on about 155MBps: 28 disks devided into 7 logical units with raid5, orion results are below. Please note that this time I did sequential large io. The mixed read/write MBps maximum here is comparable: around 155MBps. regards, Yeb Havinga ORION VERSION 10.2.0.1.0 Commandline: -run advanced -testname msa -num_disks 24 -size_small 4 -size_large 1024 -type seq -simulate concat -verbose -write 50 -duration 15 -matrix detailed -cache_size 256 This maps to this test: Test: msa Small IO size: 4 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Sequential Streams Number of Concurrent IOs Per Stream: 4 Force streams to separate disks: No Simulated Array Type: CONCAT Write: 50% Cache Size: 256 MB Duration for each Data Point: 15 seconds Small Columns:, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120 Large Columns:, 0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48 Total Data Points: 416 Name: /dev/sda1 Size: 109256361984 Name: /dev/sdb1 Size: 109256361984 Name: /dev/sdc1 Size: 109256361984 Name: /dev/sdd1 Size: 109256361984 Name: /dev/sde1 Size: 109256361984 Name: /dev/sdf1 Size: 109256361984 Name: /dev/sdg1 Size: 109256361984 7 FILEs found. Maximum Large MBPS=157.75 @ Small=0 and Large=1 Maximum Small IOPS=3595 @ Small=66 and Large=1 Minimum Small Latency=2.81 @ Small=1 and Large=0 MBPS matrix Large/Small, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120 1, 157.75, 156.47, 153.56, 153.45, 144.87, 141.78, 140.60, 112.45, 95.23, 72.80, 80.59, 36.91, 29.76, 42.86, 41.82, 33.87, 34.07, 45.62, 42.97, 26.37, 42.85, 45.49, 44.47, 37.26, 45.67, 36.18 2, 137.58, 128.48, 125.78, 133.85, 120.12, 127.86, 127.05, 119.26, 121.23, 115.00, 117.88, 114.35, 108.61, 106.55, 83.50, 78.61, 92.67, 96.01, 44.02, 70.60, 62.84, 46.52, 69.18, 51.84, 57.19, 59.62 3, 143.10, 134.92, 139.30, 138.62, 137.85, 146.17, 140.41, 138.48, 76.00, 138.17, 123.48, 137.45, 126.51, 137.11, 91.94, 90.33, 129.97, 45.35, 115.92, 89.60, 137.22, 72.46, 89.95, 77.40, 119.17, 82.09 4, 138.47, 133.74, 129.99, 122.33, 126.75, 125.22, 132.30, 120.41, 125.88, 132.21, 96.92, 115.70, 131.65, 66.34, 114.06, 113.62, 116.91, 96.97, 98.69, 127.16, 116.67, 111.53, 128.97, 92.38, 118.14, 78.31 8, 126.59, 127.92, 115.51, 125.02, 123.29, 111.94, 124.31, 125.71, 134.48, 126.40, 127.93, 125.36, 121.75, 121.75, 127.17, 116.51, 121.44, 121.12, 112.32, 121.55, 127.93, 124.86, 118.04, 114.59, 121.72, 114.79 12, 112.40, 122.58, 107.61, 125.42, 128.04, 123.80, 127.17, 127.70, 122.37, 96.52, 115.36, 124.49, 124.07, 129.31, 124.62, 124.23, 105.58, 123.55, 115.67, 120.59, 125.61, 123.57, 121.43, 121.45, 121.44, 113.64 16, 108.88, 119.79, 123.80, 120.55, 120.02, 121.66, 125.71, 122.19, 125.77, 122.27, 119.55, 118.44, 120.51, 104.66, 97.55, 115.43, 101.45, 108.99, 122.30, 100.45, 105.82, 119.56, 121.26, 126.59, 119.54, 115.09 20, 103.88, 122.95, 115.86, 114.59, 121.13, 108.52, 116.90, 121.10, 113.91, 108.20, 111.51, 125.64, 117.57, 120.86, 117.66, 100.40, 104.88, 103.15, 98.10, 104.86, 104.69, 102.99, 121.81, 107.22, 122.68, 106.43 24, 102.64, 102.33, 112.95, 110.63, 108.00, 111.53, 124.33, 103.17, 108.16, 112.63, 97.42, 106.22, 102.54, 117.46, 100.66, 99.01, 104.46, 99.02, 116.02, 112.49, 119.05, 104.03, 102.40, 102.44, 111.15, 99.51 28, 101.12, 102.76, 114.14, 109.72, 120.63, 118.09, 119.85, 113.80, 116.58, 110.24, 101.45, 110.31, 116.06, 112.04, 121.63, 91.26, 98.88, 101.55, 104.51, 116.43, 112.98, 119.46, 120.08, 109.46, 106.29, 96.69 32, 103.41, 117.33, 101.33, 102.29, 102.58, 116.18, 107.12, 114.63, 121.84, 95.14, 108.83, 99.82, 103.11, 99.36, 117.80, 94.91, 103.46, 103.97, 117.35, 100.51, 100.18, 101.98, 118.26, 115.03, 100.45, 107.90 36, 99.90, 97.98, 100.94, 95.56, 118.76, 99.05, 114.02, 93.61, 117.68, 115.22, 114.40, 116.38, 100.38, 99.15, 108.66, 101.67, 106.64, 98.69, 111.99, 108.28, 99.62, 112.67, 118.80, 110.40, 118.86, 108.46 40, 101.51, 103.38, 93.73, 121.69, 106.27, 104.09, 110.81, 105.83, 95.81, 101.47, 105.96, 113.26, 103.61, 114.26, 100.49, 102.35, 111.44, 95.09, 103.02, 106.21, 104.39, 118.31, 96.73, 109.79, 103.71, 99.70 44, 101.17, 107.22, 107.50, 115.19, 104.16, 108.93, 101.62, 111.82, 110.66, 104.13, 109.68, 103.20, 92.04, 104.70, 102.30, 117.28, 106.37, 100.42, 107.81, 105.31, 110.21, 108.66, 116.05, 105.55, 100.64, 106.67 48, 101.00, 104.13, 114.00, 99.55, 107.46, 113.29, 114.32, 108.75, 100.11, 99.89, 104.81, 107.36, 102.93, 106.43, 101.98, 103.15, 101.30, 113.94, 103.07, 102.40, 95.38, 111.33, 93.89, 112.30, 103.58, 101.82 iops matrix Large/Small, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120 0, 355, 639, 875, 1063, 1230, 1366, 1933, 2297, 2571, 2750, 2981, 3394, 3027, 3045, 3036, 3139, 3218, 3081, 3151, 3203, 3128, 3179, 3093, 3141, 3135 1, 37, 99, 144, 298, 398, 488, 1183, 1637, 2069, 2268, 2613, 2729, 2860, 2983, 3119, 3595, 3065, 3077, 3036, 3008, 3039, 3030, 3067, 3138, 3041 2, 22, 36, 44, 130, 112, 92, 271, 378, 579, 673, 903, 1091, 1131, 1735, 1612, 1809, 1236, 2316, 2302, 1410, 2467, 2526, 2692, 2606, 2625 3, 5, 13, 18, 21, 27, 27, 56, 92, 162, 196, 209, 239, 309, 595, 1551, 611, 2408, 1034, 488, 401, 1226, 1700, 2490, 1516, 2435 4, 8, 10, 33, 38, 53, 38, 137, 191, 165, 502, 369, 212, 1127, 654, 1069, 721, 643, 1046, 537, 803, 1093, 497, 1669, 1120, 1945 8, 3, 8, 6, 15, 24, 19, 61, 47, 90, 139, 109, 174, 184, 154, 261, 294, 289, 460, 338, 199, 425, 433, 633, 475, 599 12, 3, 7, 7, 10, 11, 12, 32, 74, 67, 91, 93, 120, 157, 158, 201, 191, 143, 220, 327, 217, 283, 276, 297, 336, 365 16, 2, 3, 6, 6, 10, 11, 27, 35, 56, 52, 80, 100, 89, 118, 102, 140, 178, 158, 174, 188, 185, 243, 168, 235, 249 20, 1, 3, 5, 6, 8, 11, 15, 30, 30, 54, 44, 70, 76, 87, 104, 79, 121, 115, 128, 135, 147, 158, 194, 184, 250 24, 1, 2, 4, 5, 8, 6, 14, 21, 29, 36, 42, 50, 58, 61, 70, 64, 85, 102, 117, 120, 111, 126, 129, 170, 159 28, 1, 2, 4, 3, 7, 7, 16, 19, 23, 30, 37, 51, 60, 58, 59, 65, 75, 76, 91, 83, 107, 103, 113, 120, 135 32, 1, 2, 3, 3, 6, 6, 12, 17, 19, 28, 30, 31, 32, 44, 49, 57, 53, 82, 87, 80, 84, 106, 106, 96, 93 36, 1, 2, 4, 5, 4, 7, 9, 15, 22, 27, 30, 32, 35, 43, 46, 48, 52, 67, 69, 54, 78, 87, 98, 92, 114 40, 0, 2, 2, 3, 4, 5, 12, 12, 16, 24, 25, 29, 35, 36, 42, 51, 45, 55, 60, 61, 71, 67, 72, 77, 67 44, 0, 2, 2, 3, 4, 4, 10, 12, 19, 20, 24, 24, 25, 32, 34, 40, 43, 58, 62, 60, 71, 75, 75, 68, 81 48, 0, 1, 2, 5, 4, 4, 10, 14, 16, 18, 21, 23, 27, 31, 34, 37, 44, 42, 54, 48, 59, 54, 69, 65, 67
Francisco Reyes wrote: > > Going with a 3Ware SAS controller. > > > Have some external enclosures with 16 15Krpm drives. They are older > 15K rpms, but they should be good enough. > Since the 15K rpms usually have better Transanctions per second I will > put WAL and indexes in the external enclosure. It sounds like you have a lot of hardware around - my guess it would be worthwhile to do a test setup with one server hooked up with two 3ware controllers. Also, I am not sure if it is wise to put the WAL on the same logical disk as the indexes, but that is maybe for a different thread (unwise to mix random and sequential io and also the wal has demands when it comes to write cache). regards, Yeb Havinga
>>> With 24 drives it'll probably be the controller that is the limiting >>> factor of bandwidth. Our HP SAN controller with 28 15K drives delivers >>> 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. I get about 150-200 MB/s on .... a linux software RAID of 3 cheap Samsung SATA 1TB drives (which is my home multimedia server)... IOPS would be of course horrendous, that's RAID-5, but that's not the point here. For raw sequential throughput, dumb drives with dumb software raid can be pretty fast, IF each drive has a dedicated channel (SATA ensures this) and the controller is on a fast PCIexpress (in my case, chipset SATA controller). I don't suggest you use software RAID with cheap consumer drives, just that any expensive setup that doesn't deliver MUCH more performance that is useful to you (ie in your case sequential IO) maybe isn't worth the extra price... There are many bottlenecks...
Yeb Havinga writes: > controllers. Also, I am not sure if it is wise to put the WAL on the > same logical disk as the indexes, If I only have two controllers would it then be better to put WAL on the first along with all the data and the indexes on the external? Specially since the external enclosure will have 15K rpm vs 10K rpm in the internal. > thread (unwise to mix random and sequential io and also the wal has > demands when it comes to write cache). Thanks for pointing that out. With any luck I will actually be able to do some tests for the new hardware. The curernt one I literaly did a few hours stress test and had to put in production right away.
Francisco Reyes wrote: > Who are you using for SAS? > One thing I like about 3ware is their management utility works under > both FreeBSD and Linux well. 3ware has turned into a division within LSI now, so I have my doubts about their long-term viability as a separate product as well. LSI used to be the reliable, open, but somewhat slower cards around, but that doesn't seem to be the case with their SAS products anymore. I've worked on two systems using their MegaRAID SAS 1078 chipset in RAID10 recently and been very impressed with both. That's what I'm recommending to clients now too--especially people who liked Dell anyway. (HP customers are still getting pointed toward their P400/600/800. 3ware in white box systems, still OK, but only SATA. Areca is fast, but they're really not taking the whole driver thing seriously.) You can get that direct from LSI as the MegaRAID SAS 8888ELP: http://www.lsi.com/storage_home/products_home/internal_raid/megaraid_sas/megaraid_sas_8888elp/ as well as some similar models. And that's what Dell sells as their PERC6. Here's what a typical one looks like from Linux's perspective, just to confirm which card/chipset/driver I'm talking about: # lspci -v 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) Subsystem: Dell PERC 6/i Integrated RAID Controller $ /sbin/modinfo megaraid_sas filename: /lib/modules/2.6.18-164.el5/kernel/drivers/scsi/megaraid/megaraid_sas.ko description: LSI MegaRAID SAS Driver author: megaraidlinux@lsi.com version: 00.00.04.08-RH2 license: GPL As for the management utility, LSI ships "MegaCli[64]" as a statically linked Linux binary. Plenty of reports of people running it on FreeBSD with no problems via Linux emulation libraries--it's a really basic CLI tool and whatever interface it talks to card via seems to emulate just fine. UI is awful, but once you find the magic cheat sheet at http://tools.rapidsoft.de/perc/ it's not too bad. No direct SMART monitoring here either, which is disappointing, but you can get some pretty detailed data out of MegaCli so it's not terrible. I've seen >100MB/s per drive on reads out of small RAID10 arrays, and cleared 1GB/s on larger ones (all on RHEL5+ext3) with this controller on recent installs. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Francisco Reyes wrote: > Yeb Havinga writes: > >> controllers. Also, I am not sure if it is wise to put the WAL on the >> same logical disk as the indexes, > > If I only have two controllers would it then be better to put WAL on > the first along with all the data and the indexes on the external? > Specially since the external enclosure will have 15K rpm vs 10K rpm in > the internal. It sounds like you're going to create a single logical unit / raid array on each of the controllers. Depending on the number of disks, this is a bad idea because if you'd read/write data sequentially, all drive heads will be aligned to roughly the same place ont the disks. If another process wants to read/write as well, this will interfere and bring down both iops and mbps. However, with three concurrent users.. hmm but then again, queries will scan multiple tables/indexes so there will be mixed io to several locations. What would be interesting it so see what the mbps maximum of a single controller is. Then calculate how much disks are needed to feed that, which would give a figure for number of disks per logical unit. The challenge with having a few logical units / raid arrays available, is how to divide the data over it (with tablespaces) What is good for your physical data depends on the schema and queries that are most important. For 2 relations and 2 indexes and 4 arrays, it would be clear. There's not much to say anything general here, except: do not mix table or index data with the wal. In other words: if you could make a separate raid array for the wal (2 disk raid1 probably good enough), that would be ok and doesn't matter on which controller or enclosure it happens, because io to disk is not mixed with the data io. > > Thanks for pointing that out. > With any luck I will actually be able to do some tests for the new > hardware. The curernt one I literaly did a few hours stress test and > had to put in production right away. I've heard that before ;-) If you do get around to do some tests, I'm interested in the results / hard numbers. regards, Yeb Havinga
On Wed, Mar 3, 2010 at 4:53 AM, Hannu Krosing <hannu@krosing.net> wrote: > On Wed, 2010-03-03 at 10:41 +0100, Yeb Havinga wrote: >> Scott Marlowe wrote: >> > On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: >> > >> >> With 24 drives it'll probably be the controller that is the limiting factor >> >> of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at >> >> maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K >> >> drives and put the saved money towards the controller (or maybe more than >> >> one controller). >> >> >> > >> > That's horrifically bad numbers for that many drives. I can get those >> > numbers for write performance on a RAID-6 on our office server. I >> > wonder what's making your SAN setup so slow? >> > >> Pre scriptum: >> A few minutes ago I mailed detailed information in the same thread but >> as reply to an earlier response - it tells more about setup and gives >> results of a raid1+0 test. >> >> I just have to react to "horrifically bad" and "slow" :-) : The HP san >> can do raid5 on 28 disks also on about 155MBps: > > SAN-s are "horrifically bad" and "slow" mainly because of the 2MBit sec > fiber channel. > But older ones may be just slow internally as well. > The fact that it is expensive does not make it fast. > If you need fast thrughput, use direct attached storage Let me be clear that the only number mentioned at the beginning was throughput. If you're designing a machine to run huge queries and return huge amounts of data that matters. OLAP. If you're designing for OLTP you're likely to only have a few megs a second passing through but in thousands of xactions per second. So, when presented with the only metric of throughput, I figured that's what the OP was designing for. For OLTP his SAN is plenty fast.
On Mar 2, 2010, at 1:36 PM, Francisco Reyes wrote: > david@lang.hm writes: > >> With sequential scans you may be better off with the large SATA drives as >> they fit more data per track and so give great sequential read rates. > > I lean more towards SAS because of writes. > One common thing we do is create temp tables.. so a typical pass may be: > * sequential scan > * create temp table with subset > * do queries against subset+join to smaller tables. > > I figure the concurrent read/write would be faster on SAS than on SATA. I am > trying to move to having an external enclosure (we have several not in use > or about to become free) so I could separate the read and the write of the > temp tables. > Concurrent Read/Write performance has far more to do with OS and Filesystem choice and tuning than what type of drive itis. > Lastly, it is likely we are going to do horizontal partitioning (ie master > all data in one machine, replicate and then change our code to read parts of > data from different machine) and I think at that time the better drives will > do better as we have more concurrent queries. > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Mar 2, 2010, at 2:10 PM, <david@lang.hm> wrote: > On Tue, 2 Mar 2010, Scott Marlowe wrote: > >> On Tue, Mar 2, 2010 at 2:30 PM, Francisco Reyes <lists@stringsutils.com> wrote: >>> Scott Marlowe writes: >>> >>>> Then the real thing to compare is the speed of the drives for >>>> throughput not rpm. >>> >>> In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm >>> gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives >>> me about 500MB/sec >> >> Have you tried short stroking the drives to see how they compare then? >> Or is the reduced primary storage not a valid path here? >> >> While 16x15k older drives doing 500Meg seems only a little slow, the >> 24x10k drives getting only 400MB/s seems way slow. I'd expect a >> RAID-10 of those to read at somewhere in or just past the gig per >> second range with a fast pcie (x8 or x16 or so) controller. You may >> find that a faster controller with only 8 or so fast and large SATA >> drives equals the 24 10k drives you're looking at now. I can write at >> about 300 to 350 Megs a second on a slower Areca 12xx series >> controller and 8 2TB Western Digital Green drives, which aren't even >> made for speed. > > what filesystem is being used. There is a thread on the linux-kernel > mailing list right now showing that ext4 seems to top out at ~360MB/sec > while XFS is able to go to 500MB/sec+ I have Centos 5.4 with 10 7200RPM 1TB SAS drives in RAID 10 (Seagate ES.2, same perf as the SATA ones), XFS, Adaptec 5805,and get ~750MB/sec read and write sequential throughput. A RAID 0 of two of these stops around 1000MB/sec because it is CPU bound in postgres -- for select count(*). If it is select* piped to /dev/null, it is CPU bound below 300MB/sec converting data to text. For xfs, set readahead to 16MB or so (2MB or so per stripe) (--setra 32768 is 16MB) and absolutely make sure that the xfsmount parameter 'allocsize' is set to about the same size or more. For large sequential operations, you want to makesure interleaved writes don't interleave files on disk. I use 80MB allocsize, and 40MB readahead for the reporting data. Later Linux kernels have significantly improved readahead systems that don't need to be tuned quite as much. For high sequentialthroughput, nothing is as optimized as XFS on Linux yet. It has weaknesses elsewhere however. And 3Ware on Linux + high throughput sequential = slow. PERC 6 was 20% faster, and Adaptec was 70% faster with the samedrives, and with experiments to filesystem and readahead for all. From what I hear, Areca is a significant notch aboveAdaptec on that too. > > on single disks the disk performance limits you, but on arrays where the > disk performance is higher there may be other limits you are running into. > > David Lang > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > For high sequential throughput, nothing is as optimized as XFS on Linux yet. It has weaknesses elsewhere however. > I'm curious what you feel those weaknesses are. The recent addition of XFS back into a more mainstream position in the RHEL kernel as of their 5.4 update greatly expands where I can use it now, have been heavily revisiting it since that release. I've already noted how well it does on sequential read/write tasks relative to ext3, and it looks like the main downsides I used to worry about with it (mainly crash recovery issues) were also squashed in recent years. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith <greg@2ndquadrant.com> wrote: > Scott Carey wrote: >> For high sequential throughput, nothing is as optimized as XFS on Linux >> yet. It has weaknesses elsewhere however. >> When files are extended one page at a time (as postgres does) fragmentation can be pretty high on some filesystems (ext3, but NTFS is the absolute worst) if several files (indexes + table) grow simultaneously. XFS has delayed allocation which really helps. > I'm curious what you feel those weaknesses are. Handling lots of small files, especially deleting them, is really slow on XFS. Databases don't care about that. There is also the dark side of delayed allocation : if your application is broken, it will manifest itself very painfully. Since XFS keeps a lot of unwritten stuff in the buffers, an app that doesn't fsync correctly can lose lots of data if you don't have a UPS. Fortunately, postgres handles fsync like it should be. A word of advice though : a few years ago, we lost a few terabytes on XFS (after that, restoring from backup was quite slow !) because a faulty SCSI cable crashed the server, then crashed it again during xfsrepair. So if you do xfsrepair on a suspicious system, please image the disks first.
Pierre C escribió: > On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith <greg@2ndquadrant.com> > wrote: > >> Scott Carey wrote: >>> For high sequential throughput, nothing is as optimized as XFS on >>> Linux yet. It has weaknesses elsewhere however. >>> > > When files are extended one page at a time (as postgres does) > fragmentation can be pretty high on some filesystems (ext3, but NTFS > is the absolute worst) if several files (indexes + table) grow > simultaneously. XFS has delayed allocation which really helps. > >> I'm curious what you feel those weaknesses are. > > Handling lots of small files, especially deleting them, is really slow > on XFS. > Databases don't care about that. > > There is also the dark side of delayed allocation : if your > application is broken, it will manifest itself very painfully. Since > XFS keeps a lot of unwritten stuff in the buffers, an app that doesn't > fsync correctly can lose lots of data if you don't have a UPS. > > Fortunately, postgres handles fsync like it should be. > > A word of advice though : a few years ago, we lost a few terabytes on > XFS (after that, restoring from backup was quite slow !) because a > faulty SCSI cable crashed the server, then crashed it again during > xfsrepair. So if you do xfsrepair on a suspicious system, please image > the disks first. > And then Which file system do you recommend for the PostgreSQL data directory? I was seeying that ZFS brings very cool features for that. The problem with ZFS is that this file system is only on Solaris, OpenSolaris, FreeBSD and Mac OSX Server, and on Linux systems not What do you think about that? Regards -- -------------------------------------------------------- -- Ing. Marcos Luís Ortíz Valmaseda -- -- Twitter: http://twitter.com/@marcosluis2186 -- -- FreeBSD Fan/User -- -- http://www.freebsd.org/es -- -- Linux User # 418229 -- -- Database Architect/Administrator -- -- PostgreSQL RDBMS -- -- http://www.postgresql.org -- -- http://planetpostgresql.org -- -- http://www.postgresql-es.org -- -------------------------------------------------------- -- Data WareHouse -- Business Intelligence Apprentice -- -- http://www.tdwi.org -- -------------------------------------------------------- -- Ruby on Rails Fan/Developer -- -- http://rubyonrails.org -- -------------------------------------------------------- Comunidad Técnica Cubana de PostgreSQL http://postgresql.uci.cu http://personas.grm.uci.cu/+marcos Centro de Tecnologías de Gestión de Datos (DATEC) Contacto: Correo: centalad@uci.cu Telf: +53 07-837-3737 +53 07-837-3714 Universidad de las Ciencias Informáticas http://www.uci.cu
On Tue, 9 Mar 2010, Pierre C wrote: > On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith <greg@2ndquadrant.com> wrote: > >> Scott Carey wrote: >>> For high sequential throughput, nothing is as optimized as XFS on Linux >>> yet. It has weaknesses elsewhere however. >>> > > When files are extended one page at a time (as postgres does) fragmentation > can be pretty high on some filesystems (ext3, but NTFS is the absolute worst) > if several files (indexes + table) grow simultaneously. XFS has delayed > allocation which really helps. > >> I'm curious what you feel those weaknesses are. > > Handling lots of small files, especially deleting them, is really slow on > XFS. > Databases don't care about that. accessing lots of small files works really well on XFS compared to ext* (I use XFS with a cyrus mail server which keeps each message as a seperate file and XFS vastly outperforms ext2/3 there). deleting is slow as you say David Lang > There is also the dark side of delayed allocation : if your application is > broken, it will manifest itself very painfully. Since XFS keeps a lot of > unwritten stuff in the buffers, an app that doesn't fsync correctly can lose > lots of data if you don't have a UPS. > > Fortunately, postgres handles fsync like it should be. > > A word of advice though : a few years ago, we lost a few terabytes on XFS > (after that, restoring from backup was quite slow !) because a faulty SCSI > cable crashed the server, then crashed it again during xfsrepair. So if you > do xfsrepair on a suspicious system, please image the disks first.
"Pierre C" <lists@peufeu.com> wrote: > Greg Smith <greg@2ndquadrant.com> wrote: >> I'm curious what you feel those weaknesses are. > > Handling lots of small files, especially deleting them, is really > slow on XFS. > Databases don't care about that. I know of at least one exception to that -- when we upgraded and got a newer version of the kernel where XFS has write barriers on by default, some database transactions which were creating and dropping temporary tables in a loop became orders of magnitude slower. Now, that was a silly approach to getting the data that was needed and I helped them rework the transactions, but something which had worked acceptably suddenly didn't anymore. Since we have a BBU hardware RAID controller, we can turn off write barriers safely, at least according to this page: http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F This reduces the penalty for creating and deleting lots of small files. -Kevin
Do keep the postgres xlog on a seperate ext2 partition for best performance. Other than that, xfs is definitely a good performer. Mike Stone
On Mar 8, 2010, at 11:00 PM, Greg Smith wrote: > Scott Carey wrote: >> For high sequential throughput, nothing is as optimized as XFS on Linux yet. It has weaknesses elsewhere however. >> > > I'm curious what you feel those weaknesses are. The recent addition of > XFS back into a more mainstream position in the RHEL kernel as of their > 5.4 update greatly expands where I can use it now, have been heavily > revisiting it since that release. I've already noted how well it does > on sequential read/write tasks relative to ext3, and it looks like the > main downsides I used to worry about with it (mainly crash recovery > issues) were also squashed in recent years. > My somewhat negative experiences have been: * Metadata operations are a bit slow, this manifests itself mostly with lots of small files being updated or deleted. * Improper use of the file system or hardware configuration will likely break worse (ext3 'ordered' mode makes poorly writtenapps safer). * At least with CentOS 5.3 and thier xfs version (non-Redhat, CentOS extras) sparse random writes could almost hang a filesystem. They were VERY slow. I have not tested since. None of the above affect Postgres. I'm also not sure how up to date RedHat's xfs version is -- there have been enhancements to xfs in the kernel mainline regularlyfor a long time. In non-postgres contexts, I've grown to appreciate some other qualities: Unlike ext2/3, I can have more than 32K directoriesin another directory -- XFS will do millions, though it will slow down at least it doesn't just throw an errorto the application. And although XFS is slow to delete lots of small things, it can delete large files much faster-- I deal with lots of large files and it is comical to see ext3 take a minute to delete a 30GB file while XFS doesit almost instantly. I have been happy with XFS for Postgres data directories, and ext2 for a dedicated xlog partition. Although I have not riskedthe online defragmentation on a live DB, I have defragmented a 8TB DB during maintenance and seen the performance improve. > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us >
On Mar 9, 2010, at 4:39 PM, Scott Carey wrote: > > On Mar 8, 2010, at 11:00 PM, Greg Smith wrote: > > * At least with CentOS 5.3 and thier xfs version (non-Redhat, CentOS extras) sparse random writes could almost hang afile system. They were VERY slow. I have not tested since. > Just to be clear, I mean random writes to a _sparse file_. You can cause this condition with the 'fio' tool, which will by default allocate a file for write as a sparse file, thenwrite to it. If the whole thing is written to first, then random writes are fine. Postgres only writes random whenit overwrites a page, otherwise its always an append operation AFAIK. > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > I'm also not sure how up to date RedHat's xfs version is -- there have been enhancements to xfs in the kernel mainlineregularly for a long time. > They seem to following SGI's XFS repo quite carefully and cherry-picking bug fixes out of there, not sure of how that relates to mainline kernel development right now. For example: https://bugzilla.redhat.com/show_bug.cgi?id=509902 (July 2009 SGI commit, now active for RHEL5.4) https://bugzilla.redhat.com/show_bug.cgi?id=544349 (November 2009 SGI commit, may be merged into RHEL5.5 currently in beta) Far as I've been able to tell this is all being driven wanting >16TB large filesystems, i.e. https://bugzilla.redhat.com/show_bug.cgi?id=213744 , and the whole thing will be completely mainstream (bundled into the installer, and hopefully with 32-bit support available) by RHEL6: https://bugzilla.redhat.com/show_bug.cgi?id=522180 Thanks for the comments. From all the info I've been able to gather, "works fine for what PostgreSQL does with the filesystem, not necessarily suitable for your root volume" seems to be a fair characterization of where XFS is at right now. Which is reasonable--that's the context I'm getting more requests to use it in, just as the filesystem for where the database lives. Those who don't have a separate volume and filesystem for the db also tend not to care about filesystem performance differences either. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us