Thread: How to improve db performance with $7K?
Situation: An 24/7 animal hospital (100 employees) runs their business on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to) off a 2-CPU Xeon 2.8MHz, 4GB of RAM, (3) SCSI disks RAID 0 (zcav value 35MB per sec). The databse is 11GB comprised over 100 tables and indexes from 1MB to 2GB in size. I recently told the hospital management team worst-case scenerio they need to get the database on its own drive array since the RAID0 is a disaster wating to happen. I said ideally a new dual AMD server with 6/7-disk configuration would be ideal for safety and performance, but they don't have $15K. I said a seperate drive array offer the balance of safety and performance. I have been given budget of $7K to accomplish a safer/faster database through hardware upgrades. The objective is to get a drive array, but I can use the budget any way I see fit to accomplish the goal. Since I am a dba novice, I did not physically build this server, nor did I write the application the hospital runs on, but I have the opportunity to make it better, I'd thought I should seek some advice from those who have been down this road before. Suggestions/ideas anyone? Thanks. Steve Poe
Steve Poe <spoe@sfnet.cc> writes: > Situation: An 24/7 animal hospital (100 employees) runs their business > on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to) [ itch... ] Surely they could at least move to 7.4.7 without pain. There are serious data-loss bugs known in 7.4.2. regards, tom lane
You can purchase a whole new dual opteron 740, with 6 gigs of ram, a case to match and 6 74 gig ultra320 sca drives for about $7k I know because that's what I bought one for 2 weeks ago. Using Tyan's dual board. If you need some details and are willing to go that route, let me know and I'll get you the information. Sincerely, Will LaShell Steve Poe wrote: > Situation: An 24/7 animal hospital (100 employees) runs their > business on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to) > off a 2-CPU Xeon 2.8MHz, 4GB of RAM, (3) SCSI disks RAID 0 (zcav value > 35MB per sec). The databse is 11GB comprised over 100 tables and > indexes from 1MB to 2GB in size. > > I recently told the hospital management team worst-case scenerio they > need to get the database on its own drive array since the RAID0 is a > disaster wating to happen. I said ideally a new dual AMD server with > 6/7-disk configuration would be ideal for safety and performance, but > they don't have $15K. I said a seperate drive array offer the balance > of safety and performance. > > I have been given budget of $7K to accomplish a safer/faster database > through hardware upgrades. The objective is to get a drive array, but > I can use the budget any way I see fit to accomplish the goal. > > Since I am a dba novice, I did not physically build this server, nor > did I write the application the hospital runs on, but I have the > opportunity to make it better, I'd thought I should seek some advice > from those who have been down this road before. Suggestions/ideas > anyone? > > Thanks. > > Steve Poe > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend
Steve Poe wrote: > Situation: An 24/7 animal hospital (100 employees) runs their > business on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to) > off a 2-CPU Xeon 2.8MHz, 4GB of RAM, (3) SCSI disks RAID 0 (zcav value > 35MB per sec). The databse is 11GB comprised over 100 tables and > indexes from 1MB to 2GB in size. > > I recently told the hospital management team worst-case scenerio they > need to get the database on its own drive array since the RAID0 is a > disaster wating to happen. I said ideally a new dual AMD server with > 6/7-disk configuration would be ideal for safety and performance, but > they don't have $15K. I said a seperate drive array offer the balance > of safety and performance. > > I have been given budget of $7K to accomplish a safer/faster database > through hardware upgrades. The objective is to get a drive array, but > I can use the budget any way I see fit to accomplish the goal. You could build a dual opteron with 4 GB of ram, 12 10k raptor SATA drives with a battery backed cache for about 7k or less. Or if they are not CPU bound just IO bound you could easily just add an external 12 drive array (even if scsi) for less than 7k. Sincerely, Joshua D. Drake > > Since I am a dba novice, I did not physically build this server, nor > did I write the application the hospital runs on, but I have the > opportunity to make it better, I'd thought I should seek some advice > from those who have been down this road before. Suggestions/ideas > anyone? > > Thanks. > > Steve Poe > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
Tom, From what I understand, the vendor used ProIV for development, when they attempted to use 7.4.3, they had ODBC issues and something else I honestly don't know, but I was told that data was not coming through properly. Their somewhat at the mercy of the ProIV people to give them the stamp of approval, then the vendor will tell us what they support. Thanks. Steve Poe Tom Lane wrote: >Steve Poe <spoe@sfnet.cc> writes: > > >>Situation: An 24/7 animal hospital (100 employees) runs their business >>on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to) >> >> > >[ itch... ] Surely they could at least move to 7.4.7 without pain. >There are serious data-loss bugs known in 7.4.2. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 7: don't forget to increase your free space map settings > > >
You could build a dual opteron with 4 GB of ram, 12 10k raptor SATA drives with a battery backed cache for about 7k or less. Okay. You trust SATA drives? I've been leary of them for a production database. Pardon my ignorance, but what is a "battery backed cache"? I know the drives have a built-in cache but I don't if that's the same. Are the 12 drives internal or an external chasis? Could you point me to a place that this configuration exist? > > Or if they are not CPU bound just IO bound you could easily just > add an external 12 drive array (even if scsi) for less than 7k. > I don't believe it is CPU bound. At our busiest hour, the CPU is idle about 70% on average down to 30% idle at its heaviest. Context switching averages about 4-5K per hour with momentary peaks to 25-30K for a minute. Overall disk performance is poor (35mb per sec). Thanks for your input. Steve Poe
Hi Steve, > Okay. You trust SATA drives? I've been leary of them for a production > database. Pardon my ignorance, but what is a "battery backed cache"? I > know the drives have a built-in cache but I don't if that's the same. > Are the 12 drives internal or an external chasis? Could you point me to > a place that this configuration exist? Get 12 or 16 x 74GB Western Digital Raptor S-ATA drives, one 3ware 9500S-12 or two 3ware 9500S-8 raid controllers with a battery backup unit (in case of power loss the controller saves unflushed data), a decent tyan board for the existing dual xeon with 2 pci-x slots and a matching 3U case for 12 drives (12 drives internal). Here in Germany chassis by Chenbro are quite popular, a matching one for your needs would be the chenbro RM312 or RM414 (http://61.30.15.60/product/product_preview.php?pid=90 and http://61.30.15.60/product/product_preview.php?pid=95 respectively). Take 6 or 10 drives for Raid 10 pgdata, 2-drive Raid 1 for Transaction logs (xlog), 2-drive Raid 1 for OS and Swap, and 2 spare disks. That should give you about 250 mb/s reads and 70 mb/s sustained write rate with xfs. Regards, Bjoern
Bjoern, Josh, Steve, > Get 12 or 16 x 74GB Western Digital Raptor S-ATA drives, one 3ware > 9500S-12 or two 3ware 9500S-8 raid controllers with a battery backup > unit (in case of power loss the controller saves unflushed data), a > decent tyan board for the existing dual xeon with 2 pci-x slots and a > matching 3U case for 12 drives (12 drives internal). Based on both my testing and feedback from one of the WD Raptor engineers, Raptors are still only optimal for 90% read applications. This makes them a great buy for web applications (which are 95% read usually) but a bad choice for OLTP applicaitons which sounds more like what Steve's describing. For those, it would be better to get 6 quality SCSI drives than 12 Raptors. The reason for this is that SATA still doesn't do bi-directional traffic very well (simultaneous read and write) and OSes and controllers simply haven't caught up with the drive spec and features. WD hopes that in a year they will be able to offer a Raptor that performs all operations as well as a 10K SCSI drive, for 25% less ... but that's in the next generation of drives, controllers and drivers. Steve, can we clarify that you are not currently having any performance issues, you're just worried about failure? Recommendations should be based on whether improving applicaiton speed is a requirement ... > Here in Germany chassis by Chenbro are quite popular, a matching one for > your needs would be the chenbro RM312 or RM414 > (http://61.30.15.60/product/product_preview.php?pid=90 and > http://61.30.15.60/product/product_preview.php?pid=95 respectively). The Chenbros are nice, but kinda pricey ($800) if Steve doesn't need the machine to be rackable. If your primary goal is redundancy, you may wish to consider the possibility of building a brand-new machine for $7k (you can do a lot of machine for $7000 if it doesn't have to be rackable) and re-configuring the old machine and using it as a replication or PITR backup. This would allow you to configure the new machine with only a moderate amount of hardware redundancy while still having 100% confidence in staying running. -- Josh Berkus Aglio Database Solutions San Francisco
>Steve, can we clarify that you are not currently having any performance >issues, you're just worried about failure? Recommendations should be based >on whether improving applicaiton speed is a requirement ... Josh, The priorities are: 1)improve safety/failure-prevention, 2) improve performance. The owner of the company wants greater performance (and, I concure to certain degree), but the owner's vote is only 1/7 of the managment team. And, the rest of the management team is not as focused on performance. They all agree in safety/failure-prevention. Steve
>The Chenbros are nice, but kinda pricey ($800) if Steve doesn't need the >machine to be rackable. > >If your primary goal is redundancy, you may wish to consider the possibility >of building a brand-new machine for $7k (you can do a lot of machine for >$7000 if it doesn't have to be rackable) and re-configuring the old machine >and using it as a replication or PITR backup. This would allow you to >configure the new machine with only a moderate amount of hardware redundancy >while still having 100% confidence in staying running. > > > Our servers are not racked, so a new one does not have to be. *If* it is possible, I'd like to replace the main server with a new one. I could tweak the new one the way I need it and work with the vendor to make sure everything works well. In either case, I'll still need to test how positioning of the tables/indexes across a raid10 will perform. I am also waiting onProIV developers feedback. If their ProvIV modules will not run under AMD64, or take advantage of the processor, then I'll stick with the server we have. Steve Poe
1. Buy for empty PCI-X Slot - 1 or dual channel SCSI-320 hardware RAID controller, like MegaRAID SCSI 320-2X (don't forget check driver for your OS) plus battery backup plus (optional) expand RAM to Maximum 256MB - approx $1K 2. Buy new MAXTOR drives - Atlas 15K II (4x36.7GB) - approx 4x$400. 3. SCSI 320 Cable set. 4. Old drives (2) use for OS (optional DB log) files in RAID1 mode, possible over one channel of MegaRAID. 5. New drives (4+) in RAID10 mode for DB 6. Start tuning Postres + OS: more shared RAM etc. Best regards, Alexander Kirpa
Have you already considered application/database tuning? Adding indexes? shared_buffers large enough? etc. Your database doesn't seem that large for the hardware you've already got. I'd hate to spend $7k and end up back in the same boat. :) On Sat, 2005-03-26 at 13:04 +0000, Steve Poe wrote: > >Steve, can we clarify that you are not currently having any performance > >issues, you're just worried about failure? Recommendations should be based > >on whether improving applicaiton speed is a requirement ... > > Josh, > > The priorities are: 1)improve safety/failure-prevention, 2) improve > performance. > > The owner of the company wants greater performance (and, I concure to > certain degree), but the owner's vote is only 1/7 of the managment team. > And, the rest of the management team is not as focused on performance. > They all agree in safety/failure-prevention. > > Steve > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Cott Lang wrote: >Have you already considered application/database tuning? Adding >indexes? shared_buffers large enough? etc. > >Your database doesn't seem that large for the hardware you've already >got. I'd hate to spend $7k and end up back in the same boat. :) > > Cott, I agree with you. Unfortunately, I am not the developer of the application. The vendor uses ProIV which connects via ODBC. The vendor could certain do some tuning and create more indexes where applicable. I am encouraging the vendor to take a more active role and we work together on this. With hardware tuning, I am sure we can do better than 35Mb per sec. Also moving the top 3 or 5 tables and indexes to their own slice of a RAID10 and moving pg_xlog to its own drive will help too. Since you asked about tuned settings, here's what we're using: kernel.shmmax = 1073741824 shared_buffers = 10000 sort_mem = 8192 vacuum_mem = 65536 effective_cache_size = 65536 Steve Poe
> With hardware tuning, I am sure we can do better than 35Mb per sec. Also WTF ? My Laptop does 19 MB/s (reading <10 KB files, reiser4) ! A recent desktop 7200rpm IDE drive # hdparm -t /dev/hdc1 /dev/hdc1: Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec # ll "DragonBall 001.avi" -r--r--r-- 1 peufeu users 218M mar 9 20:07 DragonBall 001.avi # time cat "DragonBall 001.avi" >/dev/null real 0m4.162s user 0m0.020s sys 0m0.510s (the file was not in the cache) => about 52 MB/s (reiser3.6) So, you have a problem with your hardware...
Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is about 50Mb/sec, and striped is about 100 Dave PFC wrote: > >> With hardware tuning, I am sure we can do better than 35Mb per sec. Also > > > WTF ? > > My Laptop does 19 MB/s (reading <10 KB files, reiser4) ! > > A recent desktop 7200rpm IDE drive > # hdparm -t /dev/hdc1 > /dev/hdc1: > Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec > > # ll "DragonBall 001.avi" > -r--r--r-- 1 peufeu users 218M mar 9 20:07 DragonBall > 001.avi > > # time cat "DragonBall 001.avi" >/dev/null > real 0m4.162s > user 0m0.020s > sys 0m0.510s > > (the file was not in the cache) > => about 52 MB/s (reiser3.6) > > So, you have a problem with your hardware... > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > > -- Dave Cramer http://www.postgresintl.com 519 939 0336 ICQ#14675561
On Mon, 2005-03-28 at 17:36 +0000, Steve Poe wrote: > I agree with you. Unfortunately, I am not the developer of the > application. The vendor uses ProIV which connects via ODBC. The vendor > could certain do some tuning and create more indexes where applicable. I > am encouraging the vendor to take a more active role and we work > together on this. I've done a lot browsing through pg_stat_activity, looking for queries that either hang around for a while or show up very often, and using explain to find out if they can use some assistance. You may also find that a dump and restore with a reconfiguration to mirrored drives speeds you up a lot - just from the dump and restore. > With hardware tuning, I am sure we can do better than 35Mb per sec. Also > moving the top 3 or 5 tables and indexes to their own slice of a RAID10 > and moving pg_xlog to its own drive will help too. If your database activity involves a lot of random i/o, 35Mb per second wouldn't be too bad. While conventional wisdom is that pg_xlog on its own drives (I know you meant plural :) ) is a big boost, in my particular case I could never get a a measurable boost that way. Obviously, YMMV.
Dave Cramer <pg@fastcrypt.com> writes: > PFC wrote: > > > > My Laptop does 19 MB/s (reading <10 KB files, reiser4) ! > > Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is > about 50Mb/sec, and striped is about 100 Well you're comparing apples and oranges here. A modern 7200rpm drive should be capable of doing 40-50MB/s depending on the location of the data on the disk. But that's only doing sequential access of data using something like dd and without other processes intervening and causing seeks. In practice it seems a busy databases see random_page_costs of about 4 which for a drive with 10ms seek time translates to only about 3.2MB/s. I think the first order of business is getting pg_xlog onto its own device. That alone should remove a lot of the seeking. If it's an ext3 device I would also consider moving the journal to a dedicated drive as well. (or if they're scsi drives or you're sure the raid controller is safe from write caching then just switch file systems to something that doesn't journal data.) -- greg
Thanks for everyone's feedback on to best improve our Postgresql database for the animal hospital. I re-read the PostgreSQL 8.0 Performance Checklist just to keep focused. We purchased (2) 4 x 146GB 10,000rpm SCSI U320 SCA drive arrays ($2600) and (1) Sun W2100z dual AMD64 workstation with 4GB RAM ($2500). We did not need a rack-mount server, so I though Sun's workstation would do fine. I'll double the RAM. Hopefully, this should out-perform our dual 2.8 Xeon with 4GB of RAM. Now, we need to purchase a good U320 RAID card now. Any suggestions for those which run well under Linux? These two drive arrays main purpose is for our database. For those messed with drive arrays before, how would you slice-up the drive array? Will database performance be effected how our RAID10 is configured? Any suggestions? Thanks. Steve Poe
I'd use two of your drives to create a mirrored partition where pg_xlog resides separate from the actual data. RAID 10 is probably appropriate for the remaining drives. Fortunately, you're not using Dell, so you don't have to worry about the Perc3/Di RAID controller, which is not so compatible with Linux... -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Mar 31, 2005, at 9:01 PM, Steve Poe wrote: > Thanks for everyone's feedback on to best improve our Postgresql > database for the animal hospital. I re-read the PostgreSQL 8.0 > Performance Checklist just to keep focused. > > We purchased (2) 4 x 146GB 10,000rpm SCSI U320 SCA drive arrays > ($2600) and (1) Sun W2100z dual AMD64 workstation with 4GB RAM > ($2500). We did not need a rack-mount server, so I though Sun's > workstation would do fine. I'll double the RAM. Hopefully, this should > out-perform our dual 2.8 Xeon with 4GB of RAM. > > Now, we need to purchase a good U320 RAID card now. Any suggestions > for those which run well under Linux? > > These two drive arrays main purpose is for our database. For those > messed with drive arrays before, how would you slice-up the drive > array? Will database performance be effected how our RAID10 is > configured? Any suggestions? > > Thanks. > > Steve Poe
On Mar 31, 2005, at 9:01 PM, Steve Poe wrote: > Now, we need to purchase a good U320 RAID card now. Any suggestions > for those which run well under Linux? Not sure if it works with linux, but under FreeBSD 5, the LSI MegaRAID cards are well supported. You should be able to pick up a 320-2X with 128Mb battery backed cache for about $1k. Wicked fast... I'm suprized you didn't go for the 15k RPM drives for a small extra cost.
Vivek Khera wrote: > > On Mar 31, 2005, at 9:01 PM, Steve Poe wrote: > >> Now, we need to purchase a good U320 RAID card now. Any suggestions >> for those which run well under Linux? > > > Not sure if it works with linux, but under FreeBSD 5, the LSI MegaRAID > cards are well supported. You should be able to pick up a 320-2X with > 128Mb battery backed cache for about $1k. Wicked fast... I'm suprized > you didn't go for the 15k RPM drives for a small extra cost. Wow, okay, so I'm not sure where everyone's email went, but I got over a weeks worth of list emails at once. Several of you have sent me requests on where we purchased our systems at. Compsource was the vendor, www.c-source.com or www.compsource.com. The sales rep we have is Steve Taylor or you can talk to the sales manager Tom. I've bought hardware from them for the last 2 years and I've been very pleased. I'm sorry wasn't able to respond sooner. Steve, The LSI MegaRAID cards are where its at. I've had -great- luck with them over the years. There were a few weird problems with a series awhile back where the linux driver needed tweaked by the developers along with a new bios update. The 320 series is just as Vivek said, wicked fast. Very strong cards. Be sure though when you order it to specificy the battery backup either with it, or make sure you buy the right one for it. There are a couple of options with battery cache on the cards that can trip you up. Good luck on your systems! Now that I've got my email problems resolved I'm definitely more than help to give any information you all need.
Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is about 50Mb/sec, and striped is about 100 Dave PFC wrote: > >> With hardware tuning, I am sure we can do better than 35Mb per sec. Also > > > WTF ? > > My Laptop does 19 MB/s (reading <10 KB files, reiser4) ! > > A recent desktop 7200rpm IDE drive > # hdparm -t /dev/hdc1 > /dev/hdc1: > Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec > > # ll "DragonBall 001.avi" > -r--r--r-- 1 peufeu users 218M mar 9 20:07 DragonBall > 001.avi > > # time cat "DragonBall 001.avi" >/dev/null > real 0m4.162s > user 0m0.020s > sys 0m0.510s > > (the file was not in the cache) > => about 52 MB/s (reiser3.6) > > So, you have a problem with your hardware... > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > >
To be honest, I've yet to run across a SCSI configuration that can touch the 3ware SATA controllers. I have yet to see one top 80MB/sec, let alone 180MB/sec read or write, which is why we moved _away_ from SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically badly on RAID 1, RAID 5 and RAID 10. 35MB/sec for a three drive RAID 0 is not bad, it's appalling. The hardware manufacturer should be publicly embarassed for this kind of speed. A single U320 10k drive can do close to 70MB/sec sustained. If someone can offer benchmarks to the contrary (particularly in linux), I would be greatly interested. Alex Turner netEconomist On Mar 29, 2005 8:17 AM, Dave Cramer <pg@fastcrypt.com> wrote: > Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is > about 50Mb/sec, and striped is about 100 > > Dave > > PFC wrote: > > > > >> With hardware tuning, I am sure we can do better than 35Mb per sec. Also > > > > > > WTF ? > > > > My Laptop does 19 MB/s (reading <10 KB files, reiser4) ! > > > > A recent desktop 7200rpm IDE drive > > # hdparm -t /dev/hdc1 > > /dev/hdc1: > > Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec > > > > # ll "DragonBall 001.avi" > > -r--r--r-- 1 peufeu users 218M mar 9 20:07 DragonBall > > 001.avi > > > > # time cat "DragonBall 001.avi" >/dev/null > > real 0m4.162s > > user 0m0.020s > > sys 0m0.510s > > > > (the file was not in the cache) > > => about 52 MB/s (reiser3.6) > > > > So, you have a problem with your hardware... > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: don't forget to increase your free space map settings > > > > > > -- > Dave Cramer > http://www.postgresintl.com > 519 939 0336 > ICQ#14675561 > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
Alex Turner wrote: >To be honest, I've yet to run across a SCSI configuration that can >touch the 3ware SATA controllers. I have yet to see one top 80MB/sec, >let alone 180MB/sec read or write, which is why we moved _away_ from >SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically >badly on RAID 1, RAID 5 and RAID 10. > > Alex, How does the 3ware controller do in heavy writes back to the database? It may have been Josh, but someone said that SATA does well with reads but not writes. Would not equal amount of SCSI drives outperform SATA? I don't want to start a "whose better" war, I am just trying to learn here. It would seem the more drives you could place in a RAID configuration, the performance would increase. Steve Poe
I'm no drive expert, but it seems to me that our write performance is excellent. I think what most are concerned about is OLTP where you are doing heavy write _and_ heavy read performance at the same time. Our system is mostly read during the day, but we do a full system update everynight that is all writes, and it's very fast compared to the smaller SCSI system we moved off of. Nearly a 6x spead improvement, as fast as 900 rows/sec with a 48 byte record, one row per transaction. I don't know enough about how SATA works to really comment on it's performance as a protocol compared with SCSI. If anyone has a usefull link on that, it would be greatly appreciated. More drives will give more throughput/sec, but not necesarily more transactions/sec. For that you will need more RAM on the controler, and defaintely a BBU to keep your data safe. Alex Turner netEconomist On Apr 4, 2005 10:39 AM, Steve Poe <spoe@sfnet.cc> wrote: > > > Alex Turner wrote: > > >To be honest, I've yet to run across a SCSI configuration that can > >touch the 3ware SATA controllers. I have yet to see one top 80MB/sec, > >let alone 180MB/sec read or write, which is why we moved _away_ from > >SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically > >badly on RAID 1, RAID 5 and RAID 10. > > > > > Alex, > > How does the 3ware controller do in heavy writes back to the database? > It may have been Josh, but someone said that SATA does well with reads > but not writes. Would not equal amount of SCSI drives outperform SATA? > I don't want to start a "whose better" war, I am just trying to learn > here. It would seem the more drives you could place in a RAID > configuration, the performance would increase. > > Steve Poe > >
On Apr 4, 2005, at 3:12 PM, Alex Turner wrote: > Our system is mostly read during the day, but we do a full system > update everynight that is all writes, and it's very fast compared to > the smaller SCSI system we moved off of. Nearly a 6x spead > improvement, as fast as 900 rows/sec with a 48 byte record, one row > per transaction. > Well, if you're not heavily multitasking, the advantage of SCSI is lost on you. Vivek Khera, Ph.D. +1-301-869-4449 x806
I'm doing some research on SATA vs SCSI right now, but to be honest I'm not turning up much at the protocol level. Alot of stupid benchmarks comparing 10k Raptor drives against Top of the line 15k drives, where usnurprsingly the SCSI drives win but of course cost 4 times as much. Although even in some, SATA wins, or draws. I'm trying to find something more apples to apples. 10k to 10k. Alex Turner netEconomist On Apr 4, 2005 3:23 PM, Vivek Khera <vivek@khera.org> wrote: > > On Apr 4, 2005, at 3:12 PM, Alex Turner wrote: > > > Our system is mostly read during the day, but we do a full system > > update everynight that is all writes, and it's very fast compared to > > the smaller SCSI system we moved off of. Nearly a 6x spead > > improvement, as fast as 900 rows/sec with a 48 byte record, one row > > per transaction. > > > > Well, if you're not heavily multitasking, the advantage of SCSI is lost > on you. > > Vivek Khera, Ph.D. > +1-301-869-4449 x806 > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
Thomas F.O'Connell wrote: > I'd use two of your drives to create a mirrored partition where pg_xlog > resides separate from the actual data. > > RAID 10 is probably appropriate for the remaining drives. > > Fortunately, you're not using Dell, so you don't have to worry about > the Perc3/Di RAID controller, which is not so compatible with > Linux... Hmm...I have to wonder how true this is these days. My company has a Dell 2500 with a Perc3/Di running Debian Linux, with the 2.6.10 kernel. The controller seems to work reasonably well, though I wouldn't doubt that it's slower than a different one might be. But so far we haven't had any reliability issues with it. Now, the performance is pretty bad considering the setup -- a RAID 5 with five 73.6 gig SCSI disks (10K RPM, I believe). Reads through the filesystem come through at about 65 megabytes/sec, writes about 35 megabytes/sec (at least, so says "bonnie -s 8192"). This is on a system with a single 3 GHz Xeon and 1 gigabyte of memory. I'd expect much better read performance from what is essentially a stripe of 4 fast SCSI disks. While compatibility hasn't really been an issue, at least as far as the basics go, I still agree with your general sentiment -- stay away from the Dells, at least if they have the Perc3/Di controller. You'll probably get much better performance out of something else. -- Kevin Brown kevin@sysexperts.com
Alex Turner wrote: > I'm no drive expert, but it seems to me that our write performance is > excellent. I think what most are concerned about is OLTP where you > are doing heavy write _and_ heavy read performance at the same time. > > Our system is mostly read during the day, but we do a full system > update everynight that is all writes, and it's very fast compared to > the smaller SCSI system we moved off of. Nearly a 6x spead > improvement, as fast as 900 rows/sec with a 48 byte record, one row > per transaction. I've started with SATA in a multi-read/multi-write environment. While it ran pretty good with 1 thread writing, the addition of a 2nd thread (whether reading or writing) would cause exponential slowdowns. I suffered through this for a week and then switched to SCSI. Single threaded performance was pretty similar but with the advanced command queueing SCSI has, I was able to do multiple reads/writes simultaneously with only a small performance hit for each thread. Perhaps having a SATA caching raid controller might help this situation. I don't know. It's pretty hard justifying buying a $$$ 3ware controller just to test it when you could spend the same money on SCSI and have a guarantee it'll work good under multi-IO scenarios.
On Tue, Apr 05, 2005 at 09:44:56PM -0700, Kevin Brown wrote: > Now, the performance is pretty bad considering the setup -- a RAID 5 > with five 73.6 gig SCSI disks (10K RPM, I believe). Reads through the > filesystem come through at about 65 megabytes/sec, writes about 35 > megabytes/sec (at least, so says "bonnie -s 8192"). This is on a > system with a single 3 GHz Xeon and 1 gigabyte of memory. I'd expect > much better read performance from what is essentially a stripe of 4 > fast SCSI disks. Data point here: We have a Linux software RAID quite close to the setup you describe, with an onboard Adaptec controller and four 146GB 10000rpm disks, and we get about 65MB/sec sustained when writing to an ext3 filesystem (actually, when wgetting a file off the gigabit LAN :-) ). I haven't tested reading, though. /* Steinar */ -- Homepage: http://www.sesse.net/
> and we get about 65MB/sec sustained when writing to an ext3 filesystem > (actually, when wgetting a file off the gigabit LAN :-) ). I haven't Well, unless you have PCI 64 bits, the "standard" PCI does 133 MB/s which is then split exactly in two times 66.5 MB/s for 1) reading from the PCI network card and 2) writing to the PCI harddisk controller. No wonder you get this figure, you're able to saturate your PCI bus, but it does not tell you a thing on the performance of your disk or network card... Note that the server which serves the file is limited in the same way unless the file is in cache (RAM) or it's PCI64. So... > tested > reading, though. > > /* Steinar */
On Wed, Apr 06, 2005 at 03:26:33PM +0200, PFC wrote: > Well, unless you have PCI 64 bits, the "standard" PCI does 133 MB/s > which is then split exactly in two times 66.5 MB/s for 1) reading from the > PCI network card and 2) writing to the PCI harddisk controller. No wonder > you get this figure, you're able to saturate your PCI bus, but it does not > tell you a thing on the performance of your disk or network card... Note > that the server which serves the file is limited in the same way unless > the file is in cache (RAM) or it's PCI64. So... This is PCI-X. /* Steinar */ -- Homepage: http://www.sesse.net/
It's hardly the same money, the drives are twice as much. It's all about the controller baby with any kind of dive. A bad SCSI controller will give sucky performance too, believe me. We had a Compaq Smart Array 5304, and it's performance was _very_ sub par. If someone has a simple benchmark test database to run, I would be happy to run it on our hardware here. Alex Turner On Apr 6, 2005 3:30 AM, William Yu <wyu@talisys.com> wrote: > Alex Turner wrote: > > I'm no drive expert, but it seems to me that our write performance is > > excellent. I think what most are concerned about is OLTP where you > > are doing heavy write _and_ heavy read performance at the same time. > > > > Our system is mostly read during the day, but we do a full system > > update everynight that is all writes, and it's very fast compared to > > the smaller SCSI system we moved off of. Nearly a 6x spead > > improvement, as fast as 900 rows/sec with a 48 byte record, one row > > per transaction. > > I've started with SATA in a multi-read/multi-write environment. While it > ran pretty good with 1 thread writing, the addition of a 2nd thread > (whether reading or writing) would cause exponential slowdowns. > > I suffered through this for a week and then switched to SCSI. Single > threaded performance was pretty similar but with the advanced command > queueing SCSI has, I was able to do multiple reads/writes simultaneously > with only a small performance hit for each thread. > > Perhaps having a SATA caching raid controller might help this situation. > I don't know. It's pretty hard justifying buying a $$$ 3ware controller > just to test it when you could spend the same money on SCSI and have a > guarantee it'll work good under multi-IO scenarios. > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
It's the same money if you factor in the 3ware controller. Even without a caching controller, SCSI works good in multi-threaded IO (not withstanding crappy shit from Dell or Compaq). You can get such cards from LSI for $75. And of course, many server MBs come with LSI controllers built-in. Our older 32-bit production servers all use Linux software RAID w/ SCSI and there's no issues when multiple users/processes hit the DB. *Maybe* a 3ware controller w/ onboard cache + battery backup might do much better for multi-threaded IO than just plain-jane SATA. Unfortunately, I have not been able to find anything online that can confirm or deny this. Hence, the choice is spend $$$ on the 3ware controller and hope it meets your needs -- or spend $$$ on SCSI drives and be sure. Now if you want to run such tests, we'd all be delighted with to see the results so we have another option for building servers. Alex Turner wrote: > It's hardly the same money, the drives are twice as much. > > It's all about the controller baby with any kind of dive. A bad SCSI > controller will give sucky performance too, believe me. We had a > Compaq Smart Array 5304, and it's performance was _very_ sub par. > > If someone has a simple benchmark test database to run, I would be > happy to run it on our hardware here. > > Alex Turner > > On Apr 6, 2005 3:30 AM, William Yu <wyu@talisys.com> wrote: > >>Alex Turner wrote: >> >>>I'm no drive expert, but it seems to me that our write performance is >>>excellent. I think what most are concerned about is OLTP where you >>>are doing heavy write _and_ heavy read performance at the same time. >>> >>>Our system is mostly read during the day, but we do a full system >>>update everynight that is all writes, and it's very fast compared to >>>the smaller SCSI system we moved off of. Nearly a 6x spead >>>improvement, as fast as 900 rows/sec with a 48 byte record, one row >>>per transaction. >> >>I've started with SATA in a multi-read/multi-write environment. While it >>ran pretty good with 1 thread writing, the addition of a 2nd thread >>(whether reading or writing) would cause exponential slowdowns. >> >>I suffered through this for a week and then switched to SCSI. Single >>threaded performance was pretty similar but with the advanced command >>queueing SCSI has, I was able to do multiple reads/writes simultaneously >>with only a small performance hit for each thread. >> >>Perhaps having a SATA caching raid controller might help this situation. >>I don't know. It's pretty hard justifying buying a $$$ 3ware controller >>just to test it when you could spend the same money on SCSI and have a >>guarantee it'll work good under multi-IO scenarios. >> >>---------------------------(end of broadcast)--------------------------- >>TIP 8: explain analyze is your friend >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
Well - unfortuantely software RAID isn't appropriate for everyone, and some of us need a hardware RAID controller. The LSI Megaraid 320-2 card is almost exactly the same price as the 3ware 9500S-12 card (although I will conceed that a 320-2 card can handle at most 2x14 devices compare with the 12 on the 9500S). If someone can come up with a test, I will be happy to run it and see how it goes. I would be _very_ interested in the results having just spent $7k on a new DB server!! I have also seen really bad performance out of SATA. It was with either an on-board controller, or a cheap RAID controller from HighPoint. As soon as I put in a decent controller, things went much better. I think it's unfair to base your opinion of SATA from a test that had a poor controler. I know I'm not the only one here running SATA RAID and being very satisfied with the results. Thanks, Alex Turner netEconomist On Apr 6, 2005 4:01 PM, William Yu <wyu@talisys.com> wrote: > It's the same money if you factor in the 3ware controller. Even without > a caching controller, SCSI works good in multi-threaded IO (not > withstanding crappy shit from Dell or Compaq). You can get such cards > from LSI for $75. And of course, many server MBs come with LSI > controllers built-in. Our older 32-bit production servers all use Linux > software RAID w/ SCSI and there's no issues when multiple > users/processes hit the DB. > > *Maybe* a 3ware controller w/ onboard cache + battery backup might do > much better for multi-threaded IO than just plain-jane SATA. > Unfortunately, I have not been able to find anything online that can > confirm or deny this. Hence, the choice is spend $$$ on the 3ware > controller and hope it meets your needs -- or spend $$$ on SCSI drives > and be sure. > > Now if you want to run such tests, we'd all be delighted with to see the > results so we have another option for building servers. > > > Alex Turner wrote: > > It's hardly the same money, the drives are twice as much. > > > > It's all about the controller baby with any kind of dive. A bad SCSI > > controller will give sucky performance too, believe me. We had a > > Compaq Smart Array 5304, and it's performance was _very_ sub par. > > > > If someone has a simple benchmark test database to run, I would be > > happy to run it on our hardware here. > > > > Alex Turner > > > > On Apr 6, 2005 3:30 AM, William Yu <wyu@talisys.com> wrote: > > > >>Alex Turner wrote: > >> > >>>I'm no drive expert, but it seems to me that our write performance is > >>>excellent. I think what most are concerned about is OLTP where you > >>>are doing heavy write _and_ heavy read performance at the same time. > >>> > >>>Our system is mostly read during the day, but we do a full system > >>>update everynight that is all writes, and it's very fast compared to > >>>the smaller SCSI system we moved off of. Nearly a 6x spead > >>>improvement, as fast as 900 rows/sec with a 48 byte record, one row > >>>per transaction. > >> > >>I've started with SATA in a multi-read/multi-write environment. While it > >>ran pretty good with 1 thread writing, the addition of a 2nd thread > >>(whether reading or writing) would cause exponential slowdowns. > >> > >>I suffered through this for a week and then switched to SCSI. Single > >>threaded performance was pretty similar but with the advanced command > >>queueing SCSI has, I was able to do multiple reads/writes simultaneously > >>with only a small performance hit for each thread. > >> > >>Perhaps having a SATA caching raid controller might help this situation. > >>I don't know. It's pretty hard justifying buying a $$$ 3ware controller > >>just to test it when you could spend the same money on SCSI and have a > >>guarantee it'll work good under multi-IO scenarios. > >> > >>---------------------------(end of broadcast)--------------------------- > >>TIP 8: explain analyze is your friend > >> > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: you can get off all lists at once with the unregister command > > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
Sorry if I'm pointing out the obvious here, but it seems worth mentioning. AFAIK all 3ware controllers are setup so that each SATA drive gets it's own SATA bus. My understanding is that by and large, SATA still suffers from a general inability to have multiple outstanding commands on the bus at once, unlike SCSI. Therefore, to get good performance out of SATA you need to have a seperate bus for each drive. Theoretically, it shouldn't really matter that it's SATA over ATA, other than I certainly wouldn't want to try and cram 8 ATA cables into a machine... Incidentally, when we were investigating storage options at a previous job we talked to someone who deals with RS/6000 storage. He had a bunch of info about their serial controller protocol (which I can't think of the name of) vs SCSI. SCSI had a lot more overhead, so you could end up saturating even a 160MB SCSI bus with only 2 or 3 drives. People are finally realizing how important bandwidth has become in modern machines. Memory bandwidth is why RS/6000 was (and maybe still is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of the water. Likewise it's why SCSI is so much better than IDE (unless you just give each drive it's own dedicated bandwidth). -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
I guess I'm setting myself up here, and I'm really not being ignorant, but can someone explain exactly how is SCSI is supposed to better than SATA? Both systems use drives with platters. Each drive can physically only read one thing at a time. SATA gives each drive it's own channel, but you have to share in SCSI. A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but SCSI can only do 320MB/sec across the entire array. What am I missing here? Alex Turner netEconomist On Apr 6, 2005 5:41 PM, Jim C. Nasby <decibel@decibel.org> wrote: > Sorry if I'm pointing out the obvious here, but it seems worth > mentioning. AFAIK all 3ware controllers are setup so that each SATA > drive gets it's own SATA bus. My understanding is that by and large, > SATA still suffers from a general inability to have multiple outstanding > commands on the bus at once, unlike SCSI. Therefore, to get good > performance out of SATA you need to have a seperate bus for each drive. > Theoretically, it shouldn't really matter that it's SATA over ATA, other > than I certainly wouldn't want to try and cram 8 ATA cables into a > machine... > > Incidentally, when we were investigating storage options at a previous > job we talked to someone who deals with RS/6000 storage. He had a bunch > of info about their serial controller protocol (which I can't think of > the name of) vs SCSI. SCSI had a lot more overhead, so you could end up > saturating even a 160MB SCSI bus with only 2 or 3 drives. > > People are finally realizing how important bandwidth has become in > modern machines. Memory bandwidth is why RS/6000 was (and maybe still > is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of > the water. Likewise it's why SCSI is so much better than IDE (unless you > just give each drive it's own dedicated bandwidth). > -- > Jim C. Nasby, Database Consultant decibel@decibel.org > Give your computer some brain candy! www.distributed.net Team #1828 > > Windows: "Where do you want to go today?" > Linux: "Where do you want to go tomorrow?" > FreeBSD: "Are you guys coming, or what?" >
Ok - so I found this fairly good online review of various SATA cards out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10. http://www.tweakers.net/reviews/557/ Very interesting stuff. Alex Turner netEconomist On Apr 6, 2005 7:32 PM, Alex Turner <armtuk@gmail.com> wrote: > I guess I'm setting myself up here, and I'm really not being ignorant, > but can someone explain exactly how is SCSI is supposed to better than > SATA? > > Both systems use drives with platters. Each drive can physically only > read one thing at a time. > > SATA gives each drive it's own channel, but you have to share in SCSI. > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > SCSI can only do 320MB/sec across the entire array. > > What am I missing here? > > Alex Turner > netEconomist > > On Apr 6, 2005 5:41 PM, Jim C. Nasby <decibel@decibel.org> wrote: > > Sorry if I'm pointing out the obvious here, but it seems worth > > mentioning. AFAIK all 3ware controllers are setup so that each SATA > > drive gets it's own SATA bus. My understanding is that by and large, > > SATA still suffers from a general inability to have multiple outstanding > > commands on the bus at once, unlike SCSI. Therefore, to get good > > performance out of SATA you need to have a seperate bus for each drive. > > Theoretically, it shouldn't really matter that it's SATA over ATA, other > > than I certainly wouldn't want to try and cram 8 ATA cables into a > > machine... > > > > Incidentally, when we were investigating storage options at a previous > > job we talked to someone who deals with RS/6000 storage. He had a bunch > > of info about their serial controller protocol (which I can't think of > > the name of) vs SCSI. SCSI had a lot more overhead, so you could end up > > saturating even a 160MB SCSI bus with only 2 or 3 drives. > > > > People are finally realizing how important bandwidth has become in > > modern machines. Memory bandwidth is why RS/6000 was (and maybe still > > is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of > > the water. Likewise it's why SCSI is so much better than IDE (unless you > > just give each drive it's own dedicated bandwidth). > > -- > > Jim C. Nasby, Database Consultant decibel@decibel.org > > Give your computer some brain candy! www.distributed.net Team #1828 > > > > Windows: "Where do you want to go today?" > > Linux: "Where do you want to go tomorrow?" > > FreeBSD: "Are you guys coming, or what?" > > >
Ok - I take it back - I'm reading through this now, and realising that the reviews are pretty clueless in several places... On Apr 6, 2005 8:12 PM, Alex Turner <armtuk@gmail.com> wrote: > Ok - so I found this fairly good online review of various SATA cards > out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10. > > http://www.tweakers.net/reviews/557/ > > Very interesting stuff. > > Alex Turner > netEconomist > > On Apr 6, 2005 7:32 PM, Alex Turner <armtuk@gmail.com> wrote: > > I guess I'm setting myself up here, and I'm really not being ignorant, > > but can someone explain exactly how is SCSI is supposed to better than > > SATA? > > > > Both systems use drives with platters. Each drive can physically only > > read one thing at a time. > > > > SATA gives each drive it's own channel, but you have to share in SCSI. > > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > > SCSI can only do 320MB/sec across the entire array. > > > > What am I missing here? > > > > Alex Turner > > netEconomist > > > > On Apr 6, 2005 5:41 PM, Jim C. Nasby <decibel@decibel.org> wrote: > > > Sorry if I'm pointing out the obvious here, but it seems worth > > > mentioning. AFAIK all 3ware controllers are setup so that each SATA > > > drive gets it's own SATA bus. My understanding is that by and large, > > > SATA still suffers from a general inability to have multiple outstanding > > > commands on the bus at once, unlike SCSI. Therefore, to get good > > > performance out of SATA you need to have a seperate bus for each drive. > > > Theoretically, it shouldn't really matter that it's SATA over ATA, other > > > than I certainly wouldn't want to try and cram 8 ATA cables into a > > > machine... > > > > > > Incidentally, when we were investigating storage options at a previous > > > job we talked to someone who deals with RS/6000 storage. He had a bunch > > > of info about their serial controller protocol (which I can't think of > > > the name of) vs SCSI. SCSI had a lot more overhead, so you could end up > > > saturating even a 160MB SCSI bus with only 2 or 3 drives. > > > > > > People are finally realizing how important bandwidth has become in > > > modern machines. Memory bandwidth is why RS/6000 was (and maybe still > > > is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of > > > the water. Likewise it's why SCSI is so much better than IDE (unless you > > > just give each drive it's own dedicated bandwidth). > > > -- > > > Jim C. Nasby, Database Consultant decibel@decibel.org > > > Give your computer some brain candy! www.distributed.net Team #1828 > > > > > > Windows: "Where do you want to go today?" > > > Linux: "Where do you want to go tomorrow?" > > > FreeBSD: "Are you guys coming, or what?" > > > > > >
Alex Turner <armtuk@gmail.com> writes: > SATA gives each drive it's own channel, but you have to share in SCSI. > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > SCSI can only do 320MB/sec across the entire array. SCSI controllers often have separate channels for each device too. In any case the issue with the IDE protocol is that fundamentally you can only have a single command pending. SCSI can have many commands pending. This is especially important for a database like postgres that may be busy committing one transaction while another is trying to read. Having several commands queued on the drive gives it a chance to execute any that are "on the way" to the committing transaction. However I'm under the impression that 3ware has largely solved this problem. Also, if you save a few dollars and can afford one additional drive that additional drive may improve your array speed enough to overcome that inefficiency. -- greg
Yeah - the more reading I'm doing - the more I'm finding out. Alledgelly the Western Digial Raptor drives implement a version of ATA-4 Tagged Queing which allows reordering of commands. Some controllers support this. The 3ware docs say that the controller support both reordering on the controller and to the drive. *shrug* This of course is all supposed to go away with SATA II which as NCQ, Native Command Queueing. Of course the 3ware controllers don't support SATA II, but a few other do, and I'm sure 3ware will come out with a controller that does. Alex Turner netEconomist On 06 Apr 2005 23:00:54 -0400, Greg Stark <gsstark@mit.edu> wrote: > > Alex Turner <armtuk@gmail.com> writes: > > > SATA gives each drive it's own channel, but you have to share in SCSI. > > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > > SCSI can only do 320MB/sec across the entire array. > > SCSI controllers often have separate channels for each device too. > > In any case the issue with the IDE protocol is that fundamentally you can only > have a single command pending. SCSI can have many commands pending. This is > especially important for a database like postgres that may be busy committing > one transaction while another is trying to read. Having several commands > queued on the drive gives it a chance to execute any that are "on the way" to > the committing transaction. > > However I'm under the impression that 3ware has largely solved this problem. > Also, if you save a few dollars and can afford one additional drive that > additional drive may improve your array speed enough to overcome that > inefficiency. > > -- > greg > >
Greg Stark <gsstark@mit.edu> writes: > In any case the issue with the IDE protocol is that fundamentally you > can only have a single command pending. SCSI can have many commands > pending. That's the bottom line: the SCSI protocol was designed (twenty years ago!) to allow the drive to do physical I/O scheduling, because the CPU can issue multiple commands before the drive has to report completion of the first one. IDE isn't designed to do that. I understand that the latest revisions to the IDE/ATA specs allow the drive to do this sort of thing, but support for it is far from widespread. regards, tom lane
Things might've changed somewhat over the past year, but this is from _the_ Linux guy at Dell... -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Strategic Open Source — Open Your i™ http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 Date: Mon, 26 Apr 2004 14:15:02 -0500 From: Matt Domsch <Matt_Domsch@dell.com> To: linux-poweredge@dell.com Subject: PERC3/Di failure workaround hypothesis --uXxzq0nDebZQVNAZ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Apr 26, 2004 at 11:10:36AM -0500, Sellek, Greg wrote: > Short of ordering a Perc4 for every 2650 that I want to upgrade to RH > ES, is there anything else I can do to get around the Perc3/Di > problem? Our working hypothesis for a workaround is to do as follows: In afacli, set: Read Cache: enabled Write Cache: enabled when protected Then unplug the ROMB battery. A reboot is not necessary. The firmware will immediately drop into Write-Through Cache mode, which in our testing has not exhibited the problem. Setting the write cache to disabled in afacli doesn't seem to help - you've got to unplug the battery with it in the above settings. We are continuing to search for the root cause to the problem, and will update the list when we can. Thanks, Matt -- Matt Domsch Sr. Software Engineer, Lead Engineer Dell Linux Solutions linux.dell.com & www.dell.com/linux Linux on Dell mailing lists @ http://lists.us.dell.com On Apr 5, 2005, at 11:44 PM, Kevin Brown wrote: > Thomas F.O'Connell wrote: >> I'd use two of your drives to create a mirrored partition where >> pg_xlog >> resides separate from the actual data. >> >> RAID 10 is probably appropriate for the remaining drives. >> >> Fortunately, you're not using Dell, so you don't have to worry about >> the Perc3/Di RAID controller, which is not so compatible with >> Linux... > > Hmm...I have to wonder how true this is these days. > > My company has a Dell 2500 with a Perc3/Di running Debian Linux, with > the 2.6.10 kernel. The controller seems to work reasonably well, > though I wouldn't doubt that it's slower than a different one might > be. But so far we haven't had any reliability issues with it. > > Now, the performance is pretty bad considering the setup -- a RAID 5 > with five 73.6 gig SCSI disks (10K RPM, I believe). Reads through the > filesystem come through at about 65 megabytes/sec, writes about 35 > megabytes/sec (at least, so says "bonnie -s 8192"). This is on a > system with a single 3 GHz Xeon and 1 gigabyte of memory. I'd expect > much better read performance from what is essentially a stripe of 4 > fast SCSI disks. > > > While compatibility hasn't really been an issue, at least as far as > the basics go, I still agree with your general sentiment -- stay away > from the Dells, at least if they have the Perc3/Di controller. You'll > probably get much better performance out of something else. > > > -- > Kevin Brown kevin@sysexperts.com
You asked for it! ;-) If you want cheap, get SATA. If you want fast under *load* conditions, get SCSI. Everything else at this time is marketing hype, either intentional or learned. Ignoring dollars, expect to see SCSI beat SATA by 40%. * * * What I tell you three times is true * * * Also, compare the warranty you get with any SATA drive with any SCSI drive. Yes, you still have some change leftover to buy more SATA drives when they fail, but... it fundamentally comes down to some actual implementation and not what is printed on the cardboard box. Disk systems are bound by the rules of queueing theory. You can hit the sales rep over the head with your queueing theory book. Ultra320 SCSI is king of the hill for high concurrency databases. If you're only streaming or serving files, save some money and get a bunch of SATA drives. But if you're reading/writing all over the disk, the simple first-come-first-serve SATA heuristic will hose your performance under load conditions. Next year, they will *try* bring out some SATA cards that improve on first-come-first-serve, but they ain't here now. There are a lot of rigged performance tests out there... Maybe by the time they fix the queueing problems, serial Attached SCSI (a/k/a SAS) will be out. Looks like Ultra320 is the end of the line for parallel SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the water. Ultra320 SCSI. Ultra320 SCSI. Ultra320 SCSI. Serial Attached SCSI. Serial Attached SCSI. Serial Attached SCSI. For future trends, see: http://www.incits.org/archive/2003/in031163/in031163.htm douglas p.s. For extra credit, try comparing SATA and SCSI drives when they're 90% full. On Apr 6, 2005, at 8:32 PM, Alex Turner wrote: > I guess I'm setting myself up here, and I'm really not being ignorant, > but can someone explain exactly how is SCSI is supposed to better than > SATA? > > Both systems use drives with platters. Each drive can physically only > read one thing at a time. > > SATA gives each drive it's own channel, but you have to share in SCSI. > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > SCSI can only do 320MB/sec across the entire array. > > What am I missing here? > > Alex Turner > netEconomist
A good one page discussion on the future of SCSI and SATA can be found in the latest CHIPS (The Department of the Navy Information Technology Magazine, formerly CHIPS AHOY) in an article by Patrick G. Koehler and Lt. Cmdr. Stan Bush. Click below if you don't mind being logged visiting Space and Naval Warfare Systems Center Charleston: http://www.chips.navy.mil/archives/05_Jan/web_pages/scuzzy.htm
Another simple question: Why is SCSI more expensive? After the eleventy-millionth controller is made, it seems like SCSI and SATA are using a controller board and a spinning disk. Is somebody still making money by licensing SCSI technology? Rick pgsql-performance-owner@postgresql.org wrote on 04/06/2005 11:58:33 PM: > You asked for it! ;-) > > If you want cheap, get SATA. If you want fast under > *load* conditions, get SCSI. Everything else at this > time is marketing hype, either intentional or learned. > Ignoring dollars, expect to see SCSI beat SATA by 40%. > > * * * What I tell you three times is true * * * > > Also, compare the warranty you get with any SATA > drive with any SCSI drive. Yes, you still have some > change leftover to buy more SATA drives when they > fail, but... it fundamentally comes down to some > actual implementation and not what is printed on > the cardboard box. Disk systems are bound by the > rules of queueing theory. You can hit the sales rep > over the head with your queueing theory book. > > Ultra320 SCSI is king of the hill for high concurrency > databases. If you're only streaming or serving files, > save some money and get a bunch of SATA drives. > But if you're reading/writing all over the disk, the > simple first-come-first-serve SATA heuristic will > hose your performance under load conditions. > > Next year, they will *try* bring out some SATA cards > that improve on first-come-first-serve, but they ain't > here now. There are a lot of rigged performance tests > out there... Maybe by the time they fix the queueing > problems, serial Attached SCSI (a/k/a SAS) will be out. > Looks like Ultra320 is the end of the line for parallel > SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the > water. > > Ultra320 SCSI. > Ultra320 SCSI. > Ultra320 SCSI. > > Serial Attached SCSI. > Serial Attached SCSI. > Serial Attached SCSI. > > For future trends, see: > http://www.incits.org/archive/2003/in031163/in031163.htm > > douglas > > p.s. For extra credit, try comparing SATA and SCSI drives > when they're 90% full. > > On Apr 6, 2005, at 8:32 PM, Alex Turner wrote: > > > I guess I'm setting myself up here, and I'm really not being ignorant, > > but can someone explain exactly how is SCSI is supposed to better than > > SATA? > > > > Both systems use drives with platters. Each drive can physically only > > read one thing at a time. > > > > SATA gives each drive it's own channel, but you have to share in SCSI. > > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > > SCSI can only do 320MB/sec across the entire array. > > > > What am I missing here? > > > > Alex Turner > > netEconomist > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
Based on the reading I'm doing, and somebody please correct me if I'm wrong, it seems that SCSI drives contain an on disk controller that has to process the tagged queue. SATA-I doesn't have this. This additional controller, is basicaly an on board computer that figures out the best order in which to process commands. I believe you are also paying for the increased tolerance that generates a better speed. If you compare an 80Gig 7200RPM IDE drive to a WD Raptor 76G 10k RPM to a Seagate 10k.6 drive to a Seagate Cheatah 15k drive, each one represents a step up in parts and technology, thereby generating a cost increase (at least thats what the manufactures tell us). I know if you ever held a 15k drive in your hand, you can notice a considerable weight difference between it and a 7200RPM IDE drive. Alex Turner netEconomist On Apr 7, 2005 11:37 AM, Richard_D_Levine@raytheon.com <Richard_D_Levine@raytheon.com> wrote: > Another simple question: Why is SCSI more expensive? After the > eleventy-millionth controller is made, it seems like SCSI and SATA are > using a controller board and a spinning disk. Is somebody still making > money by licensing SCSI technology? > > Rick > > pgsql-performance-owner@postgresql.org wrote on 04/06/2005 11:58:33 PM: > > > You asked for it! ;-) > > > > If you want cheap, get SATA. If you want fast under > > *load* conditions, get SCSI. Everything else at this > > time is marketing hype, either intentional or learned. > > Ignoring dollars, expect to see SCSI beat SATA by 40%. > > > > * * * What I tell you three times is true * * * > > > > Also, compare the warranty you get with any SATA > > drive with any SCSI drive. Yes, you still have some > > change leftover to buy more SATA drives when they > > fail, but... it fundamentally comes down to some > > actual implementation and not what is printed on > > the cardboard box. Disk systems are bound by the > > rules of queueing theory. You can hit the sales rep > > over the head with your queueing theory book. > > > > Ultra320 SCSI is king of the hill for high concurrency > > databases. If you're only streaming or serving files, > > save some money and get a bunch of SATA drives. > > But if you're reading/writing all over the disk, the > > simple first-come-first-serve SATA heuristic will > > hose your performance under load conditions. > > > > Next year, they will *try* bring out some SATA cards > > that improve on first-come-first-serve, but they ain't > > here now. There are a lot of rigged performance tests > > out there... Maybe by the time they fix the queueing > > problems, serial Attached SCSI (a/k/a SAS) will be out. > > Looks like Ultra320 is the end of the line for parallel > > SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the > > water. > > > > Ultra320 SCSI. > > Ultra320 SCSI. > > Ultra320 SCSI. > > > > Serial Attached SCSI. > > Serial Attached SCSI. > > Serial Attached SCSI. > > > > For future trends, see: > > http://www.incits.org/archive/2003/in031163/in031163.htm > > > > douglas > > > > p.s. For extra credit, try comparing SATA and SCSI drives > > when they're 90% full. > > > > On Apr 6, 2005, at 8:32 PM, Alex Turner wrote: > > > > > I guess I'm setting myself up here, and I'm really not being ignorant, > > > but can someone explain exactly how is SCSI is supposed to better than > > > SATA? > > > > > > Both systems use drives with platters. Each drive can physically only > > > read one thing at a time. > > > > > > SATA gives each drive it's own channel, but you have to share in SCSI. > > > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > > > SCSI can only do 320MB/sec across the entire array. > > > > > > What am I missing here? > > > > > > Alex Turner > > > netEconomist > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: the planner will ignore your desire to choose an index scan if > your > > joining column's datatypes do not match > >
Yep, that's it, as well as increased quality control. I found this from Seagate: http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf With this quote (note that ES stands for Enterprise System and PS stands for Personal System): There is significantly more silicon on ES products. The following comparison comes from a study done in 2000: · the ES ASIC gate count is more than 2x a PS drive, · the embedded SRAM space for program code is 2x, · the permanent flash memory for program code is 2x, · data SRAM and cache SRAM space is more than 10x. The complexity of the SCSI/FC interface compared to the IDE/ATA interface shows up here due in part to the more complex system architectures in which ES drives find themselves. ES interfaces support multiple initiators or hosts. The drive must keep track of separate sets of information for each host to which it is attached, e.g., maintaining the processor pointer sets for multiple initiators and tagged commands. The capability of SCSI/FC to efficiently process commands and tasks in parallel has also resulted in a higher overhead “kernel” structure for the firmware. All of these complexities and an overall richer command set result in the need for a more expensive PCB to carry the electronics. Rick Alex Turner <armtuk@gmail.com> wrote on 04/07/2005 10:46:31 AM: > Based on the reading I'm doing, and somebody please correct me if I'm > wrong, it seems that SCSI drives contain an on disk controller that > has to process the tagged queue. SATA-I doesn't have this. This > additional controller, is basicaly an on board computer that figures > out the best order in which to process commands. I believe you are > also paying for the increased tolerance that generates a better speed. > If you compare an 80Gig 7200RPM IDE drive to a WD Raptor 76G 10k RPM > to a Seagate 10k.6 drive to a Seagate Cheatah 15k drive, each one > represents a step up in parts and technology, thereby generating a > cost increase (at least thats what the manufactures tell us). I know > if you ever held a 15k drive in your hand, you can notice a > considerable weight difference between it and a 7200RPM IDE drive. > > Alex Turner > netEconomist > > On Apr 7, 2005 11:37 AM, Richard_D_Levine@raytheon.com > <Richard_D_Levine@raytheon.com> wrote: > > Another simple question: Why is SCSI more expensive? After the > > eleventy-millionth controller is made, it seems like SCSI and SATA are > > using a controller board and a spinning disk. Is somebody still making > > money by licensing SCSI technology? > > > > Rick > > > > pgsql-performance-owner@postgresql.org wrote on 04/06/2005 11:58:33 PM: > > > > > You asked for it! ;-) > > > > > > If you want cheap, get SATA. If you want fast under > > > *load* conditions, get SCSI. Everything else at this > > > time is marketing hype, either intentional or learned. > > > Ignoring dollars, expect to see SCSI beat SATA by 40%. > > > > > > * * * What I tell you three times is true * * * > > > > > > Also, compare the warranty you get with any SATA > > > drive with any SCSI drive. Yes, you still have some > > > change leftover to buy more SATA drives when they > > > fail, but... it fundamentally comes down to some > > > actual implementation and not what is printed on > > > the cardboard box. Disk systems are bound by the > > > rules of queueing theory. You can hit the sales rep > > > over the head with your queueing theory book. > > > > > > Ultra320 SCSI is king of the hill for high concurrency > > > databases. If you're only streaming or serving files, > > > save some money and get a bunch of SATA drives. > > > But if you're reading/writing all over the disk, the > > > simple first-come-first-serve SATA heuristic will > > > hose your performance under load conditions. > > > > > > Next year, they will *try* bring out some SATA cards > > > that improve on first-come-first-serve, but they ain't > > > here now. There are a lot of rigged performance tests > > > out there... Maybe by the time they fix the queueing > > > problems, serial Attached SCSI (a/k/a SAS) will be out. > > > Looks like Ultra320 is the end of the line for parallel > > > SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the > > > water. > > > > > > Ultra320 SCSI. > > > Ultra320 SCSI. > > > Ultra320 SCSI. > > > > > > Serial Attached SCSI. > > > Serial Attached SCSI. > > > Serial Attached SCSI. > > > > > > For future trends, see: > > > http://www.incits.org/archive/2003/in031163/in031163.htm > > > > > > douglas > > > > > > p.s. For extra credit, try comparing SATA and SCSI drives > > > when they're 90% full. > > > > > > On Apr 6, 2005, at 8:32 PM, Alex Turner wrote: > > > > > > > I guess I'm setting myself up here, and I'm really not being ignorant, > > > > but can someone explain exactly how is SCSI is supposed to better than > > > > SATA? > > > > > > > > Both systems use drives with platters. Each drive can physically only > > > > read one thing at a time. > > > > > > > > SATA gives each drive it's own channel, but you have to share in SCSI. > > > > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but > > > > SCSI can only do 320MB/sec across the entire array. > > > > > > > > What am I missing here? > > > > > > > > Alex Turner > > > > netEconomist > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 9: the planner will ignore your desire to choose an index scan if > > your > > > joining column's datatypes do not match > > > >
Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > In any case the issue with the IDE protocol is that fundamentally you > > can only have a single command pending. SCSI can have many commands > > pending. > > That's the bottom line: the SCSI protocol was designed (twenty years ago!) > to allow the drive to do physical I/O scheduling, because the CPU can > issue multiple commands before the drive has to report completion of the > first one. IDE isn't designed to do that. I understand that the latest > revisions to the IDE/ATA specs allow the drive to do this sort of thing, > but support for it is far from widespread. My question is: why does this (physical I/O scheduling) seem to matter so much? Before you flame me for asking a terribly idiotic question, let me provide some context. The operating system maintains a (sometimes large) buffer cache, with each buffer being mapped to a "physical" (which in the case of RAID is really a virtual) location on the disk. When the kernel needs to flush the cache (e.g., during a sync(), or when it needs to free up some pages), it doesn't write the pages in memory address order, it writes them in *device* address order. And it, too, maintains a queue of disk write requests. Now, unless some of the blocks on the disk are remapped behind the scenes such that an ordered list of blocks in the kernel translates to an out of order list on the target disk (which should be rare, since such remapping usually happens only when the target block is bad), how can the fact that the disk controller doesn't do tagged queuing *possibly* make any real difference unless the kernel's disk scheduling algorithm is suboptimal? In fact, if the kernel's scheduling algorithm is close to optimal, wouldn't the disk queuing mechanism *reduce* the overall efficiency of disk writes? After all, the kernel's queue is likely to be much larger than the disk controller's, and the kernel has knowledge of things like the filesystem layout that the disk controller and disks do not have. If the controller is only able to execute a subset of the write commands that the kernel has in its queue, at the very least the controller may end up leaving the head(s) in a suboptimal position relative to the next set of commands that it hasn't received yet, unless it simply writes the blocks in the order it receives it, right (admittedly, this is somewhat trivially dealt with by having the controller exclude the first and last blocks in the request from its internal sort). I can see how you might configure the RAID controller so that the kernel's scheduling algorithm will screw things up horribly. For instance, if the controller has several RAID volumes configured in such a way that the volumes share spindles, the kernel isn't likely to know about that (since each volume appears as its own device), so writes to multiple volumes can cause head movement where the kernel might be treating the volumes as completely independent. But that just means that you can't be dumb about how you configure your RAID setup. So what gives? Given the above, why is SCSI so much more efficient than plain, dumb SATA? And why wouldn't you be much better off with a set of dumb controllers in conjunction with (kernel-level) software RAID? -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > My question is: why does this (physical I/O scheduling) seem to matter > so much? > > Before you flame me for asking a terribly idiotic question, let me > provide some context. > > The operating system maintains a (sometimes large) buffer cache, with > each buffer being mapped to a "physical" (which in the case of RAID is > really a virtual) location on the disk. When the kernel needs to > flush the cache (e.g., during a sync(), or when it needs to free up > some pages), it doesn't write the pages in memory address order, it > writes them in *device* address order. And it, too, maintains a queue > of disk write requests. I think you're being misled by analyzing the write case. Consider the read case. When a user process requests a block and that read makes its way down to the driver level, the driver can't just put it aside and wait until it's convenient. It has to go ahead and issue the read right away. In the 10ms or so that it takes to seek to perform that read *nothing* gets done. If the driver receives more read or write requests it just has to sit on them and wait. 10ms is a lifetime for a computer. In that time dozens of other processes could have been scheduled and issued reads of their own. If any of those requests would have lied on the intervening tracks the drive missed a chance to execute them. Worse, it actually has to backtrack to get to them meaning another long seek. The same thing would happen if you had lots of processes issuing lots of small fsynced writes all over the place. Postgres doesn't really do that though. It sort of does with the WAL logs, but that shouldn't cause a lot of seeking. Perhaps it would mean that having your WAL share a spindle with other parts of the OS would have a bigger penalty on IDE drives than on SCSI drives though? -- greg
Greg Stark wrote: > I think you're being misled by analyzing the write case. > > Consider the read case. When a user process requests a block and > that read makes its way down to the driver level, the driver can't > just put it aside and wait until it's convenient. It has to go ahead > and issue the read right away. Well, strictly speaking it doesn't *have* to. It could delay for a couple of milliseconds to see if other requests come in, and then issue the read if none do. If there are already other requests being fulfilled, then it'll schedule the request in question just like the rest. > In the 10ms or so that it takes to seek to perform that read > *nothing* gets done. If the driver receives more read or write > requests it just has to sit on them and wait. 10ms is a lifetime for > a computer. In that time dozens of other processes could have been > scheduled and issued reads of their own. This is true, but now you're talking about a situation where the system goes from an essentially idle state to one of furious activity. In other words, it's a corner case that I strongly suspect isn't typical in situations where SCSI has historically made a big difference. Once the first request has been fulfilled, the driver can now schedule the rest of the queued-up requests in disk-layout order. I really don't see how this is any different between a system that has tagged queueing to the disks and one that doesn't. The only difference is where the queueing happens. In the case of SCSI, the queueing happens on the disks (or at least on the controller). In the case of SATA, the queueing happens in the kernel. I suppose the tagged queueing setup could begin the head movement and, if another request comes in that requests a block on a cylinder between where the head currently is and where it's going, go ahead and read the block in question. But is that *really* what happens in a tagged queueing system? It's the only major advantage I can see it having. > The same thing would happen if you had lots of processes issuing > lots of small fsynced writes all over the place. Postgres doesn't > really do that though. It sort of does with the WAL logs, but that > shouldn't cause a lot of seeking. Perhaps it would mean that having > your WAL share a spindle with other parts of the OS would have a > bigger penalty on IDE drives than on SCSI drives though? Perhaps. But I rather doubt that has to be a huge penalty, if any. When a process issues an fsync (or even a sync), the kernel doesn't *have* to drop everything it's doing and get to work on it immediately. It could easily gather a few more requests, bundle them up, and then issue them. If there's a lot of disk activity, it's probably smart to do just that. All fsync and sync require is that the caller block until the data hits the disk (from the point of view of the kernel). The specification doesn't require that the kernel act on the calls immediately or write only the blocks referred to by the call in question. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > I really don't see how this is any different between a system that has > tagged queueing to the disks and one that doesn't. The only > difference is where the queueing happens. In the case of SCSI, the > queueing happens on the disks (or at least on the controller). In the > case of SATA, the queueing happens in the kernel. That's basically what it comes down to: SCSI lets the disk drive itself do the low-level I/O scheduling whereas the ATA spec prevents the drive from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's possible for the drive to rearrange reads as well as writes --- which AFAICS is just not possible in ATA. (Maybe in the newest spec...) The reason this is so much more of a win than it was when ATA was designed is that in modern drives the kernel has very little clue about the physical geometry of the disk. Variable-size tracks, bad-block sparing, and stuff like that make for a very hard-to-predict mapping from linear sector addresses to actual disk locations. Combine that with the fact that the drive controller can be much smarter than it was twenty years ago, and you can see that the case for doing I/O scheduling in the kernel and not in the drive is pretty weak. regards, tom lane
while you weren't looking, Kevin Brown wrote: [reordering bursty reads] > In other words, it's a corner case that I strongly suspect > isn't typical in situations where SCSI has historically made a big > difference. [...] > But I rather doubt that has to be a huge penalty, if any. When a > process issues an fsync (or even a sync), the kernel doesn't *have* to > drop everything it's doing and get to work on it immediately. It > could easily gather a few more requests, bundle them up, and then > issue them. To make sure I'm following you here, are you or are you not suggesting that the kernel could sit on -all- IO requests for some small handful of ms before actually performing any IO to address what you "strongly suspect" is a "corner case"? /rls -- :wq
Imagine a system in "furious activity" with two (2) process regularly occuring Process One: Looooong read (or write). Takes 20ms to do seek, latency, and stream off. Runs over and over. Process Two: Single block read ( or write ). Typical database row access. Optimally, could be submillisecond. happens more or less randomly. Let's say process one starts, and then process two. Assume, for sake of this discussion, that P2's block lies w/in P1's swath. (But doesn't have to...) Now, everytime process two has to wait at LEAST 20ms to complete. In a queue-reordering system, it could be a lot faster. And me, looking for disk service times on P2, keep wondering "why does a single diskblock read keep taking >20ms?" Soooo....it doesn't need to be "a read" or "a write". It doesn't need to be "furious activity" (two processes is not furious, even for a single user desktop.) This is not a "corner case", and while it doesn't take into account kernel/drivecache/UBC buffering issues, I think it shines a light on why command re-ordering might be useful. <shrug> YMMV. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Kevin Brown Sent: Thursday, April 14, 2005 4:36 AM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? Greg Stark wrote: > I think you're being misled by analyzing the write case. > > Consider the read case. When a user process requests a block and that > read makes its way down to the driver level, the driver can't just put > it aside and wait until it's convenient. It has to go ahead and issue > the read right away. Well, strictly speaking it doesn't *have* to. It could delay for a couple of milliseconds to see if other requests comein, and then issue the read if none do. If there are already other requests being fulfilled, then it'll schedule therequest in question just like the rest. > In the 10ms or so that it takes to seek to perform that read > *nothing* gets done. If the driver receives more read or write > requests it just has to sit on them and wait. 10ms is a lifetime for a > computer. In that time dozens of other processes could have been > scheduled and issued reads of their own. This is true, but now you're talking about a situation where the system goes from an essentially idle state to one of furiousactivity. In other words, it's a corner case that I strongly suspect isn't typical in situations where SCSI has historicallymade a big difference. Once the first request has been fulfilled, the driver can now schedule the rest of the queued-up requests in disk-layoutorder. I really don't see how this is any different between a system that has tagged queueing to the disks and one that doesn't. The only difference is where the queueing happens. In the case of SCSI, the queueing happens on the disks (or atleast on the controller). In the case of SATA, the queueing happens in the kernel. I suppose the tagged queueing setup could begin the head movement and, if another request comes in that requests a blockon a cylinder between where the head currently is and where it's going, go ahead and read the block in question. Butis that *really* what happens in a tagged queueing system? It's the only major advantage I can see it having. > The same thing would happen if you had lots of processes issuing lots > of small fsynced writes all over the place. Postgres doesn't really do > that though. It sort of does with the WAL logs, but that shouldn't > cause a lot of seeking. Perhaps it would mean that having your WAL > share a spindle with other parts of the OS would have a bigger penalty > on IDE drives than on SCSI drives though? Perhaps. But I rather doubt that has to be a huge penalty, if any. When a process issues an fsync (or even a sync), the kernel doesn't*have* to drop everything it's doing and get to work on it immediately. It could easily gather a few more requests,bundle them up, and then issue them. If there's a lot of disk activity, it's probably smart to do just that. Allfsync and sync require is that the caller block until the data hits the disk (from the point of view of the kernel). Thespecification doesn't require that the kernel act on the calls immediately or write only the blocks referred to by thecall in question. -- Kevin Brown kevin@sysexperts.com ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
On 4/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That's basically what it comes down to: SCSI lets the disk drive itself > do the low-level I/O scheduling whereas the ATA spec prevents the drive > from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's > possible for the drive to rearrange reads as well as writes --- which > AFAICS is just not possible in ATA. (Maybe in the newest spec...) > > The reason this is so much more of a win than it was when ATA was > designed is that in modern drives the kernel has very little clue about > the physical geometry of the disk. Variable-size tracks, bad-block > sparing, and stuff like that make for a very hard-to-predict mapping > from linear sector addresses to actual disk locations. Combine that > with the fact that the drive controller can be much smarter than it was > twenty years ago, and you can see that the case for doing I/O scheduling > in the kernel and not in the drive is pretty weak. > > So if you all were going to choose between two hard drives where: drive A has capacity C and spins at 15K rpms, and drive B has capacity 2 x C and spins at 10K rpms and all other features are the same, the price is the same and C is enough disk space which would you choose? I've noticed that on IDE drives, as the capacity increases the data density increases and there is a pereceived (I've not measured it) performance increase. Would the increased data density of the higher capacity drive be of greater benefit than the faster spindle speed of drive A? -- Matthew Nuzum <matt@followers.net> www.followers.net - Makers of Elite Content Management System View samples of Elite CMS in action by visiting http://www.followers.net/portfolio/
"Matthew Nuzum" <matt.followers@gmail.com> writes: > drive A has capacity C and spins at 15K rpms, and > drive B has capacity 2 x C and spins at 10K rpms and > all other features are the same, the price is the same and C is enough > disk space which would you choose? In this case you always choose the 15k RPM drive, at least for Postgres. The 15kRPM reduces the latency which improves performance when fsyncing transaction commits. The real question is whether you choose the single 15kRPM drive or additional drives at 10kRPM... Additional spindles would give a much bigger bandwidth improvement but questionable latency improvement. > Would the increased data density of the higher capacity drive be of > greater benefit than the faster spindle speed of drive A? actually a 2xC capacity drive probably just has twice as many platters which means it would perform identically to the C capacity drive. If it has denser platters that might improve performance slightly. -- greg
Kevin Brown <kevin@sysexperts.com> writes: > Greg Stark wrote: > > > > I think you're being misled by analyzing the write case. > > > > Consider the read case. When a user process requests a block and > > that read makes its way down to the driver level, the driver can't > > just put it aside and wait until it's convenient. It has to go ahead > > and issue the read right away. > > Well, strictly speaking it doesn't *have* to. It could delay for a > couple of milliseconds to see if other requests come in, and then > issue the read if none do. If there are already other requests being > fulfilled, then it'll schedule the request in question just like the > rest. But then the cure is worse than the disease. You're basically describing exactly what does happen anyways, only you're delaying more requests than necessary. That intervening time isn't really idle, it's filled with all the requests that were delayed during the previous large seek... > Once the first request has been fulfilled, the driver can now schedule > the rest of the queued-up requests in disk-layout order. > > I really don't see how this is any different between a system that has > tagged queueing to the disks and one that doesn't. The only > difference is where the queueing happens. And *when* it happens. Instead of being able to issue requests while a large seek is happening and having some of them satisfied they have to wait until that seek is finished and get acted on during the next large seek. If my theory is correct then I would expect bandwidth to be essentially equivalent but the latency on SATA drives to be increased by about 50% of the average seek time. Ie, while a busy SCSI drive can satisfy most requests in about 10ms a busy SATA drive would satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI drives have even lower seek times and no such IDE/SATA drives exist...) In reality higher latency feeds into a system feedback loop causing your application to run slower causing bandwidth demands to be lower as well. It's often hard to distinguish root causes from symptoms when optimizing complex systems. -- greg
> -----Original Message----- > From: Greg Stark [mailto:gsstark@mit.edu] > Sent: Thursday, April 14, 2005 12:55 PM > To: matt@followers.net > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to improve db performance with $7K? > > "Matthew Nuzum" <matt.followers@gmail.com> writes: > > > drive A has capacity C and spins at 15K rpms, and > > drive B has capacity 2 x C and spins at 10K rpms and > > all other features are the same, the price is the same and > > C is enough disk space which would you choose? > > In this case you always choose the 15k RPM drive, at least > for Postgres. The 15kRPM reduces the latency which improves > performance when fsyncing transaction commits. I think drive B is clearly the best choice. Matt said "all other features are the same", including price. I take that to mean that the seek time and throughput are also identical. However, I think it's fairly clear that there is no such pair of actual devices. If Matt really meant that they have the same cache size, interface, etc, then I would agree with you. The 15k drive is likely to have the better seek time. > The real question is whether you choose the single 15kRPM > drive or additional drives at 10kRPM... Additional spindles > would give a much bigger bandwidth improvement but questionable > latency improvement. Under the assumption that the seek times and throughput are realistic rather than contrived as in the stated example, I would say the 15k drive is the likely winner. It probably has the better seek time, and it seems that latency is more important than bandwidth for DB apps. > > Would the increased data density of the higher capacity drive > > be of greater benefit than the faster spindle speed of drive > > A? > > actually a 2xC capacity drive probably just has twice as many > platters which means it would perform identically to the C > capacity drive. If it has denser platters that might improve > performance slightly. Well, according to the paper referenced by Richard, twice as many platters means that it probably has slightly worse seek time (because of the increased mass of the actuator/rw-head). Yet another reason why the smaller drive might be preferable. Of course, the data density is certainly a factor, as you say. But since the drives are within a factor of 2, it seems likely that real drives would have comparable densities. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
"Matthew Nuzum" <matt.followers@gmail.com> writes: > So if you all were going to choose between two hard drives where: > drive A has capacity C and spins at 15K rpms, and > drive B has capacity 2 x C and spins at 10K rpms and > all other features are the same, the price is the same and C is enough > disk space which would you choose? > I've noticed that on IDE drives, as the capacity increases the data > density increases and there is a pereceived (I've not measured it) > performance increase. > Would the increased data density of the higher capacity drive be of > greater benefit than the faster spindle speed of drive A? Depends how they got the 2x capacity increase. If they got it by increased bit density --- same number of tracks, but more sectors per track --- then drive B actually has a higher transfer rate, because in one rotation it can transfer twice as much data as drive A. More tracks per cylinder (ie, more platters) can also be a speed win since you can touch more data before you have to seek to another cylinder. Drive B will lose if the 2x capacity was all from adding cylinders (unless its seek-time spec is way better than A's ... which is unlikely but not impossible, considering the cylinders are probably closer together). Usually there's some-of-each involved, so it's hard to make any definite statement without more facts. regards, tom lane
I've been doing some reading up on this, trying to keep up here, and have found out that (experts, just yawn and cover your ears) 1) some SATA drives (just type II, I think?) have a "Phase Zero" implementation of Tagged Command Queueing (the special sauce for SCSI). 2) This SATA "TCQ" is called NCQ and I believe it basically allows the disk software itself to do the reordering (this is called "simple" in TCQ terminology) It does not yet allow the TCQ "head of queue" command, allowing the current tagged request to go to head of queue, which is a simple way of manifesting a "high priority" request. 3) SATA drives are not yet multi-initiator? Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives are likely to whomp SATA II drives for a while yet (read: a year or two) in multiuser PostGres applications. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Stark Sent: Thursday, April 14, 2005 2:04 PM To: Kevin Brown Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? Kevin Brown <kevin@sysexperts.com> writes: > Greg Stark wrote: > > > > I think you're being misled by analyzing the write case. > > > > Consider the read case. When a user process requests a block and > > that read makes its way down to the driver level, the driver can't > > just put it aside and wait until it's convenient. It has to go ahead > > and issue the read right away. > > Well, strictly speaking it doesn't *have* to. It could delay for a > couple of milliseconds to see if other requests come in, and then > issue the read if none do. If there are already other requests being > fulfilled, then it'll schedule the request in question just like the > rest. But then the cure is worse than the disease. You're basically describing exactly what does happen anyways, only you're delayingmore requests than necessary. That intervening time isn't really idle, it's filled with all the requests that weredelayed during the previous large seek... > Once the first request has been fulfilled, the driver can now schedule > the rest of the queued-up requests in disk-layout order. > > I really don't see how this is any different between a system that has > tagged queueing to the disks and one that doesn't. The only > difference is where the queueing happens. And *when* it happens. Instead of being able to issue requests while a large seek is happening and having some of them satisfiedthey have to wait until that seek is finished and get acted on during the next large seek. If my theory is correct then I would expect bandwidth to be essentially equivalent but the latency on SATA drives to be increasedby about 50% of the average seek time. Ie, while a busy SCSI drive can satisfy most requests in about 10ms a busySATA drive would satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI drives have even lower seektimes and no such IDE/SATA drives exist...) In reality higher latency feeds into a system feedback loop causing your application to run slower causing bandwidth demandsto be lower as well. It's often hard to distinguish root causes from symptoms when optimizing complex systems. -- greg ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
> The real question is whether you choose the single 15kRPM drive or > additional > drives at 10kRPM... Additional spindles would give a much bigger And the bonus question. Expensive fast drives as a RAID for everything, or for the same price many more slower drives (even SATA) so you can put the transaction log, tables, indexes all on separate physical drives ? Like put one very frequently used table on its own disk ? For the same amount of money which one would be more interesting ?
If SATA drives don't have the ability to replace SCSI for a multi-user Postgres apps, but you needed to save on cost (ALWAYS an issue), could/would you implement SATA for your logs (pg_xlog) and keep the rest on SCSI? Steve Poe Mohan, Ross wrote: >I've been doing some reading up on this, trying to keep up here, >and have found out that (experts, just yawn and cover your ears) > >1) some SATA drives (just type II, I think?) have a "Phase Zero" > implementation of Tagged Command Queueing (the special sauce > for SCSI). >2) This SATA "TCQ" is called NCQ and I believe it basically > allows the disk software itself to do the reordering > (this is called "simple" in TCQ terminology) It does not > yet allow the TCQ "head of queue" command, allowing the > current tagged request to go to head of queue, which is > a simple way of manifesting a "high priority" request. > >3) SATA drives are not yet multi-initiator? > >Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives >are likely to whomp SATA II drives for a while yet (read: a >year or two) in multiuser PostGres applications. > > > >-----Original Message----- >From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Stark >Sent: Thursday, April 14, 2005 2:04 PM >To: Kevin Brown >Cc: pgsql-performance@postgresql.org >Subject: Re: [PERFORM] How to improve db performance with $7K? > > >Kevin Brown <kevin@sysexperts.com> writes: > > > >>Greg Stark wrote: >> >> >> >> >>>I think you're being misled by analyzing the write case. >>> >>>Consider the read case. When a user process requests a block and >>>that read makes its way down to the driver level, the driver can't >>>just put it aside and wait until it's convenient. It has to go ahead >>>and issue the read right away. >>> >>> >>Well, strictly speaking it doesn't *have* to. It could delay for a >>couple of milliseconds to see if other requests come in, and then >>issue the read if none do. If there are already other requests being >>fulfilled, then it'll schedule the request in question just like the >>rest. >> >> > >But then the cure is worse than the disease. You're basically describing exactly what does happen anyways, only you're delayingmore requests than necessary. That intervening time isn't really idle, it's filled with all the requests that weredelayed during the previous large seek... > > > >>Once the first request has been fulfilled, the driver can now schedule >>the rest of the queued-up requests in disk-layout order. >> >>I really don't see how this is any different between a system that has >>tagged queueing to the disks and one that doesn't. The only >>difference is where the queueing happens. >> >> > >And *when* it happens. Instead of being able to issue requests while a large seek is happening and having some of them satisfiedthey have to wait until that seek is finished and get acted on during the next large seek. > >If my theory is correct then I would expect bandwidth to be essentially equivalent but the latency on SATA drives to beincreased by about 50% of the average seek time. Ie, while a busy SCSI drive can satisfy most requests in about 10ms abusy SATA drive would satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI drives have even lower seektimes and no such IDE/SATA drives exist...) > >In reality higher latency feeds into a system feedback loop causing your application to run slower causing bandwidth demandsto be lower as well. It's often hard to distinguish root causes from symptoms when optimizing complex systems. > > >
> -----Original Message----- > From: Mohan, Ross [mailto:RMohan@arbinet.com] > Sent: Thursday, April 14, 2005 1:30 PM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to improve db performance with $7K? > > Greg Stark wrote: > > > > Kevin Brown <kevin@sysexperts.com> writes: > > > > > Greg Stark wrote: > > > > > > > I think you're being misled by analyzing the write case. > > > > > > > > Consider the read case. When a user process requests a block > > > > and that read makes its way down to the driver level, the > > > > driver can't just put it aside and wait until it's convenient. > > > > It has to go ahead and issue the read right away. > > > > > > Well, strictly speaking it doesn't *have* to. It could delay > > > for a couple of milliseconds to see if other requests come in, > > > and then issue the read if none do. If there are already other > > > requests being fulfilled, then it'll schedule the request in > > > question just like the rest. > > > > But then the cure is worse than the disease. You're basically > > describing exactly what does happen anyways, only you're > > delaying more requests than necessary. That intervening time > > isn't really idle, it's filled with all the requests that > > were delayed during the previous large seek... > > [...] > > [...] > 1) some SATA drives (just type II, I think?) have a "Phase Zero" > implementation of Tagged Command Queueing (the special sauce > for SCSI). > [...] > Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives > are likely to whomp SATA II drives for a while yet (read: a > year or two) in multiuser PostGres applications. I would say it depends on the OS. What Kevin is describing sounds just like the Anticipatory I/O Scheduler in Linux 2.6: http://www.linuxjournal.com/article/6931 For certain application contexts, it looks like a big win. Not entirely sure if Postgres is one of them, though. If SCSI beats SATA, it sounds like it will be mostly due to better seek times. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
Steve Poe wrote: > If SATA drives don't have the ability to replace SCSI for a multi-user I don't think it is a matter of not having the ability. SATA all in all is fine as long as it is battery backed. It isn't as high performing as SCSI but who says it has to be? There are plenty of companies running databases on SATA without issue. Would I put it on a database that is expecting to have 500 connections at all times? No. Then again, if you have an application with that requirement, you have the money to buy a big fat SCSI array. Sincerely, Joshua D. Drake > Postgres apps, but you needed to save on cost (ALWAYS an issue), > could/would you implement SATA for your logs (pg_xlog) and keep the > rest on SCSI? > > Steve Poe > > Mohan, Ross wrote: > >> I've been doing some reading up on this, trying to keep up here, and >> have found out that (experts, just yawn and cover your ears) >> >> 1) some SATA drives (just type II, I think?) have a "Phase Zero" >> implementation of Tagged Command Queueing (the special sauce >> for SCSI). >> 2) This SATA "TCQ" is called NCQ and I believe it basically >> allows the disk software itself to do the reordering >> (this is called "simple" in TCQ terminology) It does not >> yet allow the TCQ "head of queue" command, allowing the >> current tagged request to go to head of queue, which is >> a simple way of manifesting a "high priority" request. >> >> 3) SATA drives are not yet multi-initiator? >> >> Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives >> are likely to whomp SATA II drives for a while yet (read: a >> year or two) in multiuser PostGres applications. >> >> >> -----Original Message----- >> From: pgsql-performance-owner@postgresql.org >> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Stark >> Sent: Thursday, April 14, 2005 2:04 PM >> To: Kevin Brown >> Cc: pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] How to improve db performance with $7K? >> >> >> Kevin Brown <kevin@sysexperts.com> writes: >> >> >> >>> Greg Stark wrote: >>> >>> >>> >>> >>>> I think you're being misled by analyzing the write case. >>>> >>>> Consider the read case. When a user process requests a block and >>>> that read makes its way down to the driver level, the driver can't >>>> just put it aside and wait until it's convenient. It has to go >>>> ahead and issue the read right away. >>>> >>> >>> Well, strictly speaking it doesn't *have* to. It could delay for a >>> couple of milliseconds to see if other requests come in, and then >>> issue the read if none do. If there are already other requests >>> being fulfilled, then it'll schedule the request in question just >>> like the rest. >>> >> >> >> But then the cure is worse than the disease. You're basically >> describing exactly what does happen anyways, only you're delaying >> more requests than necessary. That intervening time isn't really >> idle, it's filled with all the requests that were delayed during the >> previous large seek... >> >> >> >>> Once the first request has been fulfilled, the driver can now >>> schedule the rest of the queued-up requests in disk-layout order. >>> >>> I really don't see how this is any different between a system that >>> has tagged queueing to the disks and one that doesn't. The only >>> difference is where the queueing happens. >>> >> >> >> And *when* it happens. Instead of being able to issue requests while >> a large seek is happening and having some of them satisfied they have >> to wait until that seek is finished and get acted on during the next >> large seek. >> >> If my theory is correct then I would expect bandwidth to be >> essentially equivalent but the latency on SATA drives to be increased >> by about 50% of the average seek time. Ie, while a busy SCSI drive >> can satisfy most requests in about 10ms a busy SATA drive would >> satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM >> SCSI drives have even lower seek times and no such IDE/SATA drives >> exist...) >> >> In reality higher latency feeds into a system feedback loop causing >> your application to run slower causing bandwidth demands to be lower >> as well. It's often hard to distinguish root causes from symptoms >> when optimizing complex systems. >> >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > I really don't see how this is any different between a system that has > > tagged queueing to the disks and one that doesn't. The only > > difference is where the queueing happens. In the case of SCSI, the > > queueing happens on the disks (or at least on the controller). In the > > case of SATA, the queueing happens in the kernel. > > That's basically what it comes down to: SCSI lets the disk drive itself > do the low-level I/O scheduling whereas the ATA spec prevents the drive > from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's > possible for the drive to rearrange reads as well as writes --- which > AFAICS is just not possible in ATA. (Maybe in the newest spec...) > > The reason this is so much more of a win than it was when ATA was > designed is that in modern drives the kernel has very little clue about > the physical geometry of the disk. Variable-size tracks, bad-block > sparing, and stuff like that make for a very hard-to-predict mapping > from linear sector addresses to actual disk locations. Yeah, but it's not clear to me, at least, that this is a first-order consideration. A second-order consideration, sure, I'll grant that. What I mean is that when it comes to scheduling disk activity, knowledge of the specific physical geometry of the disk isn't really important. What's important is whether or not the disk conforms to a certain set of expectations. Namely, that the general organization is such that addressing the blocks in block number order guarantees maximum throughput. Now, bad block remapping destroys that guarantee, but unless you've got a LOT of bad blocks, it shouldn't destroy your performance, right? > Combine that with the fact that the drive controller can be much > smarter than it was twenty years ago, and you can see that the case > for doing I/O scheduling in the kernel and not in the drive is > pretty weak. Well, I certainly grant that allowing the controller to do the I/O scheduling is faster than having the kernel do it, as long as it can handle insertion of new requests into the list while it's in the middle of executing a request. The most obvious case is when the head is in motion and the new request can be satisfied by reading from the media between where the head is at the time of the new request and where the head is being moved to. My argument is that a sufficiently smart kernel scheduler *should* yield performance results that are reasonably close to what you can get with that feature. Perhaps not quite as good, but reasonably close. It shouldn't be an orders-of-magnitude type difference. -- Kevin Brown kevin@sysexperts.com
3ware claim that their 'software' implemented command queueing performs at 95% effectiveness compared to the hardware queueing on a SCSI drive, so I would say that they agree with you. I'm still learning, but as I read it, the bits are split across the platters and there is only 'one' head, but happens to be reading from multiple platters. The 'further' in linear distance the data is from the current position, the longer it's going to take to get there. This seems to be true based on a document that was circulated. A hard drive takes considerable amount of time to 'find' a track on the platter compared to the rotational speed, which would agree with the fact that you can read 70MB/sec, but it takes up to 13ms to seek. the ATA protocol is just how the HBA communicates with the drive, there is no reason why the HBA can't reschedule reads and writes just the like SCSI drive would do natively, and this is what infact 3ware claims. I get the feeling based on my own historical experience that generaly drives don't just have a bunch of bad blocks. This all leads me to believe that you can predict with pretty good accuracy how expensive it is to retrieve a given block knowing it's linear increment. Alex Turner netEconomist On 4/14/05, Kevin Brown <kevin@sysexperts.com> wrote: > Tom Lane wrote: > > Kevin Brown <kevin@sysexperts.com> writes: > > > I really don't see how this is any different between a system that has > > > tagged queueing to the disks and one that doesn't. The only > > > difference is where the queueing happens. In the case of SCSI, the > > > queueing happens on the disks (or at least on the controller). In the > > > case of SATA, the queueing happens in the kernel. > > > > That's basically what it comes down to: SCSI lets the disk drive itself > > do the low-level I/O scheduling whereas the ATA spec prevents the drive > > from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's > > possible for the drive to rearrange reads as well as writes --- which > > AFAICS is just not possible in ATA. (Maybe in the newest spec...) > > > > The reason this is so much more of a win than it was when ATA was > > designed is that in modern drives the kernel has very little clue about > > the physical geometry of the disk. Variable-size tracks, bad-block > > sparing, and stuff like that make for a very hard-to-predict mapping > > from linear sector addresses to actual disk locations. > > Yeah, but it's not clear to me, at least, that this is a first-order > consideration. A second-order consideration, sure, I'll grant that. > > What I mean is that when it comes to scheduling disk activity, > knowledge of the specific physical geometry of the disk isn't really > important. What's important is whether or not the disk conforms to a > certain set of expectations. Namely, that the general organization is > such that addressing the blocks in block number order guarantees > maximum throughput. > > Now, bad block remapping destroys that guarantee, but unless you've > got a LOT of bad blocks, it shouldn't destroy your performance, right? > > > Combine that with the fact that the drive controller can be much > > smarter than it was twenty years ago, and you can see that the case > > for doing I/O scheduling in the kernel and not in the drive is > > pretty weak. > > Well, I certainly grant that allowing the controller to do the I/O > scheduling is faster than having the kernel do it, as long as it can > handle insertion of new requests into the list while it's in the > middle of executing a request. The most obvious case is when the head > is in motion and the new request can be satisfied by reading from the > media between where the head is at the time of the new request and > where the head is being moved to. > > My argument is that a sufficiently smart kernel scheduler *should* > yield performance results that are reasonably close to what you can > get with that feature. Perhaps not quite as good, but reasonably > close. It shouldn't be an orders-of-magnitude type difference. > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> The reason this is so much more of a win than it was when ATA was >> designed is that in modern drives the kernel has very little clue about >> the physical geometry of the disk. Variable-size tracks, bad-block >> sparing, and stuff like that make for a very hard-to-predict mapping >> from linear sector addresses to actual disk locations. > What I mean is that when it comes to scheduling disk activity, > knowledge of the specific physical geometry of the disk isn't really > important. Oh? Yes, you can probably assume that blocks with far-apart numbers are going to require a big seek, and you might even be right in supposing that a block with an intermediate number should be read on the way. But you have no hope at all of making the right decisions at a more local level --- say, reading various sectors within the same cylinder in an optimal fashion. You don't know where the track boundaries are, so you can't schedule in a way that minimizes rotational latency. You're best off to throw all the requests at the drive together and let the drive sort it out. This is not to say that there's not a place for a kernel-side scheduler too. The drive will probably have a fairly limited number of slots in its command queue. The optimal thing is for those slots to be filled with requests that are in the same area of the disk. So you can still get some mileage out of an elevator algorithm that works on logical block numbers to give the drive requests for nearby block numbers at the same time. But there's also a lot of use in letting the drive do its own low-level scheduling. > My argument is that a sufficiently smart kernel scheduler *should* > yield performance results that are reasonably close to what you can > get with that feature. Perhaps not quite as good, but reasonably > close. It shouldn't be an orders-of-magnitude type difference. That might be the case with respect to decisions about long seeks, but not with respect to rotational latency. The kernel simply hasn't got the information. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> The reason this is so much more of a win than it was when ATA was > >> designed is that in modern drives the kernel has very little clue about > >> the physical geometry of the disk. Variable-size tracks, bad-block > >> sparing, and stuff like that make for a very hard-to-predict mapping > >> from linear sector addresses to actual disk locations. > > > What I mean is that when it comes to scheduling disk activity, > > knowledge of the specific physical geometry of the disk isn't really > > important. > > Oh? > > Yes, you can probably assume that blocks with far-apart numbers are > going to require a big seek, and you might even be right in supposing > that a block with an intermediate number should be read on the way. > But you have no hope at all of making the right decisions at a more > local level --- say, reading various sectors within the same cylinder > in an optimal fashion. You don't know where the track boundaries are, > so you can't schedule in a way that minimizes rotational latency. This is true, but has to be examined in the context of the workload. If the workload is a sequential read, for instance, then the question becomes whether or not giving the controller a set of sequential blocks (in block ID order) will get you maximum read throughput. Given that the manufacturers all attempt to generate the biggest read throughput numbers, I think it's reasonable to assume that (a) the sectors are ordered within a cylinder such that reading block x + 1 immediately after block x will incur the smallest possible amount of delay if requested quickly enough, and (b) the same holds true when block x + 1 is on the next cylinder. In the case of pure random reads, you'll end up having to wait an average of half of a rotation before beginning the read. Where SCSI buys you something here is when you have sequential chunks of reads that are randomly distributed. The SCSI drive can determine which block in the set to start with first. But for that to really be a big win, the chunks themselves would have to span more than half a track at least, else you'd have a greater than half a track gap in the middle of your two sorted sector lists for that track (a really well-engineered SCSI disk could take advantage of the fact that there are multiple platters and fill the "gap" with reads from a different platter). Admittedly, this can be quite a big win. With an average rotational latency of 4 milliseconds on a 7200 RPM disk, being able to begin the read at the earliest possible moment will shave at most 25% off the total average random-access latency, if the average seek time is 12 milliseconds. > That might be the case with respect to decisions about long seeks, > but not with respect to rotational latency. The kernel simply hasn't > got the information. True, but that should reduce the total latency by something like 17% (on average). Not trivial, to be sure, but not an order of magnitude, either. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > In the case of pure random reads, you'll end up having to wait an > average of half of a rotation before beginning the read. You're assuming the conclusion. The above is true if the disk is handed one request at a time by a kernel that doesn't have any low-level timing information. If there are multiple random requests on the same track, the drive has an opportunity to do better than that --- if it's got all the requests in hand. regards, tom lane
> My argument is that a sufficiently smart kernel scheduler *should* > yield performance results that are reasonably close to what you can > get with that feature. Perhaps not quite as good, but reasonably > close. It shouldn't be an orders-of-magnitude type difference. And a controller card (or drive) has a lot less RAM to use as a cache / queue for reordering stuff than the OS has, potentially the OS can us most of the available RAM, which can be gigabytes on a big server, whereas in the drive there are at most a few tens of megabytes... However all this is a bit looking at the problem through the wrong end. The OS should provide a multi-read call for the applications to pass a list of blocks they'll need, then reorder them and read them the fastest possible way, clustering them with similar requests from other threads. Right now when a thread/process issues a read() it will block until the block is delivered to this thread. The OS does not know if this thread will then need the next block (which can be had very cheaply if you know ahead of time you'll need it) or not. Thus it must make guesses, read ahead (sometimes), etc...
> platter compared to the rotational speed, which would agree with the > fact that you can read 70MB/sec, but it takes up to 13ms to seek. Actually : - the head has to be moved this time depends on the distance, for instance moving from a cylinder to the next is very fast (it needs to, to get good throughput) - then you have to wait for the disk to spin until the information you want comes in front of the head... statistically you have to wait a half rotation. And this does not depend on the distance between the cylinders, it depends on the position of the data in the cylinder. The more RPMs you have, the less you wait, which is why faster RPMs drives have faster seek (they must also have faster actuators to move the head)...
PFC wrote: > > >> My argument is that a sufficiently smart kernel scheduler *should* >> yield performance results that are reasonably close to what you can >> get with that feature. Perhaps not quite as good, but reasonably >> close. It shouldn't be an orders-of-magnitude type difference. > > > And a controller card (or drive) has a lot less RAM to use as a > cache / queue for reordering stuff than the OS has, potentially the > OS can us most of the available RAM, which can be gigabytes on a big > server, whereas in the drive there are at most a few tens of > megabytes... > > However all this is a bit looking at the problem through the wrong > end. The OS should provide a multi-read call for the applications to > pass a list of blocks they'll need, then reorder them and read them > the fastest possible way, clustering them with similar requests from > other threads. > > Right now when a thread/process issues a read() it will block > until the block is delivered to this thread. The OS does not know if > this thread will then need the next block (which can be had very > cheaply if you know ahead of time you'll need it) or not. Thus it > must make guesses, read ahead (sometimes), etc... All true. Which is why high performance computing folks use aio_read()/aio_write() and load up the kernel with all the requests they expect to make. The kernels that I'm familiar with will do read ahead on files based on some heuristics: when you read the first byte of a file the OS will typically load up several pages of the file (depending on file size, etc). If you continue doing read() calls without a seek() on the file descriptor the kernel will get the hint that you're doing a sequential read and continue caching up the pages ahead of time, usually using the pages you just read to hold the new data so that one isn't bloating out memory with data that won't be needed again. Throw in a seek() and the amount of read ahead caching may be reduced. One point that is being missed in all this discussion is that the file system also imposes some constraints on how IO's can be done. For example, simply doing a write(fd, buf, 100000000) doesn't emit a stream of sequential blocks to the drives. Some file systems (UFS was one) would force portions of large files into other cylinder groups so that small files could be located near the inode data, thus avoiding/reducing the size of seeks. Similarly, extents need to be allocated and the bitmaps recording this data usually need synchronous updates, which will require some seeks, etc. Not to mention the need to update inode data, etc. Anyway, my point is that the allocation policies of the file system can confuse the situation. Also, the seek times one sees reported are an average. One really needs to look at the track-to-track seek time and also the "full stoke" seek times. It takes a *long* time to move the heads across the whole platter. I've seen people partition drives to only use small regions of the drives to avoid long seeks and to better use the increased number of bits going under the head in one rotation. A 15K drive doesn't need to have a faster seek time than a 10K drive because the rotational speed is higher. The average seek time might be faster just because the 15K drives are smaller with fewer number of cylinders. -- Alan
On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote: > Now, bad block remapping destroys that guarantee, but unless you've > got a LOT of bad blocks, it shouldn't destroy your performance, right? > ALL disks have bad blocks, even when you receive them. you honestly think that these large disks made today (18+ GB is the smallest now) that there are no defects on the surfaces? /me remembers trying to cram an old donated 5MB (yes M) disk into an old 8088 Zenith PC in college... Vivek Khera, Ph.D. +1-301-869-4449 x806
Attachment
Vivek Khera wrote: > > On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote: > >> Now, bad block remapping destroys that guarantee, but unless you've >> got a LOT of bad blocks, it shouldn't destroy your performance, right? >> > > ALL disks have bad blocks, even when you receive them. you honestly > think that these large disks made today (18+ GB is the smallest now) > that there are no defects on the surfaces? That is correct. It is just that the HD makers will mark the bad blocks so that the OS knows not to use them. You can also run the bad blocks command to try and find new bad blocks. Over time hard drives get bad blocks. It doesn't always mean you have to replace the drive but it does mean you need to maintain it and usually at least backup, low level (if scsi) and mark bad blocks. Then restore. Sincerely, Joshua D. Drake > > /me remembers trying to cram an old donated 5MB (yes M) disk into an old > 8088 Zenith PC in college... > > Vivek Khera, Ph.D. > +1-301-869-4449 x806 > -- Your PostgreSQL solutions provider, Command Prompt, Inc. 24x7 support - 1.800.492.2240, programming, and consulting Home of PostgreSQL Replicator, plPHP, plPerlNG and pgPHPToolkit http://www.commandprompt.com / http://www.postgresql.org
On Apr 15, 2005, at 11:58 AM, Joshua D. Drake wrote: >> ALL disks have bad blocks, even when you receive them. you honestly >> think that these large disks made today (18+ GB is the smallest now) >> that there are no defects on the surfaces? > > That is correct. It is just that the HD makers will mark the bad blocks > so that the OS knows not to use them. You can also run the bad blocks > command to try and find new bad blocks. > my point was that you cannot assume an linear correlation between block number and physical location, since the bad blocks will be mapped all over the place. Vivek Khera, Ph.D. +1-301-869-4449 x806
Attachment
Tom Lane <tgl@sss.pgh.pa.us> writes: > Yes, you can probably assume that blocks with far-apart numbers are > going to require a big seek, and you might even be right in supposing > that a block with an intermediate number should be read on the way. > But you have no hope at all of making the right decisions at a more > local level --- say, reading various sectors within the same cylinder > in an optimal fashion. You don't know where the track boundaries are, > so you can't schedule in a way that minimizes rotational latency. > You're best off to throw all the requests at the drive together and > let the drive sort it out. Consider for example three reads, one at the beginning of the disk, one at the very end, and one in the middle. If the three are performed in the logical order (assuming the head starts at the beginning), then the drive has to seek, say, 4ms to get to the middle and 4ms to get to the end. But if the middle block requires a full rotation to reach it from when the head arrives that adds another 8ms of rotational delay (assuming a 7200RPM drive). Whereas the drive could have seeked over to the last block, then seeked back in 8ms and gotten there just in time to perform the read for free. I'm not entirely convinced this explains all of the SCSI drives' superior performance though. The above is about a worst-case scenario. should really only have a small effect, and it's not like the drive firmware can really schedule things perfectly either. I think most of the difference is that the drive manufacturers just don't package their high end drives with ATA interfaces. So there are no 10k RPM ATA drives and no 15k RPM ATA drives. I think WD is making fast SATA drives but most of the manufacturers aren't even doing that. -- greg
Greg, et al. I never found any evidence of a "stop and get an intermediate request" functionality in the TCQ protocol. IIRC, what is there is 1) Ordered 2) Head First 3) Simple implemented as choices. *VERY* roughly, that'd be like (1) disk subsystem satisfies requests as submitted, (2) let's the "this" request be put at the very head of the per se disk queue after the currently-running disk request is complete, and (3) is "let the per se disk and it's software reorder the requests on-hand as per it's onboard software". (N.B. in the last, it's the DISK not the controller making those decisions). (N.B. too, that this last is essentially what NCQ (cf. TCQ) is doing ) I know we've been batting around a hypothetical case of SCSI where it "stops and gets smth. on the way", but I can find no proof (yet) that this is done, pro forma, by SCSI drives. In other words, SCSI is a necessary, but not sufficient cause for intermediate reading. FWIW - Ross -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Stark Sent: Friday, April 15, 2005 2:02 PM To: Tom Lane Cc: Kevin Brown; pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? Tom Lane <tgl@sss.pgh.pa.us> writes: > Yes, you can probably assume that blocks with far-apart numbers are > going to require a big seek, and you might even be right in supposing > that a block with an intermediate number should be read on the way. > But you have no hope at all of making the right decisions at a more > local level --- say, reading various sectors within the same cylinder > in an optimal fashion. You don't know where the track boundaries are, > so you can't schedule in a way that minimizes rotational latency. > You're best off to throw all the requests at the drive together and > let the drive sort it out. Consider for example three reads, one at the beginning of the disk, one at the very end, and one in the middle. If the threeare performed in the logical order (assuming the head starts at the beginning), then the drive has to seek, say, 4msto get to the middle and 4ms to get to the end. But if the middle block requires a full rotation to reach it from when the head arrives that adds another 8ms of rotationaldelay (assuming a 7200RPM drive). Whereas the drive could have seeked over to the last block, then seeked back in 8ms and gotten there just in time to performthe read for free. I'm not entirely convinced this explains all of the SCSI drives' superior performance though. The above is about a worst-casescenario. should really only have a small effect, and it's not like the drive firmware can really schedule thingsperfectly either. I think most of the difference is that the drive manufacturers just don't package their high end drives with ATA interfaces.So there are no 10k RPM ATA drives and no 15k RPM ATA drives. I think WD is making fast SATA drives but most ofthe manufacturers aren't even doing that. -- greg ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > In the case of pure random reads, you'll end up having to wait an > > average of half of a rotation before beginning the read. > > You're assuming the conclusion. The above is true if the disk is handed > one request at a time by a kernel that doesn't have any low-level timing > information. If there are multiple random requests on the same track, > the drive has an opportunity to do better than that --- if it's got all > the requests in hand. True, but see below. Actually, I suspect what matters is if they're on the same cylinder (which may be what you're talking about here). And in the above, I was assuming randomly distributed single-sector reads. In that situation, we can't generically know what the probability that more than one will appear on the same cylinder without knowing something about the drive geometry. That said, most modern drives have tens of thousands of cylinders (the Seagate ST380011a, an 80 gigabyte drive, has 94,600 tracks per inch according to its datasheet), but much, much smaller queue lengths (tens of entries, hundreds at most, I'd expect. Hard data on this would be appreciated). For purely random reads, the probability that two or more requests in the queue happen to be in the same cylinder is going to be quite small. -- Kevin Brown kevin@sysexperts.com
Vivek Khera wrote: > > On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote: > > >Now, bad block remapping destroys that guarantee, but unless you've > >got a LOT of bad blocks, it shouldn't destroy your performance, right? > > > > ALL disks have bad blocks, even when you receive them. you honestly > think that these large disks made today (18+ GB is the smallest now) > that there are no defects on the surfaces? Oh, I'm not at all arguing that you won't have bad blocks. My argument is that the probability of any given block read or write operation actually dealing with a remapped block is going to be relatively small, unless the fraction of bad blocks to total blocks is large (in which case you basically have a bad disk). And so the ability to account for remapped blocks shouldn't itself represent a huge improvement in overall throughput. -- Kevin Brown kevin@sysexperts.com
Rosser Schwarz wrote: > while you weren't looking, Kevin Brown wrote: > > [reordering bursty reads] > > > In other words, it's a corner case that I strongly suspect > > isn't typical in situations where SCSI has historically made a big > > difference. > > [...] > > > But I rather doubt that has to be a huge penalty, if any. When a > > process issues an fsync (or even a sync), the kernel doesn't *have* to > > drop everything it's doing and get to work on it immediately. It > > could easily gather a few more requests, bundle them up, and then > > issue them. > > To make sure I'm following you here, are you or are you not suggesting > that the kernel could sit on -all- IO requests for some small handful > of ms before actually performing any IO to address what you "strongly > suspect" is a "corner case"? The kernel *can* do so. Whether or not it's a good idea depends on the activity in the system. You'd only consider doing this if you didn't already have a relatively large backlog of I/O requests to handle. You wouldn't do this for every I/O request. Consider this: I/O operations to a block device are so slow compared with the speed of other (non I/O) operations on the system that the system can easily wait for, say, a hundredth of the typical latency on the target device before issuing requests to it and not have any real negative impact on the system's I/O throughput. A process running on my test system, a 3 GHz Xeon, can issue a million read system calls per second (I've measured it. I can post the rather trivial source code if you're interested). That's the full round trip of issuing the system call and having the kernel return back. That means that in the span of a millisecond, the system could receive 1000 requests if the system were busy enough. If the average latency for a random read from the disk (including head movement and everything) is 10 milliseconds, and we decide to delay the issuance of the first I/O request for a tenth of a millisecond (a hundredth of the latency), then the system might receive 100 additional I/O requests, which it could then put into the queue and sort by block address before issuing the read request. As long as the system knows what the last block that was requested from that physical device was, it can order the requests properly and then begin issuing them. Since the latency on the target device is so high, this is likely to be a rather big win for overall throughput. -- Kevin Brown kevin@sysexperts.com
Problem with this strategy. You want battery-backed write caching for best performance & safety. (I've tried IDE for WAL before w/ write caching off -- the DB got crippled whenever I had to copy files from/to the drive on the WAL partition -- ended up just moving WAL back on the same SCSI drive as the main DB.) That means in addition to a $$$ SCSI caching controller, you also need a $$$ SATA caching controller. From my glance at prices, advanced SATA controllers seem to cost nearly as their SCSI counterparts. This also looks to be the case for the drives themselves. Sure you can get super cheap 7200RPM SATA drives but they absolutely suck for database work. Believe me, I gave it a try once -- ugh. The highend WD 10K Raptors look pretty good though -- the benchmarks @ storagereview seem to put these drives at about 90% of SCSI 10Ks for both single-user and multi-user. However, they're also priced like SCSIs -- here's what I found @ Mwave (going through pricewatch to find WD740GDs): Seagate 7200 SATA -- 80GB $59 WD 10K SATA -- 72GB $182 Seagate 10K U320 -- 72GB $289 Using the above prices for a fixed budget for RAID-10, you could get: SATA 7200 -- 680MB per $1000 SATA 10K -- 200MB per $1000 SCSI 10K -- 125MB per $1000 For a 99% read-only DB that required lots of disk space (say something like Wikipedia or blog host), using consumer level SATA probably is ok. For anything else, I'd consider SATA 10K if (1) I do not need 15K RPM and (2) I don't have SCSI intrastructure already. Steve Poe wrote: > If SATA drives don't have the ability to replace SCSI for a multi-user > Postgres apps, but you needed to save on cost (ALWAYS an issue), > could/would you implement SATA for your logs (pg_xlog) and keep the rest > on SCSI? > > Steve Poe > > Mohan, Ross wrote: > >> I've been doing some reading up on this, trying to keep up here, and >> have found out that (experts, just yawn and cover your ears) >> >> 1) some SATA drives (just type II, I think?) have a "Phase Zero" >> implementation of Tagged Command Queueing (the special sauce >> for SCSI). >> 2) This SATA "TCQ" is called NCQ and I believe it basically >> allows the disk software itself to do the reordering >> (this is called "simple" in TCQ terminology) It does not >> yet allow the TCQ "head of queue" command, allowing the >> current tagged request to go to head of queue, which is >> a simple way of manifesting a "high priority" request. >> >> 3) SATA drives are not yet multi-initiator? >> >> Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives >> are likely to whomp SATA II drives for a while yet (read: a >> year or two) in multiuser PostGres applications. >> >> >> -----Original Message----- >> From: pgsql-performance-owner@postgresql.org >> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Stark >> Sent: Thursday, April 14, 2005 2:04 PM >> To: Kevin Brown >> Cc: pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] How to improve db performance with $7K? >> >> >> Kevin Brown <kevin@sysexperts.com> writes: >> >> >> >>> Greg Stark wrote: >>> >>> >>> >>> >>>> I think you're being misled by analyzing the write case. >>>> >>>> Consider the read case. When a user process requests a block and >>>> that read makes its way down to the driver level, the driver can't >>>> just put it aside and wait until it's convenient. It has to go ahead >>>> and issue the read right away. >>>> >>> >>> Well, strictly speaking it doesn't *have* to. It could delay for a >>> couple of milliseconds to see if other requests come in, and then >>> issue the read if none do. If there are already other requests being >>> fulfilled, then it'll schedule the request in question just like the >>> rest. >>> >> >> >> But then the cure is worse than the disease. You're basically >> describing exactly what does happen anyways, only you're delaying more >> requests than necessary. That intervening time isn't really idle, it's >> filled with all the requests that were delayed during the previous >> large seek... >> >> >> >>> Once the first request has been fulfilled, the driver can now >>> schedule the rest of the queued-up requests in disk-layout order. >>> >>> I really don't see how this is any different between a system that >>> has tagged queueing to the disks and one that doesn't. The only >>> difference is where the queueing happens. >>> >> >> >> And *when* it happens. Instead of being able to issue requests while a >> large seek is happening and having some of them satisfied they have to >> wait until that seek is finished and get acted on during the next >> large seek. >> >> If my theory is correct then I would expect bandwidth to be >> essentially equivalent but the latency on SATA drives to be increased >> by about 50% of the average seek time. Ie, while a busy SCSI drive can >> satisfy most requests in about 10ms a busy SATA drive would satisfy >> most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI >> drives have even lower seek times and no such IDE/SATA drives exist...) >> >> In reality higher latency feeds into a system feedback loop causing >> your application to run slower causing bandwidth demands to be lower >> as well. It's often hard to distinguish root causes from symptoms when >> optimizing complex systems. >> >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
William Yu <wyu@talisys.com> writes: > Using the above prices for a fixed budget for RAID-10, you could get: > > SATA 7200 -- 680MB per $1000 > SATA 10K -- 200MB per $1000 > SCSI 10K -- 125MB per $1000 What a lot of these analyses miss is that cheaper == faster because cheaper means you can buy more spindles for the same price. I'm assuming you picked equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as many spindles as the 125MB/$1000. That means it would have almost double the bandwidth. And the 7200 RPM case would have more than 5x the bandwidth. While 10k RPM drives have lower seek times, and SCSI drives have a natural seek time advantage, under load a RAID array with fewer spindles will start hitting contention sooner which results into higher latency. If the controller works well the larger SATA arrays above should be able to maintain their mediocre latency much better under load than the SCSI array with fewer drives would maintain its low latency response time despite its drives' lower average seek time. -- greg
This is fundamentaly untrue. A mirror is still a mirror. At most in a RAID 10 you can have two simultaneous seeks. You are always going to be limited by the seek time of your drives. It's a stripe, so you have to read from all members of the stripe to get data, requiring all drives to seek. There is no advantage to seek time in adding more drives. By adding more drives you can increase throughput, but the max throughput of the PCI-X bus isn't that high (I think around 400MB/sec) You can easily get this with a six or seven drive RAID 5, or a ten drive RAID 10. At that point you start having to factor in the cost of a bigger chassis to hold more drives, which can be big bucks. Alex Turner netEconomist On 18 Apr 2005 10:59:05 -0400, Greg Stark <gsstark@mit.edu> wrote: > > William Yu <wyu@talisys.com> writes: > > > Using the above prices for a fixed budget for RAID-10, you could get: > > > > SATA 7200 -- 680MB per $1000 > > SATA 10K -- 200MB per $1000 > > SCSI 10K -- 125MB per $1000 > > What a lot of these analyses miss is that cheaper == faster because cheaper > means you can buy more spindles for the same price. I'm assuming you picked > equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as > many spindles as the 125MB/$1000. That means it would have almost double the > bandwidth. And the 7200 RPM case would have more than 5x the bandwidth. > > While 10k RPM drives have lower seek times, and SCSI drives have a natural > seek time advantage, under load a RAID array with fewer spindles will start > hitting contention sooner which results into higher latency. If the controller > works well the larger SATA arrays above should be able to maintain their > mediocre latency much better under load than the SCSI array with fewer drives > would maintain its low latency response time despite its drives' lower average > seek time. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
> -----Original Message----- > From: Greg Stark [mailto:gsstark@mit.edu] > Sent: Monday, April 18, 2005 9:59 AM > To: William Yu > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to improve db performance with $7K? > > William Yu <wyu@talisys.com> writes: > > > Using the above prices for a fixed budget for RAID-10, you > > could get: > > > > SATA 7200 -- 680MB per $1000 > > SATA 10K -- 200MB per $1000 > > SCSI 10K -- 125MB per $1000 > > What a lot of these analyses miss is that cheaper == faster > because cheaper means you can buy more spindles for the same > price. I'm assuming you picked equal sized drives to compare > so that 200MB/$1000 for SATA is almost twice as many spindles > as the 125MB/$1000. That means it would have almost double > the bandwidth. And the 7200 RPM case would have more than 5x > the bandwidth. > [...] Hmm...so you're saying that at some point, quantity beats quality? That's an interesting point. However, it presumes that you can actually distribute your data over a larger number of drives. If you have a db with a bottleneck of one or two very large tables, the extra spindles won't help unless you break up the tables and glue them together with query magic. But it's still a point to consider. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
Alex Turner <armtuk@gmail.com> writes: > This is fundamentaly untrue. > > A mirror is still a mirror. At most in a RAID 10 you can have two > simultaneous seeks. You are always going to be limited by the seek > time of your drives. It's a stripe, so you have to read from all > members of the stripe to get data, requiring all drives to seek. > There is no advantage to seek time in adding more drives. Adding drives will not let you get lower response times than the average seek time on your drives*. But it will let you reach that response time more often. The actual response time for a random access to a drive is the seek time plus the time waiting for your request to actually be handled. Under heavy load that could be many milliseconds. The more drives you have the fewer requests each drive has to handle. Look at the await and svctime columns of iostat -x. Under heavy random access load those columns can show up performance problems more accurately than the bandwidth columns. You could be doing less bandwidth but be having latency issues. While reorganizing data to allow for more sequential reads is the normal way to address that, simply adding more spindles can be surprisingly effective. > By adding more drives you can increase throughput, but the max throughput of > the PCI-X bus isn't that high (I think around 400MB/sec) You can easily get > this with a six or seven drive RAID 5, or a ten drive RAID 10. At that point > you start having to factor in the cost of a bigger chassis to hold more > drives, which can be big bucks. You could use software raid to spread the drives over multiple PCI-X cards. But if 400MB/s isn't enough bandwidth then you're probably in the realm of "enterprise-class" hardware anyways. * (Actually even that's possible: you could limit yourself to a portion of the drive surface to reduce seek time) -- greg
[snip] > > Adding drives will not let you get lower response times than the average seek > time on your drives*. But it will let you reach that response time more often. > [snip] I believe your assertion is fundamentaly flawed. Adding more drives will not let you reach that response time more often. All drives are required to fill every request in all RAID levels (except possibly 0+1, but that isn't used for enterprise applicaitons). Most requests in OLTP require most of the request time to seek, not to read. Only in single large block data transfers will you get any benefit from adding more drives, which is atypical in most database applications. For most database applications, the only way to increase transactions/sec is to decrease request service time, which is generaly achieved with better seek times or a better controller card, or possibly spreading your database accross multiple tablespaces on seperate paritions. My assertion therefore is that simply adding more drives to an already competent* configuration is about as likely to increase your database effectiveness as swiss cheese is to make your car run faster. Alex Turner netEconomist *Assertion here is that the DBA didn't simply configure all tables and xlog on a single 7200 RPM disk, but has seperate physical drives for xlog and tablespace at least on 10k drives.
Hi, At 18:56 18/04/2005, Alex Turner wrote: >All drives are required to fill every request in all RAID levels No, this is definitely wrong. In many cases, most drives don't actually have the data requested, how could they handle the request? When reading one random sector, only *one* drive out of N is ever used to service any given request, be it RAID 0, 1, 0+1, 1+0 or 5. When writing: - in RAID 0, 1 drive - in RAID 1, RAID 0+1 or 1+0, 2 drives - in RAID 5, you need to read on all drives and write on 2. Otherwise, what would be the point of RAID 0, 0+1 or 1+0? Jacques.
Alex Turner wrote: >[snip] > > >>Adding drives will not let you get lower response times than the average seek >>time on your drives*. But it will let you reach that response time more often. >> >> >> >[snip] > >I believe your assertion is fundamentaly flawed. Adding more drives >will not let you reach that response time more often. All drives are >required to fill every request in all RAID levels (except possibly >0+1, but that isn't used for enterprise applicaitons). Most requests >in OLTP require most of the request time to seek, not to read. Only >in single large block data transfers will you get any benefit from >adding more drives, which is atypical in most database applications. >For most database applications, the only way to increase >transactions/sec is to decrease request service time, which is >generaly achieved with better seek times or a better controller card, >or possibly spreading your database accross multiple tablespaces on >seperate paritions. > >My assertion therefore is that simply adding more drives to an already >competent* configuration is about as likely to increase your database >effectiveness as swiss cheese is to make your car run faster. > > Consider the case of a mirrored file system with a mostly read() workload. Typical behavior is to use a round-robin method for issueing the read operations to each mirror in turn, but one can use other methods like a geometric algorithm that will issue the reads to the drive with the head located closest to the desired track. Some systems have many mirrors of the data for exactly this behavior. In fact, one can carry this logic to the extreme and have one drive for every cylinder in the mirror, thus removing seek latencies completely. In fact this extreme case would also remove the rotational latency as the cylinder will be in the disks read cache. :-) Of course, writing data would be a bit slow! I'm not sure I understand your assertion that "all drives are required to fill every request in all RAID levels". After all, in mirrored reads only one mirror needs to read any given block of data, so I don't know what goal is achieved in making other mirrors read the same data. My assertion (based on ample personal experience) is that one can *always* get improved performance by adding more drives. Just limit the drives to use the first few cylinders so that the average seek time is greatly reduced and concatenate the drives together. One can then build the usual RAID device out of these concatenated metadevices. Yes, one is wasting lots of disk space, but that's life. If your goal is performance, then you need to put your money on the table. The system will be somewhat unreliable because of the device count, additional SCSI buses, etc., but that too is life in the high performance world. -- Alan
Alex Turner wrote: >[snip] > > >>Adding drives will not let you get lower response times than the average seek >>time on your drives*. But it will let you reach that response time more often. >> >> >> >[snip] > >I believe your assertion is fundamentaly flawed. Adding more drives >will not let you reach that response time more often. All drives are >required to fill every request in all RAID levels (except possibly >0+1, but that isn't used for enterprise applicaitons). > Actually 0+1 is the recommended configuration for postgres databases (both for xlog and for the bulk data), because the write speed of RAID5 is quite poor. Hence you base assumption is not correct, and adding drives *does* help. >Most requests >in OLTP require most of the request time to seek, not to read. Only >in single large block data transfers will you get any benefit from >adding more drives, which is atypical in most database applications. >For most database applications, the only way to increase >transactions/sec is to decrease request service time, which is >generaly achieved with better seek times or a better controller card, >or possibly spreading your database accross multiple tablespaces on >seperate paritions. > > This is probably true. However, if you are doing lots of concurrent connections, and things are properly spread across multiple spindles (using RAID0+1, or possibly tablespaces across multiple raids). Then each seek occurs on a separate drive, which allows them to occur at the same time, rather than sequentially. Having 2 processes competing for seeking on the same drive is going to be worse than having them on separate drives. John =:->
Attachment
Hi, At 16:59 18/04/2005, Greg Stark wrote: >William Yu <wyu@talisys.com> writes: > > > Using the above prices for a fixed budget for RAID-10, you could get: > > > > SATA 7200 -- 680MB per $1000 > > SATA 10K -- 200MB per $1000 > > SCSI 10K -- 125MB per $1000 > >What a lot of these analyses miss is that cheaper == faster because cheaper >means you can buy more spindles for the same price. I'm assuming you picked >equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as >many spindles as the 125MB/$1000. That means it would have almost double the >bandwidth. And the 7200 RPM case would have more than 5x the bandwidth. > >While 10k RPM drives have lower seek times, and SCSI drives have a natural >seek time advantage, under load a RAID array with fewer spindles will start >hitting contention sooner which results into higher latency. If the controller >works well the larger SATA arrays above should be able to maintain their >mediocre latency much better under load than the SCSI array with fewer drives >would maintain its low latency response time despite its drives' lower average >seek time. I would definitely agree. More factors in favor of more cheap drives: - cheaper drives (7200 rpm) have larger disks (3.7" diameter against 2.6 or 3.3). That means the outer tracks hold more data, and the same amount of data is held on a smaller area, which means less tracks, which means reduced seek times. You can roughly count the real average seek time as (average seek time over full disk * size of dataset / capacity of disk). And you actually need to physicall seek less often too. - more disks means less data per disk, which means the data is further concentrated on outer tracks, which means even lower seek times Also, what counts is indeed not so much the time it takes to do one single random seek, but the number of random seeks you can do per second. Hence, more disks means more seeks per second (if requests are evenly distributed among all disks, which a good stripe size should achieve). Not taking into account TCQ/NCQ or write cache optimizations, the important parameter (random seeks per second) can be approximated as: N * 1000 / (lat + seek * ds / (N * cap)) Where: N is the number of disks lat is the average rotational latency in milliseconds (500/(rpm/60)) seek is the average seek over the full disk in milliseconds ds is the dataset size cap is the capacity of each disk Using this formula and a variety of disks, counting only the disks themselves (no enclosures, controllers, rack space, power, maintenance...), trying to maximize the number of seeks/second for a fixed budget (1000 euros) with a dataset size of 100 GB makes SATA drives clear winners: you can get more than 4000 seeks/second (with 21 x 80GB disks) where SCSI cannot even make it to the 1400 seek/second point (with 8 x 36 GB disks). Results can vary quite a lot based on the dataset size, which illustrates the importance of "staying on the edges" of the disks. I'll try to make the analysis more complete by counting some of the "overhead" (obviously 21 drives has a lot of other implications!), but I believe SATA drives still win in theory. It would be interesting to actually compare this to real-world (or nearly-real-world) benchmarks to measure the effectiveness of features like TCQ/NCQ etc. Jacques.
Alex, In the situation of the animal hospital server I oversee, their application is OLTP. Adding hard drives (6-8) does help performance. Benchmarks like pgbench and OSDB agree with it, but in reality users could not see noticeable change. However, moving the top 5/10 tables and indexes to their own space made a greater impact. Someone who reads PostgreSQL 8.0 Performance Checklist is going to see point #1 add more disks is the key. How about adding a subpoint to explaining when more disks isn't enough or applicable? I maybe generalizing the complexity of tuning an OLTP application, but some clarity could help. Steve Poe
Ok - well - I am partially wrong... If you're stripe size is 64Kb, and you are reading 256k worth of data, it will be spread across four drives, so you will need to read from four devices to get your 256k of data (RAID 0 or 5 or 10), but if you are only reading 64kb of data, I guess you would only need to read from one disk. So my assertion that adding more drives doesn't help is pretty wrong... particularly with OLTP because it's always dealing with blocks that are smaller that the stripe size. Alex Turner netEconomist On 4/18/05, Jacques Caron <jc@directinfos.com> wrote: > Hi, > > At 18:56 18/04/2005, Alex Turner wrote: > >All drives are required to fill every request in all RAID levels > > No, this is definitely wrong. In many cases, most drives don't actually > have the data requested, how could they handle the request? > > When reading one random sector, only *one* drive out of N is ever used to > service any given request, be it RAID 0, 1, 0+1, 1+0 or 5. > > When writing: > - in RAID 0, 1 drive > - in RAID 1, RAID 0+1 or 1+0, 2 drives > - in RAID 5, you need to read on all drives and write on 2. > > Otherwise, what would be the point of RAID 0, 0+1 or 1+0? > > Jacques. > >
Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at least I would never recommend 1+0 for anything). RAID 10 and RAID 0+1 are _quite_ different. One gives you very good redundancy, the other is only slightly better than RAID 5, but operates faster in degraded mode (single drive). Alex Turner netEconomist On 4/18/05, John A Meinel <john@arbash-meinel.com> wrote: > Alex Turner wrote: > > >[snip] > > > > > >>Adding drives will not let you get lower response times than the average seek > >>time on your drives*. But it will let you reach that response time more often. > >> > >> > >> > >[snip] > > > >I believe your assertion is fundamentaly flawed. Adding more drives > >will not let you reach that response time more often. All drives are > >required to fill every request in all RAID levels (except possibly > >0+1, but that isn't used for enterprise applicaitons). > > > Actually 0+1 is the recommended configuration for postgres databases > (both for xlog and for the bulk data), because the write speed of RAID5 > is quite poor. > Hence you base assumption is not correct, and adding drives *does* help. > > >Most requests > >in OLTP require most of the request time to seek, not to read. Only > >in single large block data transfers will you get any benefit from > >adding more drives, which is atypical in most database applications. > >For most database applications, the only way to increase > >transactions/sec is to decrease request service time, which is > >generaly achieved with better seek times or a better controller card, > >or possibly spreading your database accross multiple tablespaces on > >seperate paritions. > > > > > This is probably true. However, if you are doing lots of concurrent > connections, and things are properly spread across multiple spindles > (using RAID0+1, or possibly tablespaces across multiple raids). > Then each seek occurs on a separate drive, which allows them to occur at > the same time, rather than sequentially. Having 2 processes competing > for seeking on the same drive is going to be worse than having them on > separate drives. > John > =:-> > > >
I think the add more disks thing is really from the point of view that one disk isn't enough ever. You should really have at least four drives configured into two RAID 1s. Most DBAs will know this, but most average Joes won't. Alex Turner netEconomist On 4/18/05, Steve Poe <spoe@sfnet.cc> wrote: > Alex, > > In the situation of the animal hospital server I oversee, their > application is OLTP. Adding hard drives (6-8) does help performance. > Benchmarks like pgbench and OSDB agree with it, but in reality users > could not see noticeable change. However, moving the top 5/10 tables and > indexes to their own space made a greater impact. > > Someone who reads PostgreSQL 8.0 Performance Checklist is going to see > point #1 add more disks is the key. How about adding a subpoint to > explaining when more disks isn't enough or applicable? I maybe > generalizing the complexity of tuning an OLTP application, but some > clarity could help. > > Steve Poe > >
So I wonder if one could take this stripe size thing further and say that a larger stripe size is more likely to result in requests getting served parallized across disks which would lead to increased performance? Again, thanks to all people on this list, I know that I have learnt a _hell_ of alot since subscribing. Alex Turner netEconomist On 4/18/05, Alex Turner <armtuk@gmail.com> wrote: > Ok - well - I am partially wrong... > > If you're stripe size is 64Kb, and you are reading 256k worth of data, > it will be spread across four drives, so you will need to read from > four devices to get your 256k of data (RAID 0 or 5 or 10), but if you > are only reading 64kb of data, I guess you would only need to read > from one disk. > > So my assertion that adding more drives doesn't help is pretty > wrong... particularly with OLTP because it's always dealing with > blocks that are smaller that the stripe size. > > Alex Turner > netEconomist > > On 4/18/05, Jacques Caron <jc@directinfos.com> wrote: > > Hi, > > > > At 18:56 18/04/2005, Alex Turner wrote: > > >All drives are required to fill every request in all RAID levels > > > > No, this is definitely wrong. In many cases, most drives don't actually > > have the data requested, how could they handle the request? > > > > When reading one random sector, only *one* drive out of N is ever used to > > service any given request, be it RAID 0, 1, 0+1, 1+0 or 5. > > > > When writing: > > - in RAID 0, 1 drive > > - in RAID 1, RAID 0+1 or 1+0, 2 drives > > - in RAID 5, you need to read on all drives and write on 2. > > > > Otherwise, what would be the point of RAID 0, 0+1 or 1+0? > > > > Jacques. > > > > >
Jacques Caron <jc@directinfos.com> writes: > When writing: > - in RAID 0, 1 drive > - in RAID 1, RAID 0+1 or 1+0, 2 drives > - in RAID 5, you need to read on all drives and write on 2. Actually RAID 5 only really needs to read from two drives. The existing parity block and the block you're replacing. It just xors the old block, the new block, and the existing parity block to generate the new parity block. -- greg
Alex Turner wrote: > Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at > least I would never recommend 1+0 for anything). Uhmm I was under the impression that 1+0 was RAID 10 and that 0+1 is NOT RAID 10. Ref: http://www.acnc.com/raid.html Sincerely, Joshua D. Drake > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Hi, At 20:16 18/04/2005, Alex Turner wrote: >So my assertion that adding more drives doesn't help is pretty >wrong... particularly with OLTP because it's always dealing with >blocks that are smaller that the stripe size. When doing random seeks (which is what a database needs most of the time), the number of disks helps improve the number of seeks per second (which is the bottleneck in this case). When doing sequential reads, the number of disks helps improve total throughput (which is the bottleneck in that case). In short: in always helps :-) Jacques.
Don't you think "optimal stripe width" would be a good question to research the binaries for? I'd think that drives the answer, largely. (uh oh, pun alert) EG, oracle issues IO requests (this may have changed _just_ recently) in 64KB chunks, regardless of what you ask for. So when I did my striping (many moons ago, when the Earth was young...) I did it in 128KB widths, and set the oracle "multiblock read count" according. For oracle, any stripe size under 64KB=stupid, anything much over 128K/258K=wasteful. I am eager to find out how PG handles all this. - Ross p.s. <Brooklyn thug accent> 'You want a database record? I gotcher record right here' http://en.wikipedia.org/wiki/Akashic_Records -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Alex Turner Sent: Monday, April 18, 2005 2:21 PM To: Jacques Caron Cc: Greg Stark; William Yu; pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? So I wonder if one could take this stripe size thing further and say that a larger stripe size is more likely to result inrequests getting served parallized across disks which would lead to increased performance? Again, thanks to all people on this list, I know that I have learnt a _hell_ of alot since subscribing. Alex Turner netEconomist On 4/18/05, Alex Turner <armtuk@gmail.com> wrote: > Ok - well - I am partially wrong... > > If you're stripe size is 64Kb, and you are reading 256k worth of data, > it will be spread across four drives, so you will need to read from > four devices to get your 256k of data (RAID 0 or 5 or 10), but if you > are only reading 64kb of data, I guess you would only need to read > from one disk. > > So my assertion that adding more drives doesn't help is pretty > wrong... particularly with OLTP because it's always dealing with > blocks that are smaller that the stripe size. > > Alex Turner > netEconomist > > On 4/18/05, Jacques Caron <jc@directinfos.com> wrote: > > Hi, > > > > At 18:56 18/04/2005, Alex Turner wrote: > > >All drives are required to fill every request in all RAID levels > > > > No, this is definitely wrong. In many cases, most drives don't > > actually have the data requested, how could they handle the request? > > > > When reading one random sector, only *one* drive out of N is ever > > used to service any given request, be it RAID 0, 1, 0+1, 1+0 or 5. > > > > When writing: > > - in RAID 0, 1 drive > > - in RAID 1, RAID 0+1 or 1+0, 2 drives > > - in RAID 5, you need to read on all drives and write on 2. > > > > Otherwise, what would be the point of RAID 0, 0+1 or 1+0? > > > > Jacques. > > > > > ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Mistype.. I meant 0+1 in the second instance :( On 4/18/05, Joshua D. Drake <jd@commandprompt.com> wrote: > Alex Turner wrote: > > Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at > > least I would never recommend 1+0 for anything). > > Uhmm I was under the impression that 1+0 was RAID 10 and that 0+1 is NOT > RAID 10. > > Ref: http://www.acnc.com/raid.html > > Sincerely, > > Joshua D. Drake > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > >
Hi, At 20:21 18/04/2005, Alex Turner wrote: >So I wonder if one could take this stripe size thing further and say >that a larger stripe size is more likely to result in requests getting >served parallized across disks which would lead to increased >performance? Actually, it would be pretty much the opposite. The smaller the stripe size, the more evenly distributed data is, and the more disks can be used to serve requests. If your stripe size is too large, many random accesses within one single file (whose size is smaller than the stripe size/number of disks) may all end up on the same disk, rather than being split across multiple disks (the extreme case being stripe size = total size of all disks, which means concatenation). If all accesses had the same cost (i.e. no seek time, only transfer time), the ideal would be to have a stripe size equal to the number of disks. But below a certain size, you're going to use multiple disks to serve one single request which would not have taken much more time from a single disk (reading even a large number of consecutive blocks within one cylinder does not take much more time than reading a single block), so you would add unnecessary seeks on a disk that could have served another request in the meantime. You should definitely not go below the filesystem block size or the database block size. There is a interesting discussion of the optimal stripe size in the vinum manpage on FreeBSD: http://www.freebsd.org/cgi/man.cgi?query=vinum&apropos=0&sektion=0&manpath=FreeBSD+5.3-RELEASE+and+Ports&format=html (look for "Performance considerations", towards the end -- note however that some of the calculations are not entirely correct). Basically it says the optimal stripe size is somewhere between 256KB and 4MB, preferably an odd number, and that some hardware RAID controllers don't like big stripe sizes. YMMV, as always. Jacques.
On 4/18/05, Jacques Caron <jc@directinfos.com> wrote: > Hi, > > At 20:21 18/04/2005, Alex Turner wrote: > >So I wonder if one could take this stripe size thing further and say > >that a larger stripe size is more likely to result in requests getting > >served parallized across disks which would lead to increased > >performance? > > Actually, it would be pretty much the opposite. The smaller the stripe > size, the more evenly distributed data is, and the more disks can be used > to serve requests. If your stripe size is too large, many random accesses > within one single file (whose size is smaller than the stripe size/number > of disks) may all end up on the same disk, rather than being split across > multiple disks (the extreme case being stripe size = total size of all > disks, which means concatenation). If all accesses had the same cost (i.e. > no seek time, only transfer time), the ideal would be to have a stripe size > equal to the number of disks. > [snip] Ahh yes - but the critical distinction is this: The smaller the stripe size, the more disks will be used to serve _a_ request - which is bad for OLTP because you want fewer disks per request so that you can have more requests per second because the cost is mostly seek. If more than one disk has to seek to serve a single request, you are preventing that disk from serving a second request at the same time. To have more throughput in MB/sec, you want a smaller stripe size so that you have more disks serving a single request allowing you to multiple by effective drives to get total bandwidth. Because OLTP is made up of small reads and writes to a small number of different files, I would guess that you want those files split up across your RAID, but not so much that a single small read or write operation would traverse more than one disk. That would infer that your optimal stripe size is somewhere on the right side of the bell curve that represents your database read and write block count distribution. If on average the dbwritter never flushes less than 1MB to disk at a time, then I guess your best stripe size would be 1MB, but that seems very large to me. So I think therefore that I may be contending the exact opposite of what you are postulating! Alex Turner netEconomist
Oooops, I revived the never-ending $7K thread. :) Well part of my message is to first relook at the idea that SATA is cheap but slow. Most people look at SATA from the view of consumer-level drives, no NCQ/TCQ -- basically these drives are IDEs that can connect to SATA cables. But if you then look at the server-level SATAs from WD, you see performance close to server-level 10K SCSIs and pricing also close. Starting with the idea of using 20 consumer-level SATA drives versus 4 10K SCSIs, the main problem of course is the lack of advanced queueing in these drives. I'm sure there's some threshold where the number of drives advantage exceeds the disadvantage of no queueing -- what that is, I don't have a clue. Now if you stuffed a ton of memory onto a SATA caching controller and these controllers did the queue management instead of the drives, that would eliminate most of the performance issues. Then you're just left with the management issues. Getting those 20 drives stuffed in a big case and keeping a close eye on the drives since drive failure will be a much bigger deal. Greg Stark wrote: > William Yu <wyu@talisys.com> writes: > > >>Using the above prices for a fixed budget for RAID-10, you could get: >> >>SATA 7200 -- 680MB per $1000 >>SATA 10K -- 200MB per $1000 >>SCSI 10K -- 125MB per $1000 > > > What a lot of these analyses miss is that cheaper == faster because cheaper > means you can buy more spindles for the same price. I'm assuming you picked > equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as > many spindles as the 125MB/$1000. That means it would have almost double the > bandwidth. And the 7200 RPM case would have more than 5x the bandwidth. > > While 10k RPM drives have lower seek times, and SCSI drives have a natural > seek time advantage, under load a RAID array with fewer spindles will start > hitting contention sooner which results into higher latency. If the controller > works well the larger SATA arrays above should be able to maintain their > mediocre latency much better under load than the SCSI array with fewer drives > would maintain its low latency response time despite its drives' lower average > seek time. >
Kevin Brown wrote: > Greg Stark wrote: > > > > I think you're being misled by analyzing the write case. > > > > Consider the read case. When a user process requests a block and > > that read makes its way down to the driver level, the driver can't > > just put it aside and wait until it's convenient. It has to go ahead > > and issue the read right away. > > Well, strictly speaking it doesn't *have* to. It could delay for a > couple of milliseconds to see if other requests come in, and then > issue the read if none do. If there are already other requests being > fulfilled, then it'll schedule the request in question just like the > rest. The idea with SCSI or any command queuing is that you don't have to wait for another request to come in --- you can send the request as it arrives, then if another shows up, you send that too, and the drive optimizes the grouping at a later time, knowing what the drive is doing, rather queueing in the kernel. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Does it really matter at which end of the cable the queueing is done (Assuming both ends know as much about drive geometry etc..)? Alex Turner netEconomist On 4/18/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Kevin Brown wrote: > > Greg Stark wrote: > > > > > > > I think you're being misled by analyzing the write case. > > > > > > Consider the read case. When a user process requests a block and > > > that read makes its way down to the driver level, the driver can't > > > just put it aside and wait until it's convenient. It has to go ahead > > > and issue the read right away. > > > > Well, strictly speaking it doesn't *have* to. It could delay for a > > couple of milliseconds to see if other requests come in, and then > > issue the read if none do. If there are already other requests being > > fulfilled, then it'll schedule the request in question just like the > > rest. > > The idea with SCSI or any command queuing is that you don't have to wait > for another request to come in --- you can send the request as it > arrives, then if another shows up, you send that too, and the drive > optimizes the grouping at a later time, knowing what the drive is doing, > rather queueing in the kernel. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
On Mon, Apr 18, 2005 at 06:49:44PM -0400, Alex Turner wrote: > Does it really matter at which end of the cable the queueing is done > (Assuming both ends know as much about drive geometry etc..)? That is a pretty strong assumption, isn't it? Also you seem to be assuming that the controller<->disk protocol (some internal, unknown to mere mortals, mechanism) is equally powerful than the host<->controller (SATA, SCSI, etc). I'm lost whether this thread is about what is possible with current, in-market technology, or about what could in theory be possible [if you were to design "open source" disk controllers and disks.] -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi)
On 4/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That's basically what it comes down to: SCSI lets the disk drive itself > do the low-level I/O scheduling whereas the ATA spec prevents the drive > from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's > possible for the drive to rearrange reads as well as writes --- which > AFAICS is just not possible in ATA. (Maybe in the newest spec...) > > The reason this is so much more of a win than it was when ATA was > designed is that in modern drives the kernel has very little clue about > the physical geometry of the disk. Variable-size tracks, bad-block > sparing, and stuff like that make for a very hard-to-predict mapping > from linear sector addresses to actual disk locations. Combine that > with the fact that the drive controller can be much smarter than it was > twenty years ago, and you can see that the case for doing I/O scheduling > in the kernel and not in the drive is pretty weak. > > So if you all were going to choose between two hard drives where: drive A has capacity C and spins at 15K rpms, and drive B has capacity 2 x C and spins at 10K rpms and all other features are the same, the price is the same and C is enough disk space which would you choose? I've noticed that on IDE drives, as the capacity increases the data density increases and there is a pereceived (I've not measured it) performance increase. Would the increased data density of the higher capacity drive be of greater benefit than the faster spindle speed of drive A? -- Matthew Nuzum www.bearfruit.org
Alex Turner wrote: > Does it really matter at which end of the cable the queueing is done > (Assuming both ends know as much about drive geometry etc..)? Good question. If the SCSI system was moving the head from track 1 to 10, and a request then came in for track 5, could the system make the head stop at track 5 on its way to track 10? That is something that only the controller could do. However, I have no idea if SCSI does that. The only part I am pretty sure about is that real-world experience shows SCSI is better for a mixed I/O environment. Not sure why, exactly, but the command queueing obviously helps, and I am not sure what else does. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> -----Original Message----- > From: Alex Turner [mailto:armtuk@gmail.com] > Sent: Monday, April 18, 2005 5:50 PM > To: Bruce Momjian > Cc: Kevin Brown; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to improve db performance with $7K? > > Does it really matter at which end of the cable the queueing is done > (Assuming both ends know as much about drive geometry etc..)? > [...] The parenthetical is an assumption I'd rather not make. If my performance depends on my kernel knowing how my drive is laid out, I would always be wondering if a new drive is going to break any of the kernel's geometry assumptions. Drive geometry doesn't seem like a kernel's business any more than a kernel should be able to decode the ccd signal of an optical mouse. The kernel should queue requests at a level of abstraction that doesn't depend on intimate knowledge of drive geometry, and the drive should queue requests on the concrete level where geometry matters. A drive shouldn't guess whether a process is trying to read a file sequentially, and a kernel shouldn't guess whether sector 30 is contiguous with sector 31 or not. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
Good question. If the SCSI system was moving the head from track 1 to 10, and a request then came in for track 5, couldthe system make the head stop at track 5 on its way to track 10? That is something that only the controller could do. However, I have no idea if SCSI does that. || SCSI, AFAIK, does NOT do this. What SCSI can do is allow "next" request insertion into head of request queue (queue-jumping), and/or defer request ordering to done by drive per se (queue re-ordering). I have looked, in vain, for evidence that SCSI somehow magically "stops in the middle of request to pick up data" (my words, not yours) The only part I am pretty sure about is that real-world experience shows SCSI is better for a mixed I/O environment. Notsure why, exactly, but the command queueing obviously helps, and I am not sure what else does. || TCQ is the secret sauce, no doubt. I think NCQ (the SATA version of per se drive request reordering) should go a looong way (but not all the way) toward making SATA 'enterprise acceptable'. Multiple initiators (e.g. more than one host being able to talk to a drive) is a biggie, too. AFAIK only SCSI drives/controllers do that for now. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Mohan, Ross wrote: > The only part I am pretty sure about is that real-world experience shows SCSI is better for a mixed I/O environment. Notsure why, exactly, but the command queueing obviously helps, and I am not sure what else does. > > || TCQ is the secret sauce, no doubt. I think NCQ (the SATA version of per se drive request reordering) > should go a looong way (but not all the way) toward making SATA 'enterprise acceptable'. Multiple > initiators (e.g. more than one host being able to talk to a drive) is a biggie, too. AFAIK only SCSI > drives/controllers do that for now. What is 'multiple initiators' used for in the real world? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Clustered file systems is the first/best example that comes to mind. Host A and Host B can both request from diskfarm, eg. -----Original Message----- From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] Sent: Tuesday, April 19, 2005 12:10 PM To: Mohan, Ross Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? Mohan, Ross wrote: > The only part I am pretty sure about is that real-world experience > shows SCSI is better for a mixed I/O environment. Not sure why, > exactly, but the command queueing obviously helps, and I am not sure > what else does. > > || TCQ is the secret sauce, no doubt. I think NCQ (the SATA version > || of per se drive request reordering) > should go a looong way (but not all the way) toward making SATA 'enterprise acceptable'. Multiple > initiators (e.g. more than one host being able to talk to a drive) is a biggie, too. AFAIK only SCSI > drives/controllers do that for now. What is 'multiple initiators' used for in the real world? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Mohan, Ross wrote: > Clustered file systems is the first/best example that > comes to mind. Host A and Host B can both request from diskfarm, eg. So one host writes to part of the disk and another host writes to a different part? --------------------------------------------------------------------------- > -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > Sent: Tuesday, April 19, 2005 12:10 PM > To: Mohan, Ross > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to improve db performance with $7K? > > > Mohan, Ross wrote: > > The only part I am pretty sure about is that real-world experience > > shows SCSI is better for a mixed I/O environment. Not sure why, > > exactly, but the command queueing obviously helps, and I am not sure > > what else does. > > > > || TCQ is the secret sauce, no doubt. I think NCQ (the SATA version > > || of per se drive request reordering) > > should go a looong way (but not all the way) toward making SATA 'enterprise acceptable'. Multiple > > initiators (e.g. more than one host being able to talk to a drive) is a biggie, too. AFAIK only SCSI > > drives/controllers do that for now. > > What is 'multiple initiators' used for in the real world? > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgsql-performance-owner@postgresql.org wrote on 04/19/2005 11:10:22 AM: > > What is 'multiple initiators' used for in the real world? I asked this same question and got an answer off list: Somebody said their SAN hardware used multiple initiators. I would try to check the archives for you, but this thread is becoming more of a rope. Multiple initiators means multiple sources on the bus issuing I/O instructions to the drives. In theory you can have two computers on the same SCSI bus issuing I/O requests to the same drive, or to anything else on the bus, but I've never seen this implemented. Others have noted this feature as being a big deal, so somebody is benefiting from it. Rick > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings
Well, more like they both are allowed to issue disk requests and the magical "clustered file system" manages locking, etc. In reality, any disk is only reading/writing to one part of the disk at any given time, of course, but that in the multiple initiator deal, multiple streams of requests from multiple hosts can be queued. -----Original Message----- From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] Sent: Tuesday, April 19, 2005 12:16 PM To: Mohan, Ross Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? Mohan, Ross wrote: > Clustered file systems is the first/best example that > comes to mind. Host A and Host B can both request from diskfarm, eg. So one host writes to part of the disk and another host writes to a different part? --------------------------------------------------------------------------- > -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > Sent: Tuesday, April 19, 2005 12:10 PM > To: Mohan, Ross > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to improve db performance with $7K? > > > Mohan, Ross wrote: > > The only part I am pretty sure about is that real-world experience > > shows SCSI is better for a mixed I/O environment. Not sure why, > > exactly, but the command queueing obviously helps, and I am not sure > > what else does. > > > > || TCQ is the secret sauce, no doubt. I think NCQ (the SATA version > > || of per se drive request reordering) > > should go a looong way (but not all the way) toward making SATA 'enterprise acceptable'. Multiple > > initiators (e.g. more than one host being able to talk to a drive) is a biggie, too. AFAIK only SCSI > > drives/controllers do that for now. > > What is 'multiple initiators' used for in the real world? > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Thu, Apr 14, 2005 at 10:51:46AM -0500, Matthew Nuzum wrote: > So if you all were going to choose between two hard drives where: > drive A has capacity C and spins at 15K rpms, and > drive B has capacity 2 x C and spins at 10K rpms and > all other features are the same, the price is the same and C is enough > disk space which would you choose? > > I've noticed that on IDE drives, as the capacity increases the data > density increases and there is a pereceived (I've not measured it) > performance increase. > > Would the increased data density of the higher capacity drive be of > greater benefit than the faster spindle speed of drive A? The increased data density will help transfer speed off the platter, but that's it. It won't help rotational latency. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Mon, Apr 18, 2005 at 07:41:49PM +0200, Jacques Caron wrote: > It would be interesting to actually compare this to real-world (or > nearly-real-world) benchmarks to measure the effectiveness of features like > TCQ/NCQ etc. I was just thinking that it would be very interesting to benchmark different RAID configurations using dbt2. I don't know if this is something that the lab is setup for or capable of, though. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Mon, Apr 18, 2005 at 10:20:36AM -0500, Dave Held wrote: > Hmm...so you're saying that at some point, quantity beats quality? > That's an interesting point. However, it presumes that you can > actually distribute your data over a larger number of drives. If > you have a db with a bottleneck of one or two very large tables, > the extra spindles won't help unless you break up the tables and > glue them together with query magic. But it's still a point to > consider. Huh? Do you know how RAID10 works? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Mon, Apr 18, 2005 at 06:41:37PM -0000, Mohan, Ross wrote: > Don't you think "optimal stripe width" would be > a good question to research the binaries for? I'd > think that drives the answer, largely. (uh oh, pun alert) > > EG, oracle issues IO requests (this may have changed _just_ > recently) in 64KB chunks, regardless of what you ask for. > So when I did my striping (many moons ago, when the Earth > was young...) I did it in 128KB widths, and set the oracle > "multiblock read count" according. For oracle, any stripe size > under 64KB=stupid, anything much over 128K/258K=wasteful. > > I am eager to find out how PG handles all this. AFAIK PostgreSQL requests data one database page at a time (normally 8k). Of course the OS might do something different. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Tue, Apr 19, 2005 at 11:22:17AM -0500, Richard_D_Levine@raytheon.com wrote: > > > pgsql-performance-owner@postgresql.org wrote on 04/19/2005 11:10:22 AM: > > > > What is 'multiple initiators' used for in the real world? > > I asked this same question and got an answer off list: Somebody said their > SAN hardware used multiple initiators. I would try to check the archives > for you, but this thread is becoming more of a rope. > > Multiple initiators means multiple sources on the bus issuing I/O > instructions to the drives. In theory you can have two computers on the > same SCSI bus issuing I/O requests to the same drive, or to anything else > on the bus, but I've never seen this implemented. Others have noted this > feature as being a big deal, so somebody is benefiting from it. It's a big deal for Oracle clustering, which relies on shared drives. Of course most people doing Oracle clustering are probably using a SAN and not raw SCSI... -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Now that we've hashed out which drives are quicker and more money equals faster... Let's say you had a server with 6 separate 15k RPM SCSI disks, what raid option would you use for a standalone postgres server? a) 3xRAID1 - 1 for data, 1 for xlog, 1 for os? b) 1xRAID1 for OS/xlog, 1xRAID5 for data c) 1xRAID10 for OS/xlong/data d) 1xRAID1 for OS, 1xRAID10 for data e) ..... I was initially leaning towards b, but after talking to Josh a bit, I suspect that with only 4 disks the raid5 might be a performance detriment vs 3 raid 1s or some sort of split raid10 setup. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
http://stats.distributed.net is setup with the OS, WAL, and temp on a RAID1 and the database on a RAID10. The drives are 200G SATA with a 3ware raid card. I don't think the controller has battery-backed cache, but I'm not sure. In any case, it's almost never disk-bound on the mirror; when it's disk-bound it's usually the RAID10. But this is a read-mostly database. If it was write-heavy, that might not be the case. Also, in general, I see very little disk activity from the OS itself, so I don't think there's a large disadvantage to having it on the same drives as part of your database. I would recommend different filesystems for each, though. (ie: not one giant / partition) On Tue, Apr 19, 2005 at 06:00:42PM -0700, Jeff Frost wrote: > Now that we've hashed out which drives are quicker and more money equals > faster... > > Let's say you had a server with 6 separate 15k RPM SCSI disks, what raid > option would you use for a standalone postgres server? > > a) 3xRAID1 - 1 for data, 1 for xlog, 1 for os? > b) 1xRAID1 for OS/xlog, 1xRAID5 for data > c) 1xRAID10 for OS/xlong/data > d) 1xRAID1 for OS, 1xRAID10 for data > e) ..... > > I was initially leaning towards b, but after talking to Josh a bit, I > suspect that with only 4 disks the raid5 might be a performance detriment > vs 3 raid 1s or some sort of split raid10 setup. > > -- > Jeff Frost, Owner <jeff@frostconsultingllc.com> > Frost Consulting, LLC http://www.frostconsultingllc.com/ > Phone: 650-780-7908 FAX: 650-649-1954 > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Jeff, > Let's say you had a server with 6 separate 15k RPM SCSI disks, what raid > option would you use for a standalone postgres server? > > a) 3xRAID1 - 1 for data, 1 for xlog, 1 for os? > b) 1xRAID1 for OS/xlog, 1xRAID5 for data > c) 1xRAID10 for OS/xlong/data > d) 1xRAID1 for OS, 1xRAID10 for data > e) ..... > > I was initially leaning towards b, but after talking to Josh a bit, I > suspect that with only 4 disks the raid5 might be a performance detriment > vs 3 raid 1s or some sort of split raid10 setup. Knowing that your installation is read-heavy, I'd recommend (d), with the WAL on the same disk as the OS, i.e. RAID1 2 disks OS, pg_xlog RAID 1+0 4 disks pgdata -- --Josh Josh Berkus Aglio Database Solutions San Francisco
My experience: 1xRAID10 for postgres 1xRAID1 for OS + WAL Jeff Frost wrote: > Now that we've hashed out which drives are quicker and more money equals > faster... > > Let's say you had a server with 6 separate 15k RPM SCSI disks, what raid > option would you use for a standalone postgres server? > > a) 3xRAID1 - 1 for data, 1 for xlog, 1 for os? > b) 1xRAID1 for OS/xlog, 1xRAID5 for data > c) 1xRAID10 for OS/xlong/data > d) 1xRAID1 for OS, 1xRAID10 for data > e) ..... > > I was initially leaning towards b, but after talking to Josh a bit, I > suspect that with only 4 disks the raid5 might be a performance > detriment vs 3 raid 1s or some sort of split raid10 setup. >
> RAID1 2 disks OS, pg_xlog > RAID 1+0 4 disks pgdata Looks like the consensus is RAID 1 for OS, pg_xlog and RAID10 for pgdata. Now here's another performance related question: I've seen quite a few folks touting the Opteron as 2.5x faster with postgres than a Xeon box. What makes the Opteron so quick? Is it that Postgres really prefers to run in 64-bit mode? When I look at AMD's TPC-C scores where they are showing off the Opteron http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_8800~96125,00.html It doesn't appear 2.5x as fast as the Xeon systems, though I have heard from a few Postgres folks that a dual Opteron is 2.5x as fast as a dual Xeon. I would think that AMD would be all over that press if they could show it, so what am I missing? Is it a bus speed thing? Better south bridge on the boards? -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
>I've seen quite a few folks touting the Opteron as 2.5x >faster with postgres than a Xeon box. What makes the >Opteron so quick? Is it that Postgres really prefers to >run in 64-bit mode? I don't know about 2.5x faster (perhaps on specific types of loads), but the reason Opterons rock for database applications is their insanely good memory bandwidth and latency that scales much better than the Xeon. Opterons also have a ccNUMA-esque I/O fabric and two dedicated on-die memory channels *per processor* -- no shared bus there, closer to real UNIX server iron than a glorified PC. We run a large Postgres database on a dual Opteron in 32-bit mode that crushes Xeons running at higher clock speeds. It has little to do with bitness or theoretical instruction dispatch, and everything to do with the superior memory controller and I/O fabric. Databases are all about moving chunks of data around and the Opteron systems were engineered to do this very well and in a very scalable fashion. For the money, it is hard to argue with the price/performance of Opteron based servers. We started with one dual Opteron postgres server just over a year ago (with an equivalent uptime) and have considered nothing but Opterons for database servers since. Opterons really are clearly superior to Xeons for this application. I don't work for AMD, just a satisfied customer. :-) re: 6 disks. Unless you are tight on disk space, a hot spare might be nice as well depending on your needs. Cheers, J. Andrew Rogers
On Tue, 19 Apr 2005, J. Andrew Rogers wrote: > I don't know about 2.5x faster (perhaps on specific types of loads), but the > reason Opterons rock for database applications is their insanely good memory > bandwidth and latency that scales much better than the Xeon. Opterons also > have a ccNUMA-esque I/O fabric and two dedicated on-die memory channels *per > processor* -- no shared bus there, closer to real UNIX server iron than a > glorified PC. Thanks J! That's exactly what I was suspecting it might be. Actually, I found an anandtech benchmark that shows the Opteron coming in at close to 2.0x performance: http://www.anandtech.com/linux/showdoc.aspx?i=2163&p=2 It's an Opteron 150 (2.4ghz) vs. Xeon 3.6ghz from August. I wonder if the differences are more pronounced with the newer Opterons. -Jeff
On 4/19/05, Mohan, Ross <RMohan@arbinet.com> wrote: > Clustered file systems is the first/best example that > comes to mind. Host A and Host B can both request from diskfarm, eg. Something like a Global File System? http://www.redhat.com/software/rha/gfs/ (I believe some other company did develop it some time in the past; hmm, probably the guys doing LVM stuff?). Anyway the idea is that two machines have same filesystem mounted and they share it. The locking I believe is handled by communication between computers using "host to host" SCSI commands. I never used it, I've only heard about it from a friend who used to work with it in CERN. Regards, Dawid
I posted this link a few months ago and there was some surprise over the difference in postgresql compared to other DBs. (Not much surprise in Opteron stomping on Xeon in pgsql as most people here have had that experience -- the surprise was in how much smaller the difference was in other DBs.) If it was across the board +100% in MS-SQL, MySQL, etc -- you can chalk in up to overall better CPU architecture. Most of the time though, the numbers I've seen show +0-30% for [insert DB here] and a huge whopping +++++ for pgsql. Why the pronounced preference for postgresql, I'm not sure if it was explained fully. BTW, the Anandtech test compares single CPU systems w/ 1GB of RAM. Go to dual/quad and SMP Xeon will suffer even more since it has to share a fixed amount of FSB/memory bandwidth amongst all CPUs. Xeons also seem to suffer more from context-switch storms. Go > 4GB of RAM and the Xeon suffers another hit due to the lack of a 64-bit IOMMU. Devices cannot map to addresses > 4GB which means the OS has to do extra work in copying data from/to > 4GB anytime you have IO. (Although this penalty might exist all the time in 64-bit mode for Xeon if Linux/Windows took the expedient and less-buggy route of using a single method versus checking whether target addresses are > or < 4GB.) Jeff Frost wrote: > On Tue, 19 Apr 2005, J. Andrew Rogers wrote: > >> I don't know about 2.5x faster (perhaps on specific types of loads), >> but the reason Opterons rock for database applications is their >> insanely good memory bandwidth and latency that scales much better >> than the Xeon. Opterons also have a ccNUMA-esque I/O fabric and two >> dedicated on-die memory channels *per processor* -- no shared bus >> there, closer to real UNIX server iron than a glorified PC. > > > Thanks J! That's exactly what I was suspecting it might be. Actually, > I found an anandtech benchmark that shows the Opteron coming in at close > to 2.0x performance: > > http://www.anandtech.com/linux/showdoc.aspx?i=2163&p=2 > > It's an Opteron 150 (2.4ghz) vs. Xeon 3.6ghz from August. I wonder if > the differences are more pronounced with the newer Opterons. > > -Jeff > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
On Apr 19, 2005, at 11:07 PM, Josh Berkus wrote: > RAID1 2 disks OS, pg_xlog > RAID 1+0 4 disks pgdata > This is my preferred setup, but I do it with 6 disks on RAID10 for data, and since I have craploads of disk space I set checkpoint segments to 256 (and checkpoint timeout to 5 minutes) Vivek Khera, Ph.D. +1-301-869-4449 x806
Attachment
On Apr 20, 2005, at 12:40 AM, Jeff Frost wrote: > I've seen quite a few folks touting the Opteron as 2.5x faster with > postgres than a Xeon box. What makes the Opteron so quick? Is it > that Postgres really prefers to run in 64-bit mode? > The I/O path on the opterons seems to be much faster, and having 64-bit all the way to the disk controller helps... just be sure to run a 64-bit version of your OS. Vivek Khera, Ph.D. +1-301-869-4449 x806
Attachment
kewl. Well, 8k request out of PG kernel might turn into an "X"Kb request at disk/OS level, but duly noted. Did you scan the code for this, or are you pulling this recollection from the cognitive archives? :-) -----Original Message----- From: Jim C. Nasby [mailto:decibel@decibel.org] Sent: Tuesday, April 19, 2005 8:12 PM To: Mohan, Ross Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? On Mon, Apr 18, 2005 at 06:41:37PM -0000, Mohan, Ross wrote: > Don't you think "optimal stripe width" would be > a good question to research the binaries for? I'd > think that drives the answer, largely. (uh oh, pun alert) > > EG, oracle issues IO requests (this may have changed _just_ > recently) in 64KB chunks, regardless of what you ask for. > So when I did my striping (many moons ago, when the Earth > was young...) I did it in 128KB widths, and set the oracle > "multiblock read count" according. For oracle, any stripe size > under 64KB=stupid, anything much over 128K/258K=wasteful. > > I am eager to find out how PG handles all this. AFAIK PostgreSQL requests data one database page at a time (normally 8k). Of course the OS might do something different. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
right, the oracle system uses a second "low latency" bus to manage locking information (at the block level) via a distributed lock manager. (but this is slightly different albeit related to a clustered file system and OS-managed locking, eg) -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Dawid Kuroczko Sent: Wednesday, April 20, 2005 4:56 AM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? On 4/19/05, Mohan, Ross <RMohan@arbinet.com> wrote: > Clustered file systems is the first/best example that > comes to mind. Host A and Host B can both request from diskfarm, eg. Something like a Global File System? http://www.redhat.com/software/rha/gfs/ (I believe some other company did develop it some time in the past; hmm, probably the guys doing LVM stuff?). Anyway the idea is that two machines have same filesystem mounted and they share it. The locking I believe is handled bycommunication between computers using "host to host" SCSI commands. I never used it, I've only heard about it from a friend who used to work with it in CERN. Regards, Dawid ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend
I wonder if thats something to think about adding to Postgresql? A setting for multiblock read count like Oracle (Although having said that I believe that Oracle natively caches pages much more aggressively that postgresql, which allows the OS to do the file caching). Alex Turner netEconomist P.S. Oracle changed this with 9i, you can change the Database block size on a tablespace by tablespace bassis making it smaller for OLTP tablespaces and larger for Warehousing tablespaces (at least I think it's on a tablespace, might be on a whole DB). On 4/19/05, Jim C. Nasby <decibel@decibel.org> wrote: > On Mon, Apr 18, 2005 at 06:41:37PM -0000, Mohan, Ross wrote: > > Don't you think "optimal stripe width" would be > > a good question to research the binaries for? I'd > > think that drives the answer, largely. (uh oh, pun alert) > > > > EG, oracle issues IO requests (this may have changed _just_ > > recently) in 64KB chunks, regardless of what you ask for. > > So when I did my striping (many moons ago, when the Earth > > was young...) I did it in 128KB widths, and set the oracle > > "multiblock read count" according. For oracle, any stripe size > > under 64KB=stupid, anything much over 128K/258K=wasteful. > > > > I am eager to find out how PG handles all this. > > AFAIK PostgreSQL requests data one database page at a time (normally > 8k). Of course the OS might do something different. > -- > Jim C. Nasby, Database Consultant decibel@decibel.org > Give your computer some brain candy! www.distributed.net Team #1828 > > Windows: "Where do you want to go today?" > Linux: "Where do you want to go tomorrow?" > FreeBSD: "Are you guys coming, or what?" > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
> -----Original Message----- > From: Alex Turner [mailto:armtuk@gmail.com] > Sent: Wednesday, April 20, 2005 12:04 PM > To: Dave Held > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] How to improve db performance with $7K? > > [...] > Lets say we invented a new protocol that including the drive telling > the controller how it was layed out at initialization time so that the > controller could make better decisions about re-ordering seeks. It > would be more cost effective to have that set of electronics just once > in the controller, than 8 times on each drive in an array, which would > yield better performance to cost ratio. Assuming that a single controller would be able to service 8 drives without delays. The fact that you want the controller to have fairly intimate knowledge of the drives implies that this is a semi-soft solution requiring some fairly fat hardware compared to firmware that is hard-wired for one drive. Note that your controller has to be 8x as fast as the on-board drive firmware. There's definitely a balance there, and it's not entirely clear to me where the break-even point is. > Therefore I would suggest it is something that should be investigated. > After all, why implemented TCQ on each drive, if it can be handled more > effeciently at the other end by the controller for less money?! Because it might not cost less. ;) However, I can see where you might want the controller to drive the actual hardware when you have a RAID setup that requires synchronized seeks, etc. But in that case, it's doing one computation for multiple drives, so there really is a win. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
Whilst I admire your purist approach, I would say that if it is beneficial to performance that a kernel understand drive geometry, then it is worth investigating teaching it how to deal with that! I was less referrring to the kernel as I was to the controller. Lets say we invented a new protocol that including the drive telling the controller how it was layed out at initialization time so that the controller could make better decisions about re-ordering seeks. It would be more cost effective to have that set of electronics just once in the controller, than 8 times on each drive in an array, which would yield better performance to cost ratio. Therefore I would suggest it is something that should be investigated. After all, why implemented TCQ on each drive, if it can be handled more effeciently at the other end by the controller for less money?! Alex Turner netEconomist On 4/19/05, Dave Held <dave.held@arrayservicesgrp.com> wrote: > > -----Original Message----- > > From: Alex Turner [mailto:armtuk@gmail.com] > > Sent: Monday, April 18, 2005 5:50 PM > > To: Bruce Momjian > > Cc: Kevin Brown; pgsql-performance@postgresql.org > > Subject: Re: [PERFORM] How to improve db performance with $7K? > > > > Does it really matter at which end of the cable the queueing is done > > (Assuming both ends know as much about drive geometry etc..)? > > [...] > > The parenthetical is an assumption I'd rather not make. If my > performance depends on my kernel knowing how my drive is laid > out, I would always be wondering if a new drive is going to > break any of the kernel's geometry assumptions. Drive geometry > doesn't seem like a kernel's business any more than a kernel > should be able to decode the ccd signal of an optical mouse. > The kernel should queue requests at a level of abstraction that > doesn't depend on intimate knowledge of drive geometry, and the > drive should queue requests on the concrete level where geometry > matters. A drive shouldn't guess whether a process is trying to > read a file sequentially, and a kernel shouldn't guess whether > sector 30 is contiguous with sector 31 or not. > > __ > David B. Held > Software Engineer/Array Services Group > 200 14th Ave. East, Sartell, MN 56377 > 320.534.3637 320.253.7800 800.752.8129 > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
Alex et al., I wonder if thats something to think about adding to Postgresql? A setting for multiblock read count like Oracle (Although || I would think so, yea. GMTA: I was just having this micro-chat with Mr. Jim Nasby. having said that I believe that Oracle natively caches pages much more aggressively that postgresql, which allows the OSto do the file caching). || Yea...and it can rely on what is likely a lot more robust and nuanced caching algorithm, but...i don't know enough (read: anything) about PG's to back that comment up. Alex Turner netEconomist P.S. Oracle changed this with 9i, you can change the Database block size on a tablespace by tablespace bassis making it smallerfor OLTP tablespaces and larger for Warehousing tablespaces (at least I think it's on a tablespace, might be on awhole DB). ||Yes, it's tspace level. On 4/19/05, Jim C. Nasby <decibel@decibel.org> wrote: > On Mon, Apr 18, 2005 at 06:41:37PM -0000, Mohan, Ross wrote: > > Don't you think "optimal stripe width" would be > > a good question to research the binaries for? I'd > > think that drives the answer, largely. (uh oh, pun alert) > > > > EG, oracle issues IO requests (this may have changed _just_ > > recently) in 64KB chunks, regardless of what you ask for. So when I > > did my striping (many moons ago, when the Earth was young...) I did > > it in 128KB widths, and set the oracle "multiblock read count" > > according. For oracle, any stripe size under 64KB=stupid, anything > > much over 128K/258K=wasteful. > > > > I am eager to find out how PG handles all this. > > AFAIK PostgreSQL requests data one database page at a time (normally > 8k). Of course the OS might do something different. > -- > Jim C. Nasby, Database Consultant decibel@decibel.org > Give your computer some brain candy! www.distributed.net Team #1828 > > Windows: "Where do you want to go today?" > Linux: "Where do you want to go tomorrow?" > FreeBSD: "Are you guys coming, or what?" > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
The Linux kernel is definitely headed this way. The 2.6 allows for several different I/O scheduling algorithms. A brief overview about the different modes: http://nwc.serverpipeline.com/highend/60400768 Although a much older article from the beta-2.5 days, more indepth info from one of the programmers who developed the AS scheduler and worked on the deadline scheduler: http://kerneltrap.org/node/657 I think I'm going to start testing the deadline scheduler for our data processing server for a few weeks before trying it on our production servers. Alex Turner wrote: > Whilst I admire your purist approach, I would say that if it is > beneficial to performance that a kernel understand drive geometry, > then it is worth investigating teaching it how to deal with that! > > I was less referrring to the kernel as I was to the controller. > > Lets say we invented a new protocol that including the drive telling > the controller how it was layed out at initialization time so that the > controller could make better decisions about re-ordering seeks. It > would be more cost effective to have that set of electronics just once > in the controller, than 8 times on each drive in an array, which would > yield better performance to cost ratio. Therefore I would suggest it > is something that should be investigated. After all, why implemented > TCQ on each drive, if it can be handled more effeciently at the other > end by the controller for less money?! > > Alex Turner > netEconomist > > On 4/19/05, Dave Held <dave.held@arrayservicesgrp.com> wrote: > >>>-----Original Message----- >>>From: Alex Turner [mailto:armtuk@gmail.com] >>>Sent: Monday, April 18, 2005 5:50 PM >>>To: Bruce Momjian >>>Cc: Kevin Brown; pgsql-performance@postgresql.org >>>Subject: Re: [PERFORM] How to improve db performance with $7K? >>> >>>Does it really matter at which end of the cable the queueing is done >>>(Assuming both ends know as much about drive geometry etc..)? >>>[...] >> >>The parenthetical is an assumption I'd rather not make. If my >>performance depends on my kernel knowing how my drive is laid >>out, I would always be wondering if a new drive is going to >>break any of the kernel's geometry assumptions. Drive geometry >>doesn't seem like a kernel's business any more than a kernel >>should be able to decode the ccd signal of an optical mouse. >>The kernel should queue requests at a level of abstraction that >>doesn't depend on intimate knowledge of drive geometry, and the >>drive should queue requests on the concrete level where geometry >>matters. A drive shouldn't guess whether a process is trying to >>read a file sequentially, and a kernel shouldn't guess whether >>sector 30 is contiguous with sector 31 or not. >> >>__ >>David B. Held >>Software Engineer/Array Services Group >>200 14th Ave. East, Sartell, MN 56377 >>320.534.3637 320.253.7800 800.752.8129 >> >>---------------------------(end of broadcast)--------------------------- >>TIP 7: don't forget to increase your free space map settings >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >