Thread: Quad processor options
Hi, I am curious if there are any real life production quad processor setups running postgresql out there. Since postgresql lacks a proper replication/cluster solution, we have to buy a bigger machine. Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI hardware-raid 10. Has anyone experiences with quad Xeon or quad Opteron setups? I am looking at the appropriate boards from Tyan, which would be the only option for us to buy such a beast. The 30k+ setups from Dell etc. don't fit our budget. I am thinking of the following: Quad processor (xeon or opteron) 5 x SCSI 15K RPM for Raid 10 + spare drive 2 x IDE for system ICP-Vortex battery backed U320 Hardware Raid 4-8 GB Ram Would be nice to hear from you. Regards, Bjoern
We use XEON Quads (PowerEdge 6650s) and they work nice, provided you configure the postgres properly. Dell is the cheapestquad you can buy i think. You shouldn't be paying 30K unless you are getting high CPU-cache on each processor andtons of memory. I am actually curious, have you researched/attempted any postgresql clustering solutions? I agree, you can't just keep buyingbigger machines. They have 5 internal drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller, 8GB RAM. Thanks, Anjan -----Original Message----- From: Bjoern Metzdorf [mailto:bm@turtle-entertainment.de] Sent: Tue 5/11/2004 3:06 PM To: pgsql-performance@postgresql.org Cc: Pgsql-Admin (E-mail) Subject: [PERFORM] Quad processor options Hi, I am curious if there are any real life production quad processor setups running postgresql out there. Since postgresql lacks a proper replication/cluster solution, we have to buy a bigger machine. Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI hardware-raid 10. Has anyone experiences with quad Xeon or quad Opteron setups? I am looking at the appropriate boards from Tyan, which would be the only option for us to buy such a beast. The 30k+ setups from Dell etc. don't fit our budget. I am thinking of the following: Quad processor (xeon or opteron) 5 x SCSI 15K RPM for Raid 10 + spare drive 2 x IDE for system ICP-Vortex battery backed U320 Hardware Raid 4-8 GB Ram Would be nice to hear from you. Regards, Bjoern ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
it's very good to understand specific choke points you're trying to address by upgrading so you dont get disappointed. Are you truly CPU constrained, or is it memory footprint or IO thruput that makes you want to upgrade? IMO The best way to begin understanding system choke points is vmstat output. Would you mind forwarding the output of "vmstat 10 120" under peak load period? (I'm asusming this is linux or unix variant) a brief description of what is happening during the vmstat sample would help a lot too. > I am curious if there are any real life production quad processor > setups running postgresql out there. Since postgresql lacks a proper > replication/cluster solution, we have to buy a bigger machine. > > Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI > hardware-raid 10. > > Has anyone experiences with quad Xeon or quad Opteron setups? I am > looking at the appropriate boards from Tyan, which would be the only > option for us to buy such a beast. The 30k+ setups from Dell etc. > don't fit our budget. > > I am thinking of the following: > > Quad processor (xeon or opteron) > 5 x SCSI 15K RPM for Raid 10 + spare drive > 2 x IDE for system > ICP-Vortex battery backed U320 Hardware Raid > 4-8 GB Ram > > Would be nice to hear from you. > > Regards, > Bjoern > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > Hi, > > I am curious if there are any real life production quad processor setups > running postgresql out there. Since postgresql lacks a proper > replication/cluster solution, we have to buy a bigger machine. > > Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI > hardware-raid 10. > > Has anyone experiences with quad Xeon or quad Opteron setups? I am > looking at the appropriate boards from Tyan, which would be the only > option for us to buy such a beast. The 30k+ setups from Dell etc. don't > fit our budget. > > I am thinking of the following: > > Quad processor (xeon or opteron) > 5 x SCSI 15K RPM for Raid 10 + spare drive > 2 x IDE for system > ICP-Vortex battery backed U320 Hardware Raid > 4-8 GB Ram Well, from what I've read elsewhere on the internet, it would seem the Opterons scale better to 4 CPUs than the basic Xeons do. Of course, the exception to this is SGI's altix, which uses their own chipset and runs the itanium with very good memory bandwidth. But, do you really need more CPU horsepower? Are you I/O or CPU or memory or memory bandwidth bound? If you're sitting at 99% idle, and iostat says your drives are only running at some small percentage of what you know they could, you might be memory or memory bandwidth limited. Adding two more CPUs will not help with that situation. If your I/O is saturated, then the answer may well be a better RAID array, with many more drives plugged into it. Do you have any spare drives you can toss on the machine to see if that helps? Sometimes going from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in performance. In short, don't expect 4 CPUs to solve the problem if the problem isn't really the CPUs being maxed out. Also, what type of load are you running? Mostly read, mostly written, few connections handling lots of data, lots of connections each handling a little data, lots of transactions, etc... If you are doing lots of writing, make SURE you have a controller that supports battery backed cache and is configured to write-back, not write-through.
Anjan Dave wrote: > We use XEON Quads (PowerEdge 6650s) and they work nice, > provided you configure the postgres properly. > Dell is the cheapest quad you can buy i think. > You shouldn't be paying 30K unless you are getting high CPU-cache > on each processor and tons of memory. good to hear, I tried to online configure a quad xeon here at dell germany, but the 6550 is not available for online configuration. at dell usa it works. I will give them a call tomorrow. > I am actually curious, have you researched/attempted any > postgresql clustering solutions? > I agree, you can't just keep buying bigger machines. There are many asynchronous, trigger based solutions out there (eRserver etc..), but what we need is basically a master <-> master setup, which seems not to be available soon for postgresql. Our current dual Xeon runs at 60-70% average cpu load, which is really much. I cannot afford any trigger overhead here. This machine is responsible for over 30M page impressions per month, 50 page impressums per second at peak times. The autovacuum daemon is a god sent gift :) I'm curious how the recently announced mysql cluster will perform, although it is not an option for us. postgresql has far superior functionality. > They have 5 internal drives (4 in RAID 10, 1 spare) on U320, > 128MB cache on the PERC controller, 8GB RAM. Could you tell me what you paid approximately for this setup? How does it perform? It certainly won't be twice as fast a as dual xeon, but I remember benchmarking a quad P3 xeon some time ago, and it was disappointingly slow... Regards, Bjoern
Did you mean to say the trigger-based clustering solution is loading the dual CPUs 60-70% right now? Performance will not be linear with more processors, but it does help with more processes. We haven't benchmarked it, butwe haven't had any problems also so far in terms of performance. Price would vary with your relation/yearly purchase, etc, but a 6650 with 2.0GHz/1MB cache/8GB Memory, RAID card, drives,etc, should definitely cost you less than 20K USD. -anjan -----Original Message----- From: Bjoern Metzdorf [mailto:bm@turtle-entertainment.de] Sent: Tue 5/11/2004 4:28 PM To: Anjan Dave Cc: pgsql-performance@postgresql.org; Pgsql-Admin (E-mail) Subject: Re: [PERFORM] Quad processor options Anjan Dave wrote: > We use XEON Quads (PowerEdge 6650s) and they work nice, > provided you configure the postgres properly. > Dell is the cheapest quad you can buy i think. > You shouldn't be paying 30K unless you are getting high CPU-cache > on each processor and tons of memory. good to hear, I tried to online configure a quad xeon here at dell germany, but the 6550 is not available for online configuration. at dell usa it works. I will give them a call tomorrow. > I am actually curious, have you researched/attempted any > postgresql clustering solutions? > I agree, you can't just keep buying bigger machines. There are many asynchronous, trigger based solutions out there (eRserver etc..), but what we need is basically a master <-> master setup, which seems not to be available soon for postgresql. Our current dual Xeon runs at 60-70% average cpu load, which is really much. I cannot afford any trigger overhead here. This machine is responsible for over 30M page impressions per month, 50 page impressums per second at peak times. The autovacuum daemon is a god sent gift :) I'm curious how the recently announced mysql cluster will perform, although it is not an option for us. postgresql has far superior functionality. > They have 5 internal drives (4 in RAID 10, 1 spare) on U320, > 128MB cache on the PERC controller, 8GB RAM. Could you tell me what you paid approximately for this setup? How does it perform? It certainly won't be twice as fast a as dual xeon, but I remember benchmarking a quad P3 xeon some time ago, and it was disappointingly slow... Regards, Bjoern
scott.marlowe wrote: > Well, from what I've read elsewhere on the internet, it would seem the > Opterons scale better to 4 CPUs than the basic Xeons do. Of course, the > exception to this is SGI's altix, which uses their own chipset and runs > the itanium with very good memory bandwidth. This is basically what I read too. But I cannot spent money on a quad opteron just for testing purposes :) > But, do you really need more CPU horsepower? > > Are you I/O or CPU or memory or memory bandwidth bound? If you're sitting > at 99% idle, and iostat says your drives are only running at some small > percentage of what you know they could, you might be memory or memory > bandwidth limited. Adding two more CPUs will not help with that > situation. Right now we have a dual xeon 2.4, 3 GB Ram, Mylex extremeraid controller, running 2 Compaq BD018122C0, 1 Seagate ST318203LC and 1 Quantum ATLAS_V_18_SCA. iostat show between 20 and 60 % user avg-cpu. And this is not even peak time. I attached a "vmstat 10 120" output for perhaps 60-70% peak load. > If your I/O is saturated, then the answer may well be a better RAID > array, with many more drives plugged into it. Do you have any spare > drives you can toss on the machine to see if that helps? Sometimes going > from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in > performance. Next drives I'll buy will certainly be 15k scsi drives. > In short, don't expect 4 CPUs to solve the problem if the problem isn't > really the CPUs being maxed out. > > Also, what type of load are you running? Mostly read, mostly written, few > connections handling lots of data, lots of connections each handling a > little data, lots of transactions, etc... In peak times we can get up to 700-800 connections at the same time. There are quite some updates involved, without having exact numbers I'll think that we have about 70% selects and 30% updates/inserts. > If you are doing lots of writing, make SURE you have a controller that > supports battery backed cache and is configured to write-back, not > write-through. Could you recommend a certain controller type? The only battery backed one that I found on the net is the newest model from icp-vortex.com. Regards, Bjoern ~# vmstat 10 120 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 1 0 24180 10584 32468 2332208 0 1 0 2 1 2 2 0 0 0 2 0 24564 10480 27812 2313528 8 0 7506 574 1199 8674 30 7 63 2 1 0 24692 10060 23636 2259176 0 18 8099 298 2074 6328 25 7 68 2 0 0 24584 18576 21056 2299804 3 6 13208 305 1598 8700 23 6 71 1 21 1 24504 16588 20912 2309468 4 0 1442 1107 754 6874 42 13 45 6 1 0 24632 13148 19992 2319400 0 0 2627 499 1184 9633 37 6 58 5 1 0 24488 10912 19292 2330080 5 0 3404 150 1466 10206 32 6 61 4 1 0 24488 12180 18824 2342280 3 0 2934 40 1052 3866 19 3 78 0 0 0 24420 14776 19412 2347232 6 0 403 216 1123 4702 22 3 74 0 0 0 24548 14408 17380 2321780 4 0 522 715 965 6336 25 5 71 4 0 0 24676 12504 17756 2322988 0 0 564 830 883 7066 31 6 63 0 3 0 24676 14060 18232 2325224 0 0 483 388 1097 3401 21 3 76 0 2 1 24676 13044 18700 2322948 0 0 701 195 1078 5187 23 3 74 2 0 0 24676 21576 18752 2328168 0 0 467 177 1552 3574 18 3 78
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > scott.marlowe wrote: > > > Well, from what I've read elsewhere on the internet, it would seem the > > Opterons scale better to 4 CPUs than the basic Xeons do. Of course, the > > exception to this is SGI's altix, which uses their own chipset and runs > > the itanium with very good memory bandwidth. > > This is basically what I read too. But I cannot spent money on a quad > opteron just for testing purposes :) Wouldn't it be nice to just have a lab full of these things? > > If your I/O is saturated, then the answer may well be a better RAID > > array, with many more drives plugged into it. Do you have any spare > > drives you can toss on the machine to see if that helps? Sometimes going > > from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in > > performance. > > Next drives I'll buy will certainly be 15k scsi drives. Better to buy more 10k drives than fewer 15k drives. Other than slightly faster select times, the 15ks aren't really any faster. > > In short, don't expect 4 CPUs to solve the problem if the problem isn't > > really the CPUs being maxed out. > > > > Also, what type of load are you running? Mostly read, mostly written, few > > connections handling lots of data, lots of connections each handling a > > little data, lots of transactions, etc... > > In peak times we can get up to 700-800 connections at the same time. > There are quite some updates involved, without having exact numbers I'll > think that we have about 70% selects and 30% updates/inserts. Wow, a lot of writes then. > > If you are doing lots of writing, make SURE you have a controller that > > supports battery backed cache and is configured to write-back, not > > write-through. > > Could you recommend a certain controller type? The only battery backed > one that I found on the net is the newest model from icp-vortex.com. Sure, adaptec makes one, so does lsi megaraid. Dell resells both of these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I believe. We run the lsi megaraid with 64 megs battery backed cache. Intel also makes one, but I've heard nothing about it. If you get the LSI megaraid, make sure you're running the latest megaraid 2 driver, not the older, slower 1.18 series. If you are running linux, look for the dkms packaged version. dkms, (Dynamic Kernel Module System) automagically compiles and installs source rpms for drivers when you install them, and configures the machine to use them to boot up. Most drivers seem to be slowly headed that way in the linux universe, and I really like the simplicity and power of dkms. I haven't directly tested anything but the adaptec and the lsi megaraid. Here at work we've had massive issues trying to get the adaptec cards configured and installed on, while the megaraid was a snap. Installed RH, installed the dkms rpm, installed the dkms enabled megaraid driver and rebooted. Literally, that's all it took.
Paul Tuckfield wrote: > Would you mind forwarding the output of "vmstat 10 120" under peak load > period? (I'm asusming this is linux or unix variant) a brief > description of what is happening during the vmstat sample would help a > lot too. see my other mail. We are running Linux, Kernel 2.4. As soon as the next debian version comes out, I'll happily switch to 2.6 :) Regards, Bjoern
Anjan Dave wrote: > Did you mean to say the trigger-based clustering solution > is loading the dual CPUs 60-70% right now? No, this is without any triggers involved. > Performance will not be linear with more processors, > but it does help with more processes. > We haven't benchmarked it, but we haven't had any > problems also so far in terms of performance. From the amount of processes view, we certainly can saturate a quad setup :) > Price would vary with your relation/yearly purchase, etc, > but a 6650 with 2.0GHz/1MB cache/8GB Memory, RAID card, > drives, etc, should definitely cost you less than 20K USD. Which is still very much. Anyone have experience with a self built quad xeon, using the Tyan Thunder board? Regards, Bjoern
scott.marlowe wrote: >>Next drives I'll buy will certainly be 15k scsi drives. > > Better to buy more 10k drives than fewer 15k drives. Other than slightly > faster select times, the 15ks aren't really any faster. Good to know. I'll remember that. >>In peak times we can get up to 700-800 connections at the same time. >>There are quite some updates involved, without having exact numbers I'll >>think that we have about 70% selects and 30% updates/inserts. > > Wow, a lot of writes then. Yes, it certainly could also be only 15-20% updates/inserts, but this is also not negligible. > Sure, adaptec makes one, so does lsi megaraid. Dell resells both of > these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I > believe. We run the lsi megaraid with 64 megs battery backed cache. The LSI sounds good. > Intel also makes one, but I've heard nothing about it. It could well be the ICP Vortex one, ICP was bought by Intel some time ago.. > I haven't directly tested anything but the adaptec and the lsi megaraid. > Here at work we've had massive issues trying to get the adaptec cards > configured and installed on, while the megaraid was a snap. Installed RH, > installed the dkms rpm, installed the dkms enabled megaraid driver and > rebooted. Literally, that's all it took. I didn't hear anything about dkms for debian, so I will be hand-patching as usual :) Regards, Bjoern
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > scott.marlowe wrote: > > Sure, adaptec makes one, so does lsi megaraid. Dell resells both of > > these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I > > believe. We run the lsi megaraid with 64 megs battery backed cache. > > The LSI sounds good. > > > Intel also makes one, but I've heard nothing about it. > > It could well be the ICP Vortex one, ICP was bought by Intel some time ago.. Also, there are bigger, faster external RAID boxes as well, that make the internal cards seem puny. They're nice because all you need in your main box is a good U320 controller to plug into the external RAID array. That URL I mentioned earlier that had prices has some of the external boxes listed. No price, not for sale on the web, get out the checkbook and write a blank check is my guess. I.e. they're not cheap. The other nice thing about the LSI cards is that you can install >1 and the act like one big RAID array. i.e. install two cards with a 20 drive RAID0 then make a RAID1 across them, and if one or the other cards itself fails, you've still got 100% of your data sitting there. Nice to know you can survive the complete failure of one half of your chain. > > I haven't directly tested anything but the adaptec and the lsi megaraid. > > Here at work we've had massive issues trying to get the adaptec cards > > configured and installed on, while the megaraid was a snap. Installed RH, > > installed the dkms rpm, installed the dkms enabled megaraid driver and > > rebooted. Literally, that's all it took. > > I didn't hear anything about dkms for debian, so I will be hand-patching > as usual :) Yeah, it seems to be an RPM kinda thing. But, I'm thinking the 2.0 drivers got included in the latest 2.6 kernels, so no biggie. I was looking around in google, and it definitely appears the 2.x and 1.x megaraid drivers were merged into "unified" driver in 2.6 kernel.
-----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Bjoern Metzdorf Sent: Tuesday, May 11, 2004 3:11 PM To: scott.marlowe Cc: pgsql-performance@postgresql.org; Pgsql-Admin (E-mail) Subject: Re: [PERFORM] Quad processor options scott.marlowe wrote: >>Next drives I'll buy will certainly be 15k scsi drives. > > Better to buy more 10k drives than fewer 15k drives. Other than slightly > faster select times, the 15ks aren't really any faster. Good to know. I'll remember that. >>In peak times we can get up to 700-800 connections at the same time. >>There are quite some updates involved, without having exact numbers I'll >>think that we have about 70% selects and 30% updates/inserts. > > Wow, a lot of writes then. Yes, it certainly could also be only 15-20% updates/inserts, but this is also not negligible. > Sure, adaptec makes one, so does lsi megaraid. Dell resells both of > these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I > believe. We run the lsi megaraid with 64 megs battery backed cache. The LSI sounds good. > Intel also makes one, but I've heard nothing about it. It could well be the ICP Vortex one, ICP was bought by Intel some time ago.. > I haven't directly tested anything but the adaptec and the lsi megaraid. > Here at work we've had massive issues trying to get the adaptec cards > configured and installed on, while the megaraid was a snap. Installed RH, > installed the dkms rpm, installed the dkms enabled megaraid driver and > rebooted. Literally, that's all it took. I didn't hear anything about dkms for debian, so I will be hand-patching as usual :) Regards, Bjoern ------------------------- Personally I would stay away from anything intel over 2 processors. I have done some research and if memory serves it something like this. Intel's architecture makes each processor compete for bandwidth on the bus to the ram. Amd differs in that each proc has its own bus to the ram. Don't take this as god's honest fact but just keep it in mind when considering a Xeon solution, it may be worth your time to do some deeper research into this. There is some on this here http://www4.tomshardware.com/cpu/20030422/ Rob
On 2004-05-11T15:29:46-0600, scott.marlowe wrote: > The other nice thing about the LSI cards is that you can install >1 and > the act like one big RAID array. i.e. install two cards with a 20 drive > RAID0 then make a RAID1 across them, and if one or the other cards itself > fails, you've still got 100% of your data sitting there. Nice to know you > can survive the complete failure of one half of your chain. ... unless that dying controller corrupted your file system. Depending on your tolerance for risk, you may not want to operate for long with a file system in an unknown state. Btw, the Intel and LSI Logic RAID controller cards have suspeciously similar specificationsi, so I would be surprised if one is an OEM. /Allan -- Allan Wind P.O. Box 2022 Woburn, MA 01888-0022 USA
Attachment
On Tue, 2004-05-11 at 12:06, Bjoern Metzdorf wrote: > Has anyone experiences with quad Xeon or quad Opteron setups? I am > looking at the appropriate boards from Tyan, which would be the only > option for us to buy such a beast. The 30k+ setups from Dell etc. don't > fit our budget. > > I am thinking of the following: > > Quad processor (xeon or opteron) > 5 x SCSI 15K RPM for Raid 10 + spare drive > 2 x IDE for system > ICP-Vortex battery backed U320 Hardware Raid > 4-8 GB Ram Just to add my two cents to the fray: We use dual Opterons around here and prefer them to the Xeons for database servers. As others have pointed out, the Opteron systems will scale well to more than two processors unlike the Xeon. I know a couple people with quad Opterons and it apparently scales very nicely, unlike quad Xeons which don't give you much more. On some supercomputing hardware lists I'm on, they seem to be of the opinion that the current Opteron fabric won't really show saturation until you have 6-8 CPUs connected to it. Like the other folks said, skip the 15k drives. Those will only give you a marginal improvement for an integer factor price increase over 10k drives. Instead spend your money on a nice RAID controller with a fat cache and a backup battery, and maybe some extra spindles for your array. I personally like the LSI MegaRAID 320-2, which I always max out to 256Mb of cache RAM and the required battery. A maxed out LSI 320-2 should set you back <$1k. Properly configured, you will notice large improvements in the performance of your disk subsystem, especially if you have a lot of writing going on. I would recommend getting the Opterons, and spending the relatively modest amount of money to get nice RAID controller with a large write-back cache while sticking with 10k drives. Depending on precisely how you configure it, this should cost you no more than $10-12k. We just built a very similar configuration, but with dual Opterons on an HDAMA motherboard rather than a quad Tyan, and it cost <$6k inclusive of everything. Add the money for 4 of the 8xx processors and the Tyan quad motherboard, and the sum comes out to a very reasonable number for what you are getting. j. andrew rogers
I'm confused why you say the system is 70% busy: the vmstat output shows 70% *idle*. The vmstat you sent shows good things and ambiguous things: - si and so are zero, so your not paging/swapping. Thats always step 1. you're fine. - bi and bo (physical IO) shows pretty high numbers for how many disks you have. (assuming random IO) so please send an "iostat 10" sampling during peak. - note that cpu is only 30% busy. that should mean that adding cpus will *not* help. - the "cache" column shows that linux is using 2.3G for cache. (way too much) you generally want to give memory to postgres to keep it "close" to the user, not leave it unused to be claimed by linux cache (need to leave *some* for linux tho) My recommendations: - I'll bet you have a low value for shared buffers, like 10000. On your 3G system you should ramp up the value to at least 1G (125000 8k buffers) unless something else runs on the system. It's best to not do things too drastically, so if Im right and you sit at 10000 now, try going to 30000 then 60000 then 125000 or above. - if the above is off base, then I wonder why we see high runque numbers in spite of over 60% idle cpu. Maybe some serialization happening somewhere. Also depending on how you've laid out your 4 disk drives, you may see all IOs going to one drive. the 7M/sec is on the high side, if that's the case. iostat numbers will reveal if it's skewed, and if it's random, tho linux iostat doesn't seem to report response times (sigh) Response times are the golden metric when diagnosing IO thruput in OLTP / stripe situation. On May 11, 2004, at 1:41 PM, Bjoern Metzdorf wrote: > scott.marlowe wrote: > >> Well, from what I've read elsewhere on the internet, it would seem >> the Opterons scale better to 4 CPUs than the basic Xeons do. Of >> course, the exception to this is SGI's altix, which uses their own >> chipset and runs the itanium with very good memory bandwidth. > > This is basically what I read too. But I cannot spent money on a quad > opteron just for testing purposes :) > >> But, do you really need more CPU horsepower? >> Are you I/O or CPU or memory or memory bandwidth bound? If you're >> sitting at 99% idle, and iostat says your drives are only running at >> some small percentage of what you know they could, you might be >> memory or memory bandwidth limited. Adding two more CPUs will not >> help with that situation. > > Right now we have a dual xeon 2.4, 3 GB Ram, Mylex extremeraid > controller, running 2 Compaq BD018122C0, 1 Seagate ST318203LC and 1 > Quantum ATLAS_V_18_SCA. > > iostat show between 20 and 60 % user avg-cpu. And this is not even > peak time. > > I attached a "vmstat 10 120" output for perhaps 60-70% peak load. > >> If your I/O is saturated, then the answer may well be a better RAID >> array, with many more drives plugged into it. Do you have any spare >> drives you can toss on the machine to see if that helps? Sometimes >> going from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big >> boost in performance. > > Next drives I'll buy will certainly be 15k scsi drives. > >> In short, don't expect 4 CPUs to solve the problem if the problem >> isn't really the CPUs being maxed out. >> Also, what type of load are you running? Mostly read, mostly >> written, few connections handling lots of data, lots of connections >> each handling a little data, lots of transactions, etc... > > In peak times we can get up to 700-800 connections at the same time. > There are quite some updates involved, without having exact numbers > I'll think that we have about 70% selects and 30% updates/inserts. > >> If you are doing lots of writing, make SURE you have a controller >> that supports battery backed cache and is configured to write-back, >> not write-through. > > Could you recommend a certain controller type? The only battery backed > one that I found on the net is the newest model from icp-vortex.com. > > Regards, > Bjoern > ~# vmstat 10 120 > procs memory swap io system > cpu > r b w swpd free buff cache si so bi bo in cs > us sy id > 1 1 0 24180 10584 32468 2332208 0 1 0 2 1 2 > 2 0 0 > 0 2 0 24564 10480 27812 2313528 8 0 7506 574 1199 8674 > 30 7 63 > 2 1 0 24692 10060 23636 2259176 0 18 8099 298 2074 6328 > 25 7 68 > 2 0 0 24584 18576 21056 2299804 3 6 13208 305 1598 8700 > 23 6 71 > 1 21 1 24504 16588 20912 2309468 4 0 1442 1107 754 6874 > 42 13 45 > 6 1 0 24632 13148 19992 2319400 0 0 2627 499 1184 9633 > 37 6 58 > 5 1 0 24488 10912 19292 2330080 5 0 3404 150 1466 10206 > 32 6 61 > 4 1 0 24488 12180 18824 2342280 3 0 2934 40 1052 3866 > 19 3 78 > 0 0 0 24420 14776 19412 2347232 6 0 403 216 1123 4702 > 22 3 74 > 0 0 0 24548 14408 17380 2321780 4 0 522 715 965 6336 > 25 5 71 > 4 0 0 24676 12504 17756 2322988 0 0 564 830 883 7066 > 31 6 63 > 0 3 0 24676 14060 18232 2325224 0 0 483 388 1097 3401 > 21 3 76 > 0 2 1 24676 13044 18700 2322948 0 0 701 195 1078 5187 > 23 3 74 > 2 0 0 24676 21576 18752 2328168 0 0 467 177 1552 3574 > 18 3 78 > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to > majordomo@postgresql.org)
On Tue, 11 May 2004, Allan Wind wrote: > On 2004-05-11T15:29:46-0600, scott.marlowe wrote: > > The other nice thing about the LSI cards is that you can install >1 and > > the act like one big RAID array. i.e. install two cards with a 20 drive > > RAID0 then make a RAID1 across them, and if one or the other cards itself > > fails, you've still got 100% of your data sitting there. Nice to know you > > can survive the complete failure of one half of your chain. > > ... unless that dying controller corrupted your file system. Depending > on your tolerance for risk, you may not want to operate for long with a > file system in an unknown state. It would have to be the primary controller for that to happen. The way the LSI's work is that you disable the BIOS on the 2nd to 4th cards, and the first card, with the active BIOS acts as the primary controller. In this case, that means the main card is doing the RAID1 work, then handing off the data to the subordinate cards. The subordinate cards do all their own RAID0 work. mobo ---controller 1--<array1 of disks in RAID0 .....|--controller 2--<array2 of disks in RAID0 and whichever controller fails just kind of disappears. Note that if it is the master controller, then you'll have to shut down and enable the BIOS on one of the secondardy (now primary) controllers. So while it's possible for the master card failing to corrupt the RAID1 set, it's still a more reliable system that with just one card. But nothing is 100% reliable, sadly. > Btw, the Intel and LSI Logic RAID controller cards have suspeciously > similar specificationsi, so I would be surprised if one is an OEM. Hmmm. I'll take a closer look.
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > I am curious if there are any real life production quad processor setups > running postgresql out there. Since postgresql lacks a proper > replication/cluster solution, we have to buy a bigger machine. Du you run the latest version of PG? I've read the thread bug have not seen any information about what pg version. All I've seen was a reference to debian which might just as well mean that you run pg 7.2 (probably not but I have to ask). Some classes of queries run much faster in pg 7.4 then in older versions so if you are lucky that can help. -- /Dennis Björklund
...and on Tue, May 11, 2004 at 03:02:24PM -0600, scott.marlowe used the keyboard: > > If you get the LSI megaraid, make sure you're running the latest megaraid > 2 driver, not the older, slower 1.18 series. If you are running linux, > look for the dkms packaged version. dkms, (Dynamic Kernel Module System) > automagically compiles and installs source rpms for drivers when you > install them, and configures the machine to use them to boot up. Most > drivers seem to be slowly headed that way in the linux universe, and I > really like the simplicity and power of dkms. > Hi, Given the fact LSI MegaRAID seems to be a popular solution around here, and many of you folx use Linux as well, I thought sharing this piece of info might be of use. Running v2 megaraid driver on a 2.4 kernel is actually not a good idea _at_ _all_, as it will silently corrupt your data in the event of a disk failure. Sorry to have to say so, but we tested it (on kernels up to 2.4.25, not sure about 2.4.26 yet) and it comes out it doesn't do hotswap the way it should. Somehow the replaced disk drives are not _really_ added to the array, which continues to work in degraded mode for a while and (even worse than that) then starts to think the replaced disk is in order without actually having resynced it, thus beginning to issue writes to non-existant areas of it. The 2.6 megaraid driver indeed seems to be a merged version of the above driver and the old one, giving both improved performance and correct functionality in the event of a hotswap taking place. Hope this helped, -- Grega Bremec Senior Administrator Noviforum Ltd., Software & Media http://www.noviforum.si/
Attachment
BM> see my other mail. BM> We are running Linux, Kernel 2.4. As soon as the next debian version BM> comes out, I'll happily switch to 2.6 :) it's very simple to use 2.6 with testing version, but if you like woody - you can simple install several packets from testing or backports.org if you think about perfomance you must use lastest version of postgresql server - it can be installed from testing or backports.org too (but postgresql from testing depend from many other testing packages). i think if you upgade existing system you can use backports.org for nevest packages, if you install new - use testing - it can be used on production servers today
On Tue, 11 May 2004 15:46:25 -0700, Paul Tuckfield <paul@tuckfield.com> wrote: >- the "cache" column shows that linux is using 2.3G for cache. (way too >much) There is no such thing as "way too much cache". > you generally want to give memory to postgres to keep it "close" to >the user, Yes, but only a moderate amount of memory. > not leave it unused to be claimed by linux cache Cache is not unused memory. >- I'll bet you have a low value for shared buffers, like 10000. On >your 3G system > you should ramp up the value to at least 1G (125000 8k buffers) In most cases this is almost the worst thing you can do. The only thing even worse would be setting it to 1.5 G. Postgres is just happy with a moderate shared_buffers setting. We usually recommend something like 10000. You could try 20000, but don't increase it beyond that without strong evidence that it helps in your particular case. This has been discussed several times here, on -hackers and on -general. Search the archives for more information. Servus Manfred
On Wed, 12 May 2004, Grega Bremec wrote: > ...and on Tue, May 11, 2004 at 03:02:24PM -0600, scott.marlowe used the keyboard: > > > > If you get the LSI megaraid, make sure you're running the latest megaraid > > 2 driver, not the older, slower 1.18 series. If you are running linux, > > look for the dkms packaged version. dkms, (Dynamic Kernel Module System) > > automagically compiles and installs source rpms for drivers when you > > install them, and configures the machine to use them to boot up. Most > > drivers seem to be slowly headed that way in the linux universe, and I > > really like the simplicity and power of dkms. > > > > Hi, > > Given the fact LSI MegaRAID seems to be a popular solution around here, and > many of you folx use Linux as well, I thought sharing this piece of info > might be of use. > > Running v2 megaraid driver on a 2.4 kernel is actually not a good idea _at_ > _all_, as it will silently corrupt your data in the event of a disk failure. > > Sorry to have to say so, but we tested it (on kernels up to 2.4.25, not sure > about 2.4.26 yet) and it comes out it doesn't do hotswap the way it should. > > Somehow the replaced disk drives are not _really_ added to the array, which > continues to work in degraded mode for a while and (even worse than that) > then starts to think the replaced disk is in order without actually having > resynced it, thus beginning to issue writes to non-existant areas of it. > > The 2.6 megaraid driver indeed seems to be a merged version of the above > driver and the old one, giving both improved performance and correct > functionality in the event of a hotswap taking place. This doesn't make any sense to me, since the hot swapping is handled by the card autonomously. I also tested it with a hot spare and pulled one drive and it worked fine during our acceptance testing. However, I've got a hot spare machine I can test on, so I'll try it again and see if I can make it fail. when testing it, was the problem present in certain RAID configurations or only one type or what? I'm curious to try and reproduce this problem, since I've never heard of it before. Also, what firmware version were those megaraid cards, ours is fairly new, as we got it at the beginning of this year, and I'm wondering if it is a firmware issue.
Hi, at first, many thanks for your valuable replies. On my quest for the ultimate hardware platform I'll try to summarize the things I learned. ------------------------------------------------------------- This is our current setup: Hardware: Dual Xeon DP 2.4 on a TYAN S2722-533 with HT enabled 3 GB Ram (2 x 1 GB + 2 x 512 MB) Mylex Extremeraid Controller U160 running RAID 10 with 4 x 18 GB SCSI 10K RPM, no other drives involved (system, pgdata and wal are all on the same volume). Software: Debian 3.0 Woody Postgresql 7.4.1 (selfcompiled, no special optimizations) Kernel 2.4.22 + fixes Database specs: Size of a gzipped -9 full dump is roughly 1 gb 70-80% selects, 20-30% updates (roughly estimated) up to 700-800 connections during peak times kernel.shmall = 805306368 kernel.shmmax = 805306368 max_connections = 900 shared_buffers = 20000 sort_mem = 16384 checkpoint_segments = 6 statistics collector is enabled (for pg_autovacuum) Loads: We are experiencing average CPU loads of up to 70% during peak hours. As Paul Tuckfield correctly pointed out, my vmstat output didn't support this. This output was not taken during peak times, it was freshly grabbed when I wrote my initial mail. It resembles perhaps 50-60% peak time load (30% cpu usage). iostat does not give results about disk usage, I don't know exactly why, the blk_read/wrtn columns are just empty. (Perhaps due to the Mylex rd driver, I don't know). ------------------------------------------------------------- Suggestions and solutions given: Anjan Dave reported, that he is pretty confident with his Quad Xeon setups, which will cost less than $20K at Dell with a reasonable hardware setup. ( Dell 6650 with 2.0GHz/1MB cache/8GB Memory, 5 internal drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller) Scott Marlowe pointed out, that one should consider more than 4 drives (6 to 8, 10K rpm is enough, 15K is rip-off) for a Raid 10 setup, because that can boost performance quite a lot. One should also be using a battery backed raid controller. Scott has good experiences with the LSI Megaraid single channel controller, which is reasonably priced at ~ $500. He also stated, that 20-30% writes on a database is quite a lot. Next Rob Sell told us about his research on more-than-2-way Intel based systems. The memory bandwidth on the xeon platform is always shared between the cpus. While a 2way xeon may perform quite well, a 4way system will be suffering due to the reduced memory bandwith available for each processor. J. Andrew Roberts supports this. He said that 4way opteron systems scale much better than a 4way xeon system. Scaling limits begin at 6-8 cpus on the opteron platform. He also says that a fully equipped dual channel LSI Megaraid 320 with 256MB cache ram will be less that $1K. A complete 4way opteron system will be at $10K-$12K. Paul Tuckfield then gave the suggestion to bump up my shared_buffers. With a 3GB memory system, I could happily be using 1GB for shared buffers (125000). This was questioned by Andrew McMillian, Manfred Kolzar and Halford Dace, who say that common tuning advices limit reasonable settings to 10000-20000 shared buffers, because the OS is better at caching than the database. ------------------------------------------------------------- Conclusion: After having read some comparisons between n-way xeon and opteron systems: http://www.anandtech.com/IT/showdoc.html?i=1982 http://www.aceshardware.com/read.jsp?id=60000275 I was given the impression, that an opteron system is the way to go. This is what I am considering the ultimate platform for postgresql: Hardware: Tyan Thunder K8QS board 2-4 x Opteron 848 in NUMA mode 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) LSI Megaraid 320-2 with 256 MB cache ram and battery backup 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both channels (3 + 4) for pgdata including indexes and wal. 2 x 80 GB S-ATA IDE for system, running linux software raid 1 or available onboard hardware raid (perhaps also 2 x 36 GB SCSI) Software: Debian Woody in amd64 biarch mode, or perhaps Redhat/SuSE Enterprise 64bit distributions. Kernel 2.6 Postgres 7.4.2 in 64bit mode shared_buffers = 20000 a bumbed up effective_cache_size Now the only problem left (besides my budget) is the availability of such a system. I have found some vendors which ship similar systems, so I will have to talk to them about my dream configuration. I will not self build this system, there are too many obstacles. I expect this system to come out on about 12-15K Euro. Very optimistic, I know :) These are the vendors I found up to now: http://www.appro.com/product/server_4144h.asp http://www.appro.com/product/server_4145h.asp http://www.pyramid.de/d/builttosuit/server/4opteron.shtml http://www.rainbow-it.co.uk/productslist.aspx?CategoryID=4&selection=2 http://www.quadopteron.com/ They all seem to sell more or less the same system. I found also some other vendors which built systems on celestica or amd boards, but they are way too expensive. Buying such a machine is worth some good thoughts. If budget is a limit and such a machine might not be maxed out during the next few months, it would make more sense to go for a slightly slower system and an upgrade when more power is needed. Thanks again for all your replies. I hope to have given a somehow clear summary. Regards, Bjoern
This is somthing I wish more of us did on the lists. The list archives have solutions and workarounds for every variety of problem but very few summary emails exist. A good example of this practice is in the sun-managers mailling list. The original poster sends a "SUMMARY" reply to the list with the original problem included and all solutions found. Also makes searching the list archives easier. Simply a suggestion for us all including myself. Greg Bjoern Metzdorf wrote: > Hi, > > at first, many thanks for your valuable replies. On my quest for the > ultimate hardware platform I'll try to summarize the things I learned. > > ------------------------------------------------------------- > > This is our current setup: > > Hardware: > Dual Xeon DP 2.4 on a TYAN S2722-533 with HT enabled > 3 GB Ram (2 x 1 GB + 2 x 512 MB) > Mylex Extremeraid Controller U160 running RAID 10 with 4 x 18 GB SCSI > 10K RPM, no other drives involved (system, pgdata and wal are all on the > same volume). > > Software: > Debian 3.0 Woody > Postgresql 7.4.1 (selfcompiled, no special optimizations) > Kernel 2.4.22 + fixes > > Database specs: > Size of a gzipped -9 full dump is roughly 1 gb > 70-80% selects, 20-30% updates (roughly estimated) > up to 700-800 connections during peak times > kernel.shmall = 805306368 > kernel.shmmax = 805306368 > max_connections = 900 > shared_buffers = 20000 > sort_mem = 16384 > checkpoint_segments = 6 > statistics collector is enabled (for pg_autovacuum) > > Loads: > We are experiencing average CPU loads of up to 70% during peak hours. As > Paul Tuckfield correctly pointed out, my vmstat output didn't support > this. This output was not taken during peak times, it was freshly > grabbed when I wrote my initial mail. It resembles perhaps 50-60% peak > time load (30% cpu usage). iostat does not give results about disk > usage, I don't know exactly why, the blk_read/wrtn columns are just > empty. (Perhaps due to the Mylex rd driver, I don't know). > > ------------------------------------------------------------- > > Suggestions and solutions given: > > Anjan Dave reported, that he is pretty confident with his Quad Xeon > setups, which will cost less than $20K at Dell with a reasonable > hardware setup. ( Dell 6650 with 2.0GHz/1MB cache/8GB Memory, 5 internal > drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller) > > Scott Marlowe pointed out, that one should consider more than 4 drives > (6 to 8, 10K rpm is enough, 15K is rip-off) for a Raid 10 setup, because > that can boost performance quite a lot. One should also be using a > battery backed raid controller. Scott has good experiences with the LSI > Megaraid single channel controller, which is reasonably priced at ~ > $500. He also stated, that 20-30% writes on a database is quite a lot. > > Next Rob Sell told us about his research on more-than-2-way Intel based > systems. The memory bandwidth on the xeon platform is always shared > between the cpus. While a 2way xeon may perform quite well, a 4way > system will be suffering due to the reduced memory bandwith available > for each processor. > > J. Andrew Roberts supports this. He said that 4way opteron systems scale > much better than a 4way xeon system. Scaling limits begin at 6-8 cpus on > the opteron platform. He also says that a fully equipped dual channel > LSI Megaraid 320 with 256MB cache ram will be less that $1K. A complete > 4way opteron system will be at $10K-$12K. > > Paul Tuckfield then gave the suggestion to bump up my shared_buffers. > With a 3GB memory system, I could happily be using 1GB for shared > buffers (125000). This was questioned by Andrew McMillian, Manfred > Kolzar and Halford Dace, who say that common tuning advices limit > reasonable settings to 10000-20000 shared buffers, because the OS is > better at caching than the database. > > ------------------------------------------------------------- > > Conclusion: > > After having read some comparisons between n-way xeon and opteron systems: > > http://www.anandtech.com/IT/showdoc.html?i=1982 > http://www.aceshardware.com/read.jsp?id=60000275 > > I was given the impression, that an opteron system is the way to go. > > This is what I am considering the ultimate platform for postgresql: > > Hardware: > Tyan Thunder K8QS board > 2-4 x Opteron 848 in NUMA mode > 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) > LSI Megaraid 320-2 with 256 MB cache ram and battery backup > 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both > channels (3 + 4) for pgdata including indexes and wal. > 2 x 80 GB S-ATA IDE for system, running linux software raid 1 or > available onboard hardware raid (perhaps also 2 x 36 GB SCSI) > > Software: > Debian Woody in amd64 biarch mode, or perhaps Redhat/SuSE Enterprise > 64bit distributions. > Kernel 2.6 > Postgres 7.4.2 in 64bit mode > shared_buffers = 20000 > a bumbed up effective_cache_size > > Now the only problem left (besides my budget) is the availability of > such a system. > > I have found some vendors which ship similar systems, so I will have to > talk to them about my dream configuration. I will not self build this > system, there are too many obstacles. > > I expect this system to come out on about 12-15K Euro. Very optimistic, > I know :) > > These are the vendors I found up to now: > > http://www.appro.com/product/server_4144h.asp > http://www.appro.com/product/server_4145h.asp > http://www.pyramid.de/d/builttosuit/server/4opteron.shtml > http://www.rainbow-it.co.uk/productslist.aspx?CategoryID=4&selection=2 > http://www.quadopteron.com/ > > They all seem to sell more or less the same system. I found also some > other vendors which built systems on celestica or amd boards, but they > are way too expensive. > > Buying such a machine is worth some good thoughts. If budget is a limit > and such a machine might not be maxed out during the next few months, it > would make more sense to go for a slightly slower system and an upgrade > when more power is needed. > > Thanks again for all your replies. I hope to have given a somehow clear > summary. > > Regards, > Bjoern > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html -- Greg Spiegelberg Product Development Manager Cranel, Incorporated. Phone: 614.318.4314 Fax: 614.431.8388 Email: gspiegelberg@cranel.com Technology. Integrity. Focus. V-Solve!
Bjoern Metzdorf wrote: > Hi, > > at first, many thanks for your valuable replies. On my quest for the > ultimate hardware platform I'll try to summarize the things I learned. > > ------------------------------------------------------------- > This is what I am considering the ultimate platform for postgresql: > > Hardware: > Tyan Thunder K8QS board > 2-4 x Opteron 848 in NUMA mode > 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) > LSI Megaraid 320-2 with 256 MB cache ram and battery backup > 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both > channels (3 + 4) for pgdata including indexes and wal. You might also consider configuring the Postgres data drives for a RAID 10 SAME configuration as described in the Oracle paper "Optimal Storage Configuration Made Easy" (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has anyone delved into this before? -- James Thornton ______________________________________________________ Internet Business Consultant, http://jamesthornton.com
James Thornton wrote: >> This is what I am considering the ultimate platform for postgresql: >> >> Hardware: >> Tyan Thunder K8QS board >> 2-4 x Opteron 848 in NUMA mode >> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) >> LSI Megaraid 320-2 with 256 MB cache ram and battery backup >> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both >> channels (3 + 4) for pgdata including indexes and wal. > > You might also consider configuring the Postgres data drives for a RAID > 10 SAME configuration as described in the Oracle paper "Optimal Storage > Configuration Made Easy" > (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has > anyone delved into this before? Ok, if I understand it correctly the papers recommends the following: 1. Get many drives and stripe them into a RAID0 with a stripe width of 1MB. I am not quite sure if this stripe width is to be controlled at the application level (does postgres support this?) or if e.g. the "chunk size" of the linux software driver is meant. Normally a chunk size of 4KB is recommended, so 1MB sounds fairly large. 2. Mirror your RAID0 and get a RAID10. 3. Use primarily the fast, outer regions of your disks. In practice this might be achieved by putting only half of the disk (the outer half) into your stripe set. E.g. put only the outer 18GB of your 36GB disks into the stripe set. Btw, is it common for all drives that the outer region is on the higher block numbers? Or is it sometimes on the lower block numbers? 4. Subset data by partition, not disk. If you have 8 disks, then don't take a 4 disk RAID10 for data and the other one for log or indexes, but make a global 8 drive RAID10 and have it partitioned the way that data and log + indexes are located on all drives. They say, which is very interesting, as it is really contrary to what is normally recommended, that it is good or better to have one big stripe set over all disks available, than to put log + indexes on a separated stripe set. Having one big stripe set means that the speed of this big stripe set is available to all data. In practice this setup is as fast as or even faster than the "old" approach. ---------------------------------------------------------------- Bottom line for a normal, less than 10 disk setup: Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it to the other 4 disks for a RAID10. Make sure to create the RAID on the outer half of the disks (setup may depend on the disk model and raid controller used), leaving the inner half empty. Use a logical volume manager (LVM), which always helps when adding disk space, and create 2 partitions on your RAID10. One for data and one for log + indexes. This should look like this: ----- ----- ----- ----- | 1 | | 1 | | 1 | | 1 | ----- ----- ----- ----- <- outer, faster half of the disk | 2 | | 2 | | 2 | | 2 | part of the RAID10 ----- ----- ----- ----- | | | | | | | | | | | | | | | | <- inner, slower half of the disk | | | | | | | | not used at all ----- ----- ----- ----- Partition 1 for data, partition 2 for log + indexes. All mirrored to the other 4 disks not shown. If you take 36GB disks, this should end up like this: RAID10 has size of 36 / 2 * 4 = 72GB Partition 1 is 36 GB Partition 2 is 36 GB If 36GB is not enough for your pgdata set, you might consider moving to 72GB disks, or (even better) make a 16 drive RAID10 out of 36GB disks, which both will end up in a size of 72GB for your data (but the 16 drive version will be faster). Any comments? Regards, Bjoern
Bjoern Metzdorf wrote: >> You might also consider configuring the Postgres data drives for a >> RAID 10 SAME configuration as described in the Oracle paper "Optimal >> Storage Configuration Made Easy" >> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has >> anyone delved into this before? > > Ok, if I understand it correctly the papers recommends the following: > > 1. Get many drives and stripe them into a RAID0 with a stripe width of > 1MB. I am not quite sure if this stripe width is to be controlled at the > application level (does postgres support this?) or if e.g. the "chunk > size" of the linux software driver is meant. Normally a chunk size of > 4KB is recommended, so 1MB sounds fairly large. > > 2. Mirror your RAID0 and get a RAID10. Don't use RAID 0+1 -- use RAID 1+0 instead. Performance is the same, but if a disk fails in a RAID 0+1 configuration, you are left with a RAID 0 array. In a RAID 1+0 configuration, multiple disks can fail. A few weeks ago I called LSI asking about the Dell PERC4-Di card, which is actually an LSI Megaraid 320-2. Dell's documentation said that its support for RAID 10 was in the form of RAID-1 concatenated, but LSI said that this is incorrect and that it supports RAID 10 proper. > 3. Use primarily the fast, outer regions of your disks. In practice this > might be achieved by putting only half of the disk (the outer half) into > your stripe set. E.g. put only the outer 18GB of your 36GB disks into > the stripe set. You can still use the inner-half of the drives, just relegate it to less-frequently accessed data. You also need to consider the filesystem. SGI and IBM did a detailed study on Linux filesystem performance, which included XFS, ext2, ext3 (various modes), ReiserFS, and JFS, and the results are presented in a paper entitled "Filesystem Performance and Scalability in Linux 2.4.17" (http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf). The scaling and load are key factors when selecting a filesystem. Since Postgres data is stored in large files, ReiserFS is not the ideal choice since it has been optimized for small files. XFS is probably the best choice for a database server running on a quad processor box. However, Dr. Bert Scalzo of Quest argues that general file system benchmarks aren't ideal for benchmarking a filesystem for a database server. In a paper entitled "Tuning an Oracle8i Database running Linux" (http://otn.oracle.com/oramag/webcolumns/2002/techarticles/scalzo_linux02.html), he says, "The trouble with these tests-for example, Bonnie, Bonnie++, Dbench, Iobench, Iozone, Mongo, and Postmark-is that they are basic file system throughput tests, so their results generally do not pertain in any meaningful fashion to the way relational database systems access data files." Instead he suggests using these two well-known and widely accepted database benchmarks: * AS3AP: a scalable, portable ANSI SQL relational database benchmark that provides a comprehensive set of tests of database-processing power; has built-in scalability and portability for testing a broad range of systems; minimizes human effort in implementing and running benchmark tests; and provides a uniform, metric, straightforward interpretation of the results. * TPC-C: an online transaction processing (OLTP) benchmark that involves a mix of five concurrent transactions of various types and either executes completely online or queries for deferred execution. The database comprises nine types of tables, having a wide range of record and population sizes. This benchmark measures the number of transactions per second. In the paper, Scalzo benchmarks ext2, ext3, ReiserFS, JFS, but not XFS. Surprisingly ext3 won, but Scalzo didn't address scaling/load. The results are surprising because most think ext3 is just ext2 with journaling, thus having extra overhead from journaling. If you read papers on ext3, you'll discover that has some optimizations that reduce disk head movement. For example, Daniel Robbins' "Advanced filesystem implementor's guide, Part 7: Introducing ext3" (http://www-106.ibm.com/developerworks/library/l-fs7/) says: "The approach that the [ext3 Journaling Block Device layer API] uses is called physical journaling, which means that the JBD uses complete physical blocks as the underlying currency for implementing the journal...the use of full blocks allows ext3 to perform some additional optimizations, such as "squishing" multiple pending IO operations within a single block into the same in-memory data structure. This, in turn, allows ext3 to write these multiple changes to disk in a single write operation, rather than many. In addition, because the literal block data is stored in memory, little or no massaging of the in-memory data is required before writing it to disk, greatly reducing CPU overhead." I suspect that less writes may be the key factor in ext3 winning Scalzo's DB benchmark. But as I said, Scalzo didn't benchmark XFS and he didn't address scaling. XFS has a feature called delayed allocation that reduces IO (http://www-106.ibm.com/developerworks/library/l-fs9/), and it scales much better than ext3 so while I haven't tested it, I suspect that it may be the ideal choice for large Linux DB servers: "XFS handles allocation by breaking it into a two-step process. First, when XFS receives new data to be written, it records the pending transaction in RAM and simply reserves an appropriate amount of space on the underlying filesystem. However, while XFS reserves space for the new data, it doesn't decide what filesystem blocks will be used to store the data, at least not yet. XFS procrastinates, delaying this decision to the last possible moment, right before this data is actually written to disk. By delaying allocation, XFS gains many opportunities to optimize write performance. When it comes time to write the data to disk, XFS can now allocate free space intelligently, in a way that optimizes filesystem performance. In particular, if a bunch of new data is being appended to a single file, XFS can allocate a single, contiguous region on disk to store this data. If XFS hadn't delayed its allocation decision, it may have unknowingly written the data into multiple non-contiguous chunks, reducing write performance significantly. But, because XFS delayed its allocation decision, it was able to write the data in one fell swoop, improving write performance as well as reducing overall filesystem fragmentation. Delayed allocation also has another performance benefit. In situations where many short-lived temporary files are created, XFS may never need to write these files to disk at all. Since no blocks are ever allocated, there's no need to deallocate any blocks, and the underlying filesystem metadata doesn't even get touched." For further study, I have compiled a list of Linux filesystem resources at: http://jamesthornton.com/hotlist/linux-filesystems/. -- James Thornton ______________________________________________________ Internet Business Consultant, http://jamesthornton.com
Hadley Willan wrote: > To answer question 1, if you use software raid the chunk size is part of > the /etc/raidtab file that is used on initial container creation. 4KB is > the standard and a LARGE chunk size of 1MB may affect performance if > you're not writing down to blocks in that size continuously. If you > make it to big and you're constantly needing to write out smaller chunks > of information, then you will find the disk "always" working and would > be an inefficient use of the blocks. There is some free info around > about calculating the ideal chunk size. Looking for "Calculating chunk > size for RAID" through google. "Why does the SAME configuration recommend a one megabyte stripe width? Let’s examine the reasoning behind this choice. Why not use a stripe depth smaller than one megabyte? Smaller stripe depths can improve disk throughput for a single process by spreading a single IO across multiple disks. However IOs that are much smaller than a megabyte can cause seek time to becomes a large fraction of the total IO time. Therefore, the overall efficiency of the storage system is reduced. In some cases it may be worth trading off some efficiency for the increased throughput that smaller stripe depths provide. In general it is not necessary to do this though. Parallel execution at database level achieves high disk throughput while keeping efficiency high. Also, remember that the degree of parallelism can be dynamically tuned, whereas the stripe depth is very costly to change. Why not use a stripe depth bigger than one megabyte? One megabyte is large enough that a sequential scan will spend most of its time transferring data instead of positioning the disk head. A bigger stripe depth will improve scan efficiency but only modestly. One megabyte is small enough that a large IO operation will not “hog” a single disk for very long before moving to the next one. Further, one megabyte is small enough that Oracle’s asynchronous readahead operations access multiple disks. One megabyte is also small enough that a single stripe unit will not become a hot-spot. Any access hot-spot that is smaller than a megabyte should fit comfortably in the database buffer cache. Therefore it will not create a hot-spot on disk." The SAME configuration paper says to ensure that that large IO operations aren't broken up between the DB and the disk, you need to be able to ensure that the database file multi-block read count (Oracle has a param called db_file_multiblock_read_count, does Postgres?) is the same size as the stripe width and the OS IO limits should be at least this size. Also, it says, "Ideally we would like to stripe the log files using the same one megabyte stripe width as the rest of the files. However, the log files are written sequentially, and many storage systems limit the maximum size of a single write operation to one megabyte (or even less). If the maximum write size is limited, then using a one megabyte stripe width for the log files may not work well. In this case, a smaller stripe width such as 64K may work better. Caching RAID controllers are an exception to this. If the storage subsystem can cache write operations in nonvolatile RAM, then a one megabyte stripe width will work well for the log files. In this case, the write operation will be buffered in cache and the next log writes can be issued before the previous write is destaged to disk." -- James Thornton ______________________________________________________ Internet Business Consultant, http://jamesthornton.com
One big caveat re. the "SAME" striping strategy, is that readahead can really hurt an OLTP you. Mind you, if you're going from a few disks to a caching array with many disks, it'll be hard to not have a big improvement But if you push the envelope of the array with a "SAME" configuration, readahead will hurt. Readahead is good for sequential reads but bad for random reads, because the various caches (array and filesystem) get flooded with all the blocks that happen to come after whatever random blocks you're reading. Because they're random reads these extra blocks are genarally *not* read by subsequent queries if the database is large enough to be much larger than the cache itself. Of course, the readahead blocks are good if you're doing sequential scans, but you're not doing sequential scans because it's an OLTP database, right? So this'll probably incite flames but: In an OLTP environment of decent size, readahead is bad. The ideal would be to adjust it dynamically til optimum (likely no readahead) if the array allows it, but most people are fooled by good performance of readahead on simple singlethreaded or small dataset tests, and get bitten by this under concurrent loads or large datasets. James Thornton wrote: > >>> This is what I am considering the ultimate platform for postgresql: >>> >>> Hardware: >>> Tyan Thunder K8QS board >>> 2-4 x Opteron 848 in NUMA mode >>> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) >>> LSI Megaraid 320-2 with 256 MB cache ram and battery backup >>> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over >>> both channels (3 + 4) for pgdata including indexes and wal. >> You might also consider configuring the Postgres data drives for a >> RAID 10 SAME configuration as described in the Oracle paper "Optimal >> Storage Configuration Made Easy" >> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has >> anyone delved into this before? > > Ok, if I understand it correctly the papers recommends the following: > > 1. Get many drives and stripe them into a RAID0 with a stripe width of > 1MB. I am not quite sure if this stripe width is to be controlled at > the application level (does postgres support this?) or if e.g. the > "chunk size" of the linux software driver is meant. Normally a chunk > size of 4KB is recommended, so 1MB sounds fairly large. > > 2. Mirror your RAID0 and get a RAID10. > > 3. Use primarily the fast, outer regions of your disks. In practice > this might be achieved by putting only half of the disk (the outer > half) into your stripe set. E.g. put only the outer 18GB of your 36GB > disks into the stripe set. Btw, is it common for all drives that the > outer region is on the higher block numbers? Or is it sometimes on the > lower block numbers? > > 4. Subset data by partition, not disk. If you have 8 disks, then don't > take a 4 disk RAID10 for data and the other one for log or indexes, > but make a global 8 drive RAID10 and have it partitioned the way that > data and log + indexes are located on all drives. > > They say, which is very interesting, as it is really contrary to what > is normally recommended, that it is good or better to have one big > stripe set over all disks available, than to put log + indexes on a > separated stripe set. Having one big stripe set means that the speed > of this big stripe set is available to all data. In practice this > setup is as fast as or even faster than the "old" approach. > > ---------------------------------------------------------------- > > Bottom line for a normal, less than 10 disk setup: > > Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it > to the other 4 disks for a RAID10. Make sure to create the RAID on the > outer half of the disks (setup may depend on the disk model and raid > controller used), leaving the inner half empty. > Use a logical volume manager (LVM), which always helps when adding > disk space, and create 2 partitions on your RAID10. One for data and > one for log + indexes. This should look like this: > > ----- ----- ----- ----- > | 1 | | 1 | | 1 | | 1 | > ----- ----- ----- ----- <- outer, faster half of the disk > | 2 | | 2 | | 2 | | 2 | part of the RAID10 > ----- ----- ----- ----- > | | | | | | | | > | | | | | | | | <- inner, slower half of the disk > | | | | | | | | not used at all > ----- ----- ----- ----- > > Partition 1 for data, partition 2 for log + indexes. All mirrored to > the other 4 disks not shown. > > If you take 36GB disks, this should end up like this: > > RAID10 has size of 36 / 2 * 4 = 72GB > Partition 1 is 36 GB > Partition 2 is 36 GB > > If 36GB is not enough for your pgdata set, you might consider moving > to 72GB disks, or (even better) make a 16 drive RAID10 out of 36GB > disks, which both will end up in a size of 72GB for your data (but the > 16 drive version will be faster). > > Any comments? > > Regards, > Bjoern > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Tue, 2004-05-11 at 15:46 -0700, Paul Tuckfield wrote: > - the "cache" column shows that linux is using 2.3G for cache. (way too > much) you generally want to give memory to postgres to keep it "close" to > the user, not leave it unused to be claimed by linux cache (need to leave > *some* for linux tho) > > My recommendations: > - I'll bet you have a low value for shared buffers, like 10000. On > your 3G system you should ramp up the value to at least 1G (125000 8k buffers) > unless something else runs on the system. It's best to not do things too > drastically, so if Im right and you sit at 10000 now, try going to > 30000 then 60000 then 125000 or above. Huh? Doesn't this run counter to almost every piece of PostgreSQL performance tuning advice given? I run my own boxes with buffers set to around 10000-20000 and an effective_cache_size = 375000 (8k pages - around 3G). That's working well with PostgreSQL 7.4.2 under Debian "woody" (using Oliver Elphick's backported packages from http://people.debian.org/~elphick/debian/). Regards, Andrew. ------------------------------------------------------------------------- Andrew @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267 Q: How much does it cost to ride the Unibus? A: 2 bits. -------------------------------------------------------------------------
I would recommend trying out several stripe sizes, and making your own measurements. A while ago I was involved in building a data warehouse system (Oracle, DB2) and after several file and db benchmark exercises we used 256K stripes, as these gave the best overall performance results for both systems. I am not saying "1M is wrong", but I am saying "1M may not be right" :-) regards Mark Bjoern Metzdorf wrote: > > 1. Get many drives and stripe them into a RAID0 with a stripe width of > 1MB. I am not quite sure if this stripe width is to be controlled at > the application level (does postgres support this?) or if e.g. the > "chunk size" of the linux software driver is meant. Normally a chunk > size of 4KB is recommended, so 1MB sounds fairly large. > >
After reading the replies to this, it is clear that this is a Lintel-centric question, but I will throw in my experience. > I am curious if there are any real life production > quad processor setups running postgresql out there. Yes. We are running a 24/7 operation on a quad CPU Sun V880. > Since postgresql lacks a proper replication/cluster > solution, we have to buy a bigger machine. This was a compelling reason for us to stick with SPARC and avoid Intel/AMD when picking a DB server. We moved off of an IBM mainframe in 1993 to Sun gear and never looked back. We can upgrade to our heart's content with minimal disruption and are only on our third box in 11 years with plenty of life left in our current one. > Right now we are running on a dual 2.4 Xeon, 3 GB Ram > and U160 SCSI hardware-raid 10. A couple people mentioned hardware RAID, which I completely agree with. I prefer an external box with a SCSI or FC connector. There are no driver issues that way. We boot from our arrays. The Nexsan ATABoy2 is a nice blend of performance, reliability and cost. Some of these with 1TB and 2TB of space were recently spotted on ebay for under $5k. We run a VERY random i/o mix on ours and it will consistently sustain 15 MB/s in blended read and write i/o, sustaining well over 1200 io/s. These are IDE drives, so they fail more often than SCSI, so run RAID1 or RAID5. The cache on these pretty much eliminates the RAID5 penalties. > The 30k+ setups from Dell etc. don't fit our budget. For that kind of money you could get a lower end Sun box (or IBM RS/6000 I would imagine) and give yourself an astounding amount of headroom for future growth. Sincerely, Marty