Thread: Quad processor options
Hi, I am curious if there are any real life production quad processor setups running postgresql out there. Since postgresql lacks a proper replication/cluster solution, we have to buy a bigger machine. Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI hardware-raid 10. Has anyone experiences with quad Xeon or quad Opteron setups? I am looking at the appropriate boards from Tyan, which would be the only option for us to buy such a beast. The 30k+ setups from Dell etc. don't fit our budget. I am thinking of the following: Quad processor (xeon or opteron) 5 x SCSI 15K RPM for Raid 10 + spare drive 2 x IDE for system ICP-Vortex battery backed U320 Hardware Raid 4-8 GB Ram Would be nice to hear from you. Regards, Bjoern
it's very good to understand specific choke points you're trying to address by upgrading so you dont get disappointed. Are you truly CPU constrained, or is it memory footprint or IO thruput that makes you want to upgrade? IMO The best way to begin understanding system choke points is vmstat output. Would you mind forwarding the output of "vmstat 10 120" under peak load period? (I'm asusming this is linux or unix variant) a brief description of what is happening during the vmstat sample would help a lot too. > I am curious if there are any real life production quad processor > setups running postgresql out there. Since postgresql lacks a proper > replication/cluster solution, we have to buy a bigger machine. > > Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI > hardware-raid 10. > > Has anyone experiences with quad Xeon or quad Opteron setups? I am > looking at the appropriate boards from Tyan, which would be the only > option for us to buy such a beast. The 30k+ setups from Dell etc. > don't fit our budget. > > I am thinking of the following: > > Quad processor (xeon or opteron) > 5 x SCSI 15K RPM for Raid 10 + spare drive > 2 x IDE for system > ICP-Vortex battery backed U320 Hardware Raid > 4-8 GB Ram > > Would be nice to hear from you. > > Regards, > Bjoern > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > Hi, > > I am curious if there are any real life production quad processor setups > running postgresql out there. Since postgresql lacks a proper > replication/cluster solution, we have to buy a bigger machine. > > Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI > hardware-raid 10. > > Has anyone experiences with quad Xeon or quad Opteron setups? I am > looking at the appropriate boards from Tyan, which would be the only > option for us to buy such a beast. The 30k+ setups from Dell etc. don't > fit our budget. > > I am thinking of the following: > > Quad processor (xeon or opteron) > 5 x SCSI 15K RPM for Raid 10 + spare drive > 2 x IDE for system > ICP-Vortex battery backed U320 Hardware Raid > 4-8 GB Ram Well, from what I've read elsewhere on the internet, it would seem the Opterons scale better to 4 CPUs than the basic Xeons do. Of course, the exception to this is SGI's altix, which uses their own chipset and runs the itanium with very good memory bandwidth. But, do you really need more CPU horsepower? Are you I/O or CPU or memory or memory bandwidth bound? If you're sitting at 99% idle, and iostat says your drives are only running at some small percentage of what you know they could, you might be memory or memory bandwidth limited. Adding two more CPUs will not help with that situation. If your I/O is saturated, then the answer may well be a better RAID array, with many more drives plugged into it. Do you have any spare drives you can toss on the machine to see if that helps? Sometimes going from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in performance. In short, don't expect 4 CPUs to solve the problem if the problem isn't really the CPUs being maxed out. Also, what type of load are you running? Mostly read, mostly written, few connections handling lots of data, lots of connections each handling a little data, lots of transactions, etc... If you are doing lots of writing, make SURE you have a controller that supports battery backed cache and is configured to write-back, not write-through.
scott.marlowe wrote: > Well, from what I've read elsewhere on the internet, it would seem the > Opterons scale better to 4 CPUs than the basic Xeons do. Of course, the > exception to this is SGI's altix, which uses their own chipset and runs > the itanium with very good memory bandwidth. This is basically what I read too. But I cannot spent money on a quad opteron just for testing purposes :) > But, do you really need more CPU horsepower? > > Are you I/O or CPU or memory or memory bandwidth bound? If you're sitting > at 99% idle, and iostat says your drives are only running at some small > percentage of what you know they could, you might be memory or memory > bandwidth limited. Adding two more CPUs will not help with that > situation. Right now we have a dual xeon 2.4, 3 GB Ram, Mylex extremeraid controller, running 2 Compaq BD018122C0, 1 Seagate ST318203LC and 1 Quantum ATLAS_V_18_SCA. iostat show between 20 and 60 % user avg-cpu. And this is not even peak time. I attached a "vmstat 10 120" output for perhaps 60-70% peak load. > If your I/O is saturated, then the answer may well be a better RAID > array, with many more drives plugged into it. Do you have any spare > drives you can toss on the machine to see if that helps? Sometimes going > from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in > performance. Next drives I'll buy will certainly be 15k scsi drives. > In short, don't expect 4 CPUs to solve the problem if the problem isn't > really the CPUs being maxed out. > > Also, what type of load are you running? Mostly read, mostly written, few > connections handling lots of data, lots of connections each handling a > little data, lots of transactions, etc... In peak times we can get up to 700-800 connections at the same time. There are quite some updates involved, without having exact numbers I'll think that we have about 70% selects and 30% updates/inserts. > If you are doing lots of writing, make SURE you have a controller that > supports battery backed cache and is configured to write-back, not > write-through. Could you recommend a certain controller type? The only battery backed one that I found on the net is the newest model from icp-vortex.com. Regards, Bjoern ~# vmstat 10 120 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 1 0 24180 10584 32468 2332208 0 1 0 2 1 2 2 0 0 0 2 0 24564 10480 27812 2313528 8 0 7506 574 1199 8674 30 7 63 2 1 0 24692 10060 23636 2259176 0 18 8099 298 2074 6328 25 7 68 2 0 0 24584 18576 21056 2299804 3 6 13208 305 1598 8700 23 6 71 1 21 1 24504 16588 20912 2309468 4 0 1442 1107 754 6874 42 13 45 6 1 0 24632 13148 19992 2319400 0 0 2627 499 1184 9633 37 6 58 5 1 0 24488 10912 19292 2330080 5 0 3404 150 1466 10206 32 6 61 4 1 0 24488 12180 18824 2342280 3 0 2934 40 1052 3866 19 3 78 0 0 0 24420 14776 19412 2347232 6 0 403 216 1123 4702 22 3 74 0 0 0 24548 14408 17380 2321780 4 0 522 715 965 6336 25 5 71 4 0 0 24676 12504 17756 2322988 0 0 564 830 883 7066 31 6 63 0 3 0 24676 14060 18232 2325224 0 0 483 388 1097 3401 21 3 76 0 2 1 24676 13044 18700 2322948 0 0 701 195 1078 5187 23 3 74 2 0 0 24676 21576 18752 2328168 0 0 467 177 1552 3574 18 3 78
Paul Tuckfield wrote: > Would you mind forwarding the output of "vmstat 10 120" under peak load > period? (I'm asusming this is linux or unix variant) a brief > description of what is happening during the vmstat sample would help a > lot too. see my other mail. We are running Linux, Kernel 2.4. As soon as the next debian version comes out, I'll happily switch to 2.6 :) Regards, Bjoern
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > scott.marlowe wrote: > > > Well, from what I've read elsewhere on the internet, it would seem the > > Opterons scale better to 4 CPUs than the basic Xeons do. Of course, the > > exception to this is SGI's altix, which uses their own chipset and runs > > the itanium with very good memory bandwidth. > > This is basically what I read too. But I cannot spent money on a quad > opteron just for testing purposes :) Wouldn't it be nice to just have a lab full of these things? > > If your I/O is saturated, then the answer may well be a better RAID > > array, with many more drives plugged into it. Do you have any spare > > drives you can toss on the machine to see if that helps? Sometimes going > > from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in > > performance. > > Next drives I'll buy will certainly be 15k scsi drives. Better to buy more 10k drives than fewer 15k drives. Other than slightly faster select times, the 15ks aren't really any faster. > > In short, don't expect 4 CPUs to solve the problem if the problem isn't > > really the CPUs being maxed out. > > > > Also, what type of load are you running? Mostly read, mostly written, few > > connections handling lots of data, lots of connections each handling a > > little data, lots of transactions, etc... > > In peak times we can get up to 700-800 connections at the same time. > There are quite some updates involved, without having exact numbers I'll > think that we have about 70% selects and 30% updates/inserts. Wow, a lot of writes then. > > If you are doing lots of writing, make SURE you have a controller that > > supports battery backed cache and is configured to write-back, not > > write-through. > > Could you recommend a certain controller type? The only battery backed > one that I found on the net is the newest model from icp-vortex.com. Sure, adaptec makes one, so does lsi megaraid. Dell resells both of these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I believe. We run the lsi megaraid with 64 megs battery backed cache. Intel also makes one, but I've heard nothing about it. If you get the LSI megaraid, make sure you're running the latest megaraid 2 driver, not the older, slower 1.18 series. If you are running linux, look for the dkms packaged version. dkms, (Dynamic Kernel Module System) automagically compiles and installs source rpms for drivers when you install them, and configures the machine to use them to boot up. Most drivers seem to be slowly headed that way in the linux universe, and I really like the simplicity and power of dkms. I haven't directly tested anything but the adaptec and the lsi megaraid. Here at work we've had massive issues trying to get the adaptec cards configured and installed on, while the megaraid was a snap. Installed RH, installed the dkms rpm, installed the dkms enabled megaraid driver and rebooted. Literally, that's all it took.
scott.marlowe wrote: >>Next drives I'll buy will certainly be 15k scsi drives. > > Better to buy more 10k drives than fewer 15k drives. Other than slightly > faster select times, the 15ks aren't really any faster. Good to know. I'll remember that. >>In peak times we can get up to 700-800 connections at the same time. >>There are quite some updates involved, without having exact numbers I'll >>think that we have about 70% selects and 30% updates/inserts. > > Wow, a lot of writes then. Yes, it certainly could also be only 15-20% updates/inserts, but this is also not negligible. > Sure, adaptec makes one, so does lsi megaraid. Dell resells both of > these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I > believe. We run the lsi megaraid with 64 megs battery backed cache. The LSI sounds good. > Intel also makes one, but I've heard nothing about it. It could well be the ICP Vortex one, ICP was bought by Intel some time ago.. > I haven't directly tested anything but the adaptec and the lsi megaraid. > Here at work we've had massive issues trying to get the adaptec cards > configured and installed on, while the megaraid was a snap. Installed RH, > installed the dkms rpm, installed the dkms enabled megaraid driver and > rebooted. Literally, that's all it took. I didn't hear anything about dkms for debian, so I will be hand-patching as usual :) Regards, Bjoern
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > scott.marlowe wrote: > > Sure, adaptec makes one, so does lsi megaraid. Dell resells both of > > these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I > > believe. We run the lsi megaraid with 64 megs battery backed cache. > > The LSI sounds good. > > > Intel also makes one, but I've heard nothing about it. > > It could well be the ICP Vortex one, ICP was bought by Intel some time ago.. Also, there are bigger, faster external RAID boxes as well, that make the internal cards seem puny. They're nice because all you need in your main box is a good U320 controller to plug into the external RAID array. That URL I mentioned earlier that had prices has some of the external boxes listed. No price, not for sale on the web, get out the checkbook and write a blank check is my guess. I.e. they're not cheap. The other nice thing about the LSI cards is that you can install >1 and the act like one big RAID array. i.e. install two cards with a 20 drive RAID0 then make a RAID1 across them, and if one or the other cards itself fails, you've still got 100% of your data sitting there. Nice to know you can survive the complete failure of one half of your chain. > > I haven't directly tested anything but the adaptec and the lsi megaraid. > > Here at work we've had massive issues trying to get the adaptec cards > > configured and installed on, while the megaraid was a snap. Installed RH, > > installed the dkms rpm, installed the dkms enabled megaraid driver and > > rebooted. Literally, that's all it took. > > I didn't hear anything about dkms for debian, so I will be hand-patching > as usual :) Yeah, it seems to be an RPM kinda thing. But, I'm thinking the 2.0 drivers got included in the latest 2.6 kernels, so no biggie. I was looking around in google, and it definitely appears the 2.x and 1.x megaraid drivers were merged into "unified" driver in 2.6 kernel.
I'm confused why you say the system is 70% busy: the vmstat output shows 70% *idle*. The vmstat you sent shows good things and ambiguous things: - si and so are zero, so your not paging/swapping. Thats always step 1. you're fine. - bi and bo (physical IO) shows pretty high numbers for how many disks you have. (assuming random IO) so please send an "iostat 10" sampling during peak. - note that cpu is only 30% busy. that should mean that adding cpus will *not* help. - the "cache" column shows that linux is using 2.3G for cache. (way too much) you generally want to give memory to postgres to keep it "close" to the user, not leave it unused to be claimed by linux cache (need to leave *some* for linux tho) My recommendations: - I'll bet you have a low value for shared buffers, like 10000. On your 3G system you should ramp up the value to at least 1G (125000 8k buffers) unless something else runs on the system. It's best to not do things too drastically, so if Im right and you sit at 10000 now, try going to 30000 then 60000 then 125000 or above. - if the above is off base, then I wonder why we see high runque numbers in spite of over 60% idle cpu. Maybe some serialization happening somewhere. Also depending on how you've laid out your 4 disk drives, you may see all IOs going to one drive. the 7M/sec is on the high side, if that's the case. iostat numbers will reveal if it's skewed, and if it's random, tho linux iostat doesn't seem to report response times (sigh) Response times are the golden metric when diagnosing IO thruput in OLTP / stripe situation. On May 11, 2004, at 1:41 PM, Bjoern Metzdorf wrote: > scott.marlowe wrote: > >> Well, from what I've read elsewhere on the internet, it would seem >> the Opterons scale better to 4 CPUs than the basic Xeons do. Of >> course, the exception to this is SGI's altix, which uses their own >> chipset and runs the itanium with very good memory bandwidth. > > This is basically what I read too. But I cannot spent money on a quad > opteron just for testing purposes :) > >> But, do you really need more CPU horsepower? >> Are you I/O or CPU or memory or memory bandwidth bound? If you're >> sitting at 99% idle, and iostat says your drives are only running at >> some small percentage of what you know they could, you might be >> memory or memory bandwidth limited. Adding two more CPUs will not >> help with that situation. > > Right now we have a dual xeon 2.4, 3 GB Ram, Mylex extremeraid > controller, running 2 Compaq BD018122C0, 1 Seagate ST318203LC and 1 > Quantum ATLAS_V_18_SCA. > > iostat show between 20 and 60 % user avg-cpu. And this is not even > peak time. > > I attached a "vmstat 10 120" output for perhaps 60-70% peak load. > >> If your I/O is saturated, then the answer may well be a better RAID >> array, with many more drives plugged into it. Do you have any spare >> drives you can toss on the machine to see if that helps? Sometimes >> going from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big >> boost in performance. > > Next drives I'll buy will certainly be 15k scsi drives. > >> In short, don't expect 4 CPUs to solve the problem if the problem >> isn't really the CPUs being maxed out. >> Also, what type of load are you running? Mostly read, mostly >> written, few connections handling lots of data, lots of connections >> each handling a little data, lots of transactions, etc... > > In peak times we can get up to 700-800 connections at the same time. > There are quite some updates involved, without having exact numbers > I'll think that we have about 70% selects and 30% updates/inserts. > >> If you are doing lots of writing, make SURE you have a controller >> that supports battery backed cache and is configured to write-back, >> not write-through. > > Could you recommend a certain controller type? The only battery backed > one that I found on the net is the newest model from icp-vortex.com. > > Regards, > Bjoern > ~# vmstat 10 120 > procs memory swap io system > cpu > r b w swpd free buff cache si so bi bo in cs > us sy id > 1 1 0 24180 10584 32468 2332208 0 1 0 2 1 2 > 2 0 0 > 0 2 0 24564 10480 27812 2313528 8 0 7506 574 1199 8674 > 30 7 63 > 2 1 0 24692 10060 23636 2259176 0 18 8099 298 2074 6328 > 25 7 68 > 2 0 0 24584 18576 21056 2299804 3 6 13208 305 1598 8700 > 23 6 71 > 1 21 1 24504 16588 20912 2309468 4 0 1442 1107 754 6874 > 42 13 45 > 6 1 0 24632 13148 19992 2319400 0 0 2627 499 1184 9633 > 37 6 58 > 5 1 0 24488 10912 19292 2330080 5 0 3404 150 1466 10206 > 32 6 61 > 4 1 0 24488 12180 18824 2342280 3 0 2934 40 1052 3866 > 19 3 78 > 0 0 0 24420 14776 19412 2347232 6 0 403 216 1123 4702 > 22 3 74 > 0 0 0 24548 14408 17380 2321780 4 0 522 715 965 6336 > 25 5 71 > 4 0 0 24676 12504 17756 2322988 0 0 564 830 883 7066 > 31 6 63 > 0 3 0 24676 14060 18232 2325224 0 0 483 388 1097 3401 > 21 3 76 > 0 2 1 24676 13044 18700 2322948 0 0 701 195 1078 5187 > 23 3 74 > 2 0 0 24676 21576 18752 2328168 0 0 467 177 1552 3574 > 18 3 78 > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to > majordomo@postgresql.org)
On Tue, 11 May 2004, Bjoern Metzdorf wrote: > I am curious if there are any real life production quad processor setups > running postgresql out there. Since postgresql lacks a proper > replication/cluster solution, we have to buy a bigger machine. Du you run the latest version of PG? I've read the thread bug have not seen any information about what pg version. All I've seen was a reference to debian which might just as well mean that you run pg 7.2 (probably not but I have to ask). Some classes of queries run much faster in pg 7.4 then in older versions so if you are lucky that can help. -- /Dennis Björklund
...and on Tue, May 11, 2004 at 03:02:24PM -0600, scott.marlowe used the keyboard: > > If you get the LSI megaraid, make sure you're running the latest megaraid > 2 driver, not the older, slower 1.18 series. If you are running linux, > look for the dkms packaged version. dkms, (Dynamic Kernel Module System) > automagically compiles and installs source rpms for drivers when you > install them, and configures the machine to use them to boot up. Most > drivers seem to be slowly headed that way in the linux universe, and I > really like the simplicity and power of dkms. > Hi, Given the fact LSI MegaRAID seems to be a popular solution around here, and many of you folx use Linux as well, I thought sharing this piece of info might be of use. Running v2 megaraid driver on a 2.4 kernel is actually not a good idea _at_ _all_, as it will silently corrupt your data in the event of a disk failure. Sorry to have to say so, but we tested it (on kernels up to 2.4.25, not sure about 2.4.26 yet) and it comes out it doesn't do hotswap the way it should. Somehow the replaced disk drives are not _really_ added to the array, which continues to work in degraded mode for a while and (even worse than that) then starts to think the replaced disk is in order without actually having resynced it, thus beginning to issue writes to non-existant areas of it. The 2.6 megaraid driver indeed seems to be a merged version of the above driver and the old one, giving both improved performance and correct functionality in the event of a hotswap taking place. Hope this helped, -- Grega Bremec Senior Administrator Noviforum Ltd., Software & Media http://www.noviforum.si/
Attachment
On Tue, 11 May 2004 15:46:25 -0700, Paul Tuckfield <paul@tuckfield.com> wrote: >- the "cache" column shows that linux is using 2.3G for cache. (way too >much) There is no such thing as "way too much cache". > you generally want to give memory to postgres to keep it "close" to >the user, Yes, but only a moderate amount of memory. > not leave it unused to be claimed by linux cache Cache is not unused memory. >- I'll bet you have a low value for shared buffers, like 10000. On >your 3G system > you should ramp up the value to at least 1G (125000 8k buffers) In most cases this is almost the worst thing you can do. The only thing even worse would be setting it to 1.5 G. Postgres is just happy with a moderate shared_buffers setting. We usually recommend something like 10000. You could try 20000, but don't increase it beyond that without strong evidence that it helps in your particular case. This has been discussed several times here, on -hackers and on -general. Search the archives for more information. Servus Manfred
On 12 May 2004, at 12:17 PM, Manfred Koizar wrote: > On Tue, 11 May 2004 15:46:25 -0700, Paul Tuckfield <paul@tuckfield.com> > wrote: > >> - I'll bet you have a low value for shared buffers, like 10000. On >> your 3G system >> you should ramp up the value to at least 1G (125000 8k buffers) > > In most cases this is almost the worst thing you can do. The only > thing > even worse would be setting it to 1.5 G. > > Postgres is just happy with a moderate shared_buffers setting. We > usually recommend something like 10000. You could try 20000, but don't > increase it beyond that without strong evidence that it helps in your > particular case. > > This has been discussed several times here, on -hackers and on > -general. > Search the archives for more information. We have definitely found this to be true here. We have some fairly complex queries running on a rather underpowered box (beautiful but steam-driven old Silicon Graphics Challenge DM). We ended up using a very slight increase to shared buffers, but gaining ENORMOUSLY through proper optimisation of queries, appropriate indices and the use of optimizer-bludgeons like "SET ENABLE_SEQSCAN = OFF" Hal
It appears that your CPU is 'slow' while your disk subsystem is 'fast'. I had once such situation with 15 kRPM drives and ~500MHz Pentium III. On that system, the best solution was to either increase effective_cache_size or decrease random_page_cost (the latter obviously has to do with the fast disk, the former with the lots of RAM). In any case, proper optimization of queries always helps. :-) Daniel >>>Halford Dace said: > > On 12 May 2004, at 12:17 PM, Manfred Koizar wrote: > > > On Tue, 11 May 2004 15:46:25 -0700, Paul Tuckfield <paul@tuckfield.com> > > wrote: > > > >> - I'll bet you have a low value for shared buffers, like 10000. On > >> your 3G system > >> you should ramp up the value to at least 1G (125000 8k buffers) > > > > In most cases this is almost the worst thing you can do. The only > > thing > > even worse would be setting it to 1.5 G. > > > > Postgres is just happy with a moderate shared_buffers setting. We > > usually recommend something like 10000. You could try 20000, but don't > > increase it beyond that without strong evidence that it helps in your > > particular case. > > > > This has been discussed several times here, on -hackers and on > > -general. > > Search the archives for more information. > > We have definitely found this to be true here. We have some fairly > complex queries running on a rather underpowered box (beautiful but > steam-driven old Silicon Graphics Challenge DM). We ended up using a > very slight increase to shared buffers, but gaining ENORMOUSLY through > proper optimisation of queries, appropriate indices and the use of > optimizer-bludgeons like "SET ENABLE_SEQSCAN = OFF" > > Hal > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Wed, 12 May 2004, Grega Bremec wrote: > ...and on Tue, May 11, 2004 at 03:02:24PM -0600, scott.marlowe used the keyboard: > > > > If you get the LSI megaraid, make sure you're running the latest megaraid > > 2 driver, not the older, slower 1.18 series. If you are running linux, > > look for the dkms packaged version. dkms, (Dynamic Kernel Module System) > > automagically compiles and installs source rpms for drivers when you > > install them, and configures the machine to use them to boot up. Most > > drivers seem to be slowly headed that way in the linux universe, and I > > really like the simplicity and power of dkms. > > > > Hi, > > Given the fact LSI MegaRAID seems to be a popular solution around here, and > many of you folx use Linux as well, I thought sharing this piece of info > might be of use. > > Running v2 megaraid driver on a 2.4 kernel is actually not a good idea _at_ > _all_, as it will silently corrupt your data in the event of a disk failure. > > Sorry to have to say so, but we tested it (on kernels up to 2.4.25, not sure > about 2.4.26 yet) and it comes out it doesn't do hotswap the way it should. > > Somehow the replaced disk drives are not _really_ added to the array, which > continues to work in degraded mode for a while and (even worse than that) > then starts to think the replaced disk is in order without actually having > resynced it, thus beginning to issue writes to non-existant areas of it. > > The 2.6 megaraid driver indeed seems to be a merged version of the above > driver and the old one, giving both improved performance and correct > functionality in the event of a hotswap taking place. This doesn't make any sense to me, since the hot swapping is handled by the card autonomously. I also tested it with a hot spare and pulled one drive and it worked fine during our acceptance testing. However, I've got a hot spare machine I can test on, so I'll try it again and see if I can make it fail. when testing it, was the problem present in certain RAID configurations or only one type or what? I'm curious to try and reproduce this problem, since I've never heard of it before. Also, what firmware version were those megaraid cards, ours is fairly new, as we got it at the beginning of this year, and I'm wondering if it is a firmware issue.
Hi, at first, many thanks for your valuable replies. On my quest for the ultimate hardware platform I'll try to summarize the things I learned. ------------------------------------------------------------- This is our current setup: Hardware: Dual Xeon DP 2.4 on a TYAN S2722-533 with HT enabled 3 GB Ram (2 x 1 GB + 2 x 512 MB) Mylex Extremeraid Controller U160 running RAID 10 with 4 x 18 GB SCSI 10K RPM, no other drives involved (system, pgdata and wal are all on the same volume). Software: Debian 3.0 Woody Postgresql 7.4.1 (selfcompiled, no special optimizations) Kernel 2.4.22 + fixes Database specs: Size of a gzipped -9 full dump is roughly 1 gb 70-80% selects, 20-30% updates (roughly estimated) up to 700-800 connections during peak times kernel.shmall = 805306368 kernel.shmmax = 805306368 max_connections = 900 shared_buffers = 20000 sort_mem = 16384 checkpoint_segments = 6 statistics collector is enabled (for pg_autovacuum) Loads: We are experiencing average CPU loads of up to 70% during peak hours. As Paul Tuckfield correctly pointed out, my vmstat output didn't support this. This output was not taken during peak times, it was freshly grabbed when I wrote my initial mail. It resembles perhaps 50-60% peak time load (30% cpu usage). iostat does not give results about disk usage, I don't know exactly why, the blk_read/wrtn columns are just empty. (Perhaps due to the Mylex rd driver, I don't know). ------------------------------------------------------------- Suggestions and solutions given: Anjan Dave reported, that he is pretty confident with his Quad Xeon setups, which will cost less than $20K at Dell with a reasonable hardware setup. ( Dell 6650 with 2.0GHz/1MB cache/8GB Memory, 5 internal drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller) Scott Marlowe pointed out, that one should consider more than 4 drives (6 to 8, 10K rpm is enough, 15K is rip-off) for a Raid 10 setup, because that can boost performance quite a lot. One should also be using a battery backed raid controller. Scott has good experiences with the LSI Megaraid single channel controller, which is reasonably priced at ~ $500. He also stated, that 20-30% writes on a database is quite a lot. Next Rob Sell told us about his research on more-than-2-way Intel based systems. The memory bandwidth on the xeon platform is always shared between the cpus. While a 2way xeon may perform quite well, a 4way system will be suffering due to the reduced memory bandwith available for each processor. J. Andrew Roberts supports this. He said that 4way opteron systems scale much better than a 4way xeon system. Scaling limits begin at 6-8 cpus on the opteron platform. He also says that a fully equipped dual channel LSI Megaraid 320 with 256MB cache ram will be less that $1K. A complete 4way opteron system will be at $10K-$12K. Paul Tuckfield then gave the suggestion to bump up my shared_buffers. With a 3GB memory system, I could happily be using 1GB for shared buffers (125000). This was questioned by Andrew McMillian, Manfred Kolzar and Halford Dace, who say that common tuning advices limit reasonable settings to 10000-20000 shared buffers, because the OS is better at caching than the database. ------------------------------------------------------------- Conclusion: After having read some comparisons between n-way xeon and opteron systems: http://www.anandtech.com/IT/showdoc.html?i=1982 http://www.aceshardware.com/read.jsp?id=60000275 I was given the impression, that an opteron system is the way to go. This is what I am considering the ultimate platform for postgresql: Hardware: Tyan Thunder K8QS board 2-4 x Opteron 848 in NUMA mode 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) LSI Megaraid 320-2 with 256 MB cache ram and battery backup 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both channels (3 + 4) for pgdata including indexes and wal. 2 x 80 GB S-ATA IDE for system, running linux software raid 1 or available onboard hardware raid (perhaps also 2 x 36 GB SCSI) Software: Debian Woody in amd64 biarch mode, or perhaps Redhat/SuSE Enterprise 64bit distributions. Kernel 2.6 Postgres 7.4.2 in 64bit mode shared_buffers = 20000 a bumbed up effective_cache_size Now the only problem left (besides my budget) is the availability of such a system. I have found some vendors which ship similar systems, so I will have to talk to them about my dream configuration. I will not self build this system, there are too many obstacles. I expect this system to come out on about 12-15K Euro. Very optimistic, I know :) These are the vendors I found up to now: http://www.appro.com/product/server_4144h.asp http://www.appro.com/product/server_4145h.asp http://www.pyramid.de/d/builttosuit/server/4opteron.shtml http://www.rainbow-it.co.uk/productslist.aspx?CategoryID=4&selection=2 http://www.quadopteron.com/ They all seem to sell more or less the same system. I found also some other vendors which built systems on celestica or amd boards, but they are way too expensive. Buying such a machine is worth some good thoughts. If budget is a limit and such a machine might not be maxed out during the next few months, it would make more sense to go for a slightly slower system and an upgrade when more power is needed. Thanks again for all your replies. I hope to have given a somehow clear summary. Regards, Bjoern
This is somthing I wish more of us did on the lists. The list archives have solutions and workarounds for every variety of problem but very few summary emails exist. A good example of this practice is in the sun-managers mailling list. The original poster sends a "SUMMARY" reply to the list with the original problem included and all solutions found. Also makes searching the list archives easier. Simply a suggestion for us all including myself. Greg Bjoern Metzdorf wrote: > Hi, > > at first, many thanks for your valuable replies. On my quest for the > ultimate hardware platform I'll try to summarize the things I learned. > > ------------------------------------------------------------- > > This is our current setup: > > Hardware: > Dual Xeon DP 2.4 on a TYAN S2722-533 with HT enabled > 3 GB Ram (2 x 1 GB + 2 x 512 MB) > Mylex Extremeraid Controller U160 running RAID 10 with 4 x 18 GB SCSI > 10K RPM, no other drives involved (system, pgdata and wal are all on the > same volume). > > Software: > Debian 3.0 Woody > Postgresql 7.4.1 (selfcompiled, no special optimizations) > Kernel 2.4.22 + fixes > > Database specs: > Size of a gzipped -9 full dump is roughly 1 gb > 70-80% selects, 20-30% updates (roughly estimated) > up to 700-800 connections during peak times > kernel.shmall = 805306368 > kernel.shmmax = 805306368 > max_connections = 900 > shared_buffers = 20000 > sort_mem = 16384 > checkpoint_segments = 6 > statistics collector is enabled (for pg_autovacuum) > > Loads: > We are experiencing average CPU loads of up to 70% during peak hours. As > Paul Tuckfield correctly pointed out, my vmstat output didn't support > this. This output was not taken during peak times, it was freshly > grabbed when I wrote my initial mail. It resembles perhaps 50-60% peak > time load (30% cpu usage). iostat does not give results about disk > usage, I don't know exactly why, the blk_read/wrtn columns are just > empty. (Perhaps due to the Mylex rd driver, I don't know). > > ------------------------------------------------------------- > > Suggestions and solutions given: > > Anjan Dave reported, that he is pretty confident with his Quad Xeon > setups, which will cost less than $20K at Dell with a reasonable > hardware setup. ( Dell 6650 with 2.0GHz/1MB cache/8GB Memory, 5 internal > drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller) > > Scott Marlowe pointed out, that one should consider more than 4 drives > (6 to 8, 10K rpm is enough, 15K is rip-off) for a Raid 10 setup, because > that can boost performance quite a lot. One should also be using a > battery backed raid controller. Scott has good experiences with the LSI > Megaraid single channel controller, which is reasonably priced at ~ > $500. He also stated, that 20-30% writes on a database is quite a lot. > > Next Rob Sell told us about his research on more-than-2-way Intel based > systems. The memory bandwidth on the xeon platform is always shared > between the cpus. While a 2way xeon may perform quite well, a 4way > system will be suffering due to the reduced memory bandwith available > for each processor. > > J. Andrew Roberts supports this. He said that 4way opteron systems scale > much better than a 4way xeon system. Scaling limits begin at 6-8 cpus on > the opteron platform. He also says that a fully equipped dual channel > LSI Megaraid 320 with 256MB cache ram will be less that $1K. A complete > 4way opteron system will be at $10K-$12K. > > Paul Tuckfield then gave the suggestion to bump up my shared_buffers. > With a 3GB memory system, I could happily be using 1GB for shared > buffers (125000). This was questioned by Andrew McMillian, Manfred > Kolzar and Halford Dace, who say that common tuning advices limit > reasonable settings to 10000-20000 shared buffers, because the OS is > better at caching than the database. > > ------------------------------------------------------------- > > Conclusion: > > After having read some comparisons between n-way xeon and opteron systems: > > http://www.anandtech.com/IT/showdoc.html?i=1982 > http://www.aceshardware.com/read.jsp?id=60000275 > > I was given the impression, that an opteron system is the way to go. > > This is what I am considering the ultimate platform for postgresql: > > Hardware: > Tyan Thunder K8QS board > 2-4 x Opteron 848 in NUMA mode > 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) > LSI Megaraid 320-2 with 256 MB cache ram and battery backup > 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both > channels (3 + 4) for pgdata including indexes and wal. > 2 x 80 GB S-ATA IDE for system, running linux software raid 1 or > available onboard hardware raid (perhaps also 2 x 36 GB SCSI) > > Software: > Debian Woody in amd64 biarch mode, or perhaps Redhat/SuSE Enterprise > 64bit distributions. > Kernel 2.6 > Postgres 7.4.2 in 64bit mode > shared_buffers = 20000 > a bumbed up effective_cache_size > > Now the only problem left (besides my budget) is the availability of > such a system. > > I have found some vendors which ship similar systems, so I will have to > talk to them about my dream configuration. I will not self build this > system, there are too many obstacles. > > I expect this system to come out on about 12-15K Euro. Very optimistic, > I know :) > > These are the vendors I found up to now: > > http://www.appro.com/product/server_4144h.asp > http://www.appro.com/product/server_4145h.asp > http://www.pyramid.de/d/builttosuit/server/4opteron.shtml > http://www.rainbow-it.co.uk/productslist.aspx?CategoryID=4&selection=2 > http://www.quadopteron.com/ > > They all seem to sell more or less the same system. I found also some > other vendors which built systems on celestica or amd boards, but they > are way too expensive. > > Buying such a machine is worth some good thoughts. If budget is a limit > and such a machine might not be maxed out during the next few months, it > would make more sense to go for a slightly slower system and an upgrade > when more power is needed. > > Thanks again for all your replies. I hope to have given a somehow clear > summary. > > Regards, > Bjoern > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html -- Greg Spiegelberg Product Development Manager Cranel, Incorporated. Phone: 614.318.4314 Fax: 614.431.8388 Email: gspiegelberg@cranel.com Technology. Integrity. Focus. V-Solve!
Bjoern Metzdorf wrote: > Hi, > > at first, many thanks for your valuable replies. On my quest for the > ultimate hardware platform I'll try to summarize the things I learned. > > ------------------------------------------------------------- > This is what I am considering the ultimate platform for postgresql: > > Hardware: > Tyan Thunder K8QS board > 2-4 x Opteron 848 in NUMA mode > 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) > LSI Megaraid 320-2 with 256 MB cache ram and battery backup > 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both > channels (3 + 4) for pgdata including indexes and wal. You might also consider configuring the Postgres data drives for a RAID 10 SAME configuration as described in the Oracle paper "Optimal Storage Configuration Made Easy" (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has anyone delved into this before? -- James Thornton ______________________________________________________ Internet Business Consultant, http://jamesthornton.com
James Thornton wrote: >> This is what I am considering the ultimate platform for postgresql: >> >> Hardware: >> Tyan Thunder K8QS board >> 2-4 x Opteron 848 in NUMA mode >> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) >> LSI Megaraid 320-2 with 256 MB cache ram and battery backup >> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both >> channels (3 + 4) for pgdata including indexes and wal. > > You might also consider configuring the Postgres data drives for a RAID > 10 SAME configuration as described in the Oracle paper "Optimal Storage > Configuration Made Easy" > (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has > anyone delved into this before? Ok, if I understand it correctly the papers recommends the following: 1. Get many drives and stripe them into a RAID0 with a stripe width of 1MB. I am not quite sure if this stripe width is to be controlled at the application level (does postgres support this?) or if e.g. the "chunk size" of the linux software driver is meant. Normally a chunk size of 4KB is recommended, so 1MB sounds fairly large. 2. Mirror your RAID0 and get a RAID10. 3. Use primarily the fast, outer regions of your disks. In practice this might be achieved by putting only half of the disk (the outer half) into your stripe set. E.g. put only the outer 18GB of your 36GB disks into the stripe set. Btw, is it common for all drives that the outer region is on the higher block numbers? Or is it sometimes on the lower block numbers? 4. Subset data by partition, not disk. If you have 8 disks, then don't take a 4 disk RAID10 for data and the other one for log or indexes, but make a global 8 drive RAID10 and have it partitioned the way that data and log + indexes are located on all drives. They say, which is very interesting, as it is really contrary to what is normally recommended, that it is good or better to have one big stripe set over all disks available, than to put log + indexes on a separated stripe set. Having one big stripe set means that the speed of this big stripe set is available to all data. In practice this setup is as fast as or even faster than the "old" approach. ---------------------------------------------------------------- Bottom line for a normal, less than 10 disk setup: Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it to the other 4 disks for a RAID10. Make sure to create the RAID on the outer half of the disks (setup may depend on the disk model and raid controller used), leaving the inner half empty. Use a logical volume manager (LVM), which always helps when adding disk space, and create 2 partitions on your RAID10. One for data and one for log + indexes. This should look like this: ----- ----- ----- ----- | 1 | | 1 | | 1 | | 1 | ----- ----- ----- ----- <- outer, faster half of the disk | 2 | | 2 | | 2 | | 2 | part of the RAID10 ----- ----- ----- ----- | | | | | | | | | | | | | | | | <- inner, slower half of the disk | | | | | | | | not used at all ----- ----- ----- ----- Partition 1 for data, partition 2 for log + indexes. All mirrored to the other 4 disks not shown. If you take 36GB disks, this should end up like this: RAID10 has size of 36 / 2 * 4 = 72GB Partition 1 is 36 GB Partition 2 is 36 GB If 36GB is not enough for your pgdata set, you might consider moving to 72GB disks, or (even better) make a 16 drive RAID10 out of 36GB disks, which both will end up in a size of 72GB for your data (but the 16 drive version will be faster). Any comments? Regards, Bjoern
Bjoern Metzdorf wrote: >> You might also consider configuring the Postgres data drives for a >> RAID 10 SAME configuration as described in the Oracle paper "Optimal >> Storage Configuration Made Easy" >> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has >> anyone delved into this before? > > Ok, if I understand it correctly the papers recommends the following: > > 1. Get many drives and stripe them into a RAID0 with a stripe width of > 1MB. I am not quite sure if this stripe width is to be controlled at the > application level (does postgres support this?) or if e.g. the "chunk > size" of the linux software driver is meant. Normally a chunk size of > 4KB is recommended, so 1MB sounds fairly large. > > 2. Mirror your RAID0 and get a RAID10. Don't use RAID 0+1 -- use RAID 1+0 instead. Performance is the same, but if a disk fails in a RAID 0+1 configuration, you are left with a RAID 0 array. In a RAID 1+0 configuration, multiple disks can fail. A few weeks ago I called LSI asking about the Dell PERC4-Di card, which is actually an LSI Megaraid 320-2. Dell's documentation said that its support for RAID 10 was in the form of RAID-1 concatenated, but LSI said that this is incorrect and that it supports RAID 10 proper. > 3. Use primarily the fast, outer regions of your disks. In practice this > might be achieved by putting only half of the disk (the outer half) into > your stripe set. E.g. put only the outer 18GB of your 36GB disks into > the stripe set. You can still use the inner-half of the drives, just relegate it to less-frequently accessed data. You also need to consider the filesystem. SGI and IBM did a detailed study on Linux filesystem performance, which included XFS, ext2, ext3 (various modes), ReiserFS, and JFS, and the results are presented in a paper entitled "Filesystem Performance and Scalability in Linux 2.4.17" (http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf). The scaling and load are key factors when selecting a filesystem. Since Postgres data is stored in large files, ReiserFS is not the ideal choice since it has been optimized for small files. XFS is probably the best choice for a database server running on a quad processor box. However, Dr. Bert Scalzo of Quest argues that general file system benchmarks aren't ideal for benchmarking a filesystem for a database server. In a paper entitled "Tuning an Oracle8i Database running Linux" (http://otn.oracle.com/oramag/webcolumns/2002/techarticles/scalzo_linux02.html), he says, "The trouble with these tests-for example, Bonnie, Bonnie++, Dbench, Iobench, Iozone, Mongo, and Postmark-is that they are basic file system throughput tests, so their results generally do not pertain in any meaningful fashion to the way relational database systems access data files." Instead he suggests using these two well-known and widely accepted database benchmarks: * AS3AP: a scalable, portable ANSI SQL relational database benchmark that provides a comprehensive set of tests of database-processing power; has built-in scalability and portability for testing a broad range of systems; minimizes human effort in implementing and running benchmark tests; and provides a uniform, metric, straightforward interpretation of the results. * TPC-C: an online transaction processing (OLTP) benchmark that involves a mix of five concurrent transactions of various types and either executes completely online or queries for deferred execution. The database comprises nine types of tables, having a wide range of record and population sizes. This benchmark measures the number of transactions per second. In the paper, Scalzo benchmarks ext2, ext3, ReiserFS, JFS, but not XFS. Surprisingly ext3 won, but Scalzo didn't address scaling/load. The results are surprising because most think ext3 is just ext2 with journaling, thus having extra overhead from journaling. If you read papers on ext3, you'll discover that has some optimizations that reduce disk head movement. For example, Daniel Robbins' "Advanced filesystem implementor's guide, Part 7: Introducing ext3" (http://www-106.ibm.com/developerworks/library/l-fs7/) says: "The approach that the [ext3 Journaling Block Device layer API] uses is called physical journaling, which means that the JBD uses complete physical blocks as the underlying currency for implementing the journal...the use of full blocks allows ext3 to perform some additional optimizations, such as "squishing" multiple pending IO operations within a single block into the same in-memory data structure. This, in turn, allows ext3 to write these multiple changes to disk in a single write operation, rather than many. In addition, because the literal block data is stored in memory, little or no massaging of the in-memory data is required before writing it to disk, greatly reducing CPU overhead." I suspect that less writes may be the key factor in ext3 winning Scalzo's DB benchmark. But as I said, Scalzo didn't benchmark XFS and he didn't address scaling. XFS has a feature called delayed allocation that reduces IO (http://www-106.ibm.com/developerworks/library/l-fs9/), and it scales much better than ext3 so while I haven't tested it, I suspect that it may be the ideal choice for large Linux DB servers: "XFS handles allocation by breaking it into a two-step process. First, when XFS receives new data to be written, it records the pending transaction in RAM and simply reserves an appropriate amount of space on the underlying filesystem. However, while XFS reserves space for the new data, it doesn't decide what filesystem blocks will be used to store the data, at least not yet. XFS procrastinates, delaying this decision to the last possible moment, right before this data is actually written to disk. By delaying allocation, XFS gains many opportunities to optimize write performance. When it comes time to write the data to disk, XFS can now allocate free space intelligently, in a way that optimizes filesystem performance. In particular, if a bunch of new data is being appended to a single file, XFS can allocate a single, contiguous region on disk to store this data. If XFS hadn't delayed its allocation decision, it may have unknowingly written the data into multiple non-contiguous chunks, reducing write performance significantly. But, because XFS delayed its allocation decision, it was able to write the data in one fell swoop, improving write performance as well as reducing overall filesystem fragmentation. Delayed allocation also has another performance benefit. In situations where many short-lived temporary files are created, XFS may never need to write these files to disk at all. Since no blocks are ever allocated, there's no need to deallocate any blocks, and the underlying filesystem metadata doesn't even get touched." For further study, I have compiled a list of Linux filesystem resources at: http://jamesthornton.com/hotlist/linux-filesystems/. -- James Thornton ______________________________________________________ Internet Business Consultant, http://jamesthornton.com
I see you've got an LSI Megaraid card with oodles of Cache. However, don't underestimate the power of the software RAID implementation that Red Hat Linux comes with.
We're using RHE 2.1 and I can recommend Red Hat Enterprise Linux if you want an excellent implementation of software RAID. In fact we have found the software implementation more flexible than that of some expensive hardware controllers. In addition there are also tools to enhance the base implementation even further, making setup and maintenance even easier. An advantage of the software implementation is being able to RAID by partition, not necessarily entire disks.
To answer question 1, if you use software raid the chunk size is part of the /etc/raidtab file that is used on initial container creation. 4KB is the standard and a LARGE chunk size of 1MB may affect performance if you're not writing down to blocks in that size continuously. If you make it to big and you're constantly needing to write out smaller chunks of information, then you will find the disk "always" working and would be an inefficient use of the blocks. There is some free info around about calculating the ideal chunk size. Looking for "Calculating chunk size for RAID" through google.
In the software implementation, after setup the raidtab is uncessary as the superblocks of the disks now contain their relevant information.
As for the application knowing any of this, no, the application layers are entirely unaware of the lower implementation. They simply function as normal by writing to directories that are now mounted a different way. The kernel takes care of the underlying RAID writes and syncs.
3 is easy to implement with software raid under linux. You simply partition the drive like normal, mark the partitions you want to "raid" as 'fd' 'linux raid autodetect', then configure the /etc/raidtab and do a mkraid /dev/mdxx where mdxx is the matching partition for the raid setup. You can map them anyway you want, but it can get confusing if you're mapping /dev/sda6 > /dev/sdb8 and calling it /dev/md7.
We've found it easier to make them all line up, /dev/sda6 > /dev/sdb6 > /dev/md6
FYI, if you want better performance, use 15K SCSI disks, and make sure you've got more than 8MB of cache per disk. Also, you're correct in splitting the drives across the channel, that's a trap for young players ;-)
Bjoern is right to recommend an LVM, it will allow you to dynamically allocate new size to the RAID volume when you add more disks. However I've no experience in implementation with an LVM under the software RAID for Linux, though I believe it can be done.
The software RAID implementation allows you to stop and start software RAID devices as desired, add new hot spare disks to the containers as needed and rebuild containers on the fly. You can even change kernel options to speed up or slow down the sync speed when rebuilding the container.
Anyway, have fun, cause striping is the hot rod of the RAID implementations ;-)
Regards.
Hadley
On Fri, 2004-05-14 at 09:53, Bjoern Metzdorf wrote:
We're using RHE 2.1 and I can recommend Red Hat Enterprise Linux if you want an excellent implementation of software RAID. In fact we have found the software implementation more flexible than that of some expensive hardware controllers. In addition there are also tools to enhance the base implementation even further, making setup and maintenance even easier. An advantage of the software implementation is being able to RAID by partition, not necessarily entire disks.
To answer question 1, if you use software raid the chunk size is part of the /etc/raidtab file that is used on initial container creation. 4KB is the standard and a LARGE chunk size of 1MB may affect performance if you're not writing down to blocks in that size continuously. If you make it to big and you're constantly needing to write out smaller chunks of information, then you will find the disk "always" working and would be an inefficient use of the blocks. There is some free info around about calculating the ideal chunk size. Looking for "Calculating chunk size for RAID" through google.
In the software implementation, after setup the raidtab is uncessary as the superblocks of the disks now contain their relevant information.
As for the application knowing any of this, no, the application layers are entirely unaware of the lower implementation. They simply function as normal by writing to directories that are now mounted a different way. The kernel takes care of the underlying RAID writes and syncs.
3 is easy to implement with software raid under linux. You simply partition the drive like normal, mark the partitions you want to "raid" as 'fd' 'linux raid autodetect', then configure the /etc/raidtab and do a mkraid /dev/mdxx where mdxx is the matching partition for the raid setup. You can map them anyway you want, but it can get confusing if you're mapping /dev/sda6 > /dev/sdb8 and calling it /dev/md7.
We've found it easier to make them all line up, /dev/sda6 > /dev/sdb6 > /dev/md6
FYI, if you want better performance, use 15K SCSI disks, and make sure you've got more than 8MB of cache per disk. Also, you're correct in splitting the drives across the channel, that's a trap for young players ;-)
Bjoern is right to recommend an LVM, it will allow you to dynamically allocate new size to the RAID volume when you add more disks. However I've no experience in implementation with an LVM under the software RAID for Linux, though I believe it can be done.
The software RAID implementation allows you to stop and start software RAID devices as desired, add new hot spare disks to the containers as needed and rebuild containers on the fly. You can even change kernel options to speed up or slow down the sync speed when rebuilding the container.
Anyway, have fun, cause striping is the hot rod of the RAID implementations ;-)
Regards.
Hadley
On Fri, 2004-05-14 at 09:53, Bjoern Metzdorf wrote:
James Thornton wrote: >> This is what I am considering the ultimate platform for postgresql: >> >> Hardware: >> Tyan Thunder K8QS board >> 2-4 x Opteron 848 in NUMA mode >> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) >> LSI Megaraid 320-2 with 256 MB cache ram and battery backup >> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both >> channels (3 + 4) for pgdata including indexes and wal. > > You might also consider configuring the Postgres data drives for a RAID > 10 SAME configuration as described in the Oracle paper "Optimal Storage > Configuration Made Easy" > (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has > anyone delved into this before? Ok, if I understand it correctly the papers recommends the following: 1. Get many drives and stripe them into a RAID0 with a stripe width of 1MB. I am not quite sure if this stripe width is to be controlled at the application level (does postgres support this?) or if e.g. the "chunk size" of the linux software driver is meant. Normally a chunk size of 4KB is recommended, so 1MB sounds fairly large. 2. Mirror your RAID0 and get a RAID10. 3. Use primarily the fast, outer regions of your disks. In practice this might be achieved by putting only half of the disk (the outer half) into your stripe set. E.g. put only the outer 18GB of your 36GB disks into the stripe set. Btw, is it common for all drives that the outer region is on the higher block numbers? Or is it sometimes on the lower block numbers? 4. Subset data by partition, not disk. If you have 8 disks, then don't take a 4 disk RAID10 for data and the other one for log or indexes, but make a global 8 drive RAID10 and have it partitioned the way that data and log + indexes are located on all drives. They say, which is very interesting, as it is really contrary to what is normally recommended, that it is good or better to have one big stripe set over all disks available, than to put log + indexes on a separated stripe set. Having one big stripe set means that the speed of this big stripe set is available to all data. In practice this setup is as fast as or even faster than the "old" approach. ---------------------------------------------------------------- Bottom line for a normal, less than 10 disk setup: Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it to the other 4 disks for a RAID10. Make sure to create the RAID on the outer half of the disks (setup may depend on the disk model and raid controller used), leaving the inner half empty. Use a logical volume manager (LVM), which always helps when adding disk space, and create 2 partitions on your RAID10. One for data and one for log + indexes. This should look like this: ----- ----- ----- ----- | 1 | | 1 | | 1 | | 1 | ----- ----- ----- ----- <- outer, faster half of the disk | 2 | | 2 | | 2 | | 2 | part of the RAID10 ----- ----- ----- ----- | | | | | | | | | | | | | | | | <- inner, slower half of the disk | | | | | | | | not used at all ----- ----- ----- ----- Partition 1 for data, partition 2 for log + indexes. All mirrored to the other 4 disks not shown. If you take 36GB disks, this should end up like this: RAID10 has size of 36 / 2 * 4 = 72GB Partition 1 is 36 GB Partition 2 is 36 GB If 36GB is not enough for your pgdata set, you might consider moving to 72GB disks, or (even better) make a 16 drive RAID10 out of 36GB disks, which both will end up in a size of 72GB for your data (but the 16 drive version will be faster). Any comments? Regards, Bjoern ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Hadley Willan wrote: > To answer question 1, if you use software raid the chunk size is part of > the /etc/raidtab file that is used on initial container creation. 4KB is > the standard and a LARGE chunk size of 1MB may affect performance if > you're not writing down to blocks in that size continuously. If you > make it to big and you're constantly needing to write out smaller chunks > of information, then you will find the disk "always" working and would > be an inefficient use of the blocks. There is some free info around > about calculating the ideal chunk size. Looking for "Calculating chunk > size for RAID" through google. "Why does the SAME configuration recommend a one megabyte stripe width? Let’s examine the reasoning behind this choice. Why not use a stripe depth smaller than one megabyte? Smaller stripe depths can improve disk throughput for a single process by spreading a single IO across multiple disks. However IOs that are much smaller than a megabyte can cause seek time to becomes a large fraction of the total IO time. Therefore, the overall efficiency of the storage system is reduced. In some cases it may be worth trading off some efficiency for the increased throughput that smaller stripe depths provide. In general it is not necessary to do this though. Parallel execution at database level achieves high disk throughput while keeping efficiency high. Also, remember that the degree of parallelism can be dynamically tuned, whereas the stripe depth is very costly to change. Why not use a stripe depth bigger than one megabyte? One megabyte is large enough that a sequential scan will spend most of its time transferring data instead of positioning the disk head. A bigger stripe depth will improve scan efficiency but only modestly. One megabyte is small enough that a large IO operation will not “hog” a single disk for very long before moving to the next one. Further, one megabyte is small enough that Oracle’s asynchronous readahead operations access multiple disks. One megabyte is also small enough that a single stripe unit will not become a hot-spot. Any access hot-spot that is smaller than a megabyte should fit comfortably in the database buffer cache. Therefore it will not create a hot-spot on disk." The SAME configuration paper says to ensure that that large IO operations aren't broken up between the DB and the disk, you need to be able to ensure that the database file multi-block read count (Oracle has a param called db_file_multiblock_read_count, does Postgres?) is the same size as the stripe width and the OS IO limits should be at least this size. Also, it says, "Ideally we would like to stripe the log files using the same one megabyte stripe width as the rest of the files. However, the log files are written sequentially, and many storage systems limit the maximum size of a single write operation to one megabyte (or even less). If the maximum write size is limited, then using a one megabyte stripe width for the log files may not work well. In this case, a smaller stripe width such as 64K may work better. Caching RAID controllers are an exception to this. If the storage subsystem can cache write operations in nonvolatile RAM, then a one megabyte stripe width will work well for the log files. In this case, the write operation will be buffered in cache and the next log writes can be issued before the previous write is destaged to disk." -- James Thornton ______________________________________________________ Internet Business Consultant, http://jamesthornton.com
One big caveat re. the "SAME" striping strategy, is that readahead can really hurt an OLTP you. Mind you, if you're going from a few disks to a caching array with many disks, it'll be hard to not have a big improvement But if you push the envelope of the array with a "SAME" configuration, readahead will hurt. Readahead is good for sequential reads but bad for random reads, because the various caches (array and filesystem) get flooded with all the blocks that happen to come after whatever random blocks you're reading. Because they're random reads these extra blocks are genarally *not* read by subsequent queries if the database is large enough to be much larger than the cache itself. Of course, the readahead blocks are good if you're doing sequential scans, but you're not doing sequential scans because it's an OLTP database, right? So this'll probably incite flames but: In an OLTP environment of decent size, readahead is bad. The ideal would be to adjust it dynamically til optimum (likely no readahead) if the array allows it, but most people are fooled by good performance of readahead on simple singlethreaded or small dataset tests, and get bitten by this under concurrent loads or large datasets. James Thornton wrote: > >>> This is what I am considering the ultimate platform for postgresql: >>> >>> Hardware: >>> Tyan Thunder K8QS board >>> 2-4 x Opteron 848 in NUMA mode >>> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor) >>> LSI Megaraid 320-2 with 256 MB cache ram and battery backup >>> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over >>> both channels (3 + 4) for pgdata including indexes and wal. >> You might also consider configuring the Postgres data drives for a >> RAID 10 SAME configuration as described in the Oracle paper "Optimal >> Storage Configuration Made Easy" >> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has >> anyone delved into this before? > > Ok, if I understand it correctly the papers recommends the following: > > 1. Get many drives and stripe them into a RAID0 with a stripe width of > 1MB. I am not quite sure if this stripe width is to be controlled at > the application level (does postgres support this?) or if e.g. the > "chunk size" of the linux software driver is meant. Normally a chunk > size of 4KB is recommended, so 1MB sounds fairly large. > > 2. Mirror your RAID0 and get a RAID10. > > 3. Use primarily the fast, outer regions of your disks. In practice > this might be achieved by putting only half of the disk (the outer > half) into your stripe set. E.g. put only the outer 18GB of your 36GB > disks into the stripe set. Btw, is it common for all drives that the > outer region is on the higher block numbers? Or is it sometimes on the > lower block numbers? > > 4. Subset data by partition, not disk. If you have 8 disks, then don't > take a 4 disk RAID10 for data and the other one for log or indexes, > but make a global 8 drive RAID10 and have it partitioned the way that > data and log + indexes are located on all drives. > > They say, which is very interesting, as it is really contrary to what > is normally recommended, that it is good or better to have one big > stripe set over all disks available, than to put log + indexes on a > separated stripe set. Having one big stripe set means that the speed > of this big stripe set is available to all data. In practice this > setup is as fast as or even faster than the "old" approach. > > ---------------------------------------------------------------- > > Bottom line for a normal, less than 10 disk setup: > > Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it > to the other 4 disks for a RAID10. Make sure to create the RAID on the > outer half of the disks (setup may depend on the disk model and raid > controller used), leaving the inner half empty. > Use a logical volume manager (LVM), which always helps when adding > disk space, and create 2 partitions on your RAID10. One for data and > one for log + indexes. This should look like this: > > ----- ----- ----- ----- > | 1 | | 1 | | 1 | | 1 | > ----- ----- ----- ----- <- outer, faster half of the disk > | 2 | | 2 | | 2 | | 2 | part of the RAID10 > ----- ----- ----- ----- > | | | | | | | | > | | | | | | | | <- inner, slower half of the disk > | | | | | | | | not used at all > ----- ----- ----- ----- > > Partition 1 for data, partition 2 for log + indexes. All mirrored to > the other 4 disks not shown. > > If you take 36GB disks, this should end up like this: > > RAID10 has size of 36 / 2 * 4 = 72GB > Partition 1 is 36 GB > Partition 2 is 36 GB > > If 36GB is not enough for your pgdata set, you might consider moving > to 72GB disks, or (even better) make a 16 drive RAID10 out of 36GB > disks, which both will end up in a size of 72GB for your data (but the > 16 drive version will be faster). > > Any comments? > > Regards, > Bjoern > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
I would recommend trying out several stripe sizes, and making your own measurements. A while ago I was involved in building a data warehouse system (Oracle, DB2) and after several file and db benchmark exercises we used 256K stripes, as these gave the best overall performance results for both systems. I am not saying "1M is wrong", but I am saying "1M may not be right" :-) regards Mark Bjoern Metzdorf wrote: > > 1. Get many drives and stripe them into a RAID0 with a stripe width of > 1MB. I am not quite sure if this stripe width is to be controlled at > the application level (does postgres support this?) or if e.g. the > "chunk size" of the linux software driver is meant. Normally a chunk > size of 4KB is recommended, so 1MB sounds fairly large. > >