Thread: Scaling further up
All:
We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives.
We are expecting a pretty high load, a few thousands of 'concurrent' users executing either select, insert, update, statments.
What is the next step up in terms of handling very heavy loads? Clustering?
Are there any standard, recommended clustering options?
How about this? http://c-jdbc.objectweb.org
Also, in terms of hardware, overall, what benefits more, a SunFire 880 (6 or 8 CPUs, lots of RAM, internal FC Drives) type of machine, or an IA-64 architecture?
Appreciate any inputs,
Thanks,
Anjan
Anjan
**************************************************************************
This e-mail and any files transmitted with it are intended for the use of the addressee(s) only and may be confidential and covered by the attorney/client and other privileges. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.
> All: > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's > an internal U320, 10K RPM RAID-10 setup on 4 drives. > > We are expecting a pretty high load, a few thousands of 'concurrent' users executing either > select, insert, update, statments. > What is the next step up in terms of handling very heavy loads? Clustering? I'd look at adding more disks first. Depending on what type of query load you get, that box sounds like it will be very much I/O bound. More spindles = more parallell operations = faster under load. Consider adding 15KRPM disks as well, they're not all that much more expensive, and should give you better performance than 10KRPM. Also, make sure you put your WAL disks on a separate RAIDset if possible (not just a separate partition on existing RAIDset). Finally, if you don't already have it, look for a battery-backed RAID controller that can do writeback-cacheing, and enable that. (Don't even think about enabling it unless it's battery backed!) And add as much RAM as you can to that controller. //Magnus
For the disks part - I am looking at a SAN implementation, and I will be planning a separate RAID group for the WALs. The controller is a PERC, with 128MB cache, and I think it is writeback. Other than the disks, I am curious what other people are using in terms of the horsepower needed. The Quad server has been keeping up, but we are expecting quite high loads in the near future, and I am not sure if just by having the disks on a high-end storage will do it. Thanks, Anjan -----Original Message----- From: Magnus Hagander [mailto:mha@sollentuna.net] Sent: Monday, March 01, 2004 3:54 PM To: Anjan Dave; pgsql-performance@postgresql.org Subject: RE: [PERFORM] Scaling further up > All: > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's > an internal U320, 10K RPM RAID-10 setup on 4 drives. > > We are expecting a pretty high load, a few thousands of 'concurrent' users executing either > select, insert, update, statments. > What is the next step up in terms of handling very heavy loads? Clustering? I'd look at adding more disks first. Depending on what type of query load you get, that box sounds like it will be very much I/O bound. More spindles = more parallell operations = faster under load. Consider adding 15KRPM disks as well, they're not all that much more expensive, and should give you better performance than 10KRPM. Also, make sure you put your WAL disks on a separate RAIDset if possible (not just a separate partition on existing RAIDset). Finally, if you don't already have it, look for a battery-backed RAID controller that can do writeback-cacheing, and enable that. (Don't even think about enabling it unless it's battery backed!) And add as much RAM as you can to that controller. //Magnus
Anjan, > Other than the disks, I am curious what other people are using in terms > of the horsepower needed. The Quad server has been keeping up, but we > are expecting quite high loads in the near future, and I am not sure if > just by having the disks on a high-end storage will do it. Do a performance analysis of RH9. My experience with RH on Xeon has been quite discouraging lately, and I've been recommending swapping stock kernels for the RH kernel. Of course, if this is RHES, rather than the standard, then test & talk to RH instead. -- -Josh Berkus Aglio Database Solutions San Francisco
After a long battle with technology, josh@agliodbs.com (Josh Berkus), an earthling, wrote: >> Other than the disks, I am curious what other people are using in >> terms of the horsepower needed. The Quad server has been keeping >> up, but we are expecting quite high loads in the near future, and I >> am not sure if just by having the disks on a high-end storage will >> do it. > > Do a performance analysis of RH9. My experience with RH on Xeon has > been quite discouraging lately, and I've been recommending swapping > stock kernels for the RH kernel. By that, you mean that you recommend that RHAT kernels be replaced by "stock" ones? > Of course, if this is RHES, rather than the standard, then test & > talk to RH instead. If you're spending the money, better demand value from the vendor... (And if RHAT is going to charge the big bucks, they'll have to provide service...) -- (reverse (concatenate 'string "gro.gultn" "@" "enworbbc")) http://www.ntlug.org/~cbbrowne/rdbms.html "I take it all back. Microsoft Exchange is RFC compliant. RFC 1925, point three." -- Author unknown
Anjan Dave wrote: > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, > PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. > > We are expecting a pretty high load, a few thousands of 'concurrent' > users executing either select, insert, update, statments. The quick and dirty method would be to upgrade to the recently announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get another +60% there due to the huge L3 hiding the Xeon's shared bus penalty.
On Tue, 2004-03-02 at 17:42, William Yu wrote: > Anjan Dave wrote: > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, > > PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. > > > > We are expecting a pretty high load, a few thousands of 'concurrent' > > users executing either select, insert, update, statments. > > The quick and dirty method would be to upgrade to the recently announced > 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get > another +60% there due to the huge L3 hiding the Xeon's shared bus penalty. If you are going to have thousands of 'concurrent' users you should seriously consider the 2.6 kernel if you are running Linux or as an alternative going with FreeBSD. You will need to load test your system and become an expert on tuning Postgres to get the absolute maximum performance from each and every query you have. And you will need lots of hard drives. By lots I mean dozen(s) in a raid 10 array with a good controller. Thousands of concurrent users means hundreds or thousands of transactions per second. I've personally seen it scale that far but in my opinion you will need a lot more hard drives and ram than cpu.
"By lots I mean dozen(s) in a raid 10 array with a good controller." I believe, for RAID-10, I will need even number of drives. Currently, the size of the database is about 13GB, and is not expected to grow exponentially with thousands of concurrent users, so total space is not of paramount importance compared to performance. Does this sound reasonable setup? 10x36GB FC drives on RAID-10 4x36GB FC drives for the logs on RAID-10 (not sure if this is the correct ratio)? 1 hotspare Total=15 Drives per enclosure. Tentatively, I am looking at an entry-level EMC CX300 product with 2GB RAID cache, etc. Question - Are 73GB drives supposed to give better performance because of higher number of platters? Thanks, Anjan -----Original Message----- From: Fred Moyer [mailto:fred@redhotpenguin.com] Sent: Tuesday, March 02, 2004 5:57 AM To: William Yu; Anjan Dave Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Scaling further up On Tue, 2004-03-02 at 17:42, William Yu wrote: > Anjan Dave wrote: > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running > > RH9, > > PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. > > > > We are expecting a pretty high load, a few thousands of 'concurrent' > > users executing either select, insert, update, statments. > > The quick and dirty method would be to upgrade to the recently > announced > 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get > another +60% there due to the huge L3 hiding the Xeon's shared bus penalty. If you are going to have thousands of 'concurrent' users you should seriously consider the 2.6 kernel if you are running Linux or as an alternative going with FreeBSD. You will need to load test your system and become an expert on tuning Postgres to get the absolute maximum performance from each and every query you have. And you will need lots of hard drives. By lots I mean dozen(s) in a raid 10 array with a good controller. Thousands of concurrent users means hundreds or thousands of transactions per second. I've personally seen it scale that far but in my opinion you will need a lot more hard drives and ram than cpu.
Hi all, If you have a DB of 'only' 13 GB and you do not expect it to grow much, it might be advisable to have enough memory (RAM) to hold the entire DB in shared memory (everything is cached). If you have a server with say 24 GB or memory and can allocate 20 GB for cache, you don't care about the speed of disks any more - all you worry about is the speed of your memory and your network connection. I believe, this not possible using 32-bit technology, you would have to go to some 64-bit platform, but if it's speed you want ... You can also try solid state hard disk drives. These are actually just meory, there are no moving parts, but the look and behave like very very fast disk drives. I have seen them at capacities of 73 GB - but they didn't mention the price (I'd probably have a heart attack when I look at the price tag). Best regards, Chris On Tuesday 02 March 2004 14:41, Anjan Dave wrote: > "By lots I mean dozen(s) in a raid 10 array with a good controller." > > I believe, for RAID-10, I will need even number of drives. Currently, > the size of the database is about 13GB, and is not expected to grow > exponentially with thousands of concurrent users, so total space is not > of paramount importance compared to performance. > > Does this sound reasonable setup? > 10x36GB FC drives on RAID-10 > 4x36GB FC drives for the logs on RAID-10 (not sure if this is the > correct ratio)? > 1 hotspare > Total=15 Drives per enclosure. > > Tentatively, I am looking at an entry-level EMC CX300 product with 2GB > RAID cache, etc. > > Question - Are 73GB drives supposed to give better performance because > of higher number of platters? > > Thanks, > Anjan > > > -----Original Message----- > From: Fred Moyer [mailto:fred@redhotpenguin.com] > Sent: Tuesday, March 02, 2004 5:57 AM > To: William Yu; Anjan Dave > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Scaling further up > > On Tue, 2004-03-02 at 17:42, William Yu wrote: > > Anjan Dave wrote: > > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running > > > RH9, > > > PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 > > drives. > > > > We are expecting a pretty high load, a few thousands of 'concurrent' > > > users executing either select, insert, update, statments. > > > > The quick and dirty method would be to upgrade to the recently > > announced > > 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get > > > > another +60% there due to the huge L3 hiding the Xeon's shared bus > > penalty. > > If you are going to have thousands of 'concurrent' users you should > seriously consider the 2.6 kernel if you are running Linux or as an > alternative going with FreeBSD. You will need to load test your system > and become an expert on tuning Postgres to get the absolute maximum > performance from each and every query you have. > > And you will need lots of hard drives. By lots I mean dozen(s) in a > raid 10 array with a good controller. Thousands of concurrent users > means hundreds or thousands of transactions per second. I've personally > seen it scale that far but in my opinion you will need a lot more hard > drives and ram than cpu. > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings
Anjan, > Question - Are 73GB drives supposed to give better performance because > of higher number of platters? Not for your situation, no. Your issue is random seek times for multiple simultaneous seek requests and batched checkpoint updates. Things that help with this are: More spindles Better controllers, both RAID and individual disks Faster drives Particularly, I'd check out stuff like reports from Tom's Hardware for evaluating the real speed of drives and seek times. Often a really good 10000 RPM SCSI will beat a 15000RPM SCSI if the latter has poor onboard programming. -- -Josh Berkus Aglio Database Solutions San Francisco
On Tue, 2 Mar 2004, Anjan Dave wrote: > "By lots I mean dozen(s) in a raid 10 array with a good controller." > > I believe, for RAID-10, I will need even number of drives. Correct. > Currently, > the size of the database is about 13GB, and is not expected to grow > exponentially with thousands of concurrent users, so total space is not > of paramount importance compared to performance. > > Does this sound reasonable setup? > 10x36GB FC drives on RAID-10 > 4x36GB FC drives for the logs on RAID-10 (not sure if this is the > correct ratio)? > 1 hotspare > Total=15 Drives per enclosure. Putting the Logs on RAID-10 is likely to be slower than, or no faster than putting them on RAID-1, since the RAID-10 will have to write to 4 drives, while the RAID-1 will only have to write to two drives. now, if you were reading in the logs a lot, it might help to have the RAID-10. > Tentatively, I am looking at an entry-level EMC CX300 product with 2GB > RAID cache, etc. Pick up a spare, I'll get you my home address, etc... :-) Seriously, that's huge. At that point you may well find that putting EVERYTHING on a big old RAID-5 performs best, since you've got lots of caching / write buffering going on. > Question - Are 73GB drives supposed to give better performance because > of higher number of platters? Generally, larger hard drives perform better than smaller hard drives because they a: have more heads and / or b: have a higher areal density. It's a common misconception that faster RPM drives are a lot faster, when, in fact, their only speed advantage is slight faster seeks. The areal density of faster spinning hard drives tends to be somewhat less than the slower spinning drives, since the maximum frequency the heads can work in on both drives, assuming the same technology, is the same. I.e. the speed at which you can read data off of the platter doesn't usually go up with a higher RPM drive, only the speed with which you can get to the first sector.
That was part of my original question - whether it makes sense to go for a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to high amounts of memory, and shouldn't have any issues addressing it all. I've had that kind of setup once temporarily on a V480 (quad UltraSparc, 16GB RAM) machine, and it did well in production use. Without having the time/resources to do extensive testing, I am not sure if Postgres/Solaris9 is really suggested by the community for high-performance, as opposed to a XEON/Linux setup. Storage being a separate discussion. Thanks, Anjan -----Original Message----- From: Chris Ruprecht [mailto:chris@ruprecht.org] Sent: Tuesday, March 02, 2004 4:17 PM To: Anjan Dave; fred@redhotpenguin.com; William Yu Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Scaling further up Hi all, If you have a DB of 'only' 13 GB and you do not expect it to grow much, it might be advisable to have enough memory (RAM) to hold the entire DB in shared memory (everything is cached). If you have a server with say 24 GB or memory and can allocate 20 GB for cache, you don't care about the speed of disks any more - all you worry about is the speed of your memory and your network connection. I believe, this not possible using 32-bit technology, you would have to go to some 64-bit platform, but if it's speed you want ... You can also try solid state hard disk drives. These are actually just meory, there are no moving parts, but the look and behave like very very fast disk drives. I have seen them at capacities of 73 GB - but they didn't mention the price (I'd probably have a heart attack when I look at the price tag). Best regards, Chris On Tuesday 02 March 2004 14:41, Anjan Dave wrote: > "By lots I mean dozen(s) in a raid 10 array with a good controller." > > I believe, for RAID-10, I will need even number of drives. Currently, > the size of the database is about 13GB, and is not expected to grow > exponentially with thousands of concurrent users, so total space is > not of paramount importance compared to performance. > > Does this sound reasonable setup? > 10x36GB FC drives on RAID-10 > 4x36GB FC drives for the logs on RAID-10 (not sure if this is the > correct ratio)? 1 hotspare > Total=15 Drives per enclosure. > > Tentatively, I am looking at an entry-level EMC CX300 product with 2GB > RAID cache, etc. > > Question - Are 73GB drives supposed to give better performance because > of higher number of platters? > > Thanks, > Anjan > > > -----Original Message----- > From: Fred Moyer [mailto:fred@redhotpenguin.com] > Sent: Tuesday, March 02, 2004 5:57 AM > To: William Yu; Anjan Dave > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Scaling further up > > On Tue, 2004-03-02 at 17:42, William Yu wrote: > > Anjan Dave wrote: > > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running > > > RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on > > > 4 > > drives. > > > > We are expecting a pretty high load, a few thousands of > > > 'concurrent' users executing either select, insert, update, > > > statments. > > > > The quick and dirty method would be to upgrade to the recently > > announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is > > that you'd get > > > > another +60% there due to the huge L3 hiding the Xeon's shared bus > > penalty. > > If you are going to have thousands of 'concurrent' users you should > seriously consider the 2.6 kernel if you are running Linux or as an > alternative going with FreeBSD. You will need to load test your > system and become an expert on tuning Postgres to get the absolute > maximum performance from each and every query you have. > > And you will need lots of hard drives. By lots I mean dozen(s) in a > raid 10 array with a good controller. Thousands of concurrent users > means hundreds or thousands of transactions per second. I've > personally seen it scale that far but in my opinion you will need a > lot more hard drives and ram than cpu. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings
Here's what I recorded today from iostat (linux, iostat -x -k, sda3 is the pg slice, logs included) during peak time on the RAID-10 array - What i see is mostly writes, and sometimes, quite a bit of writing, during which the average wait times shoot up. Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util /dev/sda3 18.81 113.21 3.90 36.33 181.54 1207.75 90.77 603.88 34.54 0.49 0.73 0.22 0.87 /dev/sda3 0.00 208.00 0.00 150.00 0.00 2884.00 0.00 1442.00 19.23 0.75 0.50 0.33 5.00 /dev/sda3 0.00 239.00 0.00 169.00 0.00 3264.00 0.00 1632.00 19.31 2.15 1.27 0.33 5.50 /dev/sda3 0.00 224.50 0.00 158.00 0.00 3060.00 0.00 1530.00 19.37 1.90 1.20 0.28 4.50 /dev/sda3 0.00 157.00 0.00 117.00 0.00 2192.00 0.00 1096.00 18.74 0.40 0.34 0.30 3.50 /dev/sda3 0.00 249.50 0.00 179.00 0.00 3596.00 0.00 1798.00 20.09 21.40 10.78 0.39 7.00 /dev/sda3 0.00 637.50 0.00 620.50 0.00 9936.00 0.00 4968.00 16.01 1137.15 183.55 1.85 115.00 /dev/sda3 0.00 690.00 0.00 548.50 0.00 9924.00 0.00 4962.00 18.09 43.10 7.82 0.46 25.50 /dev/sda3 0.00 485.00 0.00 392.00 0.00 7028.00 0.00 3514.00 17.93 86.90 22.21 1.14 44.50 /dev/sda3 0.00 312.50 0.00 206.50 0.00 4156.00 0.00 2078.00 20.13 3.50 1.69 0.53 11.00 /dev/sda3 0.00 386.50 0.00 275.50 0.00 5336.00 0.00 2668.00 19.37 16.80 6.10 0.60 16.50 /dev/sda3 0.00 259.00 0.00 176.50 0.00 3492.00 0.00 1746.00 19.78 3.25 1.84 0.40 7.00 /dev/sda3 0.00 196.00 0.00 99.00 0.00 2360.00 0.00 1180.00 23.84 0.10 0.10 0.10 1.00 /dev/sda3 0.00 147.00 0.00 100.00 0.00 1976.00 0.00 988.00 19.76 0.50 0.50 0.45 4.50 /dev/sda3 0.00 126.50 0.00 94.50 0.00 1768.00 0.00 884.00 18.71 0.20 0.21 0.21 2.00 /dev/sda3 0.00 133.50 0.00 106.50 0.00 1920.00 0.00 960.00 18.03 0.50 0.47 0.47 5.00 /dev/sda3 0.00 146.50 0.00 118.00 0.00 2116.00 0.00 1058.00 17.93 0.20 0.21 0.17 2.00 /dev/sda3 0.00 156.00 0.00 128.50 0.00 2276.00 0.00 1138.00 17.71 0.35 0.27 0.27 3.50 /dev/sda3 0.00 145.00 0.00 105.00 0.00 2000.00 0.00 1000.00 19.05 0.25 0.24 0.24 2.50 /dev/sda3 0.00 72.96 0.00 54.51 0.00 1019.74 0.00 509.87 18.71 0.17 0.31 0.31 1.72 /dev/sda3 0.00 168.50 0.00 139.50 0.00 2464.00 0.00 1232.00 17.66 0.65 0.47 0.39 5.50 /dev/sda3 0.00 130.50 0.00 100.00 0.00 1844.00 0.00 922.00 18.44 0.00 0.00 0.00 0.00 /dev/sda3 0.00 122.00 0.00 101.00 0.00 1784.00 0.00 892.00 17.66 0.25 0.25 0.25 2.50 /dev/sda3 0.00 143.00 0.00 121.50 0.00 2116.00 0.00 1058.00 17.42 0.25 0.21 0.21 2.50 /dev/sda3 0.00 134.50 0.00 96.50 0.00 1848.00 0.00 924.00 19.15 0.35 0.36 0.36 3.50 /dev/sda3 0.00 153.50 0.00 115.00 0.00 2148.00 0.00 1074.00 18.68 0.35 0.30 0.30 3.50 /dev/sda3 0.00 101.50 0.00 80.00 0.00 1452.00 0.00 726.00 18.15 0.20 0.25 0.25 2.00 /dev/sda3 0.00 108.50 0.00 92.50 0.00 1608.00 0.00 804.00 17.38 0.25 0.27 0.27 2.50 /dev/sda3 0.00 179.00 0.00 132.50 0.00 2492.00 0.00 1246.00 18.81 0.55 0.42 0.42 5.50 /dev/sda3 1.00 113.00 1.00 83.00 16.00 1568.00 8.00 784.00 18.86 0.15 0.18 0.12 1.00 /dev/sda3 0.00 117.00 0.00 97.50 0.00 1716.00 0.00 858.00 17.60 0.20 0.21 0.21 2.00 /dev/sda3 0.00 541.00 0.00 415.50 0.00 7696.00 0.00 3848.00 18.52 146.50 35.09 1.37 57.00 /dev/sda3 0.00 535.00 0.00 392.50 0.00 7404.00 0.00 3702.00 18.86 123.70 31.67 1.31 51.50 /dev/sda3 0.00 993.50 0.00 697.50 0.00 13544.00 0.00 6772.00 19.42 174.25 24.98 1.25 87.00 /dev/sda3 0.00 245.00 0.00 108.50 0.00 2832.00 0.00 1416.00 26.10 0.55 0.51 0.51 5.50 -----Original Message----- From: scott.marlowe [mailto:scott.marlowe@ihs.com] Sent: Tuesday, March 02, 2004 4:16 PM To: Anjan Dave Cc: fred@redhotpenguin.com; William Yu; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Scaling further up On Tue, 2 Mar 2004, Anjan Dave wrote: > "By lots I mean dozen(s) in a raid 10 array with a good controller." > > I believe, for RAID-10, I will need even number of drives. Correct. > Currently, > the size of the database is about 13GB, and is not expected to grow > exponentially with thousands of concurrent users, so total space is > not of paramount importance compared to performance. > > Does this sound reasonable setup? > 10x36GB FC drives on RAID-10 > 4x36GB FC drives for the logs on RAID-10 (not sure if this is the > correct ratio)? 1 hotspare > Total=15 Drives per enclosure. Putting the Logs on RAID-10 is likely to be slower than, or no faster than putting them on RAID-1, since the RAID-10 will have to write to 4 drives, while the RAID-1 will only have to write to two drives. now, if you were reading in the logs a lot, it might help to have the RAID-10. > Tentatively, I am looking at an entry-level EMC CX300 product with 2GB > RAID cache, etc. Pick up a spare, I'll get you my home address, etc... :-) Seriously, that's huge. At that point you may well find that putting EVERYTHING on a big old RAID-5 performs best, since you've got lots of caching / write buffering going on. > Question - Are 73GB drives supposed to give better performance because > of higher number of platters? Generally, larger hard drives perform better than smaller hard drives because they a: have more heads and / or b: have a higher areal density. It's a common misconception that faster RPM drives are a lot faster, when, in fact, their only speed advantage is slight faster seeks. The areal density of faster spinning hard drives tends to be somewhat less than the slower spinning drives, since the maximum frequency the heads can work in on both drives, assuming the same technology, is the same. I.e. the speed at which you can read data off of the platter doesn't usually go up with a higher RPM drive, only the speed with which you can get to the first sector.
On Tue, 2 Mar 2004, Anjan Dave wrote: > That was part of my original question - whether it makes sense to go for > a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to > high amounts of memory, and shouldn't have any issues addressing it all. > I've had that kind of setup once temporarily on a V480 (quad UltraSparc, > 16GB RAM) machine, and it did well in production use. Without having the > time/resources to do extensive testing, I am not sure if > Postgres/Solaris9 is really suggested by the community for > high-performance, as opposed to a XEON/Linux setup. Storage being a > separate discussion. Some folks on the list have experience with Postgresql on Solaris, and they generally say they use Solaris not for performance reasons, but for reliability reasons. I.e. the bigger Sun hardware is fault tolerant. For speed, the X86 32 and 64 bit architectures seem to be noticeable faster than Sparc. However, running Linux or BSD on Sparc make them pretty fast too, but you lose the fault tolerant support for things like hot swappable CPUs or memory.
> For speed, the X86 32 and 64 bit architectures seem to be noticeable > faster than Sparc. However, running Linux or BSD on Sparc make them > pretty fast too, but you lose the fault tolerant support for things like > hot swappable CPUs or memory. Agreed.. You can get a Quad Opteron with 16GB memory for around 20K. Grab 3, a cheap SAN and setup a little master/slave replication with failover (how is Slony coming?), and you're all set.
On Tue, Mar 02, 2004 at 02:16:24PM -0700, scott.marlowe wrote: > It's a common misconception that faster RPM drives are a lot faster, > when, in fact, their only speed advantage is slight faster seeks. > The areal density of faster spinning hard drives tends to be > somewhat less than the slower spinning drives, since the maximum > frequency the heads can work in on both drives, assuming the same > technology, is the same. I.e. the speed at which you can read data > off of the platter doesn't usually go up with a higher RPM drive, > only the speed with which you can get to the first sector. This would imply that an upgrade in drive RPM should be accompanied by a decrease in random_page_cost, correct? random_page_cost should be set with the following things taken into account: - seek speed - likelihood of page to be cached in memory by the kernel - anything else? Sorry, i realize this pulls the thread a bit off-topic, but i've heard that about RPM speeds before, and i just want some confirmation that my deductions are reasonable. -johnnnnnnnnnnn
Can you describe the vendors/components of a "cheap SAN setup?" Thanks, Anjan -----Original Message----- From: Rod Taylor [mailto:pg@rbt.ca] Sent: Tuesday, March 02, 2004 5:57 PM To: Scott Marlowe Cc: Anjan Dave; Chris Ruprecht; fred@redhotpenguin.com; William Yu; Postgresql Performance Subject: Re: [PERFORM] Scaling further up > For speed, the X86 32 and 64 bit architectures seem to be noticeable > faster than Sparc. However, running Linux or BSD on Sparc make them > pretty fast too, but you lose the fault tolerant support for things like > hot swappable CPUs or memory. Agreed.. You can get a Quad Opteron with 16GB memory for around 20K. Grab 3, a cheap SAN and setup a little master/slave replication with failover (how is Slony coming?), and you're all set.
On Mar 2, 2004, at 5:36 PM, scott.marlowe wrote: > Some folks on the list have experience with Postgresql on Solaris, and > they generally say they use Solaris not for performance reasons, but > for > reliability reasons. I.e. the bigger Sun hardware is fault tolerant. > Solaris isn't nearly as bad for PG as it used to be. But as you say - the #1 reason to use sun is reliability. (In my case, it was because we had a giant sun laying around :) I'm trying to remember exactly what happens.. but I know on sun if it had a severe memory error it kills off processes with data on that dimm (Since it has no idea if it is bad or not. Thanks to ECC this is very rare, but it can happen.). I want to say if a CPU dies any processes running on it at that moment are also killed. but the more I think about that th emore I don't think that is the case. As for x86.. if ram or a cpu goes bad you're SOL. Although opterons are sexy you need to remember they really are brand new cpus - I'm sure AMD has done tons of testing but sun ultrasparc's have been in every situation conceivable in production. If you are going to really have thousands of users you probably want to bet the farm on something proven. lots and lots of spindles lots and lots of ram You may also want to look into a replication solution as a hot backup. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
On Tue, 2004-03-02 at 18:24, Anjan Dave wrote: > Can you describe the vendors/components of a "cheap SAN setup?" heh.. Excellent point. My point was that you could get away with a smaller setup (number of disks) if it doesn't have to deal with reads and writes are not time dependent than you will if you attempt to pull 500MB/sec off the disks. If it is foreseeable that the database can be held in Ram, that it is much easier and cheaper way to get high IO than with physical disks.
On 02/03/2004 23:25 johnnnnnn wrote: > [snip] > random_page_cost should be set with the following things taken into > account: > - seek speed Which is not exactly the same thing as spindle speed as it's a combination of spindle speed and track-to-track speed. I think you'll find that a 15K rpm disk, whilst it will probably have a lower seek time than a 10K rpm disk, won't have a proportionately (i.e., 2/3rds) lower seek time. > - likelihood of page to be cached in memory by the kernel That's effective cache size. -- Paul Thomas +------------------------------+---------------------------------------------+ | Thomas Micro Systems Limited | Software Solutions for the Smaller Business | | Computer Consultants | http://www.thomas-micro-systems-ltd.co.uk | +------------------------------+---------------------------------------------+
On Tue, Mar 02, 2004 at 04:50:04PM -0500, Anjan Dave wrote: > time/resources to do extensive testing, I am not sure if > Postgres/Solaris9 is really suggested by the community for > high-performance, as opposed to a XEON/Linux setup. Storage being a > separate discussion. I can tell you from experience that performance on Solaris is nowhere close to what you'd expect, given the coin you're forking over for it. I think the reason to use Solaris is its support for all the nifty hot-swappable hardware, and not for its speed or any putative benefit you might get from having 64 bits at your disposal. A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
John, > This would imply that an upgrade in drive RPM should be accompanied by > a decrease in random_page_cost, correct? Maybe. Maybe not. Tom's Hardware did some Bonnie++ testing with a variety of new drives last year. They were moderately surprised to find that there were "faster" drives (i.e. higher RPM) which had lower real throughput due to poor onboard software and hardware, such as a small and slow onboard cache. So, it would be reasonable to assume that a 10,000 RPM Barracuda could support marginally lower random_page_cost than a 7,200 RPM Barracuda ... but that tells you nothing about a 10,000 RPM Maxtor Diamond (as an example). Also, many other factors influence real random_page_cost; the size and access pattern of your database is probably much more important than your RPM. -- -Josh Berkus Aglio Database Solutions San Francisco
On Wed, 3 Mar 2004, Paul Thomas wrote: > > On 02/03/2004 23:25 johnnnnnn wrote: > > [snip] > > random_page_cost should be set with the following things taken into > > account: > > - seek speed > > Which is not exactly the same thing as spindle speed as it's a combination > of spindle speed and track-to-track speed. I think you'll find that a 15K > rpm disk, whilst it will probably have a lower seek time than a 10K rpm > disk, won't have a proportionately (i.e., 2/3rds) lower seek time. There are three factors that affect how fast you can get to the next sector: seek time settle time rotational latency Most drives only list the first, and don't bother to mention the other two. On many modern drives, the seek times are around 5 to 10 milliseconds. The settle time varies as well. the longer the seek, the longer the settle, generally. This is the time it takes for the head to stop shaking and rest quietly over a particular track. Rotational Latency is the amount of time you have to wait, on average, for the sector you want to come under the heads. Assuming an 8 ms seek, and 2 ms settle (typical numbers), and that the rotational latency on average is 1/2 of a rotation: At 10k rpm, a rotation takes 1/166.667 of a second, or 6 mS. So, a half a rotation is approximately 3 mS. By going to a 15k rpm drive, the latency drops to 2 mS. So, if we add them up, on the same basic drive, one being 10k and one being 15k, we get: 10krpm: 8+2+3 = 13 mS 15krpm: 8+2+2 = 12 mS So, based on the decrease in rotational latency being the only advantage the 15krpm drive has over the 10krpm drive, we get an decrease in access time of only 1 mS, or only about an 8% decrease in actual seek time. So, if you're random page cost on 10krpm drives was 1.7, you'd need to drop it to 1.57 or so to reflect the speed increase from 15krpm drives. I.e. it's much more likely that going from 1 gig to 2 gigs of ram will make a noticeable difference than going from 10k to 15k drives.
On 03/03/2004 18:23 scott.marlowe wrote: > [snip] > There are three factors that affect how fast you can get to the next > sector: > > seek time > settle time > rotational latency > > Most drives only list the first, and don't bother to mention the other > two. Ah yes, one of my (very) few still functioning brain cells was nagging about another bit of time in the equation :) > On many modern drives, the seek times are around 5 to 10 milliseconds. > [snip] Going back to the OPs posting about random_page_cost, imagine I have 2 servers identical in every way except the disk drive. Server A has a 10K rpm drive and server B has a 15K rpm drive. Seek/settle times aren't spectacularly different between the 2 drives. I'm wondering if drive B might actually merit a _higher_ random_page_cost than drive A as, once it gets settled on a disk track, it can suck the data off a lot faster. opinions/experiences anyone? -- Paul Thomas +------------------------------+---------------------------------------------+ | Thomas Micro Systems Limited | Software Solutions for the Smaller Business | | Computer Consultants | http://www.thomas-micro-systems-ltd.co.uk | +------------------------------+---------------------------------------------+
Great response, Thanks. Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't understand is that even though the OS caches most of the memory and PG can use it if it needs it, why would the system swap (not much, only during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB, effective cache size is 2GB, sort mem is 2MB, rest is default values. It also happens that a large query (reporting type) can hold up the other queries, and the load averages shoot up during peak times. Regarding a baseline - -We have docs and monitor for frequency of sql statements, most expensive ones, etc. (IronEye) -I am monitoring disk reads/writes using iostat -How do I measure commit frequency, and system events like checkpoint? (vacuum is done nightly during less or no load) Thanks, Anjan -----Original Message----- From: Aaron W [mailto:aaronwerman@yahoo.com] Sent: Thursday, March 04, 2004 8:58 AM To: pgsql-performance@postgresql.org; Anjan Dave Subject: Re: Scaling further up .... I'd look at adding more disks first. Depending on what type of query load you get, that box sounds like it will be very much I/O bound.... Given a a 13G database on a 12G system, with a low growth rate, it is likely that there is almost no I/O for most activities. The exception is checkpointing. The first thing I'd do is try to build a spreadsheet model of: - select frequency, and # logical and physical reads involved - insert/delete/update frequency, and # logical and physical read and writes involved - commit frequency, etc. (start out with simplistic assumptions, and do it for peak load) - system events (checkpoints, vacuum) I assume that the only high I/O you will see will be for logging. The RAID issue there is basically obviated by the sequential write nature of WAL. If that is the case, EMC is not the most cost effective or obvious solution - since the value they provide is mostly manageability for disaster recovery. The goal in this case is to write at the application max speed, and with mimimal latency. Any responsible battery backed up write through (mirrored) cached controller can do that for you. On the other hand, if your requests are not *all* trivial, you are going to test the hardware and scheduling algorithms of OS and pg. Even if 0.1% of 3,000 tps take a second - that ends up generating 3 seconds of load.... Any, even slightly, slow transactions will generate enormous queues which slow down everything. In most systems of this volume I've seen, the mix of activities is constantly invalidating cache, making L2 caching less important. Memory to CPU bus speed is a limiting factor, as well as raw CPU speed in processing the requests. Xeon is not a great architecture for this because of FSB contention; I suspect a 4-way will be completely FSB bottlenecked so a more than 4 way would likely not change performance. I would try to get a simple model/benchmark going and test against it. You should be talking to the big iron vendors for their take on your issues and get their capacity benchmarks. __________________________________ Do you Yahoo!? Yahoo! Search - Find what you're looking for faster http://search.yahoo.com
On Thu, 4 Mar 2004, Paul Thomas wrote: > On 03/03/2004 18:23 scott.marlowe wrote: > > [snip] > > There are three factors that affect how fast you can get to the next > > sector: > > > > seek time > > settle time > > rotational latency > > > > Most drives only list the first, and don't bother to mention the other > > two. > > Ah yes, one of my (very) few still functioning brain cells was nagging > about another bit of time in the equation :) > > > On many modern drives, the seek times are around 5 to 10 milliseconds. > > [snip] > > Going back to the OPs posting about random_page_cost, imagine I have 2 > servers identical in every way except the disk drive. Server A has a 10K > rpm drive and server B has a 15K rpm drive. Seek/settle times aren't > spectacularly different between the 2 drives. I'm wondering if drive B > might actually merit a _higher_ random_page_cost than drive A as, once it > gets settled on a disk track, it can suck the data off a lot faster. > opinions/experiences anyone? It might well be that you have higher settle times that offset the small gain in rotational latency. I haven't looked into it, so I don't know one way or the other, but it seems a reasonable assumption. However, a common misconception is that the higher angular velocity of the 15krpm drives would allow you to read data faster. In fact, the limit of how fast you can read is set by the head. There's a maximum frequency that it can read, and the areal density / rpm have to be such that you don't exceed that frequency. OFten, the speed at which you read off the platters is exactly the same between a 10k and 15k of the same family. The required lower areal density is the reason 15krpm drives show up in the lower capacities first.
Anjan Dave wrote: > Great response, Thanks. > > Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't > understand is that even though the OS caches most of the memory and PG > can use it if it needs it, why would the system swap (not much, only > during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB, > effective cache size is 2GB, sort mem is 2MB, rest is default values. It > also happens that a large query (reporting type) can hold up the other > queries, and the load averages shoot up during peak times. In regards to your system going to swap, the only item I see is sort_mem at 2MB. How many simultaneous transactions do you get? If you get hundreds or thousands like your first message stated, every select sort would take up 2MB of memory regardless of whether it needed it or not. That could cause your swap activity during peak traffic. The only other item to bump up is the effective cache size -- I'd set it to 12GB.
.... I'd look at adding more disks first. Depending on what type of query load you get, that box sounds like it will be very much I/O bound.... Given a a 13G database on a 12G system, with a low growth rate, it is likely that there is almost no I/O for most activities. The exception is checkpointing. The first thing I'd do is try to build a spreadsheet model of: - select frequency, and # logical and physical reads involved - insert/delete/update frequency, and # logical and physical read and writes involved - commit frequency, etc. (start out with simplistic assumptions, and do it for peak load) - system events (checkpoints, vacuum) I assume that the only high I/O you will see will be for logging. The RAID issue there is basically obviated by the sequential write nature of WAL. If that is the case, EMC is not the most cost effective or obvious solution - since the value they provide is mostly manageability for disaster recovery. The goal in this case is to write at the application max speed, and with mimimal latency. Any responsible battery backed up write through (mirrored) cached controller can do that for you. On the other hand, if your requests are not *all* trivial, you are going to test the hardware and scheduling algorithms of OS and pg. Even if 0.1% of 3,000 tps take a second - that ends up generating 3 seconds of load.... Any, even slightly, slow transactions will generate enormous queues which slow down everything. In most systems of this volume I've seen, the mix of activities is constantly invalidating cache, making L2 caching less important. Memory to CPU bus speed is a limiting factor, as well as raw CPU speed in processing the requests. Xeon is not a great architecture for this because of FSB contention; I suspect a 4-way will be completely FSB bottlenecked so a more than 4 way would likely not change performance. I would try to get a simple model/benchmark going and test against it. You should be talking to the big iron vendors for their take on your issues and get their capacity benchmarks. __________________________________ Do you Yahoo!? Yahoo! Search - Find what you�re looking for faster http://search.yahoo.com
On Mon, 2004-03-08 at 11:40, William Yu wrote: > Anjan Dave wrote: > > Great response, Thanks. > > > > Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't > > understand is that even though the OS caches most of the memory and PG > > can use it if it needs it, why would the system swap (not much, only > > during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB, > > effective cache size is 2GB, sort mem is 2MB, rest is default values. It > > also happens that a large query (reporting type) can hold up the other > > queries, and the load averages shoot up during peak times. > > In regards to your system going to swap, the only item I see is sort_mem > at 2MB. How many simultaneous transactions do you get? If you get > hundreds or thousands like your first message stated, every select sort > would take up 2MB of memory regardless of whether it needed it or not. > That could cause your swap activity during peak traffic. > > The only other item to bump up is the effective cache size -- I'd set it > to 12GB. > Was surprised that no one corrected this bit of erroneous info (or at least I didn't see it) so thought I would for completeness. a basic explanation is that sort_mem controls how much memory a given query is allowed to use before spilling to disk, but it will not grab that much memory if it doesn't need it. See the docs for a more detailed explanation: http://www.postgresql.org/docs/7.4/interactive/runtime-config.html#RUNTIME-CONFIG-RESOURCE Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
We upgraded from 8GB to 12GB RAM a month or so ago, but even in the past, I've never seen the system exhaust on it's system cache (~6GB, in 'top'), while it's swapping. Some one had mentioned why not have the entire DB in memory? How do I configure that, for knowledge? Max connections is set to 500, and we haven't bumped it yet. (I've seen over 200 active queries, but the traffic is seasonal, so the high connection value) Thanks, Anjan -----Original Message----- From: Robert Treat [mailto:xzilla@users.sourceforge.net] Sent: Friday, March 12, 2004 6:02 PM To: William Yu Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Scaling further up On Mon, 2004-03-08 at 11:40, William Yu wrote: > Anjan Dave wrote: > > Great response, Thanks. > > > > Regarding 12GB memory and 13G db, and almost no I/O, one thing I > > don't understand is that even though the OS caches most of the > > memory and PG can use it if it needs it, why would the system swap > > (not much, only during peak times)? The SHMMAX is set to 512MB, > > shared_buffers is 150MB, effective cache size is 2GB, sort mem is > > 2MB, rest is default values. It also happens that a large query > > (reporting type) can hold up the other queries, and the load > > averages shoot up during peak times. > > In regards to your system going to swap, the only item I see is > sort_mem > at 2MB. How many simultaneous transactions do you get? If you get > hundreds or thousands like your first message stated, every select sort > would take up 2MB of memory regardless of whether it needed it or not. > That could cause your swap activity during peak traffic. > > The only other item to bump up is the effective cache size -- I'd set > it > to 12GB. > Was surprised that no one corrected this bit of erroneous info (or at least I didn't see it) so thought I would for completeness. a basic explanation is that sort_mem controls how much memory a given query is allowed to use before spilling to disk, but it will not grab that much memory if it doesn't need it. See the docs for a more detailed explanation: http://www.postgresql.org/docs/7.4/interactive/runtime-config.html#RUNTI ME-CONFIG-RESOURCE Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
I have some suggestions based on my anecdotal experience. 1. This is a relatively small DB -- the working set will likely be in RAM at any moment in time, making read I/O time mostly irrelevant. 2. The killer will be write times -- specifically log writes. Small and heavily synchronized writes, log and data writes, will drag down an impressive hardware RAID setup. We run mirrored hardware RAID 5 arrays with write back cache and are constantly seeking ways to improve write performance. We do a lot of batch processing, though, so we do a lot of write I/Os. 3. Be very careful with "battery backed write cache." It usually works as advertised. More than once in the past decade I have seen spontaneous cache corruption after power losss. The corruption usually happens when some admin, including me, has assumed that the cache will ALWAYS survive a power failure unblemished and has no "plan B." Make sure you have a contingency plan for corruption, or don't enable the cache. 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of the setup, and might hinder, not help small write I/O performance. 5. Most (almost all) of the I/O time will be due to the access time (head seek + head settle + rotational latency) and very little of the I/O time will due to data transfer time. In other words, getting drives that provide faster transfer rates will barely improve performance. The secret is lowering the access time. 6. A relatively cheap way to drastically drop the access time is to get large drive(s) and only use a portion of them for storage. The less space used on the drive, the less area the heads need to cover for seeks. At one extreme, you could make the partition the size of a single cylinder. This would make access time (ignoring OS and controller overhead) identical to rotational latency, which is as low as 4.2 ms for a cheap 7200 RPM drive. 7. A drive with a 5 ms average service time, servicing 8 KB blocks, will yield as much as 1.6 MB/s sustained write throughput. Not bad for a cheap uncached solution. Any OS aggregation of writes during the fsync() call will further improve this number -- it is basically a lower bound for throughput. 8. Many people, especially managers, cannot stomach buying disk space and only using a portion of it. In many cases, it seems more palatable to purchase a much more expensive solution to get to the same speeds. Good luck. scott.marlowe wrote: > On Wed, 3 Mar 2004, Paul Thomas wrote: > > > > > On 02/03/2004 23:25 johnnnnnn wrote: > > > [snip] > > > random_page_cost should be set with the following things taken into > > > account: > > > - seek speed > > > > Which is not exactly the same thing as spindle speed as it's a > combination > > of spindle speed and track-to-track speed. I think you'll find that a > 15K > > rpm disk, whilst it will probably have a lower seek time than a 10K rpm > > disk, won't have a proportionately (i.e., 2/3rds) lower seek time. > > There are three factors that affect how fast you can get to the next > sector: > > seek time > settle time > rotational latency > > Most drives only list the first, and don't bother to mention the other > two. > > On many modern drives, the seek times are around 5 to 10 milliseconds. > The settle time varies as well. the longer the seek, the longer the > settle, generally. This is the time it takes for the head to stop shaking > and rest quietly over a particular track. > Rotational Latency is the amount of time you have to wait, on average, for > the sector you want to come under the heads. > > Assuming an 8 ms seek, and 2 ms settle (typical numbers), and that the > rotational latency on average is 1/2 of a rotation: At 10k rpm, a > rotation takes 1/166.667 of a second, or 6 mS. So, a half a rotation is > approximately 3 mS. By going to a 15k rpm drive, the latency drops to 2 > mS. So, if we add them up, on the same basic drive, one being 10k and one > being 15k, we get: > > 10krpm: 8+2+3 = 13 mS > 15krpm: 8+2+2 = 12 mS > > So, based on the decrease in rotational latency being the only advantage > the 15krpm drive has over the 10krpm drive, we get an decrease in access > time of only 1 mS, or only about an 8% decrease in actual seek time. > > So, if you're random page cost on 10krpm drives was 1.7, you'd need to > drop it to 1.57 or so to reflect the speed increase from 15krpm drives. > > I.e. it's much more likely that going from 1 gig to 2 gigs of ram will > make a noticeable difference than going from 10k to 15k drives. > > >
Sorry about not chiming in before - I've been too swamped to think. I agree with most of the points, but a lot of these posts are interesting and seem to describe systems from an SA perspective to my DBA-centric view. ----- Original Message ----- From: "Marty Scholes" <marty@outputservices.com> To: <pgsql-performance@postgresql.org> Sent: Wednesday, March 10, 2004 6:29 PM Subject: Re: [PERFORM] Scaling further up > I have some suggestions based on my anecdotal experience. > > 1. This is a relatively small DB -- the working set will likely be in > RAM at any moment in time, making read I/O time mostly irrelevant. > > 2. The killer will be write times -- specifically log writes. Small and > heavily synchronized writes, log and data writes, will drag down an > impressive hardware RAID setup. We run mirrored hardware RAID 5 arrays > with write back cache and are constantly seeking ways to improve write > performance. We do a lot of batch processing, though, so we do a lot of > write I/Os. My experience with RAID5 for streaming sequential writes is bad. This is sometimes helped by the hardware caching to cover the cost of the additional I/Os for striping (write through RAID5 + big cache acts like RAID 1+0 until you run out of cache). Batch processing is different from high concurrency transactions because it needs faster volume streaming, while TP is dependant on the speed of ack'ing (few big writes with less synchronous waits vs. lots of small writes which serialize everyone). (RAID 3 worked for me in the past for logging, but I haven't used it in years.) > > 3. Be very careful with "battery backed write cache." It usually works > as advertised. More than once in the past decade I have seen > spontaneous cache corruption after power losss. The corruption usually > happens when some admin, including me, has assumed that the cache will > ALWAYS survive a power failure unblemished and has no "plan B." Make > sure you have a contingency plan for corruption, or don't enable the cache. I agree strongly. There is also the same problem with disk write back cache and even with SCSI controllers with write through enabled. PITR would help here. A lot of these problems are due to procedural error post crash. > > 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of > the setup, and might hinder, not help small write I/O performance. In a high volume system without write caching you are almost always going to see queuing, which can make the larger buffer mostly irrelevant, if it's not huge. Write caching thrives on big block sizes (which is a key reason why Symmetrix doesn't do worse than it does) by reducing I/O counts. Most shops I've set up or seen use mirroring or RAID 10 for logs. Note also that many RAID 10 controllers in a non-write cached setup allows having a race between the two writers, acknowledging when the first of the two completes - increasing throughput by about 1/4. > > 5. Most (almost all) of the I/O time will be due to the access time > (head seek + head settle + rotational latency) and very little of the > I/O time will due to data transfer time. In other words, getting drives > that provide faster transfer rates will barely improve performance. The > secret is lowering the access time. True. This is very much a latency story. Even in volume batch, you can see access time that clearly shows some other system configuration bottleneck that happens elsewhere before hitting I/O capacity. > > 6. A relatively cheap way to drastically drop the access time is to get > large drive(s) and only use a portion of them for storage. The less > space used on the drive, the less area the heads need to cover for > seeks. At one extreme, you could make the partition the size of a > single cylinder. This would make access time (ignoring OS and > controller overhead) identical to rotational latency, which is as low as > 4.2 ms for a cheap 7200 RPM drive. This is a good strategy for VLDB, and may not be relevant in this case. Also - big sequential writes and 15K rpm drives, in the case of writethrough, is a beautiful thing - they look like a manufacturers' demo. A primary performance role of a RDBMS is to convert random I/O to sequential (by buffering reads and using a streaming log to defer random writes to checkpoints). RDBMS's are the prime beneficiaries of the drive speed improvements - since logging, backups, and copies are about the only things (ignoring bad perl scripts and find commands) that generate loads of 50+ mB/sec. /Aaron
On Fri, Mar 12, 2004 at 06:25:48PM -0500, Anjan Dave wrote: > We upgraded from 8GB to 12GB RAM a month or so ago, but even in the > past, I've never seen the system exhaust on it's system cache (~6GB, in > 'top'), while it's swapping. > > Some one had mentioned why not have the entire DB in memory? How do I > configure that, for knowledge? You don't. It'll automatically be in memory if (a) you have enough memory, (b) you don't have anything else on the machine using the memory, and (c) it's been read at least one time. A -- Andrew Sullivan | ajs@crankycanuck.ca
Quoting Andrew Sullivan <ajs@crankycanuck.ca>: > On Fri, Mar 12, 2004 at 06:25:48PM -0500, Anjan Dave wrote: > > We upgraded from 8GB to 12GB RAM a month or so ago, but even in the > > past, I've never seen the system exhaust on it's system cache (~6GB, in > > 'top'), while it's swapping. > > > > Some one had mentioned why not have the entire DB in memory? How do I > > configure that, for knowledge? > > You don't. It'll automatically be in memory if (a) you have enough > memory, (b) you don't have anything else on the machine using the > memory, and (c) it's been read at least one time. This is the preferred method, but you could create a memory disk if running linux. This has several caveats, though. 1. You may have to recompile the kernel for support. 2. You must store the database on a hard drive partition during reboots. 3. Because of #2 this option is generally useful if you have static content that is loaded to the MD upon startup of the system. You could have some fancy methodology of shutting down the system and then copying the data to a disk-based filesystem, but this is inherently bad since at any moment a power outage would erase any updates changes. The option is there to start with all data in memory, but in general, this is probablt not what you want. Just an FYI.
On Tue, 2004-03-16 at 07:28, Matt Davies wrote: > This is the preferred method, but you could create a memory disk if running > linux. This has several caveats, though. > > 1. You may have to recompile the kernel for support. > 2. You must store the database on a hard drive partition during reboots. > 3. Because of #2 this option is generally useful if you have static content that > is loaded to the MD upon startup of the system. And 4. You use twice as much memory - one lot for the FS, the second for buffer cache. It's generally going to be slower than simply doing some typical queries to preload the data into buffer cache, I think. Stephen