Thread: High CPU Utilization
Our production database is seeing very heavy CPU utilization - anyone have any ideas/input considering the following? CPU utilization gradually increases during the day until it approaches 90%-100% at our peak time. When this happens our transactions/sec drops and our site becomes very slow. When in this state, I can see hundreds of queries in pg_stat_activity that are not waiting on locks but sit there for minutes. When the database is not in this state, those same queries can complete in fractions of a second - faster that my script that watches pg_stat_activity can keep track of them. This server has dual quad core xeon 5310s, 32 GB RAM, and a few different disk arrays (all managed in hardware by either the Perc5/i or Perc5/e adapter). The Postgres data is on a 14 disk 7.2k SATA raid 10. This server runs nothing but Postgres. The PostgreSQL database (according to pg_database_size) is 55GB and we are running PostgreSQL 8.3.5 and the 2.6.28.7-2 kernel under Arch Linux. Right now (not under peak load) this server is running at 68% CPU utilization and its SATA raid 10 is doing about 2MB/s writes and 11MB/ s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s reads, so we are barely using the available IO. Further when I run dd the CPU utilization of that process only approaches 20%-30% of one core. Additionally, when I view "top -c" I generally see a dozen or so "idle" postgres processes (they appear and drop away quickly though) consuming very large chunks of CPU (as much as 60% of a core). At any given time we have hundreds of idle postgres processes due to the JDBC connection pooling but most of them are 0% as I would expect them to be. I also see selects and inserts consuming very large percentages of CPU but I view that as less strange since they are doing work. Any ideas as to what is causing our CPUs to struggle? Is the fact that our RAM covers a significant portion of the database causing our CPUs to do a bunch of thrashing as they work with memory while our disk controllers sit idle? According to top we barely use any swap. We currently have max_connections set to 1000 (roughly the sum of the JDBC pools on our application servers). Would decreasing this value help? We can decrease the JDBC pools or switch to pgbouncer for pooling if this is the case. Really just looking for any input/ideas. Our workload is primarily OLTP in nature - essentially a social network. By transactions/sec at the start I am using the xact_commit value in pg_stat_database. Please let me know if this value is not appropriate for getting a tps guess. Right now with the 60% CPU utilization and low IO use xact_commit is increasing at a rate of 1070 a second. I have an identical PITR slave I can pause the PITR sync on to run any test against. I will happily provide any additional information that would be helpful. Any assistance is greatly appreciated. Joe Uhl
On Monday 16 March 2009, Joe Uhl <joeuhl@gmail.com> wrote: > Right now (not under peak load) this server is running at 68% CPU > utilization and its SATA raid 10 is doing about 2MB/s writes and 11MB/ > s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s reads, > so we are barely using the available IO. Further when I run dd the > CPU utilization of that process only approaches 20%-30% of one core. What does vmstat say when it's slow? The output of "vmstat 1 30" would be informative. note: dd is sequential I/O. Normal database usage is random I/O. -- Even a sixth-grader can figure out that you can’t borrow money to pay off your debt
Here is vmstat 1 30. We are under peak load right now so I can gather information from the real deal :) Had an almost complete lockup a moment ago, number of non-idle postgres connections was 637. Going to drop our JDBC pool sizes a bit and bounce everything. procs -----------memory---------- ---swap-- -----io---- -system-- ---- cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 12 35 95056 11102380 56856 14954948 3 4 669 541 1 2 23 3 54 19 12 39 95056 11092484 56876 14963204 0 0 6740 1204 10066 13277 91 5 0 4 8 42 95056 11081712 56888 14972244 0 0 8620 1168 10659 17020 78 6 0 15 10 30 95052 11069768 56904 14982628 0 0 8944 976 9809 15109 81 6 1 12 4 27 95048 11059576 56916 14991296 0 0 8852 440 7652 13294 63 4 2 32 5 42 95048 11052524 56932 14996496 0 0 4700 384 6383 11249 64 4 4 28 5 33 95048 11047492 56956 15001428 0 0 3852 572 6029 14010 36 4 5 56 7 35 95048 11041184 56960 15005480 0 0 3964 136 5042 10802 40 3 1 56 1 33 95048 11037988 56968 15009240 0 0 3892 168 3384 6479 26 1 3 69 3 28 95048 11029332 56980 15015744 0 0 6724 152 4964 12844 11 2 8 79 0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614 11 1 6 82 3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115 4 3 8 85 1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976 3 2 12 83 0 29 95048 10985408 57060 15054968 0 0 4200 80 2586 6582 4 1 12 83 1 29 95048 10980828 57064 15058992 0 0 4560 64 2966 7557 7 2 6 85 2 28 95048 10977192 57072 15063176 0 0 3860 72 2695 6742 11 1 7 81 2 29 95048 10969120 57088 15067808 0 0 5084 84 3296 8067 14 1 0 84 0 25 95048 10962096 57104 15072984 0 0 4440 500 2721 6263 12 1 6 80 0 23 95044 10955320 57108 15079260 0 0 5712 232 2678 5990 6 1 6 87 2 25 95044 10948644 57120 15084524 0 0 5120 184 3499 8143 20 3 9 69 3 21 95044 10939744 57128 15090644 0 0 5756 264 4724 10272 32 3 5 60 1 19 95040 10933196 57144 15095024 12 0 4440 180 2585 5244 13 2 15 70 0 21 95040 10927596 57148 15098684 0 0 3248 136 2973 7292 8 1 9 81 1 20 95040 10920708 57164 15104244 0 0 5192 360 1865 4547 3 1 9 87 1 24 95040 10914552 57172 15105856 0 0 2308 16 1948 4450 6 1 1 93 0 24 95036 10909148 57176 15110240 0 0 3824 152 1330 2632 3 1 6 90 1 21 95036 10900628 57192 15116332 0 0 5680 180 1898 3986 4 1 11 84 0 19 95036 10888356 57200 15121736 0 0 5952 120 2252 3991 12 1 8 79 2 22 95036 10874336 57204 15128252 0 0 6320 112 2831 6755 5 2 8 85 3 26 95036 10857592 57220 15134020 0 0 5124 216 3067 5296 32 6 3 59 Alan, my apologies if you get this twice. Didn't reply back to the list on first try. On Mar 16, 2009, at 3:52 PM, Alan Hodgson wrote: > On Monday 16 March 2009, Joe Uhl <joeuhl@gmail.com> wrote: >> Right now (not under peak load) this server is running at 68% CPU >> utilization and its SATA raid 10 is doing about 2MB/s writes and >> 11MB/ >> s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s >> reads, >> so we are barely using the available IO. Further when I run dd the >> CPU utilization of that process only approaches 20%-30% of one core. > > What does vmstat say when it's slow? The output of "vmstat 1 30" > would be > informative. > > note: dd is sequential I/O. Normal database usage is random I/O. > > -- > Even a sixth-grader can figure out that you can’t borrow money to > pay off > your debt > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Mon, 16 Mar 2009, Joe Uhl wrote: > Here is vmstat 1 30. We are under peak load right now so I can gather > information from the real deal Quite helpful, reformatting a bit and picking an informative section: procs -----------memory---------- ---swap- ----io--- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614 11 1 6 82 3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115 4 3 8 85 1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976 3 2 12 83 This says that your server is spending all its time waiting for I/O, actual CPU utilization is minimal. You're only achieving around 3-8MB/s of random I/O. That's the reality of what your disk I/O subsystem is capable of, regardless of what its sequential performance with dd looks like. If you were to run a more complicated benchmark like bonnie++ instead, I'd bet that your "seeks/second" results are very low, even though sequential read/write is fine. The Perc5 controllers have a pretty bad reputation for performance on this list, even in RAID10. Not much you can do about that beyond scrapping the controller and getting a better one. What you might do in order to reduce the total number of writes needed is some standard postgresql.conf tuning; see http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server What you could do here is increase shared_buffers, checkpoint_segments, and checkpoint_completion_target as described there. Having more buffers dedicated to the database and having less checkpoints can result in less random I/O showing up, as popular data pages will stay in RAM for longer without getting written out so much. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
I dropped the pool sizes and brought things back up. Things are stable, site is fast, CPU utilization is still high. Probably just a matter of time before issue comes back (we get slammed as kids get out of school in the US). Now when I run vmtstat 1 30 it looks very different (below). Waiting is minimal, user is very high. Under nontrivial load, according to xact_commit in pg_stat_database we are doing 1800+ tps. Appreciate the input and explanation on vmstat. I am going to throw some of these numbers into zabbix so I can keep a better eye on them. This server is a couple years old so the purchase of a new controller and/or disks is not out of the question. On final note, have made several changes to postgresql.conf. Some of those here: max_connections = 1000 shared_buffers = 7680MB work_mem = 30MB synchronous_commit = off checkpoint_segments = 50 effective_cache_size = 20000MB procs -----------memory---------- ---swap-- -----io---- -system-- ---- cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 9 8 73036 500164 82200 23497748 3 4 669 541 1 1 23 3 54 19 20 4 73036 497252 82200 23500836 0 0 2500 680 11145 15168 91 4 2 2 21 1 73036 491416 82204 23503832 0 0 1916 920 10303 14032 94 4 1 1 23 5 73036 489580 82212 23505860 0 0 1348 3296 11682 15970 94 5 1 0 31 1 73036 481408 82220 23507752 0 0 984 8988 10123 11289 97 3 0 0 25 4 73036 483248 82232 23509420 0 0 1268 1312 10705 14063 96 4 0 0 23 4 73036 480096 82232 23512380 0 0 2372 472 9805 13996 94 5 1 1 24 4 73036 476732 82236 23515196 0 0 2012 720 10365 14307 96 3 1 0 22 1 73036 474468 82236 23516584 0 0 944 3108 9838 12831 95 4 1 0 14 1 73036 455756 82284 23534548 0 0 908 3284 9096 11333 94 4 1 0 10 2 73036 455224 82292 23536304 0 0 1760 416 12454 17736 89 6 3 2 17 0 73036 460620 82292 23538888 0 0 1292 968 12030 18333 90 7 2 1 13 4 73036 459764 82292 23539724 0 0 332 288 9722 14197 92 5 2 1 17 5 73036 457516 82292 23542176 0 0 1872 17752 10458 15465 91 5 2 1 19 4 73036 450804 82300 23545640 0 0 2980 640 10602 15621 90 6 2 2 24 0 73036 447660 82312 23547644 0 0 1736 10724 12401 15413 93 6 1 0 20 6 73036 444380 82320 23550692 0 0 2064 476 9008 10985 94 4 1 0 22 2 73036 442880 82328 23553640 0 0 2496 3156 10739 15211 93 5 1 1 11 1 73036 441448 82328 23555632 0 0 1452 3552 10812 15337 93 5 2 1 6 2 73036 439812 82348 23557420 0 0 1052 1128 8603 10514 91 3 3 2 6 3 73036 433456 82348 23560860 0 0 2484 656 7636 13033 68 4 14 14 6 3 73036 433084 82348 23562628 0 0 1400 408 6046 11778 70 3 18 9 5 0 73036 430776 82356 23564264 0 0 1108 1300 7549 13754 73 4 16 7 5 2 73036 430124 82360 23565580 0 0 1016 2216 7844 14507 72 4 18 6 4 2 73036 429652 82380 23567480 0 0 1168 2468 7694 15466 58 4 24 14 6 2 73036 427304 82384 23569668 0 0 1132 752 5993 13606 49 5 36 10 7 1 73036 423020 82384 23571932 0 0 1244 824 8085 18072 56 3 30 10 procs -----------memory---------- ---swap-- -----io---- -system-- ---- cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 4 0 73036 420816 82392 23573824 0 0 1292 820 5370 10958 46 2 41 10 9 1 73020 418048 82392 23576900 52 0 1632 2592 5931 11629 60 3 29 8 4 2 73004 415164 82424 23578620 56 0 1812 4116 7503 14674 71 3 15 12 On Mar 16, 2009, at 4:19 PM, Dave Youatt wrote: > Last column "wa" is % cpu time spent waiting (for IO to complete). > 80s > and 90s is pretty high, probably too high. > > Might also want to measure the IO/s performance of your RAID > controller. From the descriptions, it will be much more important > that > long sequential reads/writes for characterizing your workload. > > There are also some disappointing HW RAID controllers out there. > Generally, Aretec and Promise are good, Adaptec good, depending on > model, and the ones that Dell ship w/their servers haven't had good > reviews/reports. > > > On 03/16/2009 01:10 PM, Joe Uhl wrote: >> Here is vmstat 1 30. We are under peak load right now so I can >> gather >> information from the real deal :) >> >> Had an almost complete lockup a moment ago, number of non-idle >> postgres connections was 637. Going to drop our JDBC pool sizes a >> bit >> and bounce everything. >> >> procs -----------memory---------- ---swap-- -----io---- -system-- >> ----cpu---- >> r b swpd free buff cache si so bi bo in cs us >> sy >> id wa >> 12 35 95056 11102380 56856 14954948 3 4 669 541 1 2 >> 23 3 54 19 >> 12 39 95056 11092484 56876 14963204 0 0 6740 1204 10066 >> 13277 91 5 0 4 >> 8 42 95056 11081712 56888 14972244 0 0 8620 1168 10659 >> 17020 >> 78 6 0 15 >> 10 30 95052 11069768 56904 14982628 0 0 8944 976 9809 >> 15109 >> 81 6 1 12 >> 4 27 95048 11059576 56916 14991296 0 0 8852 440 7652 13294 >> 63 4 2 32 >> 5 42 95048 11052524 56932 14996496 0 0 4700 384 6383 11249 >> 64 4 4 28 >> 5 33 95048 11047492 56956 15001428 0 0 3852 572 6029 14010 >> 36 4 5 56 >> 7 35 95048 11041184 56960 15005480 0 0 3964 136 5042 10802 >> 40 3 1 56 >> 1 33 95048 11037988 56968 15009240 0 0 3892 168 3384 6479 >> 26 1 3 69 >> 3 28 95048 11029332 56980 15015744 0 0 6724 152 4964 12844 >> 11 2 8 79 >> 0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614 >> 11 1 6 82 >> 3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115 >> 4 3 8 85 >> 1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976 >> 3 2 12 83 >> 0 29 95048 10985408 57060 15054968 0 0 4200 80 2586 6582 >> 4 1 12 83 >> 1 29 95048 10980828 57064 15058992 0 0 4560 64 2966 7557 >> 7 2 6 85 >> 2 28 95048 10977192 57072 15063176 0 0 3860 72 2695 6742 >> 11 1 7 81 >> 2 29 95048 10969120 57088 15067808 0 0 5084 84 3296 8067 >> 14 1 0 84 >> 0 25 95048 10962096 57104 15072984 0 0 4440 500 2721 6263 >> 12 1 6 80 >> 0 23 95044 10955320 57108 15079260 0 0 5712 232 2678 5990 >> 6 1 6 87 >> 2 25 95044 10948644 57120 15084524 0 0 5120 184 3499 8143 >> 20 3 9 69 >> 3 21 95044 10939744 57128 15090644 0 0 5756 264 4724 10272 >> 32 3 5 60 >> 1 19 95040 10933196 57144 15095024 12 0 4440 180 2585 5244 >> 13 2 15 70 >> 0 21 95040 10927596 57148 15098684 0 0 3248 136 2973 7292 >> 8 1 9 81 >> 1 20 95040 10920708 57164 15104244 0 0 5192 360 1865 4547 >> 3 1 9 87 >> 1 24 95040 10914552 57172 15105856 0 0 2308 16 1948 4450 >> 6 1 1 93 >> 0 24 95036 10909148 57176 15110240 0 0 3824 152 1330 2632 >> 3 1 6 90 >> 1 21 95036 10900628 57192 15116332 0 0 5680 180 1898 3986 >> 4 1 11 84 >> 0 19 95036 10888356 57200 15121736 0 0 5952 120 2252 3991 >> 12 1 8 79 >> 2 22 95036 10874336 57204 15128252 0 0 6320 112 2831 6755 >> 5 2 8 85 >> 3 26 95036 10857592 57220 15134020 0 0 5124 216 3067 5296 >> 32 6 3 59 >> >> Alan, my apologies if you get this twice. Didn't reply back to the >> list on first try. >> >> On Mar 16, 2009, at 3:52 PM, Alan Hodgson wrote: >> >>> On Monday 16 March 2009, Joe Uhl <joeuhl@gmail.com> wrote: >>>> Right now (not under peak load) this server is running at 68% CPU >>>> utilization and its SATA raid 10 is doing about 2MB/s writes and >>>> 11MB/ >>>> s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s >>>> reads, >>>> so we are barely using the available IO. Further when I run dd the >>>> CPU utilization of that process only approaches 20%-30% of one >>>> core. >>> >>> What does vmstat say when it's slow? The output of "vmstat 1 30" >>> would be >>> informative. >>> >>> note: dd is sequential I/O. Normal database usage is random I/O. >>> >>> -- >>> Even a sixth-grader can figure out that you can’t borrow money to >>> pay >>> off >>> your debt >>> >>> -- >>> Sent via pgsql-performance mailing list >>> (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >> >> > On Mar 16, 2009, at 4:35 PM, Greg Smith wrote: > On Mon, 16 Mar 2009, Joe Uhl wrote: > >> Here is vmstat 1 30. We are under peak load right now so I can >> gather information from the real deal > > Quite helpful, reformatting a bit and picking an informative section: > > procs -----------memory---------- ---swap- ----io--- -system-- > ----cpu---- > r b swpd free buff cache si so bi bo in cs us > sy id wa > 0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614 > 11 1 6 82 > 3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115 > 4 3 8 85 > 1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976 > 3 2 12 83 > > This says that your server is spending all its time waiting for I/O, > actual CPU utilization is minimal. You're only achieving around > 3-8MB/s of random I/O. That's the reality of what your disk I/O > subsystem is capable of, regardless of what its sequential > performance with dd looks like. If you were to run a more > complicated benchmark like bonnie++ instead, I'd bet that your > "seeks/second" results are very low, even though sequential read/ > write is fine. > > The Perc5 controllers have a pretty bad reputation for performance > on this list, even in RAID10. Not much you can do about that beyond > scrapping the controller and getting a better one. > > What you might do in order to reduce the total number of writes > needed is some standard postgresql.conf tuning; see http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server > > What you could do here is increase shared_buffers, > checkpoint_segments, and checkpoint_completion_target as described > there. Having more buffers dedicated to the database and having > less checkpoints can result in less random I/O showing up, as > popular data pages will stay in RAM for longer without getting > written out so much. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com > Baltimore, MD
On Mon, Mar 16, 2009 at 2:50 PM, Joe Uhl <joeuhl@gmail.com> wrote: > I dropped the pool sizes and brought things back up. Things are stable, > site is fast, CPU utilization is still high. Probably just a matter of time > before issue comes back (we get slammed as kids get out of school in the > US). Yeah, I'm guessing your server (or more specifically its RAID card) just aren't up to the task. We had the same problem last year with a machine with 16 Gig ram and dual dual core 3.0GHz xeons with a Perc 5 something or other. No matter how we tuned it or played with it, we just couldn't get good random performance out of it. It's since been replaced by a white box unit with a tyan mobo and dual 4 core opterons and an Areca 1680 and a 12 drive RAID-10. We can sustain 30 to 60 Megs a second random access with 0 to 10% iowait. Here's a typical vmstat 10 output when our load factor is hovering around 8... r b swpd free buff cache si so bi bo in cs us sy id wa st 4 1 460 170812 92856 29928156 0 0 604 3986 4863 10146 74 3 20 3 0 7 1 460 124160 92912 29939660 0 0 812 5701 4829 9733 70 3 23 3 0 13 0 460 211036 92984 29947636 0 0 589 3178 4429 9964 69 3 25 3 0 7 2 460 90968 93068 29963368 0 0 1067 4463 4915 11081 78 3 14 5 0 7 3 460 115216 93100 29963336 0 0 3008 3197 4032 11812 69 4 15 12 0 6 1 460 142120 93088 29923736 0 0 1112 6390 4991 11023 75 4 15 6 0 6 0 460 157896 93208 29932576 0 0 698 2196 4151 8877 71 2 23 3 0 11 0 460 124868 93296 29948824 0 0 963 3645 4891 10382 74 3 19 4 0 5 3 460 95960 93272 29918064 0 0 592 30055 5550 7430 56 3 18 23 0 9 0 460 95408 93196 29914556 0 0 1090 3522 4463 10421 71 3 21 5 0 9 0 460 128632 93176 29916412 0 0 883 4774 4757 10378 76 4 17 3 0 Note the bursty parts where we're shoving out 30Megs a second and the wait jumps to 23%. That's about as bad as it gets during the day for us. NBote that in your graph your bi column appears to be dominating your bo column, so it looks like you're reaching a point where the write cache on the controller gets full and you're real throughput is shown to be ~ 1 megabyte a second outbound, and the inbound traffic either has priority or is just filling in the gaps. It looks to me like your RAID card is prioritizing reads over writes, and the whole system is just slowing to a crawl. I'm willing to bet that if you were running pure SW RAID with no RAID controller you'd get better numbers.
Greg Smith <gsmith@gregsmith.com> writes: > On Mon, 16 Mar 2009, Joe Uhl wrote: > >> Here is vmstat 1 30. We are under peak load right now so I can gather >> information from the real deal > > Quite helpful, reformatting a bit and picking an informative section: > > procs -----------memory---------- ---swap- ----io--- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614 11 1 6 82 > 3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115 4 3 8 85 > 1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976 3 2 12 83 > > This says that your server is spending all its time waiting for I/O, actual CPU > utilization is minimal. You're only achieving around 3-8MB/s of random I/O. > That's the reality of what your disk I/O subsystem is capable of, regardless of > what its sequential performance with dd looks like. If you were to run a more > complicated benchmark like bonnie++ instead, I'd bet that your "seeks/second" > results are very low, even though sequential read/write is fine. > > The Perc5 controllers have a pretty bad reputation for performance on this > list, even in RAID10. Not much you can do about that beyond scrapping the > controller and getting a better one. Hm, well the tests I ran for posix_fadvise were actually on a Perc5 -- though who knows if it was the same under the hood -- and I saw better performance than this. I saw about 4MB/s for a single drive and up to about 35MB/s for 15 drives. However this was using linux md raid-0, not hardware raid. But you shouldn't get your hopes up too much for random i/o. 3-8MB seems low but consider the following: $ units 2445 units, 71 prefixes, 33 nonlinear units You have: 8kB / .5|7200min You want: MB/s * 1.92 / 0.52083333 -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
On Mon, 16 Mar 2009, Joe Uhl wrote: > Now when I run vmtstat 1 30 it looks very different (below). That looks much better. Obviously you'd like some more headroom on the CPU situation than you're seeing, but that's way better than having so much time spent waiting for I/O. > max_connections = 1000 > work_mem = 30MB Be warned that you need to be careful with this combination. If all 1000 connections were to sort something at once, you could end up with >30GB worth of RAM used for that purpose. It's probably quite unlikely that will happen, but 30MB is on the high side with that many connections. I wonder if your pool might work better, in terms of lowering total CPU usage, if you reduced the number of incoming connections. Each connection adds some overhead and now that you've got the I/O situation under better control you might get by with less simultaneous ones. Something to consider. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, 17 Mar 2009, Gregory Stark wrote: > Hm, well the tests I ran for posix_fadvise were actually on a Perc5 -- though > who knows if it was the same under the hood -- and I saw better performance > than this. I saw about 4MB/s for a single drive and up to about 35MB/s for 15 > drives. However this was using linux md raid-0, not hardware raid. Right, it's the hardware RAID on the Perc5 I think people mainly complain about. If you use it in JBOD mode and let the higher performance CPU in your main system drive the RAID functions it's not so bad. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mar 17, 2009, at 12:19 AM, Greg Smith wrote: > On Tue, 17 Mar 2009, Gregory Stark wrote: > >> Hm, well the tests I ran for posix_fadvise were actually on a Perc5 >> -- though >> who knows if it was the same under the hood -- and I saw better >> performance >> than this. I saw about 4MB/s for a single drive and up to about >> 35MB/s for 15 >> drives. However this was using linux md raid-0, not hardware raid. > > Right, it's the hardware RAID on the Perc5 I think people mainly > complain about. If you use it in JBOD mode and let the higher > performance CPU in your main system drive the RAID functions it's > not so bad. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com > Baltimore, MD I have not yet had a chance to try software raid on the standby server (still planning to) but wanted to follow up to see if there was any good way to figure out what the postgresql processes are spending their CPU time on. We are under peak load right now, and I have Zabbix plotting CPU utilization and CPU wait (from vmstat output) along with all sorts of other vitals on charts. CPU utilization is a sustained 90% - 95% and CPU Wait is hanging below 10%. Since being pointed at vmstat by this list I have been watching CPU Wait and it does get high at times (hence still wanting to try Perc5 in JBOD) but then there are sustained periods, right now included, where our CPUs are just getting crushed while wait and IO (only doing about 1.5 MB/sec right now) are very low. This high CPU utilization only occurs when under peak load and when our JDBC pools are fully loaded. We are moving more things into our cache and constantly tuning indexes/tables but just want to see if there is some underlying cause that is killing us. Any recommendations for figuring out what our database is spending its CPU time on?
On Fri, Mar 20, 2009 at 2:26 PM, Joe Uhl <joeuhl@gmail.com> wrote: > On Mar 17, 2009, at 12:19 AM, Greg Smith wrote: > >> On Tue, 17 Mar 2009, Gregory Stark wrote: >> >>> Hm, well the tests I ran for posix_fadvise were actually on a Perc5 -- >>> though >>> who knows if it was the same under the hood -- and I saw better >>> performance >>> than this. I saw about 4MB/s for a single drive and up to about 35MB/s >>> for 15 >>> drives. However this was using linux md raid-0, not hardware raid. >> >> Right, it's the hardware RAID on the Perc5 I think people mainly complain >> about. If you use it in JBOD mode and let the higher performance CPU in >> your main system drive the RAID functions it's not so bad. >> >> -- >> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > > I have not yet had a chance to try software raid on the standby server > (still planning to) but wanted to follow up to see if there was any good way > to figure out what the postgresql processes are spending their CPU time on. > > We are under peak load right now, and I have Zabbix plotting CPU utilization > and CPU wait (from vmstat output) along with all sorts of other vitals on > charts. CPU utilization is a sustained 90% - 95% and CPU Wait is hanging > below 10%. Since being pointed at vmstat by this list I have been watching > CPU Wait and it does get high at times (hence still wanting to try Perc5 in > JBOD) but then there are sustained periods, right now included, where our > CPUs are just getting crushed while wait and IO (only doing about 1.5 MB/sec > right now) are very low. > > This high CPU utilization only occurs when under peak load and when our JDBC > pools are fully loaded. We are moving more things into our cache and > constantly tuning indexes/tables but just want to see if there is some > underlying cause that is killing us. > > Any recommendations for figuring out what our database is spending its CPU > time on? What does the cs entry on vmstat say at this time? If you're cs is skyrocketing then you're getting a context switch storm, which is usually a sign that there are just too many things going on at once / you've got an old kernel things like that.
On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote: > On Fri, Mar 20, 2009 at 2:26 PM, Joe Uhl <joeuhl@gmail.com> wrote: >> On Mar 17, 2009, at 12:19 AM, Greg Smith wrote: >> >>> On Tue, 17 Mar 2009, Gregory Stark wrote: >>> >>>> Hm, well the tests I ran for posix_fadvise were actually on a >>>> Perc5 -- >>>> though >>>> who knows if it was the same under the hood -- and I saw better >>>> performance >>>> than this. I saw about 4MB/s for a single drive and up to about >>>> 35MB/s >>>> for 15 >>>> drives. However this was using linux md raid-0, not hardware raid. >>> >>> Right, it's the hardware RAID on the Perc5 I think people mainly >>> complain >>> about. If you use it in JBOD mode and let the higher performance >>> CPU in >>> your main system drive the RAID functions it's not so bad. >>> >>> -- >>> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com >>> Baltimore, MD >> >> I have not yet had a chance to try software raid on the standby >> server >> (still planning to) but wanted to follow up to see if there was any >> good way >> to figure out what the postgresql processes are spending their CPU >> time on. >> >> We are under peak load right now, and I have Zabbix plotting CPU >> utilization >> and CPU wait (from vmstat output) along with all sorts of other >> vitals on >> charts. CPU utilization is a sustained 90% - 95% and CPU Wait is >> hanging >> below 10%. Since being pointed at vmstat by this list I have been >> watching >> CPU Wait and it does get high at times (hence still wanting to try >> Perc5 in >> JBOD) but then there are sustained periods, right now included, >> where our >> CPUs are just getting crushed while wait and IO (only doing about >> 1.5 MB/sec >> right now) are very low. >> >> This high CPU utilization only occurs when under peak load and when >> our JDBC >> pools are fully loaded. We are moving more things into our cache and >> constantly tuning indexes/tables but just want to see if there is >> some >> underlying cause that is killing us. >> >> Any recommendations for figuring out what our database is spending >> its CPU >> time on? > > What does the cs entry on vmstat say at this time? If you're cs is > skyrocketing then you're getting a context switch storm, which is > usually a sign that there are just too many things going on at once / > you've got an old kernel things like that. cs column (plus cpu columns) of vmtstat 1 30 reads as follows: cs us sy id wa 11172 95 4 1 0 12498 94 5 1 0 14121 91 7 1 1 11310 90 7 1 1 12918 92 6 1 1 10613 93 6 1 1 9382 94 4 1 1 14023 89 8 2 1 10138 92 6 1 1 11932 94 4 1 1 15948 93 5 2 1 12919 92 5 3 1 10879 93 4 2 1 14014 94 5 1 1 9083 92 6 2 0 11178 94 4 2 0 10717 94 5 1 0 9279 97 2 1 0 12673 94 5 1 0 8058 82 17 1 1 8150 94 5 1 1 11334 93 6 0 0 13884 91 8 1 0 10159 92 7 0 0 9382 96 4 0 0 11450 95 4 1 0 11947 96 3 1 0 8616 95 4 1 0 10717 95 3 1 0 We are running on 2.6.28.7-2 kernel. I am unfamiliar with vmstat output but reading the man page (and that cs = "context switches per second") makes my numbers seem very high. Our sum JDBC pools currently top out at 400 connections (and we are doing work on all 400 right now). I may try dropping those pools down even smaller. Are there any general rules of thumb for figuring out how many connections you should service at maximum? I know of the memory constraints, but thinking more along the lines of connections per CPU core.
On Fri, Mar 20, 2009 at 2:49 PM, Joe Uhl <joeuhl@gmail.com> wrote: > > On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote: >> What does the cs entry on vmstat say at this time? If you're cs is >> skyrocketing then you're getting a context switch storm, which is >> usually a sign that there are just too many things going on at once / >> you've got an old kernel things like that. > > cs column (plus cpu columns) of vmtstat 1 30 reads as follows: > > cs us sy id wa > 11172 95 4 1 0 > 12498 94 5 1 0 > 14121 91 7 1 1 > 11310 90 7 1 1 > 12918 92 6 1 1 > 10613 93 6 1 1 > 9382 94 4 1 1 > 14023 89 8 2 1 > 10138 92 6 1 1 > 11932 94 4 1 1 > 15948 93 5 2 1 > 12919 92 5 3 1 > 10879 93 4 2 1 > 14014 94 5 1 1 > 9083 92 6 2 0 > 11178 94 4 2 0 > 10717 94 5 1 0 > 9279 97 2 1 0 > 12673 94 5 1 0 > 8058 82 17 1 1 > 8150 94 5 1 1 > 11334 93 6 0 0 > 13884 91 8 1 0 > 10159 92 7 0 0 > 9382 96 4 0 0 > 11450 95 4 1 0 > 11947 96 3 1 0 > 8616 95 4 1 0 > 10717 95 3 1 0 > > We are running on 2.6.28.7-2 kernel. I am unfamiliar with vmstat output but > reading the man page (and that cs = "context switches per second") makes my > numbers seem very high. No, those aren't really all that high. If you were hitting cs contention, I'd expect it to be in the 25k to 100k range. <10k average under load is pretty reasonable. > Our sum JDBC pools currently top out at 400 connections (and we are doing > work on all 400 right now). I may try dropping those pools down even > smaller. Are there any general rules of thumb for figuring out how many > connections you should service at maximum? I know of the memory > constraints, but thinking more along the lines of connections per CPU core. Well, maximum efficiency is usually somewhere in the range of 1 to 2 times the number of cores you have, so trying to get the pool down to a dozen or two connections would be the direction to generally head. May not be reasonable or doable though.
On Mar 20, 2009, at 4:58 PM, Scott Marlowe wrote: > On Fri, Mar 20, 2009 at 2:49 PM, Joe Uhl <joeuhl@gmail.com> wrote: >> >> On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote: > >>> What does the cs entry on vmstat say at this time? If you're cs is >>> skyrocketing then you're getting a context switch storm, which is >>> usually a sign that there are just too many things going on at >>> once / >>> you've got an old kernel things like that. >> >> cs column (plus cpu columns) of vmtstat 1 30 reads as follows: >> >> cs us sy id wa >> 11172 95 4 1 0 >> 12498 94 5 1 0 >> 14121 91 7 1 1 >> 11310 90 7 1 1 >> 12918 92 6 1 1 >> 10613 93 6 1 1 >> 9382 94 4 1 1 >> 14023 89 8 2 1 >> 10138 92 6 1 1 >> 11932 94 4 1 1 >> 15948 93 5 2 1 >> 12919 92 5 3 1 >> 10879 93 4 2 1 >> 14014 94 5 1 1 >> 9083 92 6 2 0 >> 11178 94 4 2 0 >> 10717 94 5 1 0 >> 9279 97 2 1 0 >> 12673 94 5 1 0 >> 8058 82 17 1 1 >> 8150 94 5 1 1 >> 11334 93 6 0 0 >> 13884 91 8 1 0 >> 10159 92 7 0 0 >> 9382 96 4 0 0 >> 11450 95 4 1 0 >> 11947 96 3 1 0 >> 8616 95 4 1 0 >> 10717 95 3 1 0 >> >> We are running on 2.6.28.7-2 kernel. I am unfamiliar with vmstat >> output but >> reading the man page (and that cs = "context switches per second") >> makes my >> numbers seem very high. > > No, those aren't really all that high. If you were hitting cs > contention, I'd expect it to be in the 25k to 100k range. <10k > average under load is pretty reasonable. > >> Our sum JDBC pools currently top out at 400 connections (and we are >> doing >> work on all 400 right now). I may try dropping those pools down even >> smaller. Are there any general rules of thumb for figuring out how >> many >> connections you should service at maximum? I know of the memory >> constraints, but thinking more along the lines of connections per >> CPU core. > > Well, maximum efficiency is usually somewhere in the range of 1 to 2 > times the number of cores you have, so trying to get the pool down to > a dozen or two connections would be the direction to generally head. > May not be reasonable or doable though. Thanks for the info. Figure I can tune our pools down and monitor throughput/CPU/IO and look for a sweet spot with our existing hardware. Just wanted to see if tuning connections down could potentially help. I feel as though we are going to have to replicate this DB before too long. We've got an almost identical server doing nothing but PITR with 8 CPU cores mostly idle that could be better spent. Our pgfouine reports, though only logging queries that take over 1 second, show 90% reads. I have heard much about Slony, but has anyone used the newer version of Mammoth Replicator (or looks to be called PostgreSQL + Replication now) on 8.3? From the documentation, it appears to be easier to set up and less invasive but I struggle to find usage information/stories online.
On Mar 20, 2009, at 4:58 PM, Scott Marlowe wrote: > On Fri, Mar 20, 2009 at 2:49 PM, Joe Uhl <joeuhl@gmail.com> wrote: >> >> On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote: > >>> What does the cs entry on vmstat say at this time? If you're cs is >>> skyrocketing then you're getting a context switch storm, which is >>> usually a sign that there are just too many things going on at >>> once / >>> you've got an old kernel things like that. >> >> cs column (plus cpu columns) of vmtstat 1 30 reads as follows: >> >> cs us sy id wa >> 11172 95 4 1 0 >> 12498 94 5 1 0 >> 14121 91 7 1 1 >> 11310 90 7 1 1 >> 12918 92 6 1 1 >> 10613 93 6 1 1 >> 9382 94 4 1 1 >> 14023 89 8 2 1 >> 10138 92 6 1 1 >> 11932 94 4 1 1 >> 15948 93 5 2 1 >> 12919 92 5 3 1 >> 10879 93 4 2 1 >> 14014 94 5 1 1 >> 9083 92 6 2 0 >> 11178 94 4 2 0 >> 10717 94 5 1 0 >> 9279 97 2 1 0 >> 12673 94 5 1 0 >> 8058 82 17 1 1 >> 8150 94 5 1 1 >> 11334 93 6 0 0 >> 13884 91 8 1 0 >> 10159 92 7 0 0 >> 9382 96 4 0 0 >> 11450 95 4 1 0 >> 11947 96 3 1 0 >> 8616 95 4 1 0 >> 10717 95 3 1 0 >> >> We are running on 2.6.28.7-2 kernel. I am unfamiliar with vmstat >> output but >> reading the man page (and that cs = "context switches per second") >> makes my >> numbers seem very high. > > No, those aren't really all that high. If you were hitting cs > contention, I'd expect it to be in the 25k to 100k range. <10k > average under load is pretty reasonable. > >> Our sum JDBC pools currently top out at 400 connections (and we are >> doing >> work on all 400 right now). I may try dropping those pools down even >> smaller. Are there any general rules of thumb for figuring out how >> many >> connections you should service at maximum? I know of the memory >> constraints, but thinking more along the lines of connections per >> CPU core. > > Well, maximum efficiency is usually somewhere in the range of 1 to 2 > times the number of cores you have, so trying to get the pool down to > a dozen or two connections would be the direction to generally head. > May not be reasonable or doable though. Turns out we may have an opportunity to purchase a new database server with this increased load. Seems that the best route, based on feedback to this thread, is to go whitebox, get quad opterons, and get a very good disk controller. Can anyone recommend a whitebox vendor? Is there a current controller anyone on this list has experience with that they could recommend? This will be a bigger purchase so will be doing research and benchmarking but any general pointers to a vendor/controller greatly appreciated.
On Tue, 24 Mar 2009, Joe Uhl wrote: > Can anyone recommend a whitebox vendor? I dumped a list of recommended vendors from a discussion here a while back at http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks you could get started with. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, Mar 24, 2009 at 1:29 PM, Greg Smith <gsmith@gregsmith.com> wrote: > On Tue, 24 Mar 2009, Joe Uhl wrote: > >> Can anyone recommend a whitebox vendor? > > I dumped a list of recommended vendors from a discussion here a while back > at http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks you could get > started with. I'd add Aberdeen Inc to that list. They supply quality white box servers with 3ware, areca, or LSI controllers, and provide a 5 year all inclusive warranty. Their customer service is top notch too.
At 02:47 PM 3/24/2009, Joe Uhl wrote: >Turns out we may have an opportunity to purchase a new database >server with this increased load. Seems that the best route, based >on feedback to this thread, is to go whitebox, get quad opterons, >and get a very good disk controller. > >Can anyone recommend a whitebox vendor? I'll 2nd the Aberdeen recommendation. I'll add Pogolinux to that list as well. >Is there a current controller anyone on this list has experience >with that they could recommend? The 2 best performing RAID controller vendors at this time are AMCC (AKA 3Ware) and Areca. In general, the 8+ port Areca's with their BB cache maxed outperform every other controller available. >This will be a bigger purchase so will be doing research and >benchmarking but any general pointers to a vendor/controller greatly >appreciated. Be =very= careful to thoroughly bench both the AMD and Intel CPU options. It is far from clear which is the better purchase. I'd be very interested to see the results of your research and benchmarks posted here on pgsql-performance. Ron Peacetree
On Tue, Mar 24, 2009 at 4:58 PM, Ron <rjpeace@earthlink.net> wrote: > At 02:47 PM 3/24/2009, Joe Uhl wrote: > >> Turns out we may have an opportunity to purchase a new database server >> with this increased load. Seems that the best route, based on feedback to >> this thread, is to go whitebox, get quad opterons, and get a very good disk >> controller. >> >> Can anyone recommend a whitebox vendor? > > I'll 2nd the Aberdeen recommendation. I'll add Pogolinux to that list as > well. > > >> Is there a current controller anyone on this list has experience with that >> they could recommend? > > The 2 best performing RAID controller vendors at this time are AMCC (AKA > 3Ware) and Areca. > In general, the 8+ port Areca's with their BB cache maxed outperform every > other controller available. > > >> This will be a bigger purchase so will be doing research and benchmarking >> but any general pointers to a vendor/controller greatly appreciated. > > Be =very= careful to thoroughly bench both the AMD and Intel CPU options. > It is far from clear which is the better purchase. My anecdotal experience has been that the Opterons stay afloat longer as load increases, but I haven't had machines with similar enough hardware to really test that. > I'd be very interested to see the results of your research and benchmarks > posted here on pgsql-performance. Me too. I'm gonna spend some time this summer benchmarking and tuning the database servers that I pretty much had to burn in and put in production this year due to time pressures.
On 3/24/09 4:16 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > On Tue, Mar 24, 2009 at 4:58 PM, Ron <rjpeace@earthlink.net> wrote: >> At 02:47 PM 3/24/2009, Joe Uhl wrote: >> >>> Turns out we may have an opportunity to purchase a new database server >>> with this increased load. Seems that the best route, based on feedback to >>> this thread, is to go whitebox, get quad opterons, and get a very good disk >>> controller. >>> >>> Can anyone recommend a whitebox vendor? >> >> I'll 2nd the Aberdeen recommendation. I'll add Pogolinux to that list as >> well. >> >> >>> Is there a current controller anyone on this list has experience with that >>> they could recommend? >> >> The 2 best performing RAID controller vendors at this time are AMCC (AKA >> 3Ware) and Areca. >> In general, the 8+ port Areca's with their BB cache maxed outperform every >> other controller available. I personally have had rather bad performance experiences with 3Ware 9550/9650 SATA cards. I have no experience with the AMCC SAS stuff though. Adaptec demolished the 9650 on arrays larger than 4 drives, and Areca will do better at the very high end. However, if CPU is the issue for this particular case, then the RAID controller details are less significant. I don't know how much data you have, but don't forget the option of SSDs, or a mix of hard drives and SSDs for different data. Ideally, you would want the OS to just extend its pagecache onto a SSD, but only OpenSolaris can do that right now and it is rather new (needs to be persistent across reboots). http://blogs.sun.com/brendan/entry/test http://blogs.sun.com/brendan/entry/l2arc_screenshots >> >> >>> This will be a bigger purchase so will be doing research and benchmarking >>> but any general pointers to a vendor/controller greatly appreciated. >> >> Be =very= careful to thoroughly bench both the AMD and Intel CPU options. >> It is far from clear which is the better purchase. > > My anecdotal experience has been that the Opterons stay afloat longer > as load increases, but I haven't had machines with similar enough > hardware to really test that. > One may want to note that Intel's next generation servers are due out within 45 days from what I can sense ('Q2' traditionally means ~April 1 for Intel when on time). These should be a rather significant bump for a database as they adopt the AMD / Alpha style memory-controller-on-CPU architecture and add a lot of cache. Other relevant improvements: increased performance on compare-and-swap operations, the return of hyper threading, and ridiculous memory bandwidth per CPU (3 DDR3 memory channels per CPU). >> I'd be very interested to see the results of your research and benchmarks >> posted here on pgsql-performance. > > Me too. I'm gonna spend some time this summer benchmarking and tuning > the database servers that I pretty much had to burn in and put in > production this year due to time pressures. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >