Thread: How to analyze load average ?
Hello, can some tell me, how I can analyze from where my server bring up load average ? I have one server with 128 GB memory, 32 CPU x86_64, RAID5 - 3 15k SAS HDD ext4 fs. That is my produce server, also is configured to send wal files over the net. Here is my configuration: max_connections = 500 shared_buffers = 32GB work_mem = 192MB maintenance_work_mem = 6GB max_stack_depth = 6MB bgwriter_delay = 200ms bgwriter_lru_maxpages = 100 bgwriter_lru_multiplier = 2.0 wal_level = hot_standby fsync = on synchronous_commit = on wal_sync_method = fdatasync full_page_writes = on wal_buffers = -1 checkpoint_segments = 32 checkpoint_timeout = 5min checkpoint_completion_target = 0.5 max_wal_senders = 5 wal_sender_delay = 1s wal_keep_segments = 64 enable_bitmapscan = on enable_hashagg = on enable_hashjoin = on enable_indexscan = on enable_material = on enable_mergejoin = on enable_nestloop = on enable_seqscan = on enable_sort = on enable_tidscan = on seq_page_cost = 1.0 random_page_cost = 2.0 cpu_tuple_cost = 0.01 cpu_index_tuple_cost = 0.005 cpu_operator_cost = 0.0025 effective_cache_size = 64GB autovacuum = on My on board raid cache write trough is OFF. When I connect to server i see only 2 query with select * from pg_stat_activity; that is not complicated, select rid from table where id = 1; Both tables have index on most frequently columns. When I check my server load average is 0.88 0.94 0.87 Im trying to check from where that load avg is so high, only postgres 9.1.4 is working on that server. Can some one point me from where I should start digging ? I think my configuration about connections, shared buffers is right as I read documentation, I think this slow down can be because mu cache is on the raid card is OFF. As I read on postgres wiki pages, if I turn ON that setting on some fall I might lost some of my data, well the company has UPS and I also have stream replicator so I won't lose much data. My iostat show: avg-cpu: %user %nice %system %iowait %steal %idle 0.90 0.00 1.06 0.00 0.00 98.04 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0.00 0.00 0.00 0 0 avg-cpu: %user %nice %system %iowait %steal %idle 1.92 0.00 1.06 0.00 0.00 97.02 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0.00 0.00 0.00 0 0 And my vmstat: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 99307408 334300 31144708 0 0 1 18 1 0 1 1 98 0 0 0 0 99303808 334300 31144716 0 0 0 0 926 715 0 0 99 0 0 0 0 99295232 334300 31144716 0 0 0 0 602 532 0 0 99 0 4 0 0 99268160 334300 31144716 0 0 0 32 975 767 2 2 96 0 1 0 0 99298544 334300 31144716 0 0 0 0 801 445 3 2 95 0 0 0 0 99311336 334300 31144716 0 0 0 0 320 175 1 0 98 0 2 0 0 99298920 334300 31144716 0 0 0 0 1195 996 1 1 97 0 0 0 0 99307184 334300 31144716 0 0 0 0 843 645 0 1 98 0 0 0 0 99301024 334300 31144716 0 0 0 12 1346 1040 2 2 96 0 Any one can tell me how I can find from where that load average is so high ? Thanks
On 6 Srpen 2012, 16:23, Condor wrote: > Hello, > > can some tell me, how I can analyze from where my server bring up load > average ? > > ... > > When I connect to server i see only 2 query with select * from > pg_stat_activity; > that is not complicated, select rid from table where id = 1; > Both tables have index on most frequently columns. When I check my > server load average is 0.88 0.94 0.87 > >... > > Any one can tell me how I can find from where that load average is so > high ? Errr, what? Why do you think the load average is high? Load average is defined as a number of processes in the run queue (i.e. using or waiting for a CPU). So the load average "0.88 0.94 0.87" means there was less than one process waiting for CPU most of the time. I wouldn't call that "high load average", especially not on a 32-core system. Tomas
On Mon, 06 Aug 2012 09:38:33 -0500, Tomas Vondra <tv@fuzzy.cz> wrote: > Load average is defined as a number of processes in the run queue That depends on if he's running Linux or BSD. http://www.undeadly.org/cgi?action=article&sid=20090715034920
On 6 Srpen 2012, 16:54, Mark Felder wrote: > On Mon, 06 Aug 2012 09:38:33 -0500, Tomas Vondra <tv@fuzzy.cz> wrote: > >> Load average is defined as a number of processes in the run queue > > That depends on if he's running Linux or BSD. > > http://www.undeadly.org/cgi?action=article&sid=20090715034920 Well, even this link states that "... most unixen load average is some measure of the size of the run queue - or the number of runnable processes over a set period" and in this sense what I said is true even on BSD systems. But you're right, the definitions are a bit different. Although the OP mentioned he's using ext4, so I suppose he's running Linux (although I know there was some ext4 support e.g. in FreeBSD). Still, the load average 0.88 means the system is almost idle, especially when there's no I/O activity etc. Tomas
On Mon, 06 Aug 2012 10:27:18 -0500, Tomas Vondra <tv@fuzzy.cz> wrote: > > Although the OP mentioned he's using ext4, so I suppose he's running > Linux > (although I know there was some ext4 support e.g. in FreeBSD). > Still, the load average 0.88 means the system is almost idle, especially > when there's no I/O activity etc. Ahh, I didn't see the mention of ext4 initially. I tend to just use iostat for getting a better baseline of what's truly happening on the system. At least on FreeBSD (not sure of Linux at the moment) the iostat output also lists CPU usage in the last columns and if "id" (idle) is not close to zero it's probably OK. :-)
On 2012-08-06 17:38, Tomas Vondra wrote: > On 6 Srpen 2012, 16:23, Condor wrote: >> Hello, >> >> can some tell me, how I can analyze from where my server bring up >> load >> average ? >> >> ... >> >> When I connect to server i see only 2 query with select * from >> pg_stat_activity; >> that is not complicated, select rid from table where id = 1; >> Both tables have index on most frequently columns. When I check my >> server load average is 0.88 0.94 0.87 >> >>... >> >> Any one can tell me how I can find from where that load average is >> so >> high ? > > Errr, what? Why do you think the load average is high? > > Load average is defined as a number of processes in the run queue > (i.e. > using or waiting for a CPU). So the load average "0.88 0.94 0.87" > means > there was less than one process waiting for CPU most of the time. I > wouldn't call that "high load average", especially not on a 32-core > system. > > Tomas I think load avg is high because before I change the servers my produce server was on 16 cpu, 24 gb memory and load avg on that server was 0.24. Database is the same, users that use the server is the same, nothing is changed. I dump the DB from old server and import it to new one before few days ago and because that is the new server with more resource I monitor his load avg and I think is too high. For that reason Im asking is there a way to detect why my load avg is 0.88. When I run select * from pg_stat_activity; did not see more then 3-4 query that isn't much complicated and I already try them with explain to see what is the result. I know what load average mean, I was OpenBSD user a few years, now I use Slackware with kernel 3.5. Hristo
Condor <condor@stz-bg.com> wrote: > For that reason Im asking is there a way to detect why my load avg > is 0.88. When I run select * from pg_stat_activity; So, on a 32 core system if you run vmstat or iostat with a short interval during such an episode, you should be seeing about 97% idle time for your CPUs. If you want to know what's sucking up the other 3%, you might want to try oprofile. -Kevin
> I think load avg is high because before I change the servers my produce > server > was on 16 cpu, 24 gb memory and load avg on that server was 0.24. > Database is the same, > users that use the server is the same, nothing is changed. I dump the DB > from old server > and import it to new one before few days ago and because that is the new > server with more > resource I monitor his load avg and I think is too high. For that reason > Im asking is there > a way to detect why my load avg is 0.88. When I run select * from > pg_stat_activity; > did not see more then 3-4 query that isn't much complicated and I > already try them with > explain to see what is the result. Well, the load average is a bit difficult to analyze because of the exponential damping. Also, I find it a bit artificial and if there are no sudden peaks or slowdowns I wouldn't bother analyzing this. A wild quess is that the new server has more CPUs but at lower frequency, therefore the tasks run longer and impact the load average accordingly. There are other such things (e.g. maintenance of larger shared buffers takes more time). Have you verified that the performance of the new hardware matches expectations and that it's actually faster than the old server? > I know what load average mean, I was OpenBSD user a few years, now I > use Slackware with kernel 3.5. So you do have 3.5 on production? Wow, you're quite adventurous. Tomas
On Mon, Aug 06, 2012 at 08:06:05PM +0300, Condor wrote: > I think load avg is high because before I change the servers my > produce server > was on 16 cpu, 24 gb memory and load avg on that server was 0.24. > Database is the same, Our monitoring system starts worrying about the load average if it ever goes above 0.75*number of cores. In your example it looks a bit like you paid for 15 more cores than necessary. Especially at the lower end you have to take the load with a large grain of salt. Lots of short running processes (like a make run) while make the load fluctuate. But even things like it taking a while for your disk cache to reach steady state after a reboot can mean that you see a higher than normal load for a while. But 0.88 is really nothing to worry about. Perhaps it is just slower core or a slower memory bus. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Attachment
On , Tomas Vondra wrote: >> I think load avg is high because before I change the servers my >> produce >> server >> was on 16 cpu, 24 gb memory and load avg on that server was 0.24. >> Database is the same, >> users that use the server is the same, nothing is changed. I dump >> the DB >> from old server >> and import it to new one before few days ago and because that is the >> new >> server with more >> resource I monitor his load avg and I think is too high. For that >> reason >> Im asking is there >> a way to detect why my load avg is 0.88. When I run select * from >> pg_stat_activity; >> did not see more then 3-4 query that isn't much complicated and I >> already try them with >> explain to see what is the result. > > Well, the load average is a bit difficult to analyze because of the > exponential damping. Also, I find it a bit artificial and if there > are > no sudden peaks or slowdowns I wouldn't bother analyzing this. > > A wild quess is that the new server has more CPUs but at lower > frequency, therefore the tasks run longer and impact the load average > accordingly. There are other such things (e.g. maintenance of larger > shared buffers takes more time). > > Have you verified that the performance of the new hardware matches > expectations and that it's actually faster than the old server? > >> I know what load average mean, I was OpenBSD user a few years, now I >> use Slackware with kernel 3.5. > > So you do have 3.5 on production? Wow, you're quite adventurous. Yep, that's me :) > > Tomas Hello to every one again, sorry for my late replay but I found the problem (I think). I change the Default IO scheduler from (No-op) to Deadline and my load average dropped down to 0.23