oprofile results for stats collector test - Mailing list pgsql-hackers

From Tom Lane
Subject oprofile results for stats collector test
Date
Msg-id 22408.1150570607@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
I've gotten some insight into the stats collection issues by monitoring
Bruce's test case with oprofile (http://oprofile.sourceforge.net/).

Test conditions: PG CVS HEAD, built with --enable-debug --disable-cassert
(debug symbols are needed for oprofile), on current Fedora Core 5
(Linux kernel 2.6.16-1.2133_FC5 #1 SMP, gcc 4.1.1), running on dual Xeon
EM64T with hyperthreading enabled.  All configuration parameters are at
defaults except for stats_command_string.  I extended the test script to
be 240000 "SELECT 1;" commands so that it would run long enough to get a
trustworthy profile.  In the profile data below, I cut off each listing
where the percentages fell below about 1% of global runtime.


With stats_command_string off:

21.7 sec to run script (with oprofile off)
22.7 sec to run script (with oprofile active)
These numbers stay pretty consistent over repeated trials.

oprofile global report:
CPU: P4 / Xeon with 2 hyper-threads, speed 2793.03 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...| samples|      %|
------------------  156544 47.2570 /usr/lib/debug/lib/modules/2.6.16-1.2133_FC5/vmlinux   84114 25.3921
/home/tgl/testversion/bin/postgres  63873 19.2818 /lib64/libc-2.4.so   13621  4.1119 /home/tgl/testversion/bin/psql
10101 3.0493 /home/tgl/testversion/lib/libpq.so.5.0    1324  0.3997 /usr/bin/oprofiled
 

kernel (note percentages here are percentage of kernel time not total time):
samples  %        symbol name
19326    12.3454  schedule
11738     7.4982  _raw_spin_lock
11631     7.4299  __switch_to
9897      6.3222  mwait_idle
7153      4.5693  __find_first_bit
6395      4.0851  try_to_wake_up
3972      2.5373  unix_stream_recvmsg
3813      2.4357  system_call
3739      2.3885  thread_return
3612      2.3073  find_busiest_group
3446      2.2013  resched_task

postgres:
samples  %        symbol name
6415      7.6266  base_yyparse
6115      7.2699  AllocSetAlloc
3525      4.1907  hash_search

libc:
samples  %        symbol name
9276     14.5226  memcpy
8686     13.5989  _int_malloc
5865      9.1823  strncpy
5538      8.6703  strlen
5286      8.2758  strspn
4351      6.8120  vfprintf


With stats_command_string on:

45.2 sec to run script (with oprofile off)
47.5 sec to run script (with oprofile active)
Also repeatable.

oprofile global report:
CPU: P4 / Xeon with 2 hyper-threads, speed 2793.03 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...| samples|      %|
------------------  785756 71.9386 /usr/lib/debug/lib/modules/2.6.16-1.2133_FC5/vmlinux  135820 12.4348
/home/tgl/testversion/bin/postgres 105828  9.6889 /lib64/libc-2.4.so   19903  1.8222
/home/tgl/testversion/lib/libpq.so.5.0  19293  1.7663 /home/tgl/testversion/bin/psql    7489  0.6856 /ip_conntrack
4555 0.4170 /usr/bin/oprofiled
 

kernel:
samples  %        symbol name
62337     7.9334  security_port_sid
61566     7.8353  schedule
37897     4.8230  _raw_spin_lock
36410     4.6338  __switch_to
26580     3.3827  mwait_idle
21797     2.7740  __find_first_bit
19118     2.4331  avc_has_perm_noaudit
18929     2.4090  sys_select
17005     2.1642  system_call
15979     2.0336  do_select
15762     2.0060  fget
14838     1.8884  try_to_wake_up
13488     1.7166  find_busiest_group
11956     1.5216  _raw_read_lock
11685     1.4871  thread_return
10639     1.3540  resched_task
10542     1.3416  copy_user_generic

postgres:
samples  %        symbol name
7817      5.7554  PgstatBufferMain
7541      5.5522  base_yyparse
7247      5.3357  AllocSetAlloc
6725      4.9514  hash_search

libc:
samples  %        symbol name
15012    14.1853  strncpy
12265    11.5896  _int_malloc
11949    11.2910  memcpy


This system is running with SELinux enabled (in permissive mode).
I presume that if I turned it off, security_port_sid() and
avc_has_perm_noaudit() would drop off the radar screen but things
wouldn't change much otherwise.

Also, you'd expect to see PgstatCollectorMain() somewhere but it's
not in the oprofile report at all.  I believe that gcc sees that it's
a static singly-referenced function and inlines it into
PgstatBufferMain; so the time shown above for PgstatBufferMain should
be read as the joint time for those two functions.  Also it looks like
all of the pgstat_recv_foo functions got inlined into it too.  The
only pgstat functions that show separately in the oprofile output are

1214      0.8938  pgstat_add_backend
860       0.6332  pgstat_report_tabstat
749       0.5515  pgstat_report_activity
434       0.3195  pgstat_write_statsfile
330       0.2430  pgstat_setheader
196       0.1443  pgstat_count_xact_commit
106       0.0780  pgstat_send
91        0.0670  pgstat_get_db_entry

So the total time spent directly in pgstat.c is less than 9% of the
postgres executable's time, or only about 1% of total runtime.

Comparing the sample counts between the two cases ought to be legitimate
because in both cases the oprofile run covered the full run of the test
script.  However the second case shows twice as many hits in psql and
libpq.so, which ought to be doing the same amount of work, so I think
there's some measurement artifact involved there.  In any case, the
kernel time has gone up way more than the userland time (a result
confirmed by "top" and "vmstat").

Conclusions:

* Kernel time is what's hurting us.

* Most of the kernel time is associated with process scheduling.
In our current system, a simple command with no stats overhead
involves two process dispatches (the backend and psql) per command
round trip.  Enabling stats_command_string increases that to six,
because we'll also dispatch the buffer process and collector process
twice per command.

* If you are using SELinux, UDP datagrams are expensive because the
kernel permission-checks each one separately (even with a bound socket,
apparently).  We might want to think about replacing the UDP mechanism
with a Unix socket where available.

In short, it sure looks like the separate buffer process is a huge loss,
at least on this platform.  Getting rid of it should reduce the kernel
overhead by a third.  It's hard to tell how much userland time it would
save, but the userland time isn't the problem.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: postgresql and process titles
Next
From: Heikki Linnakangas
Date:
Subject: Re: MultiXacts & WAL