Re: [HACKERS] Postgres Speed or lack thereof - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] Postgres Speed or lack thereof
Date
Msg-id 27367.916637503@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] Postgres Speed or lack thereof  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Postgres Speed or lack thereof  (Bruce Momjian <maillist@candle.pha.pa.us>)
List pgsql-hackers
I wrote:
> As for where the time is going: "top" showed that the time for the
> INSERT ops was almost all going into backend CPU time.  My guess is
> that most of the cost is in parsing and planning the INSERT
> statements.

Having an idle hour this evening, I thought it'd be interesting to build
a backend with profiling enabled, so as to confirm or deny the above
guess.  It seems that indeed a lot of time is being wasted, but where
it's being wasted might surprise you!

I profiled a backend built from this morning's current CVS sources,
being driven by a single psql run reading a series of exactly 5000
INSERT commands.  The first few are:

INSERT INTO "fehistory_1" values (64,14,252,'D','-','-','s',1,4027,NULL,'S',1,4088,NULL,32,'Thu Jan 07 15:15:00 1999
EST','FriJan 08 08:30:38 1999 EST',4027,32,NULL,NULL,46,NULL);
 
INSERT INTO "fehistory_1" values (65,14,252,'P','-','-','S',1,4086,NULL,NULL,NULL,NULL,NULL,32,'Fri Jan 08 16:00:00
1999EST',NULL,NULL,NULL,NULL,NULL,45,NULL);
 
INSERT INTO "fehistory_1" values (66,14,135,'P','-','-','S',1,13619,NULL,'s',1,12967,NULL,100,'Fri Jan 08 02:00:00 1999
EST',NULL,NULL,NULL,NULL,NULL,44,NULL);

and it goes on for a long while in that vein :-).  Total runtime was
278.09 CPU seconds in the backend, 307 seconds elapsed realtime.  The
table being inserted into has no indexes and was initially empty.

The profiler is GNU gprof, which provides both flat and structured
profile data.  The top few entries in the flat profile are:
 %   cumulative   self              self     total           time   seconds   seconds    calls  ms/call  ms/call  name
 34.96     97.21    97.21                             _mcount 7.18    117.19    19.98  8047816     0.00     0.00
tree_cut6.56    135.43    18.24   962187     0.02     0.02  recv 3.55    145.31     9.88  8234278     0.00     0.00
tree_insert3.49    155.02     9.71  4292947     0.00     0.01  malloc 3.31    164.22     9.20  1965056     0.00
0.01 SearchSysCache 3.24    173.23     9.01  4291775     0.00     0.01  free 2.44    180.01     6.78  6171667     0.00
  0.00  tree_concatenate 2.17    186.04     6.03  8234069     0.00     0.00  tree_delete 1.07    189.02     2.98
10250    0.29     0.29  write 1.05    191.94     2.92                             _psort_cmp 0.95    194.57     2.63
2983301    0.00     0.00  newNode 0.92    197.14     2.57  4251762     0.00     0.00  OrderedElemPop 0.88    199.60
2.46 2746172     0.00     0.00  comphash 0.82    201.88     2.28  4252495     0.00     0.01  AllocSetAlloc 0.77
204.02    2.14  1965056     0.00     0.00  SearchSysCacheTuple 0.73    206.05     2.03   265000     0.01     0.01
yylex0.69    207.96     1.91                             $$remU 0.68    209.86     1.90  4252500     0.00     0.00
OrderedElemPushHead0.62    211.58     1.72  4247473     0.00     0.00  palloc 0.60    213.26     1.68    25004     0.07
   0.07  send 0.56    214.81     1.55  1965056     0.00     0.00  CatalogCacheComputeHashIndex 0.55    216.33     1.52
  5000     0.30     1.07  yyparse 0.54    217.82     1.49  4251762     0.00     0.01  AllocSetFree 0.53    219.29
1.47 4252494     0.00     0.00  MemoryContextAlloc 0.51    220.71     1.42  4252500     0.00     0.00
OrderedElemPushInto0.49    222.07     1.36  3430844     0.00     0.00  strncmp 0.48    223.41     1.34  4205327
0.00    0.00  OrderedSetGetHead 0.47    224.73     1.32                             elog 0.45    225.99     1.26
1964976    0.00     0.00  DLRemove 0.43    227.19     1.20  2510857     0.00     0.00  strcmp 0.43    228.38     1.19
4200327    0.00     0.00  OrderedElemGetBase 0.42    229.56     1.18  4245740     0.00     0.01  PortalHeapMemoryAlloc
0.38   230.62     1.06  4252500     0.00     0.00  OrderedElemPush 0.37    231.66     1.04  4205327     0.00     0.00
AllocSetGetFirst0.36    232.66     1.00  1965145     0.00     0.00  DLAddHead
 

_mcount is part of the profiling apparatus and can be ignored (although
the fact that it's so high up implies that we're executing an awful lot
of short routine calls, because mcount runs once per entry and exit of
profilable routines).  tree_cut, tree_insert, tree_concatenate, and
tree_delete are subroutines of malloc/free.  Altogether, malloc/free
and friends accounted for 61.39 seconds out of the 180.88 non-overhead
CPU seconds in this run.

In other words, we're spending a third of our time mallocing and freeing
memory.  A tad high, what?

Actually, it's worse than that, because AllocSetAlloc,
PortalHeapMemoryAlloc, AllocSetFree, and all of the OrderedElemXXX
routines represent our own bookkeeping layer atop malloc/free.
That's another 18.66 seconds spent in these top routines, which means
that we are real close to expending half the backend's runtime on
memory bookkeeping.  This needs work.

The other thing that jumps out here is the unreasonably high position of
recv(), which is called 962187 times.  The script being read by psql was
only 957186 characters.  Evidently we're invoking a kernel recv() call
once per character read from the frontend.  I suspect this is an
inefficiency introduced by Magnus Hagander's recent rewrite of backend
libpq (see, I told you there was a reason for using stdio ;-)).  We're
gonna have to do something about that, though it's not as critical as
the memory-allocation issue.  It also appears that send() is now being
invoked multiple times per backend reply, which is going to create
inefficiencies outside the backend (ie, multiple packets per reply).
On a test case with a lot of SELECTs that would show up more than it
does here.


Moving on to the dynamic profile, the most interesting items are:

index % time    self  children    called     name
-----------------------------------------------               0.05  118.06       1/1           DoBackend [7]
[8]     65.3    0.05  118.06       1         PostgresMain [8]               0.04   57.64    5000/5000
pg_exec_query[9]               0.01   39.83    5000/5000        CommitTransactionCommand [11]               0.02
18.86   5001/5001        ReadCommand [27]               0.01    1.18    5000/5000        StartTransactionCommand [118]
            0.01    0.36    5001/5001        ReadyForQuery [190]               0.04    0.00    5001/5001
EnableNotifyInterrupt[385]               0.04    0.00    5000/5000        strspn [388]               0.00    0.01
1/1          InitPostgres [495]               0.00    0.00    5000/548573      strlen [172]               0.00    0.00
 5001/93349       memset [366]               0.00    0.00       1/1           read_pg_options [601]               0.00
 0.00       2/2           pq_putint [637]               0.00    0.00       1/15002       pq_putnchar [123]
0.00    0.00       2/10013       getenv [241]               0.00    0.00       1/1           SetPgUserName [683]
      0.00    0.00       1/1           pq_init [687]               0.00    0.00       1/1           proc_exit [704]
         0.00    0.00       1/1           pq_close [721]               0.00    0.00       1/5002        getpgusername
[521]              0.00    0.00       1/2           FindExec [722]               0.00    0.00       5/8
getopt[728]               0.00    0.00    5001/5001        DisableNotifyInterrupt [761]               0.00    0.00
9/34          pqsignal [806]               0.00    0.00       2/5           atoi [837]               0.00    0.00
1/2          SetProcessingMode [868]               0.00    0.00       1/10002       TransactionFlushEnabled [749]
       0.00    0.00       1/1           sigsetjmp [934]
 

("self" is the amount of time (CPU sec) spent directly in the named function,
while "children" is the amount of time spent in its callees, pro-rated
by the number of calls.  For example, strlen was actually measured to
use 0.47 sec in the whole run, but since PostgresMain called it less
than 1% of all the times it was called, PostgresMain gets blamed for
less than 0.01 sec of that total.)

The interesting thing here is the relative times indicated for
pg_exec_query, CommitTransactionCommand, and ReadCommand.  We already
found out why ReadCommand is so slow (recv() per character) ... but why
is CommitTransactionCommand so high, when I already discovered that
doing only one commit doesn't help much?  Patience, we'll get there.

-----------------------------------------------               0.03   57.61    5000/5000        pg_exec_query [9]
[10]    31.9    0.03   57.61    5000         pg_exec_query_dest [10]               0.09   39.12    5000/5000
pg_parse_and_plan[13]               0.00   17.37    5000/5000        ProcessQuery [36]               0.02    0.60
5000/5000       SetQuerySnapshot [159]               0.01    0.34    5000/5000        CommandCounterIncrement [196]
         0.02    0.05   10000/4291775     free [22]
 
-----------------------------------------------               0.09   39.12    5000/5000        pg_exec_query_dest [10]
[13]    21.7    0.09   39.12    5000         pg_parse_and_plan [13]               0.02   19.28    5000/5000
parser[26]               0.03   17.63    5000/5000        planner [34]               0.00    1.98    5000/5000
QueryRewrite[90]               0.02    0.05   10000/4292947     malloc [21]               0.02    0.05   10000/4291775
  free [22]               0.00    0.02    5000/497779      lappend [91]               0.01    0.00    5000/5000
IsAbortedTransactionBlockState[529]               0.00    0.00    5000/72779       length [342]
 

In other words, parsing, planning, and executing an INSERT each take
about the same amount of time.

-----------------------------------------------               0.01   39.83    5000/5000        PostgresMain [8]
[11]    22.0    0.01   39.83    5000         CommitTransactionCommand [11]               0.02   39.81    5000/5000
 CommitTransaction [12]
 
-----------------------------------------------               0.02   39.81    5000/5000        CommitTransactionCommand
[11]
[12]    22.0    0.02   39.81    5000         CommitTransaction [12]               0.00   38.12    5000/5000
AtCommit_Memory[16]               0.01    1.10    5000/5000        RecordTransactionCommit [122]               0.00
0.30   5000/5000        AtCommit_Locks [211]               0.02    0.10    5000/5000        AtEOXact_portals [286]
        0.03    0.07    5000/5000        DestroyTempRels [305]               0.03    0.00    5000/5000
_lo_commit[425]               0.01    0.00    5000/5000        AtCommit_Notify [522]               0.01    0.00
5000/5000       CloseSequences [523]               0.01    0.00    5000/5000        RelationPurgeLocalRelation [531]
          0.00    0.00    5000/10000       AtCommit_Cache [750]
 
-----------------------------------------------               0.00   38.12    5000/5000        CommitTransaction [12]
[16]    21.1    0.00   38.12    5000         AtCommit_Memory [16]               0.01   38.11    5000/5000
EndPortalAllocMode[17]               0.00    0.00    5000/10552       MemoryContextSwitchTo [747]
 
-----------------------------------------------               0.01   38.11    5000/5000        AtCommit_Memory [16]
[17]    21.1    0.01   38.11    5000         EndPortalAllocMode [17]               0.54   37.55    5000/5000
AllocSetReset[18]               0.01    0.00    5000/10000       PortalHeapMemoryGetVariableMemory [449]
0.01   0.00    5000/5000        FixedStackPop [528]               0.00    0.00    5000/51434       MemoryContextFree
[367]
-----------------------------------------------               0.54   37.55    5000/5000        EndPortalAllocMode [17]
[18]    21.1    0.54   37.55    5000         AllocSetReset [18]               1.47   32.51 4200327/4251762
AllocSetFree[20]               1.04    2.53 4205327/4205327     AllocSetGetFirst [64]
 

In other words, essentially *all* of the CPU time spent in
CommitTransaction is spent freeing memory.  That's probably why
ganging the transactions doesn't help --- it's the same number of
memory blocks getting allocated and freed.
               0.02   19.28    5000/5000        pg_parse_and_plan [13]
[26]    10.7    0.02   19.28    5000         parser [26]               0.02   13.88    5000/5000        parse_analyze
[41]              1.52    3.81    5000/5000        yyparse [54]               0.01    0.01    5000/5000        init_io
[474]              0.00    0.01    5000/420496      pstrdup [124]               0.01    0.00    5000/5000
parser_init[533]
 

Thomas might know why parse_analyze is taking so much time compared to
the rest of the parsing machinery...


I won't bother the list with the rest of the profile, although I'll be
happy to send it to anyone who wants to see all of it.  Our immediate
performance problems seem pretty clear, however: time to rethink memory
management.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: I need a PostgreSQL vacation
Next
From: Vadim Mikheev
Date:
Subject: Re: [HACKERS] Syntax errors in current tree