Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement) - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Date
Msg-id alpine.DEB.2.02.1306221634211.23902@localhost6.localdomain6
Whole thread Raw
In response to Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
List pgsql-hackers
Dear Robert and Greg,

> I think so.  If it doesn't get fixed now, it's not likely to get fixed
> later.  And the fact that nobody understands why it's happening is
> kinda worrisome...

Possibly, but I thing that it is not my fault:-)

So, I spent some time at performance debugging...

First, I finally succeeded in reproducing Greg Smith spikes on my ubuntu 
laptop. It needs short transactions and many clients per thread so as to 
be a spike. With "pgbench" standard transactions and not too many clients 
per thread it is more of a little bump, or even it is not there at all.

After some poking around, and pursuing various red herrings, I resorted to 
measure the delay for calling "PQfinish()", which is really the only 
special thing going around at the end of pgbench run...

BINGO!

In his tests Greg is using one thread. Once the overall timer is exceeded, 
clients start closing their connections by calling PQfinish once their 
transactions are done. This call takes between a few us and a few ... ms. 
So if some client closing time hit the bad figures, the transactions of 
other clients are artificially delayed by this time, and it seems they 
have a high latency, but it is really because the thread is in another 
client's PQfinish and was not available to process the data. If you have 
one thread per client, no problem, especially as the final PQfinish() time 
is not measured at all by pgbench:-) Also, the more clients, the higher 
the spike because more are to be stopped and may hit the bad figures.

Here is a trace, with the simple SELECT transaction.
  sh> ./pgbench --rate=14000 -T 10 -r -l -c 30 -S bench  ...
  sh> less pgbench_log.*
 # short transactions, about 0.250 ms ... 20 4849 241 0 1371916583.455400 21 4844 256 0 1371916583.455470 22 4832 348 0
1371916583.45556923 4829 218 0 1371916583.455627
 
   ** TIMER EXCEEDED **
 25 4722 390 0 1371916583.455820 25 done in 1560 [1,2]              # BING, 1560 us for PQfinish() 26 4557 1969 0
1371916583.45740726 done in 21 [1,2] 27 4372 1969 0 1371916583.457447 27 done in 19 [1,2] 28 4009 1910 0
1371916583.45748628 done in 1445 [1,2]              # BOUM 2 interrupted in 1300 [0,0]        # BANG 3 interrupted in
15[0,0] 4 interrupted in 40 [0,0] 5 interrupted in 203 [0,0]         # boum? 6 interrupted in 1352 [0,0]        # ZIP 7
interruptedin 18 [0,0] ... 23 interrupted in 12 [0,0]
 
 ## the cumulated PQfinish() time above is about 6 ms which ## appears as an apparent latency for the next clients:
  0 4880 6521 0 1371916583.462157  0 done in 9 [1,2]  1 4877 6397 0 1371916583.462194  1 done in 9 [1,2] 24 4807 6796 0
1371916583.46221724 done in 9 [1,2] ...
 

Note that the bad measures also appear when there is no throttling:
  sh> ./pgbench -T 10 -r -l -c 30 -S bench  sh> grep 'done.*[0-9][0-9][0-9]' pgbench_log.*   0 done in 1974 [1,2]   1
donein 312 [1,2]   3 done in 2159 [1,2]   7 done in 409 [1,2]  11 done in 393 [1,2]  12 done in 2212 [1,2]  13 done in
1458[1,2]  ## other clients execute PQfinish in less than 100 us
 

This "done" is issued by my instrumented version of clientDone().

The issue does also appear if I instrument pgbench from master, without 
anything from the throttling patch at all:
  sh> git diff master  diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c  index 1303217..7c5ea81
100644 --- a/contrib/pgbench/pgbench.c  +++ b/contrib/pgbench/pgbench.c  @@ -869,7 +869,15 @@ clientDone(CState *st,
boolok)
 
          if (st->con != NULL)          {  +               instr_time now;  +               int64 s0, s1;  +
  INSTR_TIME_SET_CURRENT(now);  +               s0 = INSTR_TIME_GET_MICROSEC(now);                  PQfinish(st->con);
+              INSTR_TIME_SET_CURRENT(now);  +               s1 = INSTR_TIME_GET_MICROSEC(now);  +
fprintf(stderr,"%d done in %ld [%d,%d]\n",  +                               st->id, s1-s0, st->listen, st->state);
           st->con = NULL;          }          return false;                           /* always false */
 
  sh> ./pgbench -T 10 -r -l -c 30 -S bench 2> x.err
  sh> grep 'done.*[0-9][0-9][0-9]' x.err  14 done in 1985 [1,2]  16 done in 276 [1,2]  17 done in 1418 [1,2]

So my argumented conclusion is that the issue is somewhere within 
PQfinish(), possibly in interaction with pgbench doings, but is *NOT* 
related in any way to the throttling patch, as it is preexisting it. Gregs 
stumbled upon it because he looked at latencies.

I'll submit a slightly improved v12 shortly.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: A better way than tweaking NTUP_PER_BUCKET
Next
From: Alvaro Herrera
Date:
Subject: Re: Support for REINDEX CONCURRENTLY