Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement) - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement) |
Date | |
Msg-id | alpine.DEB.2.02.1306221634211.23902@localhost6.localdomain6 Whole thread Raw |
In response to | Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [PATCH] pgbench --throttle (submission 7 - with lag
measurement)
Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement) |
List | pgsql-hackers |
Dear Robert and Greg, > I think so. If it doesn't get fixed now, it's not likely to get fixed > later. And the fact that nobody understands why it's happening is > kinda worrisome... Possibly, but I thing that it is not my fault:-) So, I spent some time at performance debugging... First, I finally succeeded in reproducing Greg Smith spikes on my ubuntu laptop. It needs short transactions and many clients per thread so as to be a spike. With "pgbench" standard transactions and not too many clients per thread it is more of a little bump, or even it is not there at all. After some poking around, and pursuing various red herrings, I resorted to measure the delay for calling "PQfinish()", which is really the only special thing going around at the end of pgbench run... BINGO! In his tests Greg is using one thread. Once the overall timer is exceeded, clients start closing their connections by calling PQfinish once their transactions are done. This call takes between a few us and a few ... ms. So if some client closing time hit the bad figures, the transactions of other clients are artificially delayed by this time, and it seems they have a high latency, but it is really because the thread is in another client's PQfinish and was not available to process the data. If you have one thread per client, no problem, especially as the final PQfinish() time is not measured at all by pgbench:-) Also, the more clients, the higher the spike because more are to be stopped and may hit the bad figures. Here is a trace, with the simple SELECT transaction. sh> ./pgbench --rate=14000 -T 10 -r -l -c 30 -S bench ... sh> less pgbench_log.* # short transactions, about 0.250 ms ... 20 4849 241 0 1371916583.455400 21 4844 256 0 1371916583.455470 22 4832 348 0 1371916583.45556923 4829 218 0 1371916583.455627 ** TIMER EXCEEDED ** 25 4722 390 0 1371916583.455820 25 done in 1560 [1,2] # BING, 1560 us for PQfinish() 26 4557 1969 0 1371916583.45740726 done in 21 [1,2] 27 4372 1969 0 1371916583.457447 27 done in 19 [1,2] 28 4009 1910 0 1371916583.45748628 done in 1445 [1,2] # BOUM 2 interrupted in 1300 [0,0] # BANG 3 interrupted in 15[0,0] 4 interrupted in 40 [0,0] 5 interrupted in 203 [0,0] # boum? 6 interrupted in 1352 [0,0] # ZIP 7 interruptedin 18 [0,0] ... 23 interrupted in 12 [0,0] ## the cumulated PQfinish() time above is about 6 ms which ## appears as an apparent latency for the next clients: 0 4880 6521 0 1371916583.462157 0 done in 9 [1,2] 1 4877 6397 0 1371916583.462194 1 done in 9 [1,2] 24 4807 6796 0 1371916583.46221724 done in 9 [1,2] ... Note that the bad measures also appear when there is no throttling: sh> ./pgbench -T 10 -r -l -c 30 -S bench sh> grep 'done.*[0-9][0-9][0-9]' pgbench_log.* 0 done in 1974 [1,2] 1 donein 312 [1,2] 3 done in 2159 [1,2] 7 done in 409 [1,2] 11 done in 393 [1,2] 12 done in 2212 [1,2] 13 done in 1458[1,2] ## other clients execute PQfinish in less than 100 us This "done" is issued by my instrumented version of clientDone(). The issue does also appear if I instrument pgbench from master, without anything from the throttling patch at all: sh> git diff master diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 1303217..7c5ea81 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -869,7 +869,15 @@ clientDone(CState *st, boolok) if (st->con != NULL) { + instr_time now; + int64 s0, s1; + INSTR_TIME_SET_CURRENT(now); + s0 = INSTR_TIME_GET_MICROSEC(now); PQfinish(st->con); + INSTR_TIME_SET_CURRENT(now); + s1 = INSTR_TIME_GET_MICROSEC(now); + fprintf(stderr,"%d done in %ld [%d,%d]\n", + st->id, s1-s0, st->listen, st->state); st->con = NULL; } return false; /* always false */ sh> ./pgbench -T 10 -r -l -c 30 -S bench 2> x.err sh> grep 'done.*[0-9][0-9][0-9]' x.err 14 done in 1985 [1,2] 16 done in 276 [1,2] 17 done in 1418 [1,2] So my argumented conclusion is that the issue is somewhere within PQfinish(), possibly in interaction with pgbench doings, but is *NOT* related in any way to the throttling patch, as it is preexisting it. Gregs stumbled upon it because he looked at latencies. I'll submit a slightly improved v12 shortly. -- Fabien.
pgsql-hackers by date: