Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> ... I wonder if it would help for pgbench to
>> fork off multiple sub-processes and have each sub-process tend just one
>> backend.
> I'm not sure multiple sub-processes version of pgbench shows superior
> performance than current implementation because of process context
> switching overhead. Maybe threading is better? Mr. Yasuo Ohgaki
> implemented pthead version of pgbench.
Oh, I wasn't aware someone had done it. Last night I rewrote pgbench
to fork() off one subprocess for each client. At least on the TPC-B
script, it's a bit slower, which lends weight to your worry about extra
context swap overhead. But there is something funny going on on the
backend side too. I don't have the numbers in front of me right now,
but very roughly this is what I was seeing:
pgbench alone + idle transaction
stock pgbench 680 tps 330 tps
fork() version 640 tps 50 tps
It's hard to explain that last number as being pgbench's fault. I think
that somehow the fork() version is stressing the backends more than
stock pgbench does. It might have to do with the fact that the stock
version delivers commands to multiple backends in lockstep whereas the
fork() version is more random. I've been poking at it to determine why
without a lot of success, but one interesting thing I've found out is
that with the fork() version there is a *whole* lot more contention for
the SubTransControlLock. I hacked lwlock.c to print out per-process
counts of LWLock acquisitions and number of times blocked, and this is
what I got (with the prototype patch I posted the other day to take
shared lock for SubTransGetParent):
stock pgbench, no idle xact:
PID 31529 lwlock 14: shacq 438 exacq 1 blk 0
PID 31530 lwlock 14: shacq 401 exacq 0 blk 0
PID 31531 lwlock 14: shacq 378 exacq 0 blk 0
PID 31532 lwlock 14: shacq 381 exacq 1 blk 0
PID 31533 lwlock 14: shacq 377 exacq 2 blk 0
PID 31534 lwlock 14: shacq 354 exacq 0 blk 0
PID 31535 lwlock 14: shacq 373 exacq 0 blk 0
PID 31536 lwlock 14: shacq 373 exacq 0 blk 0
PID 31537 lwlock 14: shacq 370 exacq 1 blk 0
PID 31538 lwlock 14: shacq 377 exacq 0 blk 0
fork(), no idle xact:
PID 414 lwlock 14: shacq 82401 exacq 0 blk 0
PID 415 lwlock 14: shacq 82500 exacq 3 blk 0
PID 417 lwlock 14: shacq 77727 exacq 2 blk 0
PID 419 lwlock 14: shacq 83272 exacq 2 blk 0
PID 421 lwlock 14: shacq 78579 exacq 2 blk 0
PID 424 lwlock 14: shacq 82704 exacq 0 blk 0
PID 426 lwlock 14: shacq 82252 exacq 2 blk 0
PID 429 lwlock 14: shacq 86002 exacq 0 blk 0
PID 431 lwlock 14: shacq 86617 exacq 2 blk 0
PID 432 lwlock 14: shacq 78842 exacq 1 blk 0
stock pgbench + idle xact:
PID 17868 lwlock 14: shacq 3342147 exacq 3250 blk 67
PID 17869 lwlock 14: shacq 3318728 exacq 3477 blk 74
PID 17870 lwlock 14: shacq 3324261 exacq 3858 blk 102
PID 17871 lwlock 14: shacq 3388431 exacq 3436 blk 120
PID 17872 lwlock 14: shacq 3409427 exacq 4232 blk 108
PID 17873 lwlock 14: shacq 3416117 exacq 5763 blk 130
PID 17874 lwlock 14: shacq 3396471 exacq 4860 blk 70
PID 17875 lwlock 14: shacq 3369113 exacq 4828 blk 161
PID 17876 lwlock 14: shacq 3428814 exacq 5286 blk 193
PID 17877 lwlock 14: shacq 3476198 exacq 5073 blk 147
fork() + idle xact:
PID 519 lwlock 14: shacq 83979662 exacq 2 blk 7
PID 526 lwlock 14: shacq 94968544 exacq 1 blk 1
PID 529 lwlock 14: shacq 91672324 exacq 0 blk 2
PID 530 lwlock 14: shacq 92307866 exacq 3 blk 16
PID 531 lwlock 14: shacq 93694118 exacq 0 blk 2
PID 532 lwlock 14: shacq 90776114 exacq 1 blk 2
PID 533 lwlock 14: shacq 89445464 exacq 1 blk 2
PID 534 lwlock 14: shacq 94407745 exacq 2 blk 2
PID 535 lwlock 14: shacq 88223627 exacq 2 blk 2
PID 536 lwlock 14: shacq 87223449 exacq 3 blk 6
I don't know yet why pgbench would be able to affect this, but it's
repeatable. If anyone's interested in trying to duplicate it, I'll
post my version of pgbench.
regards, tom lane