Based on feedback after the sessions I did few more tests which might be
useful to share
One point that was suggested to get each clients do more work and reduce
the number of clients.. The igen benchmarks was flexible and what I did
was remove all think time from it and repeated the test till the
scalability stops (This was done with CVS downloaded yesterday)
Note with this no think time concept, each clients can be about 75% CPU
busy from what I observed. running it I found the clients scaling up
saturates at about 60 now (compared to 500 from the original test). The
peak throughput was at about 50 users (using synchrnous_commit=off)
Here is the interesting DTrace Lock Ouput state (lock id, mode of lock
and time in ns spent waiting for lock in a 10-sec snapshot (Just taking
the last few top ones in ascending order):
With less than 20 users it is WALInsert at the top:
52 Exclusive 721950129
4 Exclusive 768537190
46 Exclusive 842063837
7 Exclusive 1031851713
With 35 Users:
52 Exclusive 2599074739
4 Exclusive 2647927574
46 Exclusive 2789581991
7 Exclusive 3220008691
At the peak at about 50 users that I saw earlier (PEAK Throughput):
46 Exclusive 3669210393
4 Exclusive 6024966938
52 Exclusive 6529168107
7 Exclusive 9408290367
With about 60 users where the throughput actually starts to drop
(throughput drops)
41 Exclusive 4570660567
52 Exclusive 10706741643
46 Exclusive 13152005125
4 Exclusive 13550187806
7 Exclusive 22146882562
With about 100 users ( below the peak value)
42 Exclusive 4238582775
46 Exclusive 6773515243
7 Exclusive 7467346038
52 Exclusive 9846216440
4 Shared 22528501166
4 Exclusive 223043774037
So it seems when both shared and exclusive time for ProcArrayLock wait
are the top 2 it is basically saturated in terms of throughput it can
handle.
Optimizing wait queues will help improve shared which might help
Exclusive a bit but eventually Exclusive for ProcArray will limit
scaling with as few as 60-70 users.
Lock hold times are below (though taken from different run)
with 30 users:
Lock Id Mode Combined Time (ns)
1616992 Exclusive 1199791629
4 Exclusive 1399371867
34 Exclusive 1426153620
1616978 Exclusive 1528327035
1616990 Exclusive 1546374298
1616988 Exclusive 1553461559
5 Exclusive 2477558484
With 50+ users
Lock Id Mode Combined Time (ns)
4 Exclusive 1438509198
1616992 Exclusive 1450973466
1616978 Exclusive 1505626978
1616990 Exclusive 1850432217
1616988 Exclusive 2033226225
34 Exclusive 2098542547
5 Exclusive 3280151374
With 100 users
Lock Id Mode Combined Time (ns)
1616992 Exclusive 1206516505
1616988 Exclusive 1486704087
1616990 Exclusive 1521900997
34 Exclusive 1532815803
1616978 Exclusive 1541986895
5 Exclusive 2179043424
5 2395098279
(Why 5 was printing with blank??)
Rerunning it with slight variation of the script
Lock Id Mode Combined Time (ns)
1616996 0 1167708953
36 0 1291958451
5 4299305160 1344486968
4 0 1347557908
1616978 0 1377931882
34 0 1724752938
5 0 2079012548
Looks like trend of 4's hold time looks similar to previous ones..
though the new kid is 5 with mode <> 0,1 .. not sure if that is causing
problems..What mode is "4299305160" for Lock 5 (SInvalLock) ? Anyway at
this point the wait time for 4 increases to a point where the database
is not scaling anymore
any thoughts?
-Jignesh