Thread: User concurrency thresholding: where do I look?
Folks, I've run into this a number of times with various PostgreSQL users, so we tested it at Sun. What seems to be happening is that at some specific number of connections average throughput drops 30% and response time quadruples or worse. The amount seems to vary per machine; I've seen it as variously 95, 1050, 1700 or 2800 connections. Tinkering with postgresql.conf parameters doesn't seem to affect this threshold. As an example of this behavior: Users Txn/User Resp. Time 50 105.38 0.01 100 113.05 0.01 150 114.05 0.01 200 113.51 0.01 250 113.38 0.01 300 112.14 0.01 350 112.26 0.01 400 111.43 0.01 450 110.72 0.01 500 110.44 0.01 550 109.36 0.01 600 107.01 0.02 650 105.71 0.02 700 106.95 0.02 750 107.69 0.02 800 106.78 0.02 850 108.59 0.02 900 106.03 0.02 950 106.13 0.02 1000 64.58 0.15 1050 52.32 0.23 1100 49.79 0.25 Tinkering with shared_buffers has had no effect on this threholding (the above was with 3gb to 6gb of shared_buffers). Any ideas on where we should look for the source of the bottleneck? -- Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus wrote: > Folks, > > 650 105.71 0.02 > 700 106.95 0.02 > 750 107.69 0.02 > 800 106.78 0.02 > 850 108.59 0.02 > 900 106.03 0.02 > 950 106.13 0.02 > 1000 64.58 0.15 > 1050 52.32 0.23 > 1100 49.79 0.25 > > Tinkering with shared_buffers has had no effect on this threholding (the above > was with 3gb to 6gb of shared_buffers). Any ideas on where we should look > for the source of the bottleneck? I have seen this as well. I always knocked it up to PG having to managing so many connections but there are some interesting evidences to review. The amount of memory "each" connection takes up. Consider 4-11 meg per connection depending on various things like number of prepared queries. Number of CPUs. Obviously 500 connections over 4 CPUS isn't the same as 500 connections over 8 CPUS. That number of connections generally means a higher velocity, a higher velocity means more checkpoint segments. Wrong settings with your checkpoint segments, bgwriter and checkpoint will cause you to start falling down. I would also note that our experience is that PG falls down a little higher, more toward 2500 connections last time I checked, but this was likely on different hardware. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
Josh Berkus wrote: > Folks, > > I've run into this a number of times with various PostgreSQL users, so we > tested it at Sun. What seems to be happening is that at some specific number > of connections average throughput drops 30% and response time quadruples or > worse. The amount seems to vary per machine; I've seen it as variously 95, > 1050, 1700 or 2800 connections. Tinkering with postgresql.conf parameters > doesn't seem to affect this threshold. > > As an example of this behavior: > > Users Txn/User Resp. Time > 50 105.38 0.01 > 100 113.05 0.01 > 150 114.05 0.01 > 200 113.51 0.01 > 250 113.38 0.01 > 300 112.14 0.01 > 350 112.26 0.01 > 400 111.43 0.01 > 450 110.72 0.01 > 500 110.44 0.01 > 550 109.36 0.01 > 600 107.01 0.02 > 650 105.71 0.02 > 700 106.95 0.02 > 750 107.69 0.02 > 800 106.78 0.02 > 850 108.59 0.02 > 900 106.03 0.02 > 950 106.13 0.02 > 1000 64.58 0.15 > 1050 52.32 0.23 > 1100 49.79 0.25 > > Tinkering with shared_buffers has had no effect on this threholding (the above > was with 3gb to 6gb of shared_buffers). Any ideas on where we should look > for the source of the bottleneck? Have you messed with max_connections and/or max_locks_per_transaction while testing this? The lock table is sized to max_locks_per_xact times max_connections, and shared memory hash tables get slower when they are full. Of course, the saturation point would depend on the avg number of locks acquired per user, which would explain why you are seeing a lower number for some users and higher for others (simpler/more complex queries). This is just a guess though. No profiling or measuring at all, really. -- Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC "How amazing is that? I call it a night and come back to find that a bug has been identified and patched while I sleep." (Robert Davidson) http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php
Alvaro, > Have you messed with max_connections and/or max_locks_per_transaction > while testing this? The lock table is sized to max_locks_per_xact times > max_connections, and shared memory hash tables get slower when they are > full. Of course, the saturation point would depend on the avg number of > locks acquired per user, which would explain why you are seeing a lower > number for some users and higher for others (simpler/more complex > queries). That's an interesting thought. Let me check lock counts and see if this is possibly the case. -- Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Alvaro, >> Have you messed with max_connections and/or max_locks_per_transaction >> while testing this? The lock table is sized to max_locks_per_xact times >> max_connections, and shared memory hash tables get slower when they are >> full. Of course, the saturation point would depend on the avg number of >> locks acquired per user, which would explain why you are seeing a lower >> number for some users and higher for others (simpler/more complex >> queries). > That's an interesting thought. Let me check lock counts and see if this is > possibly the case. AFAIK you'd get hard failures, not slowdowns, if you ran out of lock space entirely; and the fact that you can continue the curve upwards says that you're not on the edge of running out. However I agree that it's worth experimenting with those two parameters to see if the curve moves around at all. Another resource that might be interesting is the number of open files. Also, have you tried watching vmstat or local equivalent to confirm that the machine's not starting to swap? regards, tom lane
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > Alvaro, > >> Have you messed with max_connections and/or max_locks_per_transaction > >> while testing this? The lock table is sized to max_locks_per_xact times > >> max_connections, and shared memory hash tables get slower when they are > >> full. Of course, the saturation point would depend on the avg number of > >> locks acquired per user, which would explain why you are seeing a lower > >> number for some users and higher for others (simpler/more complex > >> queries). > > > That's an interesting thought. Let me check lock counts and see if this is > > possibly the case. > > AFAIK you'd get hard failures, not slowdowns, if you ran out of lock > space entirely; Well, if there still is shared memory available, the lock hash can continue to grow, but it would slow down according to this comment in ShmemInitHash: * max_size is the estimated maximum number of hashtable entries. This is * not a hard limit, but the access efficiency will degrade if it is * exceeded substantially (since it's used to compute directory size and * the hash table buckets will get overfull). For the lock hash tables this max_size is (MaxBackends+max_prepared_xacts) * max_locks_per_xact. So maybe this does not make much sense in normal operation, thus not applicable to what Josh Berkus is reporting. However I was talking to Josh Drake yesterday and he told me that pg_dump was spending some significant amount of time in LOCK TABLE when there are lots of tables (say 300k). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> AFAIK you'd get hard failures, not slowdowns, if you ran out of lock >> space entirely; > Well, if there still is shared memory available, the lock hash can > continue to grow, but it would slow down according to this comment in > ShmemInitHash: Right, but there's not an enormous amount of headroom in shared memory beyond the intended size of the hash tables. I'd think that you'd start seeing hard failures not very far beyond the point at which performance impacts became visible. Of course this is all speculation; I quite agree with varying the table-size parameters to see if it makes a difference. Josh, what sort of workload is being tested here --- read-mostly, write-mostly, a mixture? > However I was talking to Josh Drake yesterday and he told me that > pg_dump was spending some significant amount of time in LOCK TABLE when > there are lots of tables (say 300k). I wouldn't be too surprised if there's some O(N^2) effects when a single transaction holds that many locks, because of the linked-list proclock data structures. This would not be relevant to Josh's case though. regards, tom lane
Alvaro Herrera wrote: > Tom Lane wrote: >> Josh Berkus <josh@agliodbs.com> writes: > So maybe this does not make much sense in normal operation, thus not > applicable to what Josh Berkus is reporting. > > However I was talking to Josh Drake yesterday and he told me that > pg_dump was spending some significant amount of time in LOCK TABLE when > there are lots of tables (say 300k). Less, 128k Joshua D. Drake >
On Thu, 19 Jul 2007, Josh Berkus wrote: > What seems to be happening is that at some specific number of > connections average throughput drops 30% and response time quadruples or > worse. Could you characterize what each connection is doing and how you're generating the load? I don't know how productive speculating about the cause here will be until there's a test script available so other people can see where the tipping point is on their system. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Tom, all: > Also, have you tried watching vmstat or local equivalent to confirm that > the machine's not starting to swap? We're not swapping. > Josh, what sort of workload is being tested here --- read-mostly, > write-mostly, a mixture? It's a TPCC-like workload, so heavy single-row updates, and the updates/inserts are what's being measured. For that matter, when I've seen this before it was with heavy-write workloads and we were measuring the number of updates/inserts and not the number of reads. -- Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> Josh, what sort of workload is being tested here --- read-mostly, >> write-mostly, a mixture? > It's a TPCC-like workload, so heavy single-row updates, and the > updates/inserts are what's being measured. For that matter, when I've seen > this before it was with heavy-write workloads and we were measuring the > number of updates/inserts and not the number of reads. Well, if the load is a lot of short writing transactions then you'd expect the throughput to depend on how fast stuff can be pushed down to WAL. What have you got wal_buffers set to? Are you using a commit delay? What's the I/O system anyway (any BB write cache on the WAL disk?) and what wal sync method are you using? While I'm asking questions, exactly what were the data columns you presented? Txn/User doesn't make much sense to me, and I'm not sure what "response time" you were measuring either. regards, tom lane
Tom, > Well, if the load is a lot of short writing transactions then you'd > expect the throughput to depend on how fast stuff can be pushed down to > WAL. What have you got wal_buffers set to? Are you using a commit > delay? What's the I/O system anyway (any BB write cache on the WAL > disk?) and what wal sync method are you using? You know, I think Jignesh needs to me on this list so I can stop relaying questions on a workload I didn't design. Let me get him. -- Josh Berkus PostgreSQL @ Sun San Francisco
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Josh Berkus <josh@agliodbs.com> writes: > >> That's an interesting thought. Let me check lock counts and see if this is >> possibly the case. > > AFAIK you'd get hard failures, not slowdowns, if you ran out of lock > space entirely I assume you've checked the server logs and are sure that you aren't in fact getting errors. I could, for example, envision a situation where a fraction of the transactions are getting some error and those transactions are therefore not being counted against the txn/s result. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Thu, 19 Jul 2007, Josh Berkus wrote: > It's a TPCC-like workload, so heavy single-row updates, and the > updates/inserts are what's being measured. There's so much going on with a TPC-C kind of workload. Has anyone ever looked into quantifying scaling for more fundamental operations? There are so many places a complicated workload could get caught up that starting there is hard. I've found it's helpful to see the breaking points for simpler operations, then compare how things change as each new transaction element is introduced. As an example, take a look at the MySQL SysBench tool: http://sysbench.sourceforge.net/docs/ Specifically their "oltp" tests. Note how you can get a handle on how simple selects scale, then simple inserts, then updates, and so on. The only thing I've thought of they missed is testing a trivial operation that doesn't even touch the buffer cache ('SELECT 1'?) that would let you quantify just general connection scaling issues. It seems to me that you could narrow the list of possible causes here much more quickly if you had a good handle on the upper concurrency of lower-level operations. [Note: it's possible to run SysBench against a PG database, but the code is very immature. Last time I tried there were plenty of crashes and there seemed to be some transaction wrapping issues that caused deadlocks with some tests.] -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg, > There's so much going on with a TPC-C kind of workload. Has anyone ever > looked into quantifying scaling for more fundamental operations? There > are so many places a complicated workload could get caught up that > starting there is hard. I've found it's helpful to see the breaking > points for simpler operations, then compare how things change as each > new transaction element is introduced. ... eagerly awaiting Michael Doilson's PgUnitTest .... --Josh
Awww Josh, I was just enjoying the chat on the picket fence! :-) Anyway the workload is mixed (reads,writes) with simple to medium queries. The workload is known to scale well. But inorder to provide substantial input I am still trying to eliminate things that can bottleneck. Currently I have eliminated CPU (plenty free) , RAM (memory is 48GB RAM in this server for a 32-bit postgresql instance), IO Storage (used the free ram to do /tmp database to eliminate IO) and am still trying to eliminate any network bottlenecks to say for sure we have a problem in PostgreSQL. But yes till that final thing is confirmed (network which can very well be the case) it could be a problem somewhere else. However the thing that worries me is more of the big drop instead of remaining constant out there.. Anyway more on this within a day or two once I add more network nics between the systems to eliminate network problems (even though stats dont show them as problems right now) and also reduce malloc lock penalties if any. As for other questions: max_locks_per_transactions is set to default (10 I believe) increasing it still seems to degrade overall throughput number. max_connections is set to 1500 for now till I get decent scaling till 1400-1500 users. There are no hard failures reported anywhere. Log min durations does show that queries are now slowing down and taking longer. OS is not swapping and also eliminated IO by putting the whole database on /tmp So while I finish adding more network connections between the two systems (need to get cards) do enjoy the following URL :-) http://www.spec.org/jAppServer2004/results/res2007q3/jAppServer2004-20070703-00073.html Of course all postgresql.conf still remains from the old test so no flames on that one again :-) Regards, Jignesh Josh Berkus wrote: > Tom, > > >> Well, if the load is a lot of short writing transactions then you'd >> expect the throughput to depend on how fast stuff can be pushed down to >> WAL. What have you got wal_buffers set to? Are you using a commit >> delay? What's the I/O system anyway (any BB write cache on the WAL >> disk?) and what wal sync method are you using? >> > > You know, I think Jignesh needs to me on this list so I can stop relaying > questions on a workload I didn't design. Let me get him. > >
I forgot to add one more piece of information.. I also tried the same test with 64-bit postgresql with 6GB shared_buffers and results are the same it drops around the same point which to me sounds like a bottleneck.. More later -Jignesh Jignesh K. Shah wrote: > Awww Josh, > > I was just enjoying the chat on the picket fence! :-) > > Anyway the workload is mixed (reads,writes) with simple to medium > queries. The workload is known to scale well. But inorder to provide > substantial input I am still trying to eliminate things that can > bottleneck. Currently I have eliminated CPU (plenty free) , RAM > (memory is 48GB RAM in this server for a 32-bit postgresql > instance), IO Storage (used the free ram to do /tmp database to > eliminate IO) and am still trying to eliminate any network > bottlenecks to say for sure we have a problem in PostgreSQL. But yes > till that final thing is confirmed (network which can very well be the > case) it could be a problem somewhere else. However the thing that > worries me is more of the big drop instead of remaining constant out > there.. > > Anyway more on this within a day or two once I add more network nics > between the systems to eliminate network problems (even though stats > dont show them as problems right now) and also reduce malloc lock > penalties if any. > > As for other questions: > > max_locks_per_transactions is set to default (10 I believe) increasing > it still seems to degrade overall throughput number. > > max_connections is set to 1500 for now till I get decent scaling till > 1400-1500 users. > > There are no hard failures reported anywhere. Log min durations does > show that queries are now slowing down and taking longer. > > OS is not swapping and also eliminated IO by putting the whole > database on /tmp > > So while I finish adding more network connections between the two > systems (need to get cards) do enjoy the following URL :-) > > http://www.spec.org/jAppServer2004/results/res2007q3/jAppServer2004-20070703-00073.html > > > Of course all postgresql.conf still remains from the old test so no > flames on that one again :-) > > Regards, > Jignesh > > > > > Josh Berkus wrote: >> Tom, >> >> >>> Well, if the load is a lot of short writing transactions then you'd >>> expect the throughput to depend on how fast stuff can be pushed down to >>> WAL. What have you got wal_buffers set to? Are you using a commit >>> delay? What's the I/O system anyway (any BB write cache on the WAL >>> disk?) and what wal sync method are you using? >>> >> >> You know, I think Jignesh needs to me on this list so I can stop >> relaying questions on a workload I didn't design. Let me get him. >> >> > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > There are no hard failures reported anywhere. Log min durations does > show that queries are now slowing down and taking longer. > OS is not swapping and also eliminated IO by putting the whole database > on /tmp Hmm. Do you see any evidence of a context swap storm (ie, a drastic increase in the context swaps/second reading reported by vmstat)? regards, tom lane
Yes I did see increase in context switches and CPU migrations at that point using mpstat. Regards, Jignesh Tom Lane wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > >> There are no hard failures reported anywhere. Log min durations does >> show that queries are now slowing down and taking longer. >> OS is not swapping and also eliminated IO by putting the whole database >> on /tmp >> > > Hmm. Do you see any evidence of a context swap storm (ie, a drastic > increase in the context swaps/second reading reported by vmstat)? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > Yes I did see increase in context switches and CPU migrations at that > point using mpstat. So follow that up --- try to determine which lock is being contended for. There's some very crude code in the sources that you can enable with -DLWLOCK_STATS, but probably DTrace would be a better tool. regards, tom lane
Tom Lane wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: >> Yes I did see increase in context switches and CPU migrations at that >> point using mpstat. > > So follow that up --- try to determine which lock is being contended > for. There's some very crude code in the sources that you can enable > with -DLWLOCK_STATS, but probably DTrace would be a better tool. > > regards, tom lane Using plockstat -A -s 5 -p $pid on bgwriter: doesnt report anything On one of the many connections: This one is hard to read easily By default, plockstat monitors all lock con- tention events, gathers frequency and timing data about those events, and displays the data in decreasing frequency order, so that the most common events appear first. ^Cbash-3.00# plockstat -A -s 5 -p 6401 ^C Mutex hold ------------------------------------------------------------------------------- Count nsec Lock Caller 59 186888 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 16384 | | 1 libumem.so.1`process_free+0x12c 32768 |@@@@@ | 14 postgres`AllocSetDelete+0x98 65536 |@@ | 5 postgres`MemoryContextDelete+0x78 131072 | | 0 postgres`CommitTransaction+0x240 262144 |@@@@@@@@@@@@@@@ | 39 ------------------------------------------------------------------------------- Count nsec Lock Caller 530 12226 0x10059e280 libumem.so.1`umem_cache_alloc+0x200 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@ | 338 libumem.so.1`umem_cache_alloc+0x200 8192 |@ | 24 libumem.so.1`umem_alloc+0x5c 16384 |@ | 37 libumem.so.1`malloc+0x40 32768 |@@@@@ | 131 postgres`AllocSetAlloc+0x1c4 ------------------------------------------------------------------------------- Count nsec Lock Caller 324 10214 0x100578030 libumem.so.1`vmem_xfree+0x164 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@ | 192 libumem.so.1`vmem_xfree+0x164 8192 |@@@@ | 56 libumem.so.1`process_free+0x12c 16384 |@ | 26 postgres`AllocSetDelete+0x98 32768 |@@@ | 50 postgres`MemoryContextDelete+0x78 ------------------------------------------------------------------------------- Count nsec Lock Caller 161 13585 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@ | 118 libumem.so.1`process_free+0x12c 8192 | | 4 postgres`AllocSetDelete+0x98 16384 |@ | 10 postgres`MemoryContextDelete+0x78 32768 |@@@ | 24 postgres`PortalDrop+0x160 65536 | | 3 131072 | | 0 262144 | | 2 ------------------------------------------------------------------------------- Count nsec Lock Caller 326 6081 libumem.so.1`vmem0+0xc38 libumem.so.1`vmem_xalloc+0x630 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@ | 170 libumem.so.1`vmem_xalloc+0x630 8192 |@@@@@@@@@@@ | 155 libumem.so.1`vmem_alloc+0x1f8 16384 | | 1 libumem.so.1`vmem_xalloc+0x524 libumem.so.1`vmem_alloc+0x1f8 ------------------------------------------------------------------------------- Count nsec Lock Caller 326 5867 libumem.so.1`vmem0+0x30 libumem.so.1`vmem_alloc+0x248 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@ | 185 libumem.so.1`vmem_alloc+0x248 8192 |@@@@@@@@@@ | 141 libumem.so.1`vmem_sbrk_alloc+0x30 libumem.so.1`vmem_xalloc+0x524 libumem.so.1`vmem_alloc+0x1f8 ------------------------------------------------------------------------------- Count nsec Lock Caller 318 5873 0x100578030 libumem.so.1`vmem_alloc+0x1d0 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@ | 228 libumem.so.1`vmem_alloc+0x1d0 8192 |@@@@@ | 78 libumem.so.1`umem_alloc+0xec 16384 | | 6 libumem.so.1`malloc+0x40 32768 | | 6 postgres`AllocSetAlloc+0x1c4 ------------------------------------------------------------------------------- Count nsec Lock Caller 326 5591 0x100578030 libumem.so.1`vmem_xalloc+0x630 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@ | 213 libumem.so.1`vmem_xalloc+0x630 8192 |@@@@@@@@ | 112 libumem.so.1`vmem_alloc+0x1f8 16384 | | 0 libumem.so.1`umem_alloc+0xec 32768 | | 1 libumem.so.1`malloc+0x40 ------------------------------------------------------------------------------- Count nsec Lock Caller 324 5208 libumem.so.1`vmem0+0xc38 libumem.so.1`vmem_xfree+0x164 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@ | 236 libumem.so.1`vmem_xfree+0x164 8192 |@@@@@@ | 88 libumem.so.1`process_free+0x12c postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 ------------------------------------------------------------------------------- Count nsec Lock Caller 326 4108 libumem.so.1`vmem0+0xc38 libumem.so.1`vmem_alloc+0x1d0 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@ | 325 libumem.so.1`vmem_alloc+0x1d0 8192 | | 1 libumem.so.1`vmem_xalloc+0x524 libumem.so.1`vmem_alloc+0x1f8 libumem.so.1`umem_alloc+0xec ------------------------------------------------------------------------------- Count nsec Lock Caller 326 4108 0x100578030 libumem.so.1`vmem_xalloc+0x50c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@ | 325 libumem.so.1`vmem_xalloc+0x50c 8192 | | 1 libumem.so.1`vmem_alloc+0x1f8 libumem.so.1`umem_alloc+0xec libumem.so.1`malloc+0x40 ------------------------------------------------------------------------------- Count nsec Lock Caller 326 4096 libumem.so.1`vmem0+0xc38 libumem.so.1`vmem_xalloc+0x50c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@@| 326 libumem.so.1`vmem_xalloc+0x50c libumem.so.1`vmem_alloc+0x1f8 libumem.so.1`vmem_xalloc+0x524 libumem.so.1`vmem_alloc+0x1f8 ------------------------------------------------------------------------------- Count nsec Lock Caller 240 5444 libumem.so.1`vmem0+0x30 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@ | 167 libumem.so.1`process_free+0x12c 8192 |@@@@@@@ | 72 postgres`AllocSetDelete+0x98 16384 | | 0 postgres`MemoryContextDelete+0x78 32768 | | 1 postgres`ExecutorEnd+0x40 ------------------------------------------------------------------------------- Count nsec Lock Caller 123 9057 0x10059e1d0 libumem.so.1`umem_depot_alloc+0xb8 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@ | 60 libumem.so.1`umem_depot_alloc+0xb8 8192 |@@@@ | 24 libumem.so.1`umem_cache_free+0xc4 16384 |@@@@@@@ | 37 libumem.so.1`process_free+0x12c 32768 | | 2 postgres`AllocSetDelete+0x98 ------------------------------------------------------------------------------- Count nsec Lock Caller 200 4935 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@ | 185 libumem.so.1`process_free+0x12c 8192 | | 4 postgres`AllocSetDelete+0x98 16384 |@ | 10 postgres`MemoryContextDelete+0x78 32768 | | 1 postgres`ExecutorEnd+0x40 ------------------------------------------------------------------------------- Count nsec Lock Caller 164 5219 0x100595700 libumem.so.1`umem_cache_alloc+0x200 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@ | 121 libumem.so.1`umem_cache_alloc+0x200 8192 |@@@@@@ | 42 libumem.so.1`umem_alloc+0x5c 16384 | | 1 libumem.so.1`malloc+0x40 postgres`AllocSetAlloc+0x1c4 ------------------------------------------------------------------------------- Count nsec Lock Caller 122 6748 0x10059e1d0 libumem.so.1`umem_depot_alloc+0xb8 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@ | 43 libumem.so.1`umem_depot_alloc+0xb8 8192 |@@@@@@@@@@@@@@@ | 79 libumem.so.1`umem_cache_alloc+0xa0 libumem.so.1`umem_alloc+0x5c libumem.so.1`malloc+0x40 ------------------------------------------------------------------------------- Count nsec Lock Caller 163 4146 0x100595700 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@ | 161 libumem.so.1`process_free+0x12c 8192 | | 2 postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 postgres`PortalDrop+0x160 ------------------------------------------------------------------------------- Count nsec Lock Caller 50 12615 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@ | 28 libumem.so.1`process_free+0x12c 8192 |@ | 3 postgres`AllocSetDelete+0x98 16384 |@@@ | 8 postgres`MemoryContextDelete+0x78 32768 |@@@@@ | 11 postgres`FreeExecutorState+0x6c ------------------------------------------------------------------------------- Count nsec Lock Caller 123 4096 0x10059e1d0 libumem.so.1`umem_cache_free+0xec nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@@| 123 libumem.so.1`umem_cache_free+0xec libumem.so.1`process_free+0x12c postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 ------------------------------------------------------------------------------- Count nsec Lock Caller 122 4096 0x10059e1d0 libumem.so.1`umem_cache_alloc+0xc4 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@@| 122 libumem.so.1`umem_cache_alloc+0xc4 libumem.so.1`umem_alloc+0x5c libumem.so.1`malloc+0x40 postgres`AllocSetAlloc+0x1c4 ------------------------------------------------------------------------------- Count nsec Lock Caller 37 7970 libumem.so.1`vmem0+0x30 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@ | 2 libumem.so.1`process_free+0x12c 8192 |@@@@@@@@@@@@@@@@@@@@@@ | 35 postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 postgres`exec_parse_message+0x130 ------------------------------------------------------------------------------- Count nsec Lock Caller 37 5867 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@ | 33 libumem.so.1`process_free+0x12c 8192 |@ | 2 postgres`AllocSetDelete+0x98 16384 | | 0 postgres`MemoryContextDelete+0x78 32768 |@ | 2 postgres`exec_parse_message+0x130 ------------------------------------------------------------------------------- Count nsec Lock Caller 39 4516 libumem.so.1`vmem0+0x30 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@ | 35 libumem.so.1`process_free+0x12c 8192 |@@ | 4 postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 postgres`PortalDrop+0x160 ------------------------------------------------------------------------------- Count nsec Lock Caller 37 4428 0x10058b700 libumem.so.1`umem_cache_alloc+0x200 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@ | 34 libumem.so.1`umem_cache_alloc+0x200 8192 |@ | 3 libumem.so.1`umem_alloc+0x5c libumem.so.1`malloc+0x40 postgres`base_yy_scan_buffer+0x38 ------------------------------------------------------------------------------- Count nsec Lock Caller 37 4206 0x10058b700 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@ | 36 libumem.so.1`process_free+0x12c 8192 | | 1 postgres`scanner_finish+0x50 postgres`raw_parser+0x3c postgres`pg_parse_query+0x54 ------------------------------------------------------------------------------- Count nsec Lock Caller 11 10426 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@ | 6 libumem.so.1`process_free+0x12c 8192 |@@@@@@ | 3 postgres`AllocSetDelete+0x98 16384 | | 0 postgres`MemoryContextDelete+0x78 32768 |@@@@ | 2 postgres`ExecEndAgg+0x68 ------------------------------------------------------------------------------- Count nsec Lock Caller 8 5120 libumem.so.1`vmem0+0x30 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@ | 6 libumem.so.1`process_free+0x12c 8192 |@@@@@@ | 2 postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 postgres`ExecEndSort+0x24 ------------------------------------------------------------------------------- Count nsec Lock Caller 8 4096 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@@| 8 libumem.so.1`process_free+0x12c postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 postgres`ExecEndSort+0x24 ------------------------------------------------------------------------------- Count nsec Lock Caller 8 4096 0x100578030 libumem.so.1`vmem_alloc+0x1d0 nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@@| 8 libumem.so.1`vmem_alloc+0x1d0 libumem.so.1`umem_alloc+0xec libumem.so.1`malloc+0x40 postgres`AllocSetAlloc+0x314 ------------------------------------------------------------------------------- Count nsec Lock Caller 3 4096 0x10059e280 libumem.so.1`process_free+0x12c nsec ---- Time Distribution --- count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@@| 3 libumem.so.1`process_free+0x12c postgres`AllocSetDelete+0x98 postgres`MemoryContextDelete+0x78 postgres`tbm_free+0x10 bash-3.00#
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > Tom Lane wrote: >> So follow that up --- try to determine which lock is being contended >> for. There's some very crude code in the sources that you can enable >> with -DLWLOCK_STATS, but probably DTrace would be a better tool. > Using plockstat -A -s 5 -p $pid I don't know what that is, but it doesn't appear to have anything to do with Postgres LWLocks or spinlocks, which are the locks I was thinking of. Try asking Robert Lor about this --- IIRC he had some dtrace probes to work with our locks. regards, tom lane
sorry.. The are solaris mutex locks used by the postgresql process. What its saying is that there are holds/waits in trying to get locks which are locked at Solaris user library levels called from the postgresql functions: For example both the following functions are hitting on the same mutex lock 0x10059e280 in Solaris Library call: postgres`AllocSetDelete+0x98 postgres`AllocSetAlloc+0x1c4 I need to enable the DTrace probes on my builds -Jignesh Tom Lane wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > >> Tom Lane wrote: >> >>> So follow that up --- try to determine which lock is being contended >>> for. There's some very crude code in the sources that you can enable >>> with -DLWLOCK_STATS, but probably DTrace would be a better tool. >>> > > >> Using plockstat -A -s 5 -p $pid >> > > I don't know what that is, but it doesn't appear to have anything to do > with Postgres LWLocks or spinlocks, which are the locks I was thinking of. > Try asking Robert Lor about this --- IIRC he had some dtrace probes to > work with our locks. > > regards, tom lane >
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > What its saying is that there are holds/waits in trying to get locks > which are locked at Solaris user library levels called from the > postgresql functions: > For example both the following functions are hitting on the same mutex > lock 0x10059e280 in Solaris Library call: > postgres`AllocSetDelete+0x98 > postgres`AllocSetAlloc+0x1c4 That's a perfect example of the sort of useless overhead that I was complaining of just now in pgsql-patches. Having malloc/free use an internal mutex is necessary in multi-threaded programs, but the backend isn't multi-threaded. And yet, apparently you can't turn that off in Solaris. (Fortunately, the palloc layer is probably insulating us from malloc's performance enough that this isn't a huge deal. But it's annoying.) regards, tom lane
True you cant switch off the locks since libthread has been folded into libc in Solaris 10. Anyway just to give you an idea of the increase in context switching at the break point here are the mpstat (taken at 10 second interval) on this 8-socket Sun Fire V890. The low icsw (Involuntary Context Switches) is about 950-1000 user mark after which a context switch storm starts at users above 1000-1050 mark and drops in total throughput drops about 30% instantaneously.. I will try rebuilding the postgresql with dtrace probes to get more clues. (NOTE you will see 1 cpu (cpuid:22) doing more system work... thats the one doing handling the network interrupts) CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 57 0 27 108 6 4072 98 1749 416 1 7763 47 13 0 40 1 46 0 24 22 6 4198 11 1826 427 0 7547 45 13 0 42 2 42 0 34 104 8 4103 91 1682 424 1 7797 46 13 0 41 3 51 0 22 21 6 4125 10 1734 435 0 7399 45 13 0 43 4 65 0 27 19 6 4015 8 1706 411 0 7292 44 15 0 41 5 54 0 21 21 6 4297 10 1702 464 0 7708 45 13 0 42 6 36 0 16 66 47 4218 12 1713 426 0 7685 47 11 0 42 7 40 0 100 318 206 3699 10 1534 585 0 6851 45 14 0 41 16 41 0 30 87 5 3780 78 1509 401 1 7604 45 13 0 42 17 39 0 24 22 5 3970 12 1631 408 0 7265 44 12 0 44 18 42 0 24 99 5 3829 89 1519 401 1 7343 45 12 0 43 19 39 0 31 78830 5 3588 8 1509 400 0 6629 43 13 0 44 20 22 0 20 19 6 3925 9 1577 419 0 7364 44 12 0 44 21 38 0 31 23 5 3792 13 1566 407 0 7133 45 12 0 44 22 8 0 110 7053 7045 1641 8 728 838 0 2917 16 50 0 33 23 62 0 29 21 5 3985 10 1579 449 0 7368 44 12 0 44 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 13 0 27 123 6 4228 113 1820 433 1 8084 49 13 0 38 1 16 0 63 26 6 4253 15 1875 420 0 7754 47 14 0 39 2 11 0 31 110 8 4178 97 1741 425 1 8095 48 14 0 38 3 8 0 24 20 6 4257 9 1818 444 0 7807 47 13 0 40 4 13 0 54 28 6 4145 17 1774 426 1 7732 46 16 0 38 5 12 0 35 23 6 4412 12 1775 447 0 8249 48 13 0 39 6 8 0 24 38 15 4323 14 1760 422 0 8016 49 11 0 39 7 8 0 120 323 206 3801 15 1599 635 0 7290 47 15 0 38 16 11 0 44 107 5 3896 98 1582 393 1 7997 47 15 0 39 17 15 0 29 24 5 4120 14 1716 416 0 7648 46 13 0 41 18 9 0 35 113 5 3933 103 1594 399 1 7714 47 13 0 40 19 8 0 34 83271 5 3702 12 1564 403 0 7010 45 14 0 41 20 7 0 28 27 6 3997 16 1624 400 0 7676 46 13 0 41 21 8 0 28 25 5 3997 15 1664 402 0 7658 47 12 0 41 22 4 0 97 7741 7731 1586 11 704 906 0 2933 17 51 0 32 23 13 0 28 25 5 4144 15 1658 437 0 7810 47 12 0 41 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 141 315 6 9262 301 2812 330 0 10905 49 16 0 35 1 1 0 153 199 6 9400 186 2808 312 0 11066 48 16 0 37 2 0 0 140 256 8 8798 242 2592 310 0 10111 47 15 0 38 3 1 0 141 189 6 8803 172 2592 314 0 10171 47 15 0 39 4 0 0 120 214 6 9540 207 2801 322 0 10531 46 17 0 36 5 1 0 152 180 6 8764 161 2564 342 0 9904 47 15 0 38 6 1 0 107 344 148 8180 181 2512 290 0 9314 51 14 0 35 7 0 0 665 443 204 8733 153 2574 404 0 9892 43 21 0 37 16 0 0 113 217 5 6446 201 1975 265 0 7552 45 12 0 44 17 0 0 107 153 5 6568 140 2021 274 0 7586 44 11 0 45 18 0 0 121 215 5 6072 201 1789 276 1 7690 44 12 0 44 19 1 0 102 47142 5 6123 126 1829 262 0 7185 43 12 0 45 20 0 0 102 143 6 6451 129 1939 262 0 7450 43 13 0 44 21 1 0 106 150 5 6538 133 1997 285 0 7425 44 11 0 44 22 0 0 494 5949 5876 3586 73 1040 399 0 4058 26 39 0 34 23 0 0 102 159 5 6393 142 1942 324 0 7226 43 12 0 46 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 217 441 7 10763 426 3234 363 0 12449 47 18 0 35 1 0 0 210 322 7 11113 309 3273 351 0 12527 46 17 0 37 2 1 0 212 387 8 10306 370 2977 354 0 11320 45 16 0 38 3 0 0 230 276 7 10332 257 2947 341 0 11901 43 16 0 40 4 0 0 234 306 7 11324 290 3265 352 0 12805 45 18 0 37 5 0 0 212 284 7 10590 262 3042 388 0 11789 44 17 0 39 6 1 0 154 307 48 9583 241 2903 324 0 10564 50 15 0 35 7 0 0 840 535 206 10354 247 3035 428 0 11700 42 22 0 37 16 0 0 169 303 5 7446 286 2250 290 0 8361 42 13 0 45 17 0 0 173 240 5 7640 225 2288 295 0 8674 41 13 0 47 18 0 0 170 289 5 7445 270 2108 286 0 8167 41 12 0 47 19 0 0 176 51118 5 7365 197 2138 288 0 7934 40 13 0 47 20 1 0 172 222 6 7835 204 2323 298 0 8759 40 14 0 46 21 0 0 167 233 5 7749 218 2339 326 0 8264 42 13 0 46 22 0 0 749 6612 6516 4173 97 1166 421 0 4741 23 44 0 33 23 0 0 181 239 6 7709 219 2258 383 0 8402 41 12 0 47 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 198 439 6 10364 417 3113 327 0 11962 49 17 0 34 1 0 0 210 299 6 10655 282 3135 346 0 12463 47 17 0 36 2 0 0 202 352 8 9960 332 2890 320 0 11261 47 16 0 37 3 0 0 182 276 6 9950 255 2857 334 0 11021 46 16 0 38 4 0 0 200 305 6 10841 286 3127 325 0 12440 48 18 0 35 5 0 0 240 286 6 9983 272 2912 358 0 11450 46 16 0 37 6 0 0 153 323 81 9062 233 2767 300 0 9675 49 18 0 33 7 0 0 850 556 206 10027 271 2910 415 0 11048 43 22 0 35 16 0 0 152 306 5 7261 291 2216 266 0 8055 44 12 0 44 17 0 0 151 236 5 7193 217 2170 283 0 8099 43 12 0 45 18 0 0 170 263 5 7008 246 2009 254 0 7836 43 12 0 46 19 0 0 165 47738 5 6824 197 1989 273 0 7663 42 12 0 46 20 0 0 188 217 6 7496 197 2222 280 0 8435 43 13 0 44 21 0 0 179 248 5 7352 234 2233 309 0 8237 43 12 0 44 22 0 0 813 6041 5963 4006 82 1125 448 0 4442 25 42 0 33 23 0 0 162 241 5 7364 225 2170 355 0 7720 43 11 0 45 Tom Lane wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > >> What its saying is that there are holds/waits in trying to get locks >> which are locked at Solaris user library levels called from the >> postgresql functions: >> For example both the following functions are hitting on the same mutex >> lock 0x10059e280 in Solaris Library call: >> postgres`AllocSetDelete+0x98 >> postgres`AllocSetAlloc+0x1c4 >> > > That's a perfect example of the sort of useless overhead that I was > complaining of just now in pgsql-patches. Having malloc/free use > an internal mutex is necessary in multi-threaded programs, but the > backend isn't multi-threaded. And yet, apparently you can't turn > that off in Solaris. > > (Fortunately, the palloc layer is probably insulating us from malloc's > performance enough that this isn't a huge deal. But it's annoying.) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Tom Lane wrote: > Having malloc/free use > an internal mutex is necessary in multi-threaded programs, but the > backend isn't multi-threaded. > Hmm...confused. I'm not following why then there is contention for the mutex. Surely this has to be some other mutex that is in contention, not a heap lock ? It'd be handy to see the call stack for the wait state -- if the thing is spending a significant proportion of its time in contention it should be easy to get that with a simple tool such as pstack or a debugger.
David Boreham <david_list@boreham.org> writes: > Tom Lane wrote: >> Having malloc/free use >> an internal mutex is necessary in multi-threaded programs, but the >> backend isn't multi-threaded. > Hmm...confused. I'm not following why then there is contention for the > mutex. There isn't any contention for that mutex; Jignesh's results merely show that it was taken and released a lot of times. regards, tom lane
On Fri, 2007-07-20 at 16:57 -0400, Tom Lane wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > > What its saying is that there are holds/waits in trying to get locks > > which are locked at Solaris user library levels called from the > > postgresql functions: > > For example both the following functions are hitting on the same mutex > > lock 0x10059e280 in Solaris Library call: > > postgres`AllocSetDelete+0x98 > > postgres`AllocSetAlloc+0x1c4 > > That's a perfect example of the sort of useless overhead that I was > complaining of just now in pgsql-patches. Having malloc/free use > an internal mutex is necessary in multi-threaded programs, but the > backend isn't multi-threaded. And yet, apparently you can't turn > that off in Solaris. > > (Fortunately, the palloc layer is probably insulating us from malloc's > performance enough that this isn't a huge deal. But it's annoying.) There is one thing that the palloc layer doesn't handle: EState. All other memory contexts have a very well chosen initial allocation that prevents mallocs during low-medium complexity OLTP workloads. EState is about 8300 bytes, so just above the large allocation limit. This means that every time we request an EState, i.e. at least once per statement we need to malloc() and then later free(). Would it be worth a special case in the palloc system to avoid having to repeatedly issue external memory allocation calls? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > EState is about 8300 bytes, What? (gdb) p sizeof(EState) $1 = 112 This is on a 32-bit machine, but even on 64-bit it wouldn't be more than double that. > Would it be worth a special case in the palloc system to avoid having to > repeatedly issue external memory allocation calls? The appropriate hack would be to change the AllocSetContextCreate initial-size parameter for the containing context. But I really have no idea what you're on about. regards, tom lane
On Mon, 2007-07-23 at 10:11 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > EState is about 8300 bytes, > > What? > > (gdb) p sizeof(EState) > $1 = 112 > > This is on a 32-bit machine, but even on 64-bit it wouldn't be more than > double that. > > > Would it be worth a special case in the palloc system to avoid having to > > repeatedly issue external memory allocation calls? > > The appropriate hack would be to change the AllocSetContextCreate > initial-size parameter for the containing context. But I really have > no idea what you're on about. I looked at this last May and my notes say "ExecutorState". I guess that was wrong, but my analysis showed there was a single malloc of 8228 bytes happening once per query during my tests. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > I looked at this last May and my notes say "ExecutorState". I guess that > was wrong, but my analysis showed there was a single malloc of 8228 > bytes happening once per query during my tests. Well, if you can track down where it's coming from, we could certainly hack the containing context's parameters. But EState's not it. regards, tom lane
On Mon, 2007-07-23 at 10:54 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > I looked at this last May and my notes say "ExecutorState". I guess that > > was wrong, but my analysis showed there was a single malloc of 8228 > > bytes happening once per query during my tests. > > Well, if you can track down where it's coming from, we could certainly > hack the containing context's parameters. But EState's not it. Well, I discover there is an allocation of 8232 (inflation...) made once per statement by a memory context called... ExecutorState. Still not sure exactly which allocation this is, but its definitely once per statement on pgbench, which should narrow it down. Plan, query etc? I don't see a way to hack the allocation, since the max chunk size is 8K. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Well, I discover there is an allocation of 8232 (inflation...) made once > per statement by a memory context called... ExecutorState. Still not > sure exactly which allocation this is, but its definitely once per > statement on pgbench, which should narrow it down. Plan, query etc? Are you working with stock sources? The only allocation exceeding 1K that I can see during pgbench is BTScanOpaqueData, which is 6600 bytes. (Checked by setting a conditional breakpoint on AllocSetAlloc.) The path that allocates a single-chunk block is never taken at all. regards, tom lane
On Mon, 2007-07-23 at 16:48 +0100, Simon Riggs wrote: > On Mon, 2007-07-23 at 10:54 -0400, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > I looked at this last May and my notes say "ExecutorState". I guess that > > > was wrong, but my analysis showed there was a single malloc of 8228 > > > bytes happening once per query during my tests. > > > > Well, if you can track down where it's coming from, we could certainly > > hack the containing context's parameters. But EState's not it. > > Well, I discover there is an allocation of 8232 (inflation...) made once > per statement by a memory context called... ExecutorState. Still not > sure exactly which allocation this is, but its definitely once per > statement on pgbench, which should narrow it down. Plan, query etc? > > I don't see a way to hack the allocation, since the max chunk size is > 8K. It is the allocation of BTScanOpaqueData called from btrescan() in nbtree.c currPos and markPos are defined as BTScanPosData, which is an array of BTScanPosItems. That makes BTScanOpaqueData up to 8232 bytes, which seems wasteful since markPos is only ever used during merge joins. Most of that space isn't even used during merge joins either, we just do that to slightly optimise the speed of the restore during merge joins. Seems like we should allocate the memory when we do the first mark. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-07-23 at 12:35 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Well, I discover there is an allocation of 8232 (inflation...) made once > > per statement by a memory context called... ExecutorState. Still not > > sure exactly which allocation this is, but its definitely once per > > statement on pgbench, which should narrow it down. Plan, query etc? > > Are you working with stock sources? The only allocation exceeding 1K > that I can see during pgbench is BTScanOpaqueData, which is 6600 bytes. > (Checked by setting a conditional breakpoint on AllocSetAlloc.) The > path that allocates a single-chunk block is never taken at all. I do have the bitmap patch currently applied, but it doesn't touch that part of the code. (gdb) p size $1 = 8232 (gdb) p sizeof(int) $2 = 4 (gdb) p sizeof(BTScanPosData) $3 = 4104 Since my notes say I got 8228 last year, seems reasonable. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > currPos and markPos are defined as BTScanPosData, which is an array of > BTScanPosItems. That makes BTScanOpaqueData up to 8232 bytes, which > seems wasteful since markPos is only ever used during merge joins. Most > of that space isn't even used during merge joins either, we just do that > to slightly optimise the speed of the restore during merge joins. Ah. I was seeing it as 6600 bytes on HPPA and 6608 on x86_64, but I forgot that both of those architectures have MAXALIGN = 8. On a MAXALIGN = 4 machine, MaxIndexTuplesPerPage will be significantly larger, leading to larger BTScanPosData. Not sure it's worth fooling with, given that these days almost everyone who's seriously concerned about performance is probably using 64bit hardware. One less malloc cycle per indexscan is never going to be a measurable savings anyway... regards, tom lane
On Mon, 2007-07-23 at 14:19 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > currPos and markPos are defined as BTScanPosData, which is an array of > > BTScanPosItems. That makes BTScanOpaqueData up to 8232 bytes, which > > seems wasteful since markPos is only ever used during merge joins. Most > > of that space isn't even used during merge joins either, we just do that > > to slightly optimise the speed of the restore during merge joins. > > Ah. I was seeing it as 6600 bytes on HPPA and 6608 on x86_64, but > I forgot that both of those architectures have MAXALIGN = 8. On a > MAXALIGN = 4 machine, MaxIndexTuplesPerPage will be significantly > larger, leading to larger BTScanPosData. > > Not sure it's worth fooling with, given that these days almost everyone > who's seriously concerned about performance is probably using 64bit > hardware. One less malloc cycle per indexscan is never going to be a > measurable savings anyway... Oh sure, I was thinking to avoid Solaris' mutex by avoiding malloc() completely. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Here is how I got the numbers.. I had about 1600 users login into postgresql. Then started the run with 500 users and using DTrace I started tracking Postgresql Locking "as viewed from one user/connection". Echo statements indicate how many users were active at that point and how was throughput performing. All IO is done on /tmp which means on a RAM disk. bash-3.00# echo 500 users - baseline number 500 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count FirstLockMgrLock Exclusive 1 RelCacheInitLock Exclusive 2 SInvalLock Exclusive 2 WALInsertLock Exclusive 10 BufMappingLock Exclusive 12 CheckpointLock Shared 29 CheckpointStartLock Shared 29 OidGenLock Exclusive 29 XidGenLock Exclusive 29 FirstLockMgrLock Shared 33 CheckpointStartLock Exclusive 78 FreeSpaceLock Exclusive 114 OidGenLock Shared 126 XidGenLock Shared 152 ProcArrayLock Shared 482 Lock Id Combined Time (ns) SInvalLock 29800 RelCacheInitLock 30300 BufMappingLock 168800 FirstLockMgrLock 414300 FreeSpaceLock 1281700 ProcArrayLock 7869900 WALInsertLock 11113200 CheckpointStartLock 13494700 OidGenLock 25719100 XidGenLock 26443300 CheckpointLock 194267800 bash-3.00# echo 600 users - Throughput rising 600 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count RelCacheInitLock Exclusive 1 SInvalLock Exclusive 1 BufMappingLock Exclusive 2 CLogControlLock Exclusive 2 WALInsertLock Exclusive 11 FirstLockMgrLock Shared 20 CheckpointLock Shared 24 CheckpointStartLock Shared 24 OidGenLock Exclusive 24 XidGenLock Exclusive 24 CheckpointStartLock Exclusive 72 FreeSpaceLock Exclusive 102 OidGenLock Shared 106 XidGenLock Shared 128 ProcArrayLock Shared 394 Lock Id Combined Time (ns) SInvalLock 15600 RelCacheInitLock 15700 BufMappingLock 31000 CLogControlLock 41500 FirstLockMgrLock 289000 FreeSpaceLock 3045400 CheckpointStartLock 7371800 WALInsertLock 9383200 ProcArrayLock 10457900 OidGenLock 20005900 XidGenLock 20331500 CheckpointLock 187067900 bash-3.00# echo 700 users - Throughput rising 700 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count RelCacheInitLock Exclusive 1 SInvalLock Exclusive 1 BufMappingLock Exclusive 2 WALInsertLock Exclusive 17 CheckpointLock Shared 33 CheckpointStartLock Shared 33 OidGenLock Exclusive 33 XidGenLock Exclusive 33 FirstLockMgrLock Shared 81 CheckpointStartLock Exclusive 87 FreeSpaceLock Exclusive 124 OidGenLock Shared 125 XidGenLock Shared 150 ProcArrayLock Shared 500 Lock Id Combined Time (ns) RelCacheInitLock 15100 SInvalLock 15400 BufMappingLock 47400 FirstLockMgrLock 3021000 FreeSpaceLock 3794300 WALInsertLock 7567300 XidGenLock 18427400 ProcArrayLock 20884000 CheckpointStartLock 24084900 OidGenLock 26399500 CheckpointLock 256549800 bash-3.00# echo 800 users - Throughput rising 800 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count BufMappingLock Exclusive 1 RelCacheInitLock Exclusive 1 SInvalLock Exclusive 1 WALWriteLock Exclusive 1 WALInsertLock Exclusive 11 CheckpointLock Shared 27 CheckpointStartLock Shared 27 OidGenLock Exclusive 27 XidGenLock Exclusive 27 FirstLockMgrLock Shared 32 CheckpointStartLock Exclusive 73 FreeSpaceLock Exclusive 110 OidGenLock Shared 118 XidGenLock Shared 140 ProcArrayLock Shared 442 Lock Id Combined Time (ns) WALWriteLock 13900 SInvalLock 15000 RelCacheInitLock 15500 BufMappingLock 18600 FirstLockMgrLock 391100 WALInsertLock 3953700 FreeSpaceLock 4801300 CheckpointStartLock 13131800 ProcArrayLock 14480500 OidGenLock 17736500 XidGenLock 21723100 CheckpointLock 206423500 bash-3.00# echo 850 users - SLIGHT DROP in throughput 850 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count FirstLockMgrLock Exclusive 1 RelCacheInitLock Exclusive 1 SInvalLock Exclusive 1 WALWriteLock Exclusive 1 BufMappingLock Exclusive 3 WALInsertLock Exclusive 7 CheckpointLock Shared 39 CheckpointStartLock Shared 39 OidGenLock Exclusive 39 XidGenLock Exclusive 39 FirstLockMgrLock Shared 47 CheckpointStartLock Exclusive 113 FreeSpaceLock Exclusive 152 OidGenLock Shared 162 XidGenLock Shared 194 ProcArrayLock Shared 621 Lock Id Combined Time (ns) WALWriteLock 14200 RelCacheInitLock 15100 SInvalLock 15600 BufMappingLock 64100 WALInsertLock 2073200 FirstLockMgrLock 3040300 FreeSpaceLock 7329500 OidGenLock 21619100 CheckpointStartLock 23261300 ProcArrayLock 23917500 XidGenLock 24873100 CheckpointLock 309221200 bash-3.00# echo 900 users - ANOTHER SLIGHT DROP IN THROUGPUT 900 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count WALWriteLock Exclusive 1 WALInsertLock Exclusive 7 CheckpointStartLock Shared 13 OidGenLock Exclusive 13 CheckpointLock Shared 14 XidGenLock Exclusive 14 FirstLockMgrLock Shared 15 FreeSpaceLock Exclusive 51 OidGenLock Shared 51 XidGenLock Shared 62 CheckpointStartLock Exclusive 170 ProcArrayLock Shared 202 Lock Id Combined Time (ns) WALWriteLock 16800 FirstLockMgrLock 170300 FreeSpaceLock 601500 ProcArrayLock 3971300 WALInsertLock 7757200 OidGenLock 8261900 XidGenLock 18450900 CheckpointStartLock 39155100 CheckpointLock 143751500 bash-3.00# echo 950 users - BIG DROP IN THROUGHPUT 950 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count WALInsertLock Exclusive 3 FirstLockMgrLock Shared 4 CheckpointLock Shared 7 CheckpointStartLock Shared 7 OidGenLock Exclusive 7 XidGenLock Exclusive 7 FreeSpaceLock Exclusive 29 OidGenLock Shared 30 XidGenLock Shared 36 ProcArrayLock Shared 115 CheckpointStartLock Exclusive 134 Lock Id Combined Time (ns) FirstLockMgrLock 64400 FreeSpaceLock 342300 WALInsertLock 1759600 OidGenLock 4276900 ProcArrayLock 6234300 XidGenLock 6865000 CheckpointStartLock 37590800 CheckpointLock 58994300 bash-3.00# echo 1000 users - STEADY AT PREVIOUS LOW VALUE 1000 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count BufMappingLock Exclusive 1 RelCacheInitLock Exclusive 1 SInvalLock Exclusive 1 WALInsertLock Exclusive 3 CheckpointLock Shared 9 CheckpointStartLock Shared 9 OidGenLock Exclusive 9 XidGenLock Exclusive 9 FirstLockMgrLock Shared 14 FreeSpaceLock Exclusive 33 OidGenLock Shared 37 XidGenLock Shared 44 CheckpointStartLock Exclusive 122 ProcArrayLock Shared 145 Lock Id Combined Time (ns) RelCacheInitLock 14300 SInvalLock 15600 BufMappingLock 21400 FirstLockMgrLock 184000 FreeSpaceLock 366200 WALInsertLock 1769500 ProcArrayLock 5076500 XidGenLock 5898400 OidGenLock 9244800 CheckpointStartLock 31077500 CheckpointLock 91861900 bash-3.00# echo 1050 users - SMALL INCREASE 1050 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count BufMappingLock Exclusive 2 WALInsertLock Exclusive 9 CheckpointLock Shared 24 XidGenLock Exclusive 24 CheckpointStartLock Shared 25 OidGenLock Exclusive 25 FirstLockMgrLock Shared 30 FreeSpaceLock Exclusive 100 OidGenLock Shared 107 XidGenLock Shared 129 CheckpointStartLock Exclusive 153 ProcArrayLock Shared 400 Lock Id Combined Time (ns) BufMappingLock 36600 FirstLockMgrLock 420600 FreeSpaceLock 2998400 WALInsertLock 3818300 ProcArrayLock 8986900 OidGenLock 18127200 XidGenLock 18569200 CheckpointStartLock 44795700 CheckpointLock 206488400 bash-3.00# echo 1100 users - SMALL DROP AGAIN 1100 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count BufMappingLock Exclusive 1 WALInsertLock Exclusive 6 CheckpointLock Shared 11 XidGenLock Exclusive 11 CheckpointStartLock Shared 12 OidGenLock Exclusive 12 FirstLockMgrLock Shared 24 FreeSpaceLock Exclusive 39 OidGenLock Shared 44 XidGenLock Shared 51 CheckpointStartLock Exclusive 88 ProcArrayLock Shared 171 Lock Id Combined Time (ns) BufMappingLock 19500 FirstLockMgrLock 302700 FreeSpaceLock 511200 ProcArrayLock 5042300 WALInsertLock 5592800 CheckpointStartLock 25009900 OidGenLock 25231600 XidGenLock 108045300 CheckpointLock 379734000 bash-3.00# echo 1150 users - STEADY AT LOW VALUE 1150 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count WALWriteLock Exclusive 1 WALInsertLock Exclusive 2 CheckpointLock Shared 5 CheckpointStartLock Shared 6 OidGenLock Exclusive 6 XidGenLock Exclusive 6 FirstLockMgrLock Shared 8 FreeSpaceLock Exclusive 21 OidGenLock Shared 26 XidGenLock Shared 31 ProcArrayLock Shared 93 CheckpointStartLock Exclusive 122 Lock Id Combined Time (ns) WALWriteLock 14900 WALInsertLock 116900 FirstLockMgrLock 120600 FreeSpaceLock 2177800 XidGenLock 4899200 ProcArrayLock 20721700 CheckpointStartLock 27805200 CheckpointLock 76369300 OidGenLock 470145800 bash-3.00# echo 1250 users - STEADY AT LOW VALUE 1250 users bash-3.00# ./3_lwlock_acquires.d 19178 Lock Id Mode Count CheckpointLock Shared 2 CheckpointStartLock Shared 2 OidGenLock Exclusive 2 WALInsertLock Exclusive 2 XidGenLock Exclusive 2 FreeSpaceLock Exclusive 9 OidGenLock Shared 10 XidGenLock Shared 12 ProcArrayLock Shared 36 CheckpointStartLock Exclusive 135 Lock Id Combined Time (ns) WALInsertLock 39500 FreeSpaceLock 98600 ProcArrayLock 318800 XidGenLock 1379900 OidGenLock 3437700 CheckpointLock 9565200 CheckpointStartLock 56547900 bash-3.00#
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > Here is how I got the numbers.. > I had about 1600 users login into postgresql. Then started the run with > 500 users and using DTrace I started tracking Postgresql Locking "as > viewed from one user/connection". Echo statements indicate how many > users were active at that point and how was throughput performing. All > IO is done on /tmp which means on a RAM disk. > bash-3.00# echo 500 users - baseline number > 500 users > bash-3.00# ./3_lwlock_acquires.d 19178 > Lock Id Mode Count > FirstLockMgrLock Exclusive 1 > RelCacheInitLock Exclusive 2 > SInvalLock Exclusive 2 > WALInsertLock Exclusive 10 > BufMappingLock Exclusive 12 > CheckpointLock Shared 29 > CheckpointStartLock Shared 29 > OidGenLock Exclusive 29 > XidGenLock Exclusive 29 > FirstLockMgrLock Shared 33 > CheckpointStartLock Exclusive 78 > FreeSpaceLock Exclusive 114 > OidGenLock Shared 126 > XidGenLock Shared 152 > ProcArrayLock Shared 482 I don't think I believe these numbers. For one thing, CheckpointLock is simply not ever taken in shared mode. The ratios of counts for different locks seems pretty improbable too, eg there is no way on earth that the LockMgrLocks are taken more often shared than exclusive (I would expect no shared acquires at all in the sort of test you are running). Not to mention that the absolute number of counts seems way too low. So I think the counting tool is broken. > Lock Id Combined Time (ns) > SInvalLock 29800 > RelCacheInitLock 30300 > BufMappingLock 168800 > FirstLockMgrLock 414300 > FreeSpaceLock 1281700 > ProcArrayLock 7869900 > WALInsertLock 11113200 > CheckpointStartLock 13494700 > OidGenLock 25719100 > XidGenLock 26443300 > CheckpointLock 194267800 Combined time of what exactly? It looks like this must be the total duration the lock is held, at least assuming that the time for CheckpointLock is correctly reported. It'd be much more useful to see the total time spent waiting to acquire the lock. regards, tom lane
The count is only for a 10-second snapshot.. Plus remember there are about 1000 users running so the connection being profiled only gets 0.01 of the period on CPU.. And the count is for that CONNECTION only. Anyway using the lock wait script it shows the real picture as you requested. Here the combined time means time "spent waiting" for the lock. bash-3.00# echo 500 users 500 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count OidGenLock Exclusive 1 CheckpointStartLock Shared 3 OidGenLock Shared 4 WALInsertLock Exclusive 7 FreeSpaceLock Exclusive 8 XidGenLock Exclusive 15 CheckpointStartLock Exclusive 16 Lock Id Combined Time (ns) XidGenLock 3825000 OidGenLock 5307100 WALInsertLock 6317800 FreeSpaceLock 7244100 CheckpointStartLock 22199200 bash-3.00# echo 600 users 600 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count OidGenLock Exclusive 1 WALInsertLock Exclusive 1 CheckpointStartLock Shared 4 CheckpointStartLock Exclusive 11 XidGenLock Exclusive 21 Lock Id Combined Time (ns) OidGenLock 1728200 WALInsertLock 2040700 XidGenLock 19878500 CheckpointStartLock 24156500 bash-3.00# echo 700 users 700 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count OidGenLock Shared 1 XidGenLock Shared 1 CheckpointStartLock Shared 2 WALInsertLock Exclusive 4 CheckpointStartLock Exclusive 6 FreeSpaceLock Exclusive 6 XidGenLock Exclusive 13 Lock Id Combined Time (ns) OidGenLock 1730000 WALInsertLock 7253400 FreeSpaceLock 10977700 CheckpointStartLock 13356800 XidGenLock 38220500 bash-3.00# echo 800 users 800 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count CheckpointStartLock Shared 1 WALInsertLock Exclusive 2 XidGenLock Shared 2 CheckpointStartLock Exclusive 5 FreeSpaceLock Exclusive 8 XidGenLock Exclusive 12 Lock Id Combined Time (ns) WALInsertLock 3746800 FreeSpaceLock 7628900 CheckpointStartLock 11297500 XidGenLock 16649000 bash-3.00# echo 900 users - BIG DROP IN THROUGHPUT OCCURS... 900 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count OidGenLock Exclusive 1 OidGenLock Shared 1 XidGenLock Shared 1 FreeSpaceLock Exclusive 2 WALInsertLock Exclusive 2 CheckpointStartLock Shared 6 XidGenLock Exclusive 12 CheckpointStartLock Exclusive 121 Lock Id Combined Time (ns) OidGenLock 1968100 FreeSpaceLock 2076300 WALInsertLock 2190400 XidGenLock 20259400 CheckpointStartLock 1407294300 bash-3.00# echo 950 users 950 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count OidGenLock Exclusive 1 OidGenLock Shared 2 CheckpointStartLock Shared 3 WALInsertLock Exclusive 4 FreeSpaceLock Exclusive 5 XidGenLock Exclusive 11 CheckpointStartLock Exclusive 50 Lock Id Combined Time (ns) WALInsertLock 5577100 FreeSpaceLock 9115900 XidGenLock 13765800 OidGenLock 50155500 CheckpointStartLock 759872200 bash-3.00# echo 1000 users 1000 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count OidGenLock Shared 1 WALInsertLock Exclusive 1 XidGenLock Exclusive 5 CheckpointStartLock Shared 6 CheckpointStartLock Exclusive 102 Lock Id Combined Time (ns) OidGenLock 21900 WALInsertLock 82500 XidGenLock 3313400 CheckpointStartLock 1448289900 bash-3.00# echo 1050 users 1050 users bash-3.00# ./4_lwlock_waits.d 20764 Lock Id Mode Count FreeSpaceLock Exclusive 1 CheckpointStartLock Shared 3 XidGenLock Exclusive 3 CheckpointStartLock Exclusive 146 Lock Id Combined Time (ns) FreeSpaceLock 18400 XidGenLock 1900900 CheckpointStartLock 2392893700 bash-3.00# -Jignesh Tom Lane wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > >> Here is how I got the numbers.. >> I had about 1600 users login into postgresql. Then started the run with >> 500 users and using DTrace I started tracking Postgresql Locking "as >> viewed from one user/connection". Echo statements indicate how many >> users were active at that point and how was throughput performing. All >> IO is done on /tmp which means on a RAM disk. >> > > >> bash-3.00# echo 500 users - baseline number >> 500 users >> bash-3.00# ./3_lwlock_acquires.d 19178 >> > I don't think I believe these numbers. For one thing, CheckpointLock > is simply not ever taken in shared mode. The ratios of counts for > different locks seems pretty improbable too, eg there is no way on > earth that the LockMgrLocks are taken more often shared than > exclusive (I would expect no shared acquires at all in the sort of > test you are running). Not to mention that the absolute number of > counts seems way too low. So I think the counting tool is broken. > > > Combined time of what exactly? It looks like this must be the total > duration the lock is held, at least assuming that the time for > CheckpointLock is correctly reported. It'd be much more useful to see > the total time spent waiting to acquire the lock. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > The count is only for a 10-second snapshot.. Plus remember there are > about 1000 users running so the connection being profiled only gets > 0.01 of the period on CPU.. And the count is for that CONNECTION only. OK, that explains the low absolute levels of the counts, but if the counts are for a regular backend then there's still a lot of bogosity here. Backends won't be taking the CheckpointLock at all, nor do they take CheckpointStartLock in exclusive mode. The bgwriter would do that but it'd not be taking most of these other locks. So I think the script is mislabeling the locks somehow. Also, elementary statistics should tell you that a sample taken as above is going to have enormous amounts of noise. You should be sampling over a much longer period, say on the order of a minute of runtime, to have numbers that are trustworthy. regards, tom lane
On Thu, 2007-07-26 at 10:29 -0400, Jignesh K. Shah wrote: > The count is only for a 10-second snapshot.. Plus remember there are > about 1000 users running so the connection being profiled only gets > 0.01 of the period on CPU.. And the count is for that CONNECTION only. Is that for one process, or all processes aggregated in some way? > CheckpointStartLock Shared 6 > CheckpointStartLock Exclusive 102 That's definitely whacked. Surely we didn't start 102 checkpoints yet attempt to commit 6 times? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
I will look for runs with longer samples.. Now the script could have mislabeled lock names.. Anyway digging into the one that seems to increase over time... I did stack profiles on how that increases... and here are some numbers.. For 600-850 Users: that potential mislabeled CheckPointStartLock or LockID==12 comes from various sources where the top source (while system is still doing great) comes from: postgres`LWLockAcquire+0x1c8 postgres`SimpleLruReadPage_ReadOnly+0xc postgres`TransactionIdGetStatus+0x14 postgres`TransactionLogFetch+0x58 postgres`TransactionIdDidCommit+0x4 postgres`HeapTupleSatisfiesSnapshot+0x234 postgres`heap_release_fetch+0x1a8 postgres`index_getnext+0xf4 postgres`IndexNext+0x7c postgres`ExecScan+0x8c postgres`ExecProcNode+0xb4 postgres`ExecutePlan+0xdc postgres`ExecutorRun+0xb0 postgres`PortalRunSelect+0x9c postgres`PortalRun+0x244 postgres`exec_execute_message+0x3a0 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 8202100 postgres`LWLockAcquire+0x1c8 postgres`TransactionIdSetStatus+0x1c postgres`RecordTransactionCommit+0x2a8 postgres`CommitTransaction+0xc8 postgres`CommitTransactionCommand+0x90 postgres`finish_xact_command+0x60 postgres`exec_execute_message+0x3d8 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 30822100 However at 900 Users where the big drop in throughput occurs: It gives a different top "consumer" of time: postgres`LWLockAcquire+0x1c8 postgres`TransactionIdSetStatus+0x1c postgres`RecordTransactionCommit+0x2a8 postgres`CommitTransaction+0xc8 postgres`CommitTransactionCommand+0x90 postgres`finish_xact_command+0x60 postgres`exec_execute_message+0x3d8 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 406601300 postgres`LWLockAcquire+0x1c8 postgres`SimpleLruReadPage+0x1ac postgres`TransactionIdGetStatus+0x14 postgres`TransactionLogFetch+0x58 postgres`TransactionIdDidCommit+0x4 postgres`HeapTupleSatisfiesUpdate+0x360 postgres`heap_lock_tuple+0x27c postgres`ExecutePlan+0x33c postgres`ExecutorRun+0xb0 postgres`PortalRunSelect+0x9c postgres`PortalRun+0x244 postgres`exec_execute_message+0x3a0 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 444523100 postgres`LWLockAcquire+0x1c8 postgres`SimpleLruReadPage+0x1ac postgres`TransactionIdGetStatus+0x14 postgres`TransactionLogFetch+0x58 postgres`TransactionIdDidCommit+0x4 postgres`HeapTupleSatisfiesSnapshot+0x234 postgres`heap_release_fetch+0x1a8 postgres`index_getnext+0xf4 postgres`IndexNext+0x7c postgres`ExecScan+0x8c postgres`ExecProcNode+0xb4 postgres`ExecutePlan+0xdc postgres`ExecutorRun+0xb0 postgres`PortalRunSelect+0x9c postgres`PortalRun+0x244 postgres`exec_execute_message+0x3a0 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 1661300000 Maybe you all will understand more than I do on what it does here.. Looks like IndexNext has a problem at high number of users to me.. but I could be wrong.. -Jignesh Tom Lane wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > >> The count is only for a 10-second snapshot.. Plus remember there are >> about 1000 users running so the connection being profiled only gets >> 0.01 of the period on CPU.. And the count is for that CONNECTION only. >> > > OK, that explains the low absolute levels of the counts, but if the > counts are for a regular backend then there's still a lot of bogosity > here. Backends won't be taking the CheckpointLock at all, nor do they > take CheckpointStartLock in exclusive mode. The bgwriter would do that > but it'd not be taking most of these other locks. So I think the script > is mislabeling the locks somehow. > > Also, elementary statistics should tell you that a sample taken as above > is going to have enormous amounts of noise. You should be sampling over > a much longer period, say on the order of a minute of runtime, to have > numbers that are trustworthy. > > regards, tom lane >
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > For 600-850 Users: that potential mislabeled CheckPointStartLock or > LockID==12 comes from various sources where the top source (while system > is still doing great) comes from: > postgres`LWLockAcquire+0x1c8 > postgres`SimpleLruReadPage_ReadOnly+0xc > postgres`TransactionIdGetStatus+0x14 That path would be taking CLogControlLock ... so you're off by at least one. Compare the script to src/include/storage/lwlock.h. regards, tom lane
On Thu, 2007-07-26 at 11:27 -0400, Jignesh K. Shah wrote: > However at 900 Users where the big drop in throughput occurs: > It gives a different top "consumer" of time: postgres`LWLockAcquire+0x1c8 > postgres`SimpleLruReadPage+0x1ac > postgres`TransactionIdGetStatus+0x14 > postgres`TransactionLogFetch+0x58 TransactionIdGetStatus doesn't directly call SimpleLruReadPage(). Presumably the compiler has been rearranging things?? Looks like you're out of clog buffers. It seems like the clog buffers aren't big enough to hold clog pages for long enough and the SELECT FOR SHARE processing is leaving lots of additional read locks that are increasing the number of clog requests for older xids. Try the enclosed patch. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
BEAUTIFUL!!! Using the Patch I can now go upto 1300 users without dropping.. But now it still repeats at 1300-1350 users.. I corrected the Lock Descriptions based on what I got from lwlock.h and retried the whole scalability again with profiling again.. This time it looks like the ProcArrayLock bash-3.00# echo 600 users 600 users bash-3.00# ./4_lwlock_waits.d 7056 Lock Id Mode Count XidGenLock Exclusive 1 CLogControlLock Shared 2 XidGenLock Shared 2 WALWriteLock Exclusive 4 WALInsertLock Exclusive 8 CLogControlLock Exclusive 9 ProcArrayLock Exclusive 9 Lock Id Combined Time (ns) WALWriteLock 2842300 XidGenLock 4951000 CLogControlLock 11151800 WALInsertLock 13035600 ProcArrayLock 20040000 bash-3.00# echo 700 users 700 users bash-3.00# ./4_lwlock_waits.d 7056 Lock Id Mode Count XidGenLock Exclusive 1 WALWriteLock Exclusive 2 XidGenLock Shared 2 CLogControlLock Shared 3 CLogControlLock Exclusive 8 WALInsertLock Exclusive 9 ProcArrayLock Exclusive 22 Lock Id Combined Time (ns) XidGenLock 4093300 WALWriteLock 4914800 WALInsertLock 7389100 ProcArrayLock 10248200 CLogControlLock 11989400 bash-3.00# echo 800 users 800 users bash-3.00# ./4_lwlock_waits.d 7056 Lock Id Mode Count WALWriteLock Exclusive 1 XidGenLock Shared 2 CLogControlLock Shared 3 CLogControlLock Exclusive 7 WALInsertLock Exclusive 7 ProcArrayLock Exclusive 31 Lock Id Combined Time (ns) WALWriteLock 319100 XidGenLock 5388700 WALInsertLock 9901400 CLogControlLock 13465000 ProcArrayLock 42979700 bash-3.00# echo 900 users 900 users bash-3.00# ./4_lwlock_waits.d 7056 Lock Id Mode Count CLogControlLock Shared 1 XidGenLock Exclusive 1 WALWriteLock Exclusive 2 CLogControlLock Exclusive 6 WALInsertLock Exclusive 9 ProcArrayLock Exclusive 25 Lock Id Combined Time (ns) XidGenLock 3197700 WALWriteLock 3887100 CLogControlLock 15774500 WALInsertLock 38268700 ProcArrayLock 162073100 bash-3.00# ./6_lwlock_stack.d 4 7056 Lock Id Mode Count ProcArrayLock Shared 1 ProcArrayLock Exclusive 67 Lock Id Combined Time (ns) ProcArrayLock 216773800 Lock Id Combined Time (ns) postgres`LWLockAcquire+0x1c8 postgres`GetSnapshotData+0x118 postgres`GetTransactionSnapshot+0x5c postgres`PortalStart+0x150 postgres`exec_bind_message+0x81c postgres`PostgresMain+0x12b8 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 2779000 postgres`LWLockAcquire+0x1c8 postgres`CommitTransaction+0xe0 postgres`CommitTransactionCommand+0x90 postgres`finish_xact_command+0x60 postgres`exec_execute_message+0x3d8 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 213994800 bash-3.00# echo 1000 users - SLIGHT DROP 1000 users bash-3.00# ./4_lwlock_waits.d 7056 Lock Id Mode Count WALWriteLock Exclusive 1 CLogControlLock Exclusive 2 XidGenLock Shared 2 CLogControlLock Shared 3 WALInsertLock Exclusive 4 ProcArrayLock Exclusive 26 Lock Id Combined Time (ns) WALWriteLock 1807400 XidGenLock 2024000 WALInsertLock 2177500 CLogControlLock 9064700 ProcArrayLock 199216000 bash-3.00# ./6_lwlock_stack.d 4 7056 Lock Id Mode Count ProcArrayLock Shared 3 ProcArrayLock Exclusive 38 Lock Id Combined Time (ns) ProcArrayLock 858238600 Lock Id Combined Time (ns) postgres`LWLockAcquire+0x1c8 postgres`TransactionIdIsInProgress+0x50 postgres`HeapTupleSatisfiesVacuum+0x2ec postgres`_bt_check_unique+0x2a0 postgres`_bt_doinsert+0x98 postgres`btinsert+0x54 postgres`FunctionCall6+0x44 postgres`index_insert+0x90 postgres`ExecInsertIndexTuples+0x1bc postgres`ExecUpdate+0x500 postgres`ExecutePlan+0x704 postgres`ExecutorRun+0x60 postgres`PortalRunMulti+0x2a0 postgres`PortalRun+0x310 postgres`exec_execute_message+0x3a0 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 167100 postgres`LWLockAcquire+0x1c8 postgres`GetSnapshotData+0x118 postgres`GetTransactionSnapshot+0x5c postgres`PortalRunMulti+0x22c postgres`PortalRun+0x310 postgres`exec_execute_message+0x3a0 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 7125900 postgres`LWLockAcquire+0x1c8 postgres`CommitTransaction+0xe0 postgres`CommitTransactionCommand+0x90 postgres`finish_xact_command+0x60 postgres`exec_execute_message+0x3d8 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 850945600 bash-3.00# echo 1100 users - DROP .... 1100 users bash-3.00# ./4_lwlock_waits.d 7056 Lock Id Mode Count CLogControlLock Shared 1 WALWriteLock Exclusive 1 XidGenLock Exclusive 1 ProcArrayLock Shared 2 WALInsertLock Exclusive 2 XidGenLock Shared 2 CLogControlLock Exclusive 4 ProcArrayLock Exclusive 20 Lock Id Combined Time (ns) WALWriteLock 4179500 XidGenLock 6249400 CLogControlLock 20411100 WALInsertLock 29707600 ProcArrayLock 207923700 bash-3.00# ./6_lwlock_stack.d 4 7056 Lock Id Mode Count ProcArrayLock Exclusive 40 Lock Id Combined Time (ns) ProcArrayLock 692242100 Lock Id Combined Time (ns) postgres`LWLockAcquire+0x1c8 postgres`CommitTransaction+0xe0 postgres`CommitTransactionCommand+0x90 postgres`finish_xact_command+0x60 postgres`exec_execute_message+0x3d8 postgres`PostgresMain+0x1300 postgres`BackendRun+0x278 postgres`ServerLoop+0x63c postgres`PostmasterMain+0xc40 postgres`main+0x394 postgres`_start+0x108 692242100 bash-3.00# LockID for ProcArrayLock is 4 or the 5 entry in lwlock.h which seems to indicate it is lwlock.h Any tweaks for that? -Jignesh Simon Riggs wrote: > On Thu, 2007-07-26 at 11:27 -0400, Jignesh K. Shah wrote: > > >> However at 900 Users where the big drop in throughput occurs: >> It gives a different top "consumer" of time: >> > > > postgres`LWLockAcquire+0x1c8 > >> postgres`SimpleLruReadPage+0x1ac >> postgres`TransactionIdGetStatus+0x14 >> postgres`TransactionLogFetch+0x58 >> > > TransactionIdGetStatus doesn't directly call SimpleLruReadPage(). > Presumably the compiler has been rearranging things?? > > Looks like you're out of clog buffers. It seems like the clog buffers > aren't big enough to hold clog pages for long enough and the SELECT FOR > SHARE processing is leaving lots of additional read locks that are > increasing the number of clog requests for older xids. > > Try the enclosed patch. > > > ------------------------------------------------------------------------ > > Index: src/include/access/clog.h > =================================================================== > RCS file: /projects/cvsroot/pgsql/src/include/access/clog.h,v > retrieving revision 1.19 > diff -c -r1.19 clog.h > *** src/include/access/clog.h 5 Jan 2007 22:19:50 -0000 1.19 > --- src/include/access/clog.h 26 Jul 2007 15:44:58 -0000 > *************** > *** 29,35 **** > > > /* Number of SLRU buffers to use for clog */ > ! #define NUM_CLOG_BUFFERS 8 > > > extern void TransactionIdSetStatus(TransactionId xid, XidStatus status); > --- 29,35 ---- > > > /* Number of SLRU buffers to use for clog */ > ! #define NUM_CLOG_BUFFERS 64 > > > extern void TransactionIdSetStatus(TransactionId xid, XidStatus status); > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
Tom Lane wrote: > > That path would be taking CLogControlLock ... so you're off by at least > one. Compare the script to src/include/storage/lwlock.h. > Indeed, the indexing was off by one due to the removal of BuffMappingLock in src/include/storage/lwlock.h between 8.1 and 8.2 which was not updated in the DTrace script. Thanks, Robert
On Thu, 2007-07-26 at 15:44 -0400, Jignesh K. Shah wrote: > BEAUTIFUL!!! > > Using the Patch I can now go upto 1300 users without dropping.. But now > it still repeats at 1300-1350 users.. OK, can you try again with 16 and 32 buffers please? We need to know how many is enough and whether this number needs to be variable via a parameter, or just slightly larger by default. Thanks. > I corrected the Lock Descriptions based on what I got from lwlock.h and > retried the whole scalability again with profiling again.. This time it > looks like the ProcArrayLock That's what I would expect with that many users. > Lock Id Mode Count > XidGenLock Exclusive 1 > CLogControlLock Shared 2 > XidGenLock Shared 2 > WALWriteLock Exclusive 4 > WALInsertLock Exclusive 8 > CLogControlLock Exclusive 9 > ProcArrayLock Exclusive 9 ...but as Tom mentioned, we need to do longer runs now so these counts get to somewhere in the hundreds so we have some statistical validity. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Will try 16 and 32 CLOGBUFFER tomorrow: But here is locks data again with about increased time profiling (about 2 minutes) for the connection with about 2000 users: bash-3.00# time ./4_lwlock_waits.d 13583 ^C Lock Id Mode Count ProcArrayLock Shared 5 XidGenLock Exclusive 13 CLogControlLock Shared 14 XidGenLock Shared 15 CLogControlLock Exclusive 21 WALInsertLock Exclusive 77 WALWriteLock Exclusive 175 ProcArrayLock Exclusive 275 Lock Id Combined Time (ns) XidGenLock 194966200 WALInsertLock 517955000 CLogControlLock 679665100 WALWriteLock 2838716200 ProcArrayLock 44181002600 Top Wait time seems to come from the following code path for ProcArrayLock: Lock Id Mode Count ProcArrayLock Exclusive 21 Lock Id Combined Time (ns) ProcArrayLock 5255937500 Lock Id Combined Time (ns) postgres`LWLockAcquire+0x1f0 postgres`CommitTransaction+0x104 postgres`CommitTransactionCommand+0xbc postgres`finish_xact_command+0x78 postgres`exec_execute_message+0x42c postgres`PostgresMain+0x1838 postgres`BackendRun+0x2f8 postgres`ServerLoop+0x680 postgres`PostmasterMain+0xda8 postgres`main+0x3d0 postgres`_start+0x17c 5255937500 Regards, Jignesh Simon Riggs wrote: > On Thu, 2007-07-26 at 15:44 -0400, Jignesh K. Shah wrote: > >> BEAUTIFUL!!! >> >> Using the Patch I can now go upto 1300 users without dropping.. But now >> it still repeats at 1300-1350 users.. >> > > OK, can you try again with 16 and 32 buffers please? We need to know > how many is enough and whether this number needs to be variable via a > parameter, or just slightly larger by default. Thanks. > > >> I corrected the Lock Descriptions based on what I got from lwlock.h and >> retried the whole scalability again with profiling again.. This time it >> looks like the ProcArrayLock >> > > That's what I would expect with that many users. > > >> Lock Id Mode Count >> XidGenLock Exclusive 1 >> CLogControlLock Shared 2 >> XidGenLock Shared 2 >> WALWriteLock Exclusive 4 >> WALInsertLock Exclusive 8 >> CLogControlLock Exclusive 9 >> ProcArrayLock Exclusive 9 >> > > ...but as Tom mentioned, we need to do longer runs now so these counts > get to somewhere in the hundreds so we have some statistical validity. > >
Jignesh K. Shah wrote: > Top Wait time seems to come from the following code path for > ProcArrayLock: > > Lock Id Mode Count > ProcArrayLock Exclusive 21 > > Lock Id Combined Time (ns) > ProcArrayLock 5255937500 > > Lock Id Combined Time (ns) > > > postgres`LWLockAcquire+0x1f0 > postgres`CommitTransaction+0x104 Yeah, ProcArrayLock is pretty contended. I think it would be kinda neat if we could split it up in partitions. This lock is quite particular though. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, 2007-07-27 at 04:58 -0400, Alvaro Herrera wrote: > Jignesh K. Shah wrote: > > > Top Wait time seems to come from the following code path for > > ProcArrayLock: > > > > Lock Id Mode Count > > ProcArrayLock Exclusive 21 > > > > Lock Id Combined Time (ns) > > ProcArrayLock 5255937500 > > > > Lock Id Combined Time (ns) > > > > > > postgres`LWLockAcquire+0x1f0 > > postgres`CommitTransaction+0x104 > > Yeah, ProcArrayLock is pretty contended. I think it would be kinda neat > if we could split it up in partitions. This lock is quite particular > though. Maybe, if we did we should set the partitions according to numbers of users, so lower numbers of users are all in one partition. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-07-26 at 17:17 -0400, Jignesh K. Shah wrote: > Lock Id Combined Time (ns) > XidGenLock 194966200 > WALInsertLock 517955000 > CLogControlLock 679665100 > WALWriteLock 2838716200 > ProcArrayLock 44181002600 Is this the time the lock is held for or the time that we wait for that lock? It would be good to see the break down of time separately for shared and exclusive. Can we have a table like this: LockId,LockMode,SumTimeLockHeld,SumTimeLockWait > Top Wait time seems to come from the following code path for > ProcArrayLock: > > Lock Id Mode Count > ProcArrayLock Exclusive 21 > > Lock Id Combined Time (ns) > ProcArrayLock 5255937500 > > Lock Id Combined Time (ns) > > > postgres`LWLockAcquire+0x1f0 > postgres`CommitTransaction+0x104 > postgres`CommitTransactionCommand+0xbc > postgres`finish_xact_command+0x78 Well thats pretty weird. That code path clearly only happens once per transaction and ought to be fast. The other code paths that take ProcArrayLock like TransactionIdIsInProgress() and GetSnapshotData() ought to spend more time holding the lock. Presumably you are running with a fair number of SERIALIZABLE transactions? Are you running with commit_delay > 0? Its possible that the call to CountActiveBackends() is causing pinging of the procarray by other backends while we're trying to read it during CommitTransaction(). If so, try the attached patch. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
Yes I can try to breakup the Shared and exclusive time.. Also yes I use commit delays =10, it helps out a lot in reducing IO load.. I will try out the patch soon. -Jignesh Simon Riggs wrote: > On Thu, 2007-07-26 at 17:17 -0400, Jignesh K. Shah wrote: > >> Lock Id Combined Time (ns) >> XidGenLock 194966200 >> WALInsertLock 517955000 >> CLogControlLock 679665100 >> WALWriteLock 2838716200 >> ProcArrayLock 44181002600 >> > > Is this the time the lock is held for or the time that we wait for that > lock? It would be good to see the break down of time separately for > shared and exclusive. > > Can we have a table like this: > LockId,LockMode,SumTimeLockHeld,SumTimeLockWait > > >> Top Wait time seems to come from the following code path for >> ProcArrayLock: >> >> Lock Id Mode Count >> ProcArrayLock Exclusive 21 >> >> Lock Id Combined Time (ns) >> ProcArrayLock 5255937500 >> >> Lock Id Combined Time (ns) >> >> >> postgres`LWLockAcquire+0x1f0 >> postgres`CommitTransaction+0x104 >> postgres`CommitTransactionCommand+0xbc >> postgres`finish_xact_command+0x78 >> > > Well thats pretty weird. That code path clearly only happens once per > transaction and ought to be fast. The other code paths that take > ProcArrayLock like TransactionIdIsInProgress() and GetSnapshotData() > ought to spend more time holding the lock. Presumably you are running > with a fair number of SERIALIZABLE transactions? > > Are you running with commit_delay > 0? Its possible that the call to > CountActiveBackends() is causing pinging of the procarray by other > backends while we're trying to read it during CommitTransaction(). If > so, try the attached patch. > >
I tried CLOG Buffers 32 and the performance is as good as 64 bit.. (I havent tried 16 yet though.. ) I am going to try your second patch now.. Also here is the breakup by Mode. The combined time is the total time it waits for all counts. Lock Id Mode Count ProcArrayLock Shared 1 CLogControlLock Exclusive 4 CLogControlLock Shared 4 XidGenLock Shared 4 XidGenLock Exclusive 7 WALInsertLock Exclusive 21 WALWriteLock Exclusive 62 ProcArrayLock Exclusive 79 Lock Id Mode Combined Time (ns) CLogControlLock Exclusive 325200 CLogControlLock Shared 4509200 XidGenLock Exclusive 11839600 ProcArrayLock Shared 40506600 XidGenLock Shared 119013700 WALInsertLock Exclusive 148063100 WALWriteLock Exclusive 347052100 ProcArrayLock Exclusive 1054780600 Here is another one at higher user count 1600: bash-3.00# ./4_lwlock_waits.d 9208 Lock Id Mode Count CLogControlLock Exclusive 1 CLogControlLock Shared 2 XidGenLock Shared 7 WALInsertLock Exclusive 12 WALWriteLock Exclusive 50 ProcArrayLock Exclusive 82 Lock Id Mode Combined Time (ns) CLogControlLock Exclusive 27300 XidGenLock Shared 14689300 CLogControlLock Shared 72664900 WALInsertLock Exclusive 101431300 WALWriteLock Exclusive 534357400 ProcArrayLock Exclusive 4110350300 Now I will try with your second patch. Regards, Jignesh Simon Riggs wrote: > On Thu, 2007-07-26 at 17:17 -0400, Jignesh K. Shah wrote: > >> Lock Id Combined Time (ns) >> XidGenLock 194966200 >> WALInsertLock 517955000 >> CLogControlLock 679665100 >> WALWriteLock 2838716200 >> ProcArrayLock 44181002600 >> > > Is this the time the lock is held for or the time that we wait for that > lock? It would be good to see the break down of time separately for > shared and exclusive. > > Can we have a table like this: > LockId,LockMode,SumTimeLockHeld,SumTimeLockWait > > >> Top Wait time seems to come from the following code path for >> ProcArrayLock: >> >> Lock Id Mode Count >> ProcArrayLock Exclusive 21 >> >> Lock Id Combined Time (ns) >> ProcArrayLock 5255937500 >> >> Lock Id Combined Time (ns) >> >> >> postgres`LWLockAcquire+0x1f0 >> postgres`CommitTransaction+0x104 >> postgres`CommitTransactionCommand+0xbc >> postgres`finish_xact_command+0x78 >> > > Well thats pretty weird. That code path clearly only happens once per > transaction and ought to be fast. The other code paths that take > ProcArrayLock like TransactionIdIsInProgress() and GetSnapshotData() > ought to spend more time holding the lock. Presumably you are running > with a fair number of SERIALIZABLE transactions? > > Are you running with commit_delay > 0? Its possible that the call to > CountActiveBackends() is causing pinging of the procarray by other > backends while we're trying to read it during CommitTransaction(). If > so, try the attached patch. > > > ------------------------------------------------------------------------ > > Index: src/backend/access/transam/xact.c > =================================================================== > RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xact.c,v > retrieving revision 1.245 > diff -c -r1.245 xact.c > *** src/backend/access/transam/xact.c 7 Jun 2007 21:45:58 -0000 1.245 > --- src/backend/access/transam/xact.c 27 Jul 2007 09:09:08 -0000 > *************** > *** 820,827 **** > * are fewer than CommitSiblings other backends with active > * transactions. > */ > ! if (CommitDelay > 0 && enableFsync && > ! CountActiveBackends() >= CommitSiblings) > pg_usleep(CommitDelay); > > XLogFlush(recptr); > --- 820,826 ---- > * are fewer than CommitSiblings other backends with active > * transactions. > */ > ! if (CommitDelay > 0 && enableFsync) > pg_usleep(CommitDelay); > > XLogFlush(recptr); > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Using CLOG Buffers 32 and the commit sibling check patch I still see a drop at 1200-1300 users.. bash-3.00# ./4_lwlock_waits.d 18250 Lock Id Mode Count XidGenLock Shared 1 CLogControlLock Shared 2 ProcArrayLock Shared 2 XidGenLock Exclusive 4 CLogControlLock Exclusive 15 WALInsertLock Exclusive 18 WALWriteLock Exclusive 38 ProcArrayLock Exclusive 77 Lock Id Mode Combined Time (ns) XidGenLock Shared 88700 WALInsertLock Exclusive 69556000 ProcArrayLock Shared 95656800 XidGenLock Exclusive 139634100 CLogControlLock Exclusive 148822200 CLogControlLock Shared 161630000 WALWriteLock Exclusive 332781800 ProcArrayLock Exclusive 5688265500 bash-3.00# ./4_lwlock_waits.d 18599 Lock Id Mode Count ProcArrayLock Shared 2 XidGenLock Exclusive 3 XidGenLock Shared 4 CLogControlLock Shared 5 WALInsertLock Exclusive 10 CLogControlLock Exclusive 21 WALWriteLock Exclusive 28 ProcArrayLock Exclusive 54 Lock Id Mode Combined Time (ns) XidGenLock Exclusive 5688800 WALInsertLock Exclusive 11424700 CLogControlLock Shared 55589100 ProcArrayLock Shared 135220400 WALWriteLock Exclusive 177906900 XidGenLock Shared 524146500 CLogControlLock Exclusive 524563900 ProcArrayLock Exclusive 5828744500 bash-3.00# bash-3.00# ./6_lwlock_stack.d 4 18599 Lock Id Mode Count ProcArrayLock Shared 1 ProcArrayLock Exclusive 52 Lock Id Mode Combined Time (ns) ProcArrayLock Shared 41428300 ProcArrayLock Exclusive 3858386500 Lock Id Combined Time (ns) postgres`LWLockAcquire+0x1f0 postgres`GetSnapshotData+0x120 postgres`GetTransactionSnapshot+0x80 postgres`PortalStart+0x198 postgres`exec_bind_message+0x84c postgres`PostgresMain+0x17f8 postgres`BackendRun+0x2f8 postgres`ServerLoop+0x680 postgres`PostmasterMain+0xda8 postgres`main+0x3d0 postgres`_start+0x17c Shared 41428300 postgres`LWLockAcquire+0x1f0 postgres`CommitTransaction+0x104 postgres`CommitTransactionCommand+0xbc postgres`finish_xact_command+0x78 postgres`exec_execute_message+0x42c postgres`PostgresMain+0x1838 postgres`BackendRun+0x2f8 postgres`ServerLoop+0x680 postgres`PostmasterMain+0xda8 postgres`main+0x3d0 postgres`_start+0x17c Exclusive 3858386500 -Jignesh
With CLOG 16 the drp[s comes at about 1150 users with the following lock stats bash-3.00# ./4_lwlock_waits.d 16404 Lock Id Mode Count ProcArrayLock Shared 2 XidGenLock Exclusive 2 XidGenLock Shared 4 WALInsertLock Exclusive 7 CLogControlLock Shared 8 WALWriteLock Exclusive 46 ProcArrayLock Exclusive 64 CLogControlLock Exclusive 263 Lock Id Mode Combined Time (ns) XidGenLock Exclusive 528300 ProcArrayLock Shared 968800 WALInsertLock Exclusive 4090900 XidGenLock Shared 73987600 WALWriteLock Exclusive 86200700 ProcArrayLock Exclusive 130756000 CLogControlLock Shared 240471000 CLogControlLock Exclusive 4115158500 So I think 32 is a better option for CLogs before ProcArrayLock becomes the bottleneck. Though I havent seen what we can do with ProcArrayLock problem. Regards, Jignesh Jignesh K. Shah wrote: > Using CLOG Buffers 32 and the commit sibling check patch I still see a > drop at 1200-1300 users.. > > > > bash-3.00# ./4_lwlock_waits.d 18250 > > Lock Id Mode Count > XidGenLock Shared 1 > CLogControlLock Shared 2 > ProcArrayLock Shared 2 > XidGenLock Exclusive 4 > CLogControlLock Exclusive 15 > WALInsertLock Exclusive 18 > WALWriteLock Exclusive 38 > ProcArrayLock Exclusive 77 > > Lock Id Mode Combined Time (ns) > XidGenLock Shared 88700 > WALInsertLock Exclusive 69556000 > ProcArrayLock Shared 95656800 > XidGenLock Exclusive 139634100 > CLogControlLock Exclusive 148822200 > CLogControlLock Shared 161630000 > WALWriteLock Exclusive 332781800 > ProcArrayLock Exclusive 5688265500 > > bash-3.00# ./4_lwlock_waits.d 18599 > > Lock Id Mode Count > ProcArrayLock Shared 2 > XidGenLock Exclusive 3 > XidGenLock Shared 4 > CLogControlLock Shared 5 > WALInsertLock Exclusive 10 > CLogControlLock Exclusive 21 > WALWriteLock Exclusive 28 > ProcArrayLock Exclusive 54 > > Lock Id Mode Combined Time (ns) > XidGenLock Exclusive 5688800 > WALInsertLock Exclusive 11424700 > CLogControlLock Shared 55589100 > ProcArrayLock Shared 135220400 > WALWriteLock Exclusive 177906900 > XidGenLock Shared 524146500 > CLogControlLock Exclusive 524563900 > ProcArrayLock Exclusive 5828744500 > > bash-3.00# > bash-3.00# ./6_lwlock_stack.d 4 18599 > > Lock Id Mode Count > ProcArrayLock Shared 1 > ProcArrayLock Exclusive 52 > > Lock Id Mode Combined Time (ns) > ProcArrayLock Shared 41428300 > ProcArrayLock Exclusive 3858386500 > > Lock Id Combined Time (ns) > > > postgres`LWLockAcquire+0x1f0 > postgres`GetSnapshotData+0x120 > postgres`GetTransactionSnapshot+0x80 > postgres`PortalStart+0x198 > postgres`exec_bind_message+0x84c > postgres`PostgresMain+0x17f8 > postgres`BackendRun+0x2f8 > postgres`ServerLoop+0x680 > postgres`PostmasterMain+0xda8 > postgres`main+0x3d0 > postgres`_start+0x17c > Shared 41428300 > > postgres`LWLockAcquire+0x1f0 > postgres`CommitTransaction+0x104 > postgres`CommitTransactionCommand+0xbc > postgres`finish_xact_command+0x78 > postgres`exec_execute_message+0x42c > postgres`PostgresMain+0x1838 > postgres`BackendRun+0x2f8 > postgres`ServerLoop+0x680 > postgres`PostmasterMain+0xda8 > postgres`main+0x3d0 > postgres`_start+0x17c > Exclusive 3858386500 > > > -Jignesh > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings
Simon, > Well thats pretty weird. That code path clearly only happens once per > transaction and ought to be fast. The other code paths that take > ProcArrayLock like TransactionIdIsInProgress() and GetSnapshotData() > ought to spend more time holding the lock. Presumably you are running > with a fair number of SERIALIZABLE transactions? Given that this is TPCC-analog, I'd assume that we are. Jignesh? -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Yep quite a bit of transactions .. But the piece that's slow is where it is clearing it up in CommitTransaction(). I am not sure of how ProcArrayLock is designed to work and hence not clear what we are seeing is what we expect. Do we have some design doc on ProcArrayLock to understand its purpose? Thanks. Regards, Jignesh Josh Berkus wrote: > Simon, > > >> Well thats pretty weird. That code path clearly only happens once per >> transaction and ought to be fast. The other code paths that take >> ProcArrayLock like TransactionIdIsInProgress() and GetSnapshotData() >> ought to spend more time holding the lock. Presumably you are running >> with a fair number of SERIALIZABLE transactions? >> > > Given that this is TPCC-analog, I'd assume that we are. > > Jignesh? > >
Hi Simon, This patch seems to work well (both with 32 and 64 value but not with 16 and the default 8). Is there a way we can integrate this in 8.3? This will improve out of box performance quite a bit for high number of users (atleat 30% in my OLTP test) Regards, Jignesh Simon Riggs wrote: > On Thu, 2007-07-26 at 11:27 -0400, Jignesh K. Shah wrote: > > >> However at 900 Users where the big drop in throughput occurs: >> It gives a different top "consumer" of time: >> > > > postgres`LWLockAcquire+0x1c8 > >> postgres`SimpleLruReadPage+0x1ac >> postgres`TransactionIdGetStatus+0x14 >> postgres`TransactionLogFetch+0x58 >> > > TransactionIdGetStatus doesn't directly call SimpleLruReadPage(). > Presumably the compiler has been rearranging things?? > > Looks like you're out of clog buffers. It seems like the clog buffers > aren't big enough to hold clog pages for long enough and the SELECT FOR > SHARE processing is leaving lots of additional read locks that are > increasing the number of clog requests for older xids. > > Try the enclosed patch. > > > ------------------------------------------------------------------------ > > Index: src/include/access/clog.h > =================================================================== > RCS file: /projects/cvsroot/pgsql/src/include/access/clog.h,v > retrieving revision 1.19 > diff -c -r1.19 clog.h > *** src/include/access/clog.h 5 Jan 2007 22:19:50 -0000 1.19 > --- src/include/access/clog.h 26 Jul 2007 15:44:58 -0000 > *************** > *** 29,35 **** > > > /* Number of SLRU buffers to use for clog */ > ! #define NUM_CLOG_BUFFERS 8 > > > extern void TransactionIdSetStatus(TransactionId xid, XidStatus status); > --- 29,35 ---- > > > /* Number of SLRU buffers to use for clog */ > ! #define NUM_CLOG_BUFFERS 64 > > > extern void TransactionIdSetStatus(TransactionId xid, XidStatus status); > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
On Fri, 2007-08-03 at 16:09 -0400, Jignesh K. Shah wrote: > This patch seems to work well (both with 32 and 64 value but not with 16 > and the default 8). Could you test at 24 please also? Tom has pointed out the additional cost of setting this higher, even in workloads that don't benefit from the I/O-induced contention reduction. > Is there a way we can integrate this in 8.3? I just replied to Josh's thread on -hackers about this. > This will improve out of box performance quite a bit for high number of > users (atleat 30% in my OLTP test) Yes, thats good. Will this have a dramatic effect on a particular benchmark, or for what reason might we need this? Tom has questioned the use case here, so I think it would be good to explain a little more for everyone. Thanks. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
I tried with CLOG 24 also and I got linear performance upto 1250 users after which it started to tank. 32 got us to 1350 users before some other bottleneck overtook it. Based on what Tom said earlier, it might then make sense to make it a tunable with the default of 8 but something one can change for high number of users. Thanks. Regards, Jignesh Simon Riggs wrote: > On Fri, 2007-08-03 at 16:09 -0400, Jignesh K. Shah wrote: > > >> This patch seems to work well (both with 32 and 64 value but not with 16 >> and the default 8). >> > > Could you test at 24 please also? Tom has pointed out the additional > cost of setting this higher, even in workloads that don't benefit from > the I/O-induced contention reduction. > > >> Is there a way we can integrate this in 8.3? >> > > I just replied to Josh's thread on -hackers about this. > > >> This will improve out of box performance quite a bit for high number of >> users (atleat 30% in my OLTP test) >> > > Yes, thats good. Will this have a dramatic effect on a particular > benchmark, or for what reason might we need this? Tom has questioned the > use case here, so I think it would be good to explain a little more for > everyone. Thanks. > >
On Fri, 2007-08-10 at 13:54 -0400, Jignesh K. Shah wrote: > I tried with CLOG 24 also and I got linear performance upto 1250 users > after which it started to tank. 32 got us to 1350 users before some > other bottleneck overtook it. Jignesh, Thanks for testing that. It's not very clear to everybody why an extra 100 users is useful and it would certainly help your case if you can explain. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com