Thread: Linux/PostgreSQL scalability issue - problem with 8 cores
Hi all, we have a PostgreSQL dedicated Linux server with 8 cores (2xX5355). We came accross a strange issue: when running with all 8 cores enabled approximatly once a minute (period differs) the system is very busy for a few seconds (~5-10s) and we don't know why - this issue don't show up when we tell Linux to use only 2 cores, with 4 cores the problem is here but it is still better than with 8 cores - all on the same machine, same config, same workload. We don't see any apparent reason for these peaks. We'd like to investigate it further but we don't know what to try next. Any suggenstions? Any tunning tips for Linux+PostgreSQL on 8-way system? Can this be connected with our heavy use of listen/notify and hundreds backends in listen mode? More details are below. Thanks, Kuba System: HP DL360 2x5355, 8G RAM, P600+MSA50 - internal 2x72GB RAID 10 for OS, 10x72G disks RAID 10 for PostgreSQL data and wal OS: Linux 2.6 64bit (kernel 2.6.21, 22, 23 makes little difference) PostgreSQL: 8.2.4 (64bit), shared buffers 1G Nothing else than PostgreSQL is running on the server. Cca 800 concurrent backends. Majority of backends in LISTEN doing nothing. Client interface for most backends is ecpg+libpq. Problem description: The system is usually running 80-95% idle. Approximatly once a minute for cca 5-10s there is a peak in activity which looks like this: vmstat (and top or atop) reports 0% idle, 100% in user mode, very low iowait, low IO activity, higher number of contex switches than usual but not exceedingly high (2000-4000cs/s, usually 1500cs/s), few hundreds waiting processes per second (usually 0-1/s). From looking at top and running processes we can't see any obvious reason for the peak. According to PostgreSQL log the long running commands from these moments are e.g. begin transaction lasting several seconds. When only 2 cores are enabled (kernel command line) then everything is running smoothly. 4 cores exibits slightly better behavior than 8 cores but worse than 2 cores - the peaks are visible. We've tried kernel versions 2.6.21-23 (latest revisions as of beginning December from kernel.org) the pattern slightly changed but it may also be that the workload slightly changed. pgbench or any other stress testing runs smoothly on the server. The only strange thing about our usage pattern I can think of is heavy use of LISTEN/NOTIFY especially hunderds backends in listen mode. When restarting our connected clients the peaks are not there from time 0, they are visible after a while - seems something gets synchronized and causing troubles then. Since the server is PostgreSQL dedicated and no our client applications are running on it - and there is a difference when 2 and 8 cores are enabled - we think that the peaks are not caused by our client applications. How can we diagnose what is happening during the peaks?
Hi Jakub, I do have a similar server (from DELL), which performance well with our PostgreSQL application. I guess the peak in context switches is the only think you can see. Anyhow, I think it is you're LISTEN/NOTIFY approach which cause that behaviour. I guess all backends do listen to the same notification. I don't know the exact implementation, but I can imagine that all backends are access the same section in the shared memory which cause the increase of context switches. More cores means more access at the same time. Can you change your implementation? - split you problem - create multiple notification if possible - do an UNLISTEN if possible - use another signalisation technique Regards Sven Jakub Ouhrabka schrieb: > Hi all, > > we have a PostgreSQL dedicated Linux server with 8 cores (2xX5355). We > came accross a strange issue: when running with all 8 cores enabled > approximatly once a minute (period differs) the system is very busy for > a few seconds (~5-10s) and we don't know why - this issue don't show up > when we tell Linux to use only 2 cores, with 4 cores the problem is here > but it is still better than with 8 cores - all on the same machine, same > config, same workload. We don't see any apparent reason for these peaks. > We'd like to investigate it further but we don't know what to try next. > Any suggenstions? Any tunning tips for Linux+PostgreSQL on 8-way system? > Can this be connected with our heavy use of listen/notify and hundreds > backends in listen mode? > > More details are below. > > Thanks, > > Kuba > > System: HP DL360 2x5355, 8G RAM, P600+MSA50 - internal 2x72GB RAID 10 > for OS, 10x72G disks RAID 10 for PostgreSQL data and wal > OS: Linux 2.6 64bit (kernel 2.6.21, 22, 23 makes little difference) > PostgreSQL: 8.2.4 (64bit), shared buffers 1G > > Nothing else than PostgreSQL is running on the server. Cca 800 > concurrent backends. Majority of backends in LISTEN doing nothing. > Client interface for most backends is ecpg+libpq. > > Problem description: > > The system is usually running 80-95% idle. Approximatly once a minute > for cca 5-10s there is a peak in activity which looks like this: > > vmstat (and top or atop) reports 0% idle, 100% in user mode, very low > iowait, low IO activity, higher number of contex switches than usual but > not exceedingly high (2000-4000cs/s, usually 1500cs/s), few hundreds > waiting processes per second (usually 0-1/s). From looking at top and > running processes we can't see any obvious reason for the peak. > According to PostgreSQL log the long running commands from these moments > are e.g. begin transaction lasting several seconds. > > When only 2 cores are enabled (kernel command line) then everything is > running smoothly. 4 cores exibits slightly better behavior than 8 cores > but worse than 2 cores - the peaks are visible. > > We've tried kernel versions 2.6.21-23 (latest revisions as of beginning > December from kernel.org) the pattern slightly changed but it may also > be that the workload slightly changed. > > pgbench or any other stress testing runs smoothly on the server. > > The o usage panly strange thing about ourttern I can think of is heavy > use of LISTEN/NOTIFY especially hunderds backends in listen mode. > > When restarting our connected clients the peaks are not there from time > 0, they are visible after a while - seems something gets synchronized > and causing troubles then. > > Since the server is PostgreSQL dedicated and no our client applications > are running on it - and there is a difference when 2 and 8 cores are > enabled - we think that the peaks are not caused by our client > applications. > > How can we diagnose what is happening during the peaks? > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Sven Geisler <sgeisler@aeccom.com> Tel +49.30.921017.81 Fax .50 Senior Developer, AEC/communications GmbH & Co. KG Berlin, Germany
Jakub Ouhrabka <kuba@comgate.cz> writes: > we have a PostgreSQL dedicated Linux server with 8 cores (2xX5355). We > came accross a strange issue: when running with all 8 cores enabled > approximatly once a minute (period differs) the system is very busy for > a few seconds (~5-10s) and we don't know why - this issue don't show up > when we tell Linux to use only 2 cores, with 4 cores the problem is here > but it is still better than with 8 cores - all on the same machine, same > config, same workload. We don't see any apparent reason for these peaks. Interesting. Maybe you could use oprofile to try to see what's happening? It sounds a bit like momentary contention for a spinlock, but exactly what isn't clear. > Can this be connected with our heavy use of listen/notify and hundreds > backends in listen mode? Perhaps. Have you tried logging executions of NOTIFY to see if they are correlated with the spikes? regards, tom lane
Hi Tom, > Interesting. Maybe you could use oprofile to try to see what's > happening? It sounds a bit like momentary contention for a spinlock, > but exactly what isn't clear. ok, we're going to try oprofile, will let you know... > Perhaps. Have you tried logging executions of NOTIFY to see if they > are correlated with the spikes? We didn't log the notifies but I think it's not correlated. We'll have a detailed look next time we try it (with oprofile). Thanks for suggestions! Kuba
Hi Sven, > I guess all backends do listen to the same notification. Unfortunatelly no. The backends are listening to different notifications in different databases. Usually there are only few listens per database with only one exception - there are many (hundreds) listens in one database but all for different notifications. > Can you change your implementation? > - split you problem - create multiple notification if possible Yes, it is like this. > - do an UNLISTEN if possible Yes, we're issuing unlistens when appropriate. > - use another signalisation technique We're planning to reduce the number of databases/backends/listens but anyway we'd like to run system on 8 cores if it is running without any problems on 2 cores... Thanks for the suggestions! Kuba Sven Geisler napsal(a): > Hi Jakub, > > I do have a similar server (from DELL), which performance well with our > PostgreSQL application. I guess the peak in context switches is the only > think you can see. > > Anyhow, I think it is you're LISTEN/NOTIFY approach which cause that > behaviour. I guess all backends do listen to the same notification. > I don't know the exact implementation, but I can imagine that all > backends are access the same section in the shared memory which cause > the increase of context switches. More cores means more access at the > same time. > > Can you change your implementation? > - split you problem - create multiple notification if possible > - do an UNLISTEN if possible > - use another signalisation technique > > Regards > Sven > > > Jakub Ouhrabka schrieb: >> Hi all, >> >> we have a PostgreSQL dedicated Linux server with 8 cores (2xX5355). We >> came accross a strange issue: when running with all 8 cores enabled >> approximatly once a minute (period differs) the system is very busy for >> a few seconds (~5-10s) and we don't know why - this issue don't show up >> when we tell Linux to use only 2 cores, with 4 cores the problem is here >> but it is still better than with 8 cores - all on the same machine, same >> config, same workload. We don't see any apparent reason for these peaks. >> We'd like to investigate it further but we don't know what to try next. >> Any suggenstions? Any tunning tips for Linux+PostgreSQL on 8-way system? >> Can this be connected with our heavy use of listen/notify and hundreds >> backends in listen mode? >> >> More details are below. >> >> Thanks, >> >> Kuba >> >> System: HP DL360 2x5355, 8G RAM, P600+MSA50 - internal 2x72GB RAID 10 >> for OS, 10x72G disks RAID 10 for PostgreSQL data and wal >> OS: Linux 2.6 64bit (kernel 2.6.21, 22, 23 makes little difference) >> PostgreSQL: 8.2.4 (64bit), shared buffers 1G >> >> Nothing else than PostgreSQL is running on the server. Cca 800 >> concurrent backends. Majority of backends in LISTEN doing nothing. >> Client interface for most backends is ecpg+libpq. >> >> Problem description: >> >> The system is usually running 80-95% idle. Approximatly once a minute >> for cca 5-10s there is a peak in activity which looks like this: >> >> vmstat (and top or atop) reports 0% idle, 100% in user mode, very low >> iowait, low IO activity, higher number of contex switches than usual but >> not exceedingly high (2000-4000cs/s, usually 1500cs/s), few hundreds >> waiting processes per second (usually 0-1/s). From looking at top and >> running processes we can't see any obvious reason for the peak. >> According to PostgreSQL log the long running commands from these moments >> are e.g. begin transaction lasting several seconds. >> >> When only 2 cores are enabled (kernel command line) then everything is >> running smoothly. 4 cores exibits slightly better behavior than 8 cores >> but worse than 2 cores - the peaks are visible. >> >> We've tried kernel versions 2.6.21-23 (latest revisions as of beginning >> December from kernel.org) the pattern slightly changed but it may also >> be that the workload slightly changed. >> >> pgbench or any other stress testing runs smoothly on the server. >> >> The o usage panly strange thing about ourttern I can think of is heavy >> use of LISTEN/NOTIFY especially hunderds backends in listen mode. >> >> When restarting our connected clients the peaks are not there from time >> 0, they are visible after a while - seems something gets synchronized >> and causing troubles then. >> >> Since the server is PostgreSQL dedicated and no our client applications >> are running on it - and there is a difference when 2 and 8 cores are >> enabled - we think that the peaks are not caused by our client >> applications. >> >> How can we diagnose what is happening during the peaks? >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly >
Jakub Ouhrabka wrote: > > - do an UNLISTEN if possible > > Yes, we're issuing unlistens when appropriate. You are vacuuming pg_listener periodically, yes? Not that this seems to have any relationship to your problem, but ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro, >>> - do an UNLISTEN if possible >> Yes, we're issuing unlistens when appropriate. > > You are vacuuming pg_listener periodically, yes? Not that this seems > to have any relationship to your problem, but ... yes, autovacuum should take care of this. But looking forward for multiple-workers in 8.3 as it should help us during high load periods (some tables might wait too long for autovacuum now - but it's not that big problem for us...). Thanks for great work! Kuba
Jakub Ouhrabka wrote: > How can we diagnose what is happening during the peaks? Can you try forcing a core from a bunch of the busy processes? (Hmm - does Linux have an equivalent to the useful Solaris pstacks?) James
James Mansion wrote: > Jakub Ouhrabka wrote: >> How can we diagnose what is happening during the peaks? > Can you try forcing a core from a bunch of the busy processes? (Hmm - > does Linux have an equivalent to the useful Solaris pstacks?) There's a 'pstack' for Linux, shipped at least in Red Hat distributions (and possibly others, I'm not sure). It's a shell script wrapper around gdb, so easily ported to any Linux.
Hi Tom & all, >> It sounds a bit like momentary contention for a spinlock, >> but exactly what isn't clear. > ok, we're going to try oprofile, will let you know... yes, it seems like contention for spinlock if I'm intepreting oprofile correctly, around 60% of time during spikes is in s_lock. [for details see below]. We've tried several times to get stacktrace from some of the running backends during spikes, we got always this: 0x00002b005d00a9a9 in semop () from /lib/libc.so.6 #0 0x00002b005d00a9a9 in semop () from /lib/libc.so.6 #1 0x000000000054fe53 in PGSemaphoreLock (sema=0x2b00a04e5090, interruptOK=0 '\0') at pg_sema.c:411 #2 0x0000000000575d95 in LWLockAcquire (lockid=SInvalLock, mode=LW_EXCLUSIVE) at lwlock.c:455 #3 0x000000000056fbfe in ReceiveSharedInvalidMessages (invalFunction=0x5e9a30 <LocalExecuteInvalidationMessage>, resetFunction=0x5e9df0 <InvalidateSystemCaches>) at sinval.c:159 #4 0x0000000000463505 in StartTransactionCommand () at xact.c:1439 #5 0x000000000056fa4b in ProcessCatchupEvent () at sinval.c:347 #6 0x000000000056fb20 in CatchupInterruptHandler (postgres_signal_arg=<value optimized out>) at sinval.c:221 #7 0x00002b005cf6f110 in killpg () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () Is this enough info to guess what's happening? What should we try next? Thanks, Kuba oprofile results: [I've shortened path from /usr/local/pg... to just pg for better readilibity] # opreport --long-filenames CPU: Core 2, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 CPU_CLK_UNHALT...| samples| %| ------------------ 125577 90.7584 pg-8.2.4/bin/postgres 3792 2.7406 /lib/libc-2.3.6.so 3220 2.3272 /usr/src/linux-2.6.22.15/vmlinux 2145 1.5503 /usr/bin/oprofiled 1540 1.1130 /xfs 521 0.3765 pg-8.2.4/lib/plpgsql.so 441 0.3187 /cciss 374 0.2703 /oprofile ... CPU: Core 2, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % app name symbol name 85355 61.6887 pg-8.2.4/bin/postgres s_lock 9803 7.0849 pg-8.2.4/bin/postgres LWLockRelease 5535 4.0003 pg-8.2.4/bin/postgres LWLockAcquire 3792 2.7406 /lib/libc-2.3.6.so (no symbols) 3724 2.6915 pg-8.2.4/bin/postgres DropRelFileNodeBuffers 2145 1.5503 /usr/bin/oprofiled (no symbols) 2069 1.4953 pg-8.2.4/bin/postgres GetSnapshotData 1540 1.1130 /xfs (no symbols) 1246 0.9005 pg-8.2.4/bin/postgres hash_search_with_hash_value 1052 0.7603 pg-8.2.4/bin/postgres AllocSetAlloc 1015 0.7336 pg-8.2.4/bin/postgres heapgettup 879 0.6353 pg-8.2.4/bin/postgres hash_any 862 0.6230 /usr/src/linux-2.6.22.15/vmlinux mwait_idle 740 0.5348 pg-8.2.4/bin/postgres hash_seq_search 674 0.4871 pg-8.2.4/bin/postgres HeapTupleSatisfiesNow 557 0.4026 pg-8.2.4/bin/postgres SIGetDataEntry 552 0.3989 pg-8.2.4/bin/postgres equal 469 0.3390 pg-8.2.4/bin/postgres SearchCatCache 441 0.3187 /cciss (no symbols) 433 0.3129 /usr/src/linux-2.6.22.15/vmlinux find_busiest_group 413 0.2985 pg-8.2.4/bin/postgres PinBuffer 393 0.2840 pg-8.2.4/bin/postgres MemoryContextAllocZeroAligned 374 0.2703 /oprofile (no symbols) 275 0.1988 pg-8.2.4/bin/postgres ExecInitExpr 253 0.1829 pg-8.2.4/bin/postgres base_yyparse 206 0.1489 pg-8.2.4/bin/postgres CatalogCacheFlushRelation 201 0.1453 pg-8.2.4/bin/postgres MemoryContextAlloc 194 0.1402 pg-8.2.4/bin/postgres _bt_compare 188 0.1359 /nf_conntrack (no symbols) 158 0.1142 /bnx2 (no symbols) 147 0.1062 pg-8.2.4/bin/postgres pgstat_initstats 139 0.1005 pg-8.2.4/bin/postgres fmgr_info_cxt_security 132 0.0954 /usr/src/linux-2.6.22.15/vmlinux task_rq_lock 131 0.0947 /bin/bash (no symbols) 129 0.0932 pg-8.2.4/bin/postgres AllocSetFree 125 0.0903 pg-8.2.4/bin/postgres ReadBuffer 124 0.0896 pg-8.2.4/bin/postgres MemoryContextCreate 124 0.0896 pg-8.2.4/bin/postgres SyncOneBuffer 124 0.0896 pg-8.2.4/bin/postgres XLogInsert 123 0.0889 pg-8.2.4/bin/postgres _equalAggref 122 0.0882 pg-8.2.4/bin/postgres HeapTupleSatisfiesSnapshot 112 0.0809 pg-8.2.4/bin/postgres copyObject 102 0.0737 pg-8.2.4/bin/postgres UnpinBuffer 99 0.0716 pg-8.2.4/bin/postgres _SPI_execute_plan 99 0.0716 pg-8.2.4/bin/postgres nocachegetattr 98 0.0708 /usr/src/linux-2.6.22.15/vmlinux __wake_up_bit 97 0.0701 pg-8.2.4/bin/postgres TransactionIdIsInProgress 94 0.0679 pg-8.2.4/bin/postgres check_stack_depth 93 0.0672 pg-8.2.4/bin/postgres base_yylex 91 0.0658 pg-8.2.4/bin/postgres pfree 89 0.0643 pg-8.2.4/lib/plpgsql.so exec_stmts 86 0.0622 /usr/src/linux-2.6.22.15/vmlinux __switch_to 85 0.0614 pg-8.2.4/bin/postgres LockAcquire 83 0.0600 pg-8.2.4/bin/postgres FunctionCall2 82 0.0593 pg-8.2.4/bin/postgres ExecInitAgg 82 0.0593 /usr/src/linux-2.6.22.15/vmlinux system_call 79 0.0571 pg-8.2.4/bin/postgres pgstat_write_statsfile 77 0.0557 pg-8.2.4/bin/postgres heap_getsysattr 73 0.0528 pg-8.2.4/bin/postgres .plt 72 0.0520 /lib/ld-2.3.6.so (no symbols) 71 0.0513 pg-8.2.4/bin/postgres SearchSysCache 71 0.0513 pg-8.2.4/bin/postgres _bt_checkkeys 68 0.0491 /usr/src/linux-2.6.22.15/vmlinux apic_timer_interrupt 67 0.0484 pg-8.2.4/bin/postgres slot_deform_tuple 64 0.0463 pg-8.2.4/bin/postgres TupleDescInitEntry 64 0.0463 pg-8.2.4/bin/postgres newarc 63 0.0455 /usr/src/linux-2.6.22.15/vmlinux __wake_up 62 0.0448 pg-8.2.4/bin/postgres LocalExecuteInvalidationMessage 61 0.0441 /usr/src/linux-2.6.22.15/vmlinux try_to_wake_up 60 0.0434 pg-8.2.4/bin/postgres ReceiveSharedInvalidMessages 59 0.0426 pg-8.2.4/bin/postgres ExecutorStart 58 0.0419 pg-8.2.4/bin/postgres DirectFunctionCall1 58 0.0419 pg-8.2.4/bin/postgres ScanKeywordLookup 57 0.0412 pg-8.2.4/bin/postgres hash_search 56 0.0405 pg-8.2.4/bin/postgres CatalogCacheComputeHashValue 56 0.0405 pg-8.2.4/bin/postgres ExecProject 55 0.0398 pg-8.2.4/bin/postgres _bt_first 54 0.0390 pg-8.2.4/lib/plpgsql.so exec_eval_simple_expr 53 0.0383 pg-8.2.4/bin/postgres AllocSetDelete 53 0.0383 pg-8.2.4/bin/postgres CleanupTempFiles 52 0.0376 pg-8.2.4/bin/postgres ExecCreateTupleTable 52 0.0376 pg-8.2.4/lib/plpgsql.so copy_plpgsql_datum 49 0.0354 pg-8.2.4/bin/postgres MemoryContextAllocZero 49 0.0354 pg-8.2.4/bin/postgres SIDelExpiredDataEntries 49 0.0354 pg-8.2.4/bin/postgres fix_opfuncids_walker 47 0.0340 pg-8.2.4/bin/postgres expression_tree_walker 46 0.0332 pg-8.2.4/bin/postgres LockBuffer 45 0.0325 pg-8.2.4/bin/postgres lappend 45 0.0325 /usr/src/linux-2.6.22.15/vmlinux do_IRQ 44 0.0318 pg-8.2.4/bin/postgres LockReleaseAll 43 0.0311 pg-8.2.4/bin/postgres ExecutorRun 43 0.0311 pg-8.2.4/bin/postgres exprTypmod 42 0.0304 pg-8.2.4/bin/postgres ExecClearTuple 42 0.0304 pg-8.2.4/bin/postgres heap_fill_tuple 41 0.0296 pg-8.2.4/bin/postgres ExecIndexBuildScanKeys 40 0.0289 pg-8.2.4/bin/postgres _bt_readpage 40 0.0289 pg-8.2.4/lib/plpgsql.so plpgsql_exec_function 40 0.0289 /usr/src/linux-2.6.22.15/vmlinux lock_timer_base 39 0.0282 pg-8.2.4/bin/postgres heap_release_fetch 38 0.0275 pg-8.2.4/bin/postgres ExecEvalVar 38 0.0275 pg-8.2.4/bin/postgres ExecTypeFromTLInternal 37 0.0267 pg-8.2.4/bin/postgres LockReassignCurrentOwner 37 0.0267 /usr/src/linux-2.6.22.15/vmlinux scheduler_tick 36 0.0260 pg-8.2.4/bin/postgres ReleaseAndReadBuffer 36 0.0260 pg-8.2.4/bin/postgres _bt_checkpage 36 0.0260 pg-8.2.4/bin/postgres freearc 36 0.0260 pg-8.2.4/bin/postgres heapgetpage 36 0.0260 /usr/src/linux-2.6.22.15/vmlinux resched_task 35 0.0253 pg-8.2.4/bin/postgres PGSemaphoreLock 35 0.0253 pg-8.2.4/bin/postgres optimize 35 0.0253 pg-8.2.4/bin/postgres slot_getattr 35 0.0253 /usr/src/linux-2.6.22.15/vmlinux effective_prio 35 0.0253 /usr/src/linux-2.6.22.15/vmlinux hrtimer_run_queues 34 0.0246 /ip_tables (no symbols) 34 0.0246 pg-8.2.4/bin/postgres ExecCountSlotsNode 34 0.0246 pg-8.2.4/bin/postgres ExecMakeFunctionResultNoSets 34 0.0246 pg-8.2.4/bin/postgres ResourceOwnerForgetBuffer 34 0.0246 pg-8.2.4/bin/postgres TransactionIdPrecedes 34 0.0246 pg-8.2.4/bin/postgres _bt_getroot 34 0.0246 pg-8.2.4/lib/plpgsql.so exec_stmt_block 34 0.0246 /usr/src/linux-2.6.22.15/vmlinux do_page_fault 34 0.0246 /usr/src/linux-2.6.22.15/vmlinux run_rebalance_domains 33 0.0239 pg-8.2.4/bin/postgres AtEOXact_GUC 33 0.0239 /usr/src/linux-2.6.22.15/vmlinux __exit_idle 33 0.0239 /usr/src/linux-2.6.22.15/vmlinux sched_clock 32 0.0231 pg-8.2.4/bin/postgres _bt_moveright 32 0.0231 pg-8.2.4/bin/postgres compact 32 0.0231 /usr/src/linux-2.6.22.15/vmlinux __mod_timer 31 0.0224 pg-8.2.4/bin/postgres CreateExprContext 31 0.0224 pg-8.2.4/bin/postgres ExecInitNode 31 0.0224 pg-8.2.4/bin/postgres ExecMakeFunctionResult 31 0.0224 pg-8.2.4/bin/postgres ReleaseCatCache 31 0.0224 pg-8.2.4/bin/postgres exprType 30 0.0217 pg-8.2.4/bin/postgres _bt_binsrch 30 0.0217 pg-8.2.4/bin/postgres new_tail_cell 30 0.0217 /usr/src/linux-2.6.22.15/vmlinux run_timer_softirq 29 0.0210 pg-8.2.4/bin/postgres ExecAssignScanProjectionInfo 29 0.0210 pg-8.2.4/bin/postgres heap_form_tuple 28 0.0202 /usr/lib/gconv/ISO8859-1.so (no symbols) 28 0.0202 pg-8.2.4/bin/postgres AllocSetContextCreate 28 0.0202 pg-8.2.4/bin/postgres ResourceOwnerForgetCatCacheRef 28 0.0202 pg-8.2.4/bin/postgres hashoid 28 0.0202 pg-8.2.4/bin/postgres new_list 27 0.0195 pg-8.2.4/bin/postgres CreateTemplateTupleDesc 27 0.0195 pg-8.2.4/bin/postgres MemoryContextStrdup 27 0.0195 pg-8.2.4/bin/postgres SPI_execute_plan 27 0.0195 pg-8.2.4/bin/postgres btint4cmp 27 0.0195 /usr/src/linux-2.6.22.15/vmlinux IRQ0xa9_interrupt 26 0.0188 pg-8.2.4/bin/postgres ExecBuildProjectionInfo 26 0.0188 pg-8.2.4/bin/postgres ExecDropTupleTable 26 0.0188 pg-8.2.4/bin/postgres ExecEndNode 26 0.0188 pg-8.2.4/bin/postgres ExecEvalParam 26 0.0188 pg-8.2.4/bin/postgres FreeExecutorState 26 0.0188 pg-8.2.4/bin/postgres MemoryContextReset 26 0.0188 /usr/src/linux-2.6.22.15/vmlinux smp_apic_timer_interrupt 25 0.0181 pg-8.2.4/bin/postgres ExecInitSubPlan 25 0.0181 pg-8.2.4/bin/postgres FuncnameGetCandidates 25 0.0181 pg-8.2.4/bin/postgres MemoryContextDelete 25 0.0181 pg-8.2.4/bin/postgres SearchCatCacheList 25 0.0181 /usr/src/linux-2.6.22.15/vmlinux handle_edge_irq 25 0.0181 /usr/src/linux-2.6.22.15/vmlinux msecs_to_jiffies 24 0.0173 /nf_conntrack_ipv4 (no symbols) 24 0.0173 pg-8.2.4/bin/postgres ExecInitIndexScan 24 0.0173 pg-8.2.4/bin/postgres ExecProcNode 24 0.0173 pg-8.2.4/bin/postgres InternalCreateExecutorState 24 0.0173 pg-8.2.4/bin/postgres PostgresMain 24 0.0173 pg-8.2.4/bin/postgres RelationIncrementReferenceCount 24 0.0173 pg-8.2.4/bin/postgres heapgettup_pagemode 24 0.0173 /usr/src/linux-2.6.22.15/vmlinux __wake_up_common 24 0.0173 /usr/src/linux-2.6.22.15/vmlinux get_task_mm 24 0.0173 /usr/src/linux-2.6.22.15/vmlinux page_waitqueue 23 0.0166 pg-8.2.4/bin/postgres ExecCheckRTPerms 23 0.0166 pg-8.2.4/bin/postgres TransactionIdFollowsOrEquals 23 0.0166 pg-8.2.4/bin/postgres internal_putbytes 23 0.0166 pg-8.2.4/bin/postgres scanner_init 23 0.0166 pg-8.2.4/lib/plpgsql.so exec_stmt_execsql 23 0.0166 /usr/src/linux-2.6.22.15/vmlinux IRQ0xc1_interrupt 23 0.0166 /usr/src/linux-2.6.22.15/vmlinux __do_softirq 22 0.0159 pg-8.2.4/bin/postgres AfterTriggerBeginXact 22 0.0159 pg-8.2.4/bin/postgres GetTransactionSnapshot 22 0.0159 pg-8.2.4/bin/postgres ResourceOwnerReleaseInternal 22 0.0159 pg-8.2.4/bin/postgres lcons 22 0.0159 pg-8.2.4/bin/postgres markcanreach 22 0.0159 pg-8.2.4/lib/plpgsql.so exec_eval_datum 22 0.0159 pg-8.2.4/lib/plpgsql.so exec_eval_expr 22 0.0159 pg-8.2.4/lib/utf8_and_win.so win_to_utf8 22 0.0159 /usr/src/linux-2.6.22.15/vmlinux enqueue_task 22 0.0159 /usr/src/linux-2.6.22.15/vmlinux math_state_restore 21 0.0152 pg-8.2.4/bin/postgres AtCommit_Notify 21 0.0152 pg-8.2.4/bin/postgres CreateTupleDescCopy 21 0.0152 pg-8.2.4/bin/postgres ExecScan 21 0.0152 pg-8.2.4/bin/postgres ResourceOwnerEnlargeCatCacheRefs 21 0.0152 pg-8.2.4/bin/postgres ResourceOwnerForgetRelationRef 21 0.0152 pg-8.2.4/bin/postgres ResourceOwnerRememberCatCacheRef 21 0.0152 pg-8.2.4/bin/postgres SHMQueueInsertBefore 21 0.0152 pg-8.2.4/bin/postgres exec_simple_query 21 0.0152 pg-8.2.4/bin/postgres get_hash_value 21 0.0152 pg-8.2.4/bin/postgres index_getnext 21 0.0152 pg-8.2.4/bin/postgres next_token 21 0.0152 pg-8.2.4/bin/postgres pgstat_report_activity 21 0.0152 pg-8.2.4/lib/plpgsql.so exec_run_select 20 0.0145 pg-8.2.4/bin/postgres AllocSetReset 20 0.0145 pg-8.2.4/bin/postgres AtProcExit_Buffers 20 0.0145 pg-8.2.4/bin/postgres LocalToUtf 20 0.0145 pg-8.2.4/bin/postgres index_getprocinfo 20 0.0145 pg-8.2.4/bin/postgres spi_dest_startup 20 0.0145 pg-8.2.4/bin/postgres tag_hash 20 0.0145 /usr/src/linux-2.6.22.15/vmlinux sysret_check 19 0.0137 pg-8.2.4/bin/postgres Async_Unlisten 19 0.0137 pg-8.2.4/bin/postgres FileSeek 19 0.0137 pg-8.2.4/bin/postgres PortalStart 19 0.0137 pg-8.2.4/bin/postgres heap_open 19 0.0137 pg-8.2.4/bin/postgres pg_mblen 19 0.0137 /usr/src/linux-2.6.22.15/vmlinux prepare_to_wait 18 0.0130 /iptable_filter (no symbols) 18 0.0130 /iptable_nat (no symbols) 18 0.0130 /lib/libm-2.3.6.so (no symbols) 18 0.0130 pg-8.2.4/bin/postgres CopySnapshot 18 0.0130 pg-8.2.4/bin/postgres ExecSetSlotDescriptor 18 0.0130 pg-8.2.4/bin/postgres SPI_connect 18 0.0130 pg-8.2.4/bin/postgres cleartraverse 18 0.0130 pg-8.2.4/bin/postgres pg_mbcliplen 18 0.0130 pg-8.2.4/bin/postgres strlcpy 18 0.0130 pg-8.2.4/lib/plpgsql.so plpgsql_compile 18 0.0130 /usr/src/linux-2.6.22.15/vmlinux account_system_time 18 0.0130 /usr/src/linux-2.6.22.15/vmlinux bit_waitqueue 17 0.0123 pg-8.2.4/bin/postgres AtEOXact_MultiXact 17 0.0123 pg-8.2.4/bin/postgres CreateTupleDescCopyConstr 17 0.0123 pg-8.2.4/bin/postgres ExecAgg 17 0.0123 pg-8.2.4/bin/postgres ExecResult 17 0.0123 pg-8.2.4/bin/postgres FlushBuffer 17 0.0123 pg-8.2.4/bin/postgres RelationGetIndexScan 17 0.0123 pg-8.2.4/bin/postgres SHMQueueDelete 17 0.0123 pg-8.2.4/bin/postgres SendRowDescriptionMessage 17 0.0123 pg-8.2.4/bin/postgres _bt_search 17 0.0123 pg-8.2.4/bin/postgres _copyTargetEntry 17 0.0123 pg-8.2.4/bin/postgres heap_getnext 17 0.0123 pg-8.2.4/bin/postgres markreachable 17 0.0123 pg-8.2.4/bin/postgres miss 17 0.0123 pg-8.2.4/bin/postgres scanRTEForColumn 17 0.0123 pg-8.2.4/lib/plpgsql.so exec_assign_value 17 0.0123 /usr/src/linux-2.6.22.15/vmlinux do_softirq 16 0.0116 pg-8.2.4/bin/postgres ExecEvalConst 16 0.0116 pg-8.2.4/bin/postgres ExecReScan 16 0.0116 pg-8.2.4/bin/postgres ExecSetParamPlan 16 0.0116 pg-8.2.4/bin/postgres downcase_truncate_identifier 16 0.0116 pg-8.2.4/bin/postgres transformExpr 16 0.0116 pg-8.2.4/bin/postgres varstr_cmp 16 0.0116 pg-8.2.4/lib/plpgsql.so plpgsql_call_handler 16 0.0116 pg-8.2.4/lib/plpgsql.so plpgsql_estate_setup 16 0.0116 /usr/src/linux-2.6.22.15/vmlinux cpu_idle 15 0.0108 pg-8.2.4/bin/postgres CommitTransactionCommand 15 0.0108 pg-8.2.4/bin/postgres ExecEvalFuncArgs 15 0.0108 pg-8.2.4/bin/postgres GrantLockLocal 15 0.0108 pg-8.2.4/bin/postgres PrepareToInvalidateCacheTuple 15 0.0108 pg-8.2.4/bin/postgres UtfToLocal 15 0.0108 pg-8.2.4/bin/postgres cost_nestloop 15 0.0108 pg-8.2.4/bin/postgres errstart 15 0.0108 pg-8.2.4/bin/postgres index_open 15 0.0108 pg-8.2.4/bin/postgres init_fcache 15 0.0108 pg-8.2.4/bin/postgres list_delete_cell 14 0.0101 pg-8.2.4/bin/postgres ExecStoreTuple 14 0.0101 pg-8.2.4/bin/postgres ReadNewTransactionId 14 0.0101 pg-8.2.4/bin/postgres ReleaseBuffer 14 0.0101 pg-8.2.4/bin/postgres ResourceOwnerRememberBuffer 14 0.0101 pg-8.2.4/bin/postgres SHMQueueNext 14 0.0101 pg-8.2.4/bin/postgres btgettuple 14 0.0101 pg-8.2.4/bin/postgres duptraverse 14 0.0101 pg-8.2.4/bin/postgres heap_compute_data_size 14 0.0101 pg-8.2.4/bin/postgres index_rescan 14 0.0101 pg-8.2.4/bin/postgres list_copy 14 0.0101 pg-8.2.4/bin/postgres smgrclosenode 14 0.0101 pg-8.2.4/lib/plpgsql.so .plt 14 0.0101 /usr/src/linux-2.6.22.15/vmlinux idle_cpu 14 0.0101 /usr/src/linux-2.6.22.15/vmlinux run_workqueue 13 0.0094 pg-8.2.4/bin/postgres Async_Listen 13 0.0094 pg-8.2.4/bin/postgres DLMoveToFront 13 0.0094 pg-8.2.4/bin/postgres FreeExprContext 13 0.0094 pg-8.2.4/bin/postgres IndexNext 13 0.0094 pg-8.2.4/bin/postgres LockRelease 13 0.0094 pg-8.2.4/bin/postgres ResourceOwnerEnlargeBuffers 13 0.0094 pg-8.2.4/bin/postgres ShutdownExprContext 13 0.0094 pg-8.2.4/bin/postgres cleanup 13 0.0094 pg-8.2.4/bin/postgres datumCopy 13 0.0094 pg-8.2.4/bin/postgres grouping_planner 13 0.0094 pg-8.2.4/bin/postgres newstate 13 0.0094 pg-8.2.4/bin/postgres pgstat_start 13 0.0094 pg-8.2.4/bin/postgres spi_printtup 13 0.0094 pg-8.2.4/bin/postgres transformStmt 12 0.0087 pg-8.2.4/bin/postgres ExecEvalOper 12 0.0087 pg-8.2.4/bin/postgres ExecQual 12 0.0087 pg-8.2.4/bin/postgres IsSharedRelation 12 0.0087 pg-8.2.4/bin/postgres RelationIdGetRelation 12 0.0087 pg-8.2.4/bin/postgres StartTransactionCommand 12 0.0087 pg-8.2.4/bin/postgres _bt_insertonpg 12 0.0087 pg-8.2.4/bin/postgres advance_aggregates 12 0.0087 pg-8.2.4/bin/postgres getvacant 12 0.0087 pg-8.2.4/bin/postgres list_delete_ptr 12 0.0087 pg-8.2.4/bin/postgres namestrcpy 12 0.0087 pg-8.2.4/bin/postgres oideq 12 0.0087 /usr/src/linux-2.6.22.15/vmlinux local_bh_enable_ip 12 0.0087 /usr/src/linux-2.6.22.15/vmlinux sys_rt_sigreturn 11 0.0080 pg-8.2.4/bin/postgres CommitTransaction 11 0.0080 pg-8.2.4/bin/postgres ExecEvalNullTest 11 0.0080 pg-8.2.4/bin/postgres LockRelationOid 11 0.0080 pg-8.2.4/bin/postgres ProcessUtility 11 0.0080 pg-8.2.4/bin/postgres UnGrantLock 11 0.0080 pg-8.2.4/bin/postgres XidInSnapshot 11 0.0080 pg-8.2.4/bin/postgres _bt_doinsert 11 0.0080 pg-8.2.4/bin/postgres _bt_steppage 11 0.0080 pg-8.2.4/bin/postgres appendBinaryStringInfo 11 0.0080 pg-8.2.4/bin/postgres okcolors 11 0.0080 pg-8.2.4/bin/postgres pq_getmessage 11 0.0080 pg-8.2.4/lib/plpgsql.so exec_move_row 11 0.0080 /usr/src/linux-2.6.22.15/vmlinux clocksource_get_next 11 0.0080 /usr/src/linux-2.6.22.15/vmlinux handle_IRQ_event 11 0.0080 /usr/src/linux-2.6.22.15/vmlinux local_bh_enable 11 0.0080 /usr/src/linux-2.6.22.15/vmlinux mod_timer 10 0.0072 pg-8.2.4/bin/postgres CommandEndInvalidationMessages 10 0.0072 pg-8.2.4/bin/postgres CopyTriggerDesc 10 0.0072 pg-8.2.4/bin/postgres ExecAssignExprContext 10 0.0072 pg-8.2.4/bin/postgres ExecEvalScalarArrayOp 10 0.0072 pg-8.2.4/bin/postgres FunctionCall5 10 0.0072 pg-8.2.4/bin/postgres GetCurrentSubTransactionId 10 0.0072 pg-8.2.4/bin/postgres RecordTransactionCommit 10 0.0072 pg-8.2.4/bin/postgres StrategyGetBuffer 10 0.0072 pg-8.2.4/bin/postgres _bt_getbuf 10 0.0072 pg-8.2.4/bin/postgres _bt_preprocess_keys 10 0.0072 pg-8.2.4/bin/postgres eval_const_expressions_mutator 10 0.0072 pg-8.2.4/bin/postgres index_beginscan_internal 10 0.0072 pg-8.2.4/bin/postgres new_head_cell 10 0.0072 pg-8.2.4/bin/postgres perform_default_encoding_conversion 10 0.0072 pg-8.2.4/bin/postgres pg_regcomp 10 0.0072 pg-8.2.4/lib/plpgsql.so exec_eval_boolean 10 0.0072 pg-8.2.4/lib/plpgsql.so exec_stmt_if 10 0.0072 /usr/src/linux-2.6.22.15/vmlinux __rcu_pending 10 0.0072 /usr/src/linux-2.6.22.15/vmlinux __rcu_process_callbacks
Jakub Ouhrabka wrote: > We've tried several times to get stacktrace from some of the running > backends during spikes, we got always this: > > 0x00002b005d00a9a9 in semop () from /lib/libc.so.6 > #0 0x00002b005d00a9a9 in semop () from /lib/libc.so.6 > #1 0x000000000054fe53 in PGSemaphoreLock (sema=0x2b00a04e5090, > interruptOK=0 '\0') at pg_sema.c:411 > #2 0x0000000000575d95 in LWLockAcquire (lockid=SInvalLock, > mode=LW_EXCLUSIVE) at lwlock.c:455 > #3 0x000000000056fbfe in ReceiveSharedInvalidMessages > (invalFunction=0x5e9a30 <LocalExecuteInvalidationMessage>, > resetFunction=0x5e9df0 <InvalidateSystemCaches>) at sinval.c:159 > #4 0x0000000000463505 in StartTransactionCommand () at xact.c:1439 > #5 0x000000000056fa4b in ProcessCatchupEvent () at sinval.c:347 > #6 0x000000000056fb20 in CatchupInterruptHandler > (postgres_signal_arg=<value optimized out>) at sinval.c:221 > #7 0x00002b005cf6f110 in killpg () from /lib/libc.so.6 > #8 0x0000000000000000 in ?? () Perhaps it would make sense to try to take the "fast path" in SIDelExpiredDataEntries with only a shared lock rather than exclusive. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Jakub Ouhrabka <kuba@comgate.cz> writes: > We've tried several times to get stacktrace from some of the running > backends during spikes, we got always this: > 0x00002b005d00a9a9 in semop () from /lib/libc.so.6 > #0 0x00002b005d00a9a9 in semop () from /lib/libc.so.6 > #1 0x000000000054fe53 in PGSemaphoreLock (sema=0x2b00a04e5090, > interruptOK=0 '\0') at pg_sema.c:411 > #2 0x0000000000575d95 in LWLockAcquire (lockid=SInvalLock, > mode=LW_EXCLUSIVE) at lwlock.c:455 > #3 0x000000000056fbfe in ReceiveSharedInvalidMessages > (invalFunction=0x5e9a30 <LocalExecuteInvalidationMessage>, > resetFunction=0x5e9df0 <InvalidateSystemCaches>) at sinval.c:159 > #4 0x0000000000463505 in StartTransactionCommand () at xact.c:1439 > #5 0x000000000056fa4b in ProcessCatchupEvent () at sinval.c:347 > #6 0x000000000056fb20 in CatchupInterruptHandler > (postgres_signal_arg=<value optimized out>) at sinval.c:221 CatchupInterruptHandler, eh? That seems to let NOTIFY off the hook, and instead points in the direction of sinval processing; which is to say, propagation of changes to system catalogs. Does your app create and destroy a tremendous number of temp tables, or anything else in the way of frequent DDL commands? regards, tom lane
Jakub Ouhrabka <jakub.ouhrabka@comgate.cz> writes: >>> Does your app create and destroy a tremendous number of temp tables, >>> or anything else in the way of frequent DDL commands? > Hmm. I can't think of anything like this. Maybe there are few backends > which create temp tables but not tremendous number. I don't think our > applications issue DDL statements either. Can LOCK TABLE IN EXCLUSIVE > MODE cause this? No. I did some experimenting to see exactly how large the sinval message buffer is in today's terms, and what I find is that about 22 cycles of create temp table foo (f1 int, f2 text); drop table foo; is enough to force a CatchupInterrupt to a sleeping backend. This case is a bit more complex than it appears since the text column forces the temp table to have a toast table; but even with only fixed-width columns, if you were creating one temp table a second that would be plenty to explain once-a-minute-or-so CatchupInterrupt processing. And if you've got a lot of backends that normally *don't* wake up that often, then they'd be sitting and failing to absorb the sinval traffic in any more timely fashion. So I'm betting that we've got the source of the spike identified. You could check this theory out by strace'ing some of the idle backends and seeing if their activity spikes are triggered by receipt of SIGUSR1 signals. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Perhaps it would make sense to try to take the "fast path" in > SIDelExpiredDataEntries with only a shared lock rather than exclusive. I think the real problem here is that sinval catchup processing is well designed to create contention :-(. Once we've decided that the message queue is getting too full, we SIGUSR1 all the backends at once (or as fast as the postmaster can do it anyway), then they all go off and try to touch the sinval queue. Backends that haven't awoken even once since the last time will have to process the entire queue contents, and they're all trying to do that at the same time. What's worse, they take and release the SInvalLock once for each message they take off the queue. This isn't so horrid for one-core machines (since each process will monopolize the CPU for probably less than one timeslice while it's catching up) but it's pretty obvious where all the contention is coming from on an 8-core. Some ideas for improving matters: 1. Try to avoid having all the backends hit the queue at once. Instead of SIGUSR1'ing everybody at the same time, maybe hit only the process with the oldest message pointer, and have him hit the next oldest after he's done reading the queue. 2. Try to take more than one message off the queue per SInvalLock cycle. (There is a tuning tradeoff here, since it would mean holding the lock for longer at a time.) 3. Try to avoid having every backend run SIDelExpiredDataEntries every time through ReceiveSharedInvalidMessages. It's not critical to delete entries until the queue starts getting full --- maybe we could rejigger the logic so it only happens once when somebody notices the queue is getting full, or so that only the guy(s) who had nextMsgNum == minMsgNum do it, or something like that? regards, tom lane
> You could check this theory > out by strace'ing some of the idle backends and seeing if their > activity spikes are triggered by receipt of SIGUSR1 signals. Yes, I can confirm that it's triggered by SIGUSR1 signals. If I understand it correctly we have following choices now: 1) Use only 2 cores (out of 8 cores) 2) Lower the number of idle backends - at least force backends to do something at different times to eliminate spikes - is "select 1" enough to force processing the queue? 3) Is there any chance of this being fixed/improved in 8.3 or even 8.2? It's a (performance) bug from our point of view. I realize we're first who noticed it and it's not typical use case to have so many idle backends. But large installation with connection pooling is something similar and that's not that uncommon - one active backend doing DDL can then cause unexpected spikes during otherwise quiet hours... 4) Sure we'll try to reduce the number of DDL statements (which in fact we're not sure where exactly they are comming from) but I guess it would only longer the time between spikes but not make them any smoother. Any other suggestions? Thanks, Kuba
Jakub Ouhrabka <kuba@comgate.cz> writes: > Yes, I can confirm that it's triggered by SIGUSR1 signals. OK, that confirms the theory that it's sinval-queue contention. > If I understand it correctly we have following choices now: > 1) Use only 2 cores (out of 8 cores) > 2) Lower the number of idle backends - at least force backends to do > something at different times to eliminate spikes - is "select 1" enough > to force processing the queue? Yeah, if you could get your clients to issue trivial queries every few seconds (not all at the same time) the spikes should go away. If you don't want to change your clients, one possible amelioration is to reduce the signaling threshold in SIInsertDataEntry --- instead of 70% of MAXNUMMESSAGES, maybe signal at 10% or 20%. That would make the spikes more frequent but smaller, which might help ... or not. > 3) Is there any chance of this being fixed/improved in 8.3 or even 8.2? I doubt we'd risk destabilizing 8.3 at this point, for a problem that affects so few people; let alone back-patching into 8.2. There are some other known performance problems in the sinval signaling (for instance, that a queue overflow results in cache resets for all backends, not only the slowest), so I think addressing all of them at once would be the thing to do. That would be a large enough patch that it would certainly need to go through beta testing before I'd want to inflict it on the world... This discussion has raised the priority of the problem in my mind, so I'm thinking it should be worked on in 8.4; but it's too late for 8.3. regards, tom lane
Hi Tom, > I doubt we'd risk destabilizing 8.3 at this point, for a problem that > affects so few people; let alone back-patching into 8.2. understand. > OK, that confirms the theory that it's sinval-queue contention. We'we tried hard to identify what's the cause of filling sinval-queue. We went through query logs as well as function bodies stored in the database. We were not able to find any DDL, temp table creations etc. We did following experiment: stop one of our clients, so there started to be queue of events (aka rows in db) for it to process. Then we started the client again, it started processing the queue - that means calling simple selects, updates and complex plpgsql function(s). And at this moment, the spike started even it shouldn't start to meet usual periodicity. It's consistent. We are pretty sure that this client is not doing any DDL... What should we look for to find the cause? Thanks for any hints, Kuba
Jakub Ouhrabka <kuba@comgate.cz> writes: > We'we tried hard to identify what's the cause of filling sinval-queue. > We went through query logs as well as function bodies stored in the > database. We were not able to find any DDL, temp table creations etc. Strange. The best idea that comes to mind is to add some debugging code to SendSharedInvalidMessage to log the content of each message that's sent out. That would at least tell us *what* is going into the queue, even if not directly *why*. Try something like (untested) void SendSharedInvalidMessage(SharedInvalidationMessage *msg) { bool insertOK; + elog(LOG, "sending inval msg %d %u %u %u %u %u", + msg->cc.id, + msg->cc.tuplePtr.ip_blkid.bi_hi, + msg->cc.tuplePtr.ip_blkid.bi_lo, + msg->cc.tuplePtr.ip_posid, + msg->cc.dbId, + msg->cc.hashValue); + LWLockAcquire(SInvalLock, LW_EXCLUSIVE); insertOK = SIInsertDataEntry(shmInvalBuffer, msg); LWLockRelease(SInvalLock); regards, tom lane
Hi Tom, > Strange. The best idea that comes to mind is to add some debugging > code to SendSharedInvalidMessage to log the content of each message > that's sent out. That would at least tell us *what* is going into > the queue, even if not directly *why*. we've patched postgresql and run one of our plpgsql complex procedures. There are many of sinval messages - log output is below. What does it mean? Thanks, Kuba LOG: sending inval msg 30 0 26 13 30036 4294936593 LOG: sending inval msg 29 0 26 13 30036 337030170 LOG: sending inval msg 30 0 25 46 30036 4294936593 LOG: sending inval msg 29 0 25 46 30036 337030170 LOG: sending inval msg 30 0 26 13 30036 4294936593 LOG: sending inval msg 29 0 26 13 30036 337030170 LOG: sending inval msg 30 0 25 45 30036 4294936595 LOG: sending inval msg 29 0 25 45 30036 2019111801 LOG: sending inval msg 30 0 26 11 30036 4294936595 LOG: sending inval msg 29 0 26 11 30036 2019111801 LOG: sending inval msg 30 0 25 44 30036 4294936597 LOG: sending inval msg 29 0 25 44 30036 3703878920 LOG: sending inval msg 30 0 26 10 30036 4294936597 LOG: sending inval msg 29 0 26 10 30036 3703878920 LOG: sending inval msg 30 0 26 9 30036 4294936616 LOG: sending inval msg 29 0 26 9 30036 3527122063 LOG: sending inval msg 30 0 25 43 30036 4294936616 LOG: sending inval msg 29 0 25 43 30036 3527122063 LOG: sending inval msg 30 0 26 9 30036 4294936616 LOG: sending inval msg 29 0 26 9 30036 3527122063 LOG: sending inval msg 30 0 25 41 30036 4294936618 LOG: sending inval msg 29 0 25 41 30036 2126866956 LOG: sending inval msg 30 0 26 7 30036 4294936618 LOG: sending inval msg 29 0 26 7 30036 2126866956 LOG: sending inval msg 30 0 25 40 30036 4294936620 LOG: sending inval msg 29 0 25 40 30036 1941919314 LOG: sending inval msg 30 0 26 5 30036 4294936620 LOG: sending inval msg 29 0 26 5 30036 1941919314 LOG: sending inval msg 30 0 26 4 30036 4294936633 LOG: sending inval msg 29 0 26 4 30036 544523647 LOG: sending inval msg 30 0 25 39 30036 4294936633 LOG: sending inval msg 29 0 25 39 30036 544523647 LOG: sending inval msg 30 0 26 4 30036 4294936633 LOG: sending inval msg 29 0 26 4 30036 544523647 LOG: sending inval msg 30 0 25 38 30036 4294936635 LOG: sending inval msg 29 0 25 38 30036 2557582018 LOG: sending inval msg 30 0 26 3 30036 4294936635 LOG: sending inval msg 29 0 26 3 30036 2557582018 LOG: sending inval msg 30 0 25 37 30036 4294936637 LOG: sending inval msg 29 0 25 37 30036 2207280630 LOG: sending inval msg 30 0 26 2 30036 4294936637 LOG: sending inval msg 29 0 26 2 30036 2207280630 LOG: sending inval msg 30 0 26 1 30036 4294936669 LOG: sending inval msg 29 0 26 1 30036 1310188568 LOG: sending inval msg 30 0 25 36 30036 4294936669 LOG: sending inval msg 29 0 25 36 30036 1310188568 LOG: sending inval msg 30 0 26 1 30036 4294936669 LOG: sending inval msg 29 0 26 1 30036 1310188568 LOG: sending inval msg 30 0 25 35 30036 4294936671 LOG: sending inval msg 29 0 25 35 30036 2633053415 LOG: sending inval msg 30 0 25 48 30036 4294936671 LOG: sending inval msg 29 0 25 48 30036 2633053415 LOG: sending inval msg 30 0 25 33 30036 4294936673 LOG: sending inval msg 29 0 25 33 30036 2049964857 LOG: sending inval msg 30 0 25 47 30036 4294936673 LOG: sending inval msg 29 0 25 47 30036 2049964857 LOG: sending inval msg -1 0 30036 0 30700 3218341912 LOG: sending inval msg -2 2084 1663 0 30036 50335 LOG: sending inval msg -2 0 1663 0 30036 50336 LOG: sending inval msg -1 2075 30036 0 30702 30036 LOG: sending inval msg -2 0 1663 0 30036 50324 LOG: sending inval msg -1 0 30036 0 30702 30036 LOG: sending inval msg -2 0 1663 0 30036 50336 LOG: sending inval msg -2 0 1663 0 30036 50323 LOG: sending inval msg -1 0 30036 0 30700 30036 LOG: sending inval msg -2 0 1663 0 30036 50335 LOG: sending inval msg -2 0 1663 0 30036 50322 LOG: sending inval msg -1 0 30036 0 30698 30036 LOG: sending inval msg -2 0 1663 0 30036 50334 LOG: sending inval msg -1 0 30036 0 30677 3218341912 LOG: sending inval msg -2 2084 1663 0 30036 50332 LOG: sending inval msg -2 0 1663 0 30036 50333 LOG: sending inval msg -1 2075 30036 0 30679 30036 LOG: sending inval msg -2 0 1663 0 30036 50321 LOG: sending inval msg -1 0 30036 0 30679 30036 LOG: sending inval msg -2 0 1663 0 30036 50333 LOG: sending inval msg -2 0 1663 0 30036 50320 LOG: sending inval msg -1 0 30036 0 30677 30036 LOG: sending inval msg -2 0 1663 0 30036 50332 LOG: sending inval msg -2 0 1663 0 30036 50319 LOG: sending inval msg -1 0 30036 0 30675 30036 LOG: sending inval msg -2 0 1663 0 30036 50331 LOG: sending inval msg -1 0 30036 0 30660 3218341912 LOG: sending inval msg -2 2084 1663 0 30036 50329 LOG: sending inval msg -2 0 1663 0 30036 50330 LOG: sending inval msg -1 2075 30036 0 30662 30036 LOG: sending inval msg -2 0 1663 0 30036 50318 LOG: sending inval msg -1 0 30036 0 30662 30036 LOG: sending inval msg -2 0 1663 0 30036 50330 LOG: sending inval msg -2 0 1663 0 30036 50317 LOG: sending inval msg -1 0 30036 0 30660 30036 LOG: sending inval msg -2 0 1663 0 30036 50329 LOG: sending inval msg -2 0 1663 0 30036 50316 LOG: sending inval msg -1 0 30036 0 30658 30036 LOG: sending inval msg -2 0 1663 0 30036 50328 LOG: sending inval msg -1 0 30036 0 30624 3218341912 LOG: sending inval msg -2 2084 1663 0 30036 50326 LOG: sending inval msg -2 0 1663 0 30036 50327 LOG: sending inval msg -1 2075 30036 0 30626 30036 LOG: sending inval msg -2 0 1663 0 30036 50315 LOG: sending inval msg -1 0 30036 0 30626 30036 LOG: sending inval msg -2 0 1663 0 30036 50327 LOG: sending inval msg -2 0 1663 0 30036 50314 LOG: sending inval msg -1 0 30036 0 30624 30036 LOG: sending inval msg -2 0 1663 0 30036 50326 LOG: sending inval msg -2 0 1663 0 30036 50313 LOG: sending inval msg -1 0 30036 0 30622 30036 LOG: sending inval msg -2 0 1663 0 30036 50325 Tom Lane napsal(a): > Jakub Ouhrabka <kuba@comgate.cz> writes: >> We'we tried hard to identify what's the cause of filling sinval-queue. >> We went through query logs as well as function bodies stored in the >> database. We were not able to find any DDL, temp table creations etc. > > Strange. The best idea that comes to mind is to add some debugging > code to SendSharedInvalidMessage to log the content of each message > that's sent out. That would at least tell us *what* is going into > the queue, even if not directly *why*. Try something like (untested) > > void > SendSharedInvalidMessage(SharedInvalidationMessage *msg) > { > bool insertOK; > > + elog(LOG, "sending inval msg %d %u %u %u %u %u", > + msg->cc.id, > + msg->cc.tuplePtr.ip_blkid.bi_hi, > + msg->cc.tuplePtr.ip_blkid.bi_lo, > + msg->cc.tuplePtr.ip_posid, > + msg->cc.dbId, > + msg->cc.hashValue); > + > LWLockAcquire(SInvalLock, LW_EXCLUSIVE); > insertOK = SIInsertDataEntry(shmInvalBuffer, msg); > LWLockRelease(SInvalLock); > > regards, tom lane
Jakub Ouhrabka <kuba@comgate.cz> writes: > What does it mean? Look at src/include/storage/sinval.h and src/include/utils/syscache.h. What you seem to have here is a bunch of tuple updates in pg_class (invalidating caches 29 and 30, which in 8.2 correspond to RELNAMENSP and RELOID), followed by a bunch of SharedInvalRelcacheMsg and SharedInvalSmgrMsg. What I find interesting is that the hits are coming against nearly-successive tuple CTIDs in pg_class, eg these are all on pages 25 and 26 of pg_class: > LOG: sending inval msg 30 0 25 45 30036 4294936595 > LOG: sending inval msg 29 0 25 45 30036 2019111801 > LOG: sending inval msg 30 0 26 11 30036 4294936595 > LOG: sending inval msg 29 0 26 11 30036 2019111801 > LOG: sending inval msg 30 0 25 44 30036 4294936597 > LOG: sending inval msg 29 0 25 44 30036 3703878920 > LOG: sending inval msg 30 0 26 10 30036 4294936597 > LOG: sending inval msg 29 0 26 10 30036 3703878920 > LOG: sending inval msg 30 0 26 9 30036 4294936616 > LOG: sending inval msg 29 0 26 9 30036 3527122063 > LOG: sending inval msg 30 0 25 43 30036 4294936616 > LOG: sending inval msg 29 0 25 43 30036 3527122063 The ordering is a little strange --- not sure what's producing that. I can think of three things that might be producing this: 1. DDL operations ... but most sorts of DDL on a table would touch more catalogs than just pg_class, so this is a bit hard to credit. 2. VACUUM. 3. Some sort of direct update of pg_class. The fact that we have a bunch of catcache invals followed by relcache/smgr invals says that this all happened in one transaction, else they'd have been intermixed better. That lets VACUUM off the hook, because it processes each table in a separate transaction. I am wondering if maybe your app does one of those sneaky things like fooling with pg_class.reltriggers. If so, the problem might be soluble by just avoiding unnecessary updates. regards, tom lane
Hi Tom, > I can think of three things that might be producing this: we've found it: TRUNCATE We'll try to eliminate use of TRUNCATE and the periodical spikes should go off. There will still be possibility of spikes because of database creation etc - we'll try to handle this by issuing trivial commands from idle backeds and in longer run to decrease the number of backends/databases running. This is the way to go, right? One more question: is it ok to do mass regexp update of pg_proc.prosrc changing TRUNCATEs to DELETEs? Anything we should be aware of? Sure we'll do our own testing but in case you see any obvious downsides... Many thanks for your guidance! Kuba
Hi > > I can think of three things that might be producing this: > > we've found it: TRUNCATE I haven't been following this thread. Can someone please explain to me why TRUNCATE causes these spikes? -- Adrian Moisey System Administrator | CareerJunction | Your Future Starts Here. Web: www.careerjunction.co.za | Email: adrian@careerjunction.co.za Phone: +27 21 686 6820 | Mobile: +27 82 858 7830 | Fax: +27 21 686 6842
Jakub Ouhrabka <kuba@comgate.cz> writes: > we've found it: TRUNCATE Huh. One transaction truncating a dozen tables? That would match the sinval trace all right ... > One more question: is it ok to do mass regexp update of pg_proc.prosrc > changing TRUNCATEs to DELETEs? You might be throwing the baby out with the bathwater, performance-wise. Mass DELETEs will require cleanup by VACUUM, and that'll likely eat more cycles and I/O than you save. I'd think in terms of trying to spread out the TRUNCATEs or check to see if you really need one (maybe the table's already empty), rather than this. I do plan to look at the sinval code once 8.3 is out the door, so another possibility if you can wait a few weeks/months is to leave your app alone and see if the eventual patch looks sane to back-patch. (I don't think the community would consider doing so, but you could run a locally modified Postgres with it.) regards, tom lane
> Huh. One transaction truncating a dozen tables? That would match the > sinval trace all right ... It should be 4 tables - the shown log looks like there were more truncates? > You might be throwing the baby out with the bathwater, > performance-wise. Yes, performance was the initial reason to use truncate instead of delete many years ago. But today the truncated tables usualy contain exactly one row - quick measurements now show that it's faster to issue delete instead of truncate in this case. Again, many thanks for your invaluable advice! Kuba
Adrian Moisey <adrian@careerjunction.co.za> writes: >> we've found it: TRUNCATE > I haven't been following this thread. Can someone please explain to me > why TRUNCATE causes these spikes? It's not so much the TRUNCATE as the overhead of broadcasting the resultant catalog changes to the many hundreds of (mostly idle) backends he's got --- all of which respond by trying to lock the shared sinval message queue at about the same time. You could see the whole thread as an object lesson in why connection pooling is a good idea. But certainly it seems that sinval is the next bottleneck in terms of being able to scale Postgres up to very large numbers of backends. regards, tom lane
Jakub Ouhrabka <kuba@comgate.cz> writes: >>> Huh. One transaction truncating a dozen tables? That would match the >>> sinval trace all right ... > It should be 4 tables - the shown log looks like there were more truncates? Actually, counting up the entries, there are close to 2 dozen relations apparently being truncated in the trace you showed. But that might be only four tables at the user level, since each index on these tables would appear separately, and you might have a toast table plus index for each one too. If you want to dig down, the table OIDs are visible in the trace, in the messages with type -1: >> LOG: sending inval msg -1 0 30036 0 30700 3218341912 ^^^^^ ^^^^^ DBOID RELOID so you could look into pg_class to confirm what's what. > Yes, performance was the initial reason to use truncate instead of > delete many years ago. But today the truncated tables usualy contain > exactly one row - quick measurements now show that it's faster to issue > delete instead of truncate in this case. Okay, for a table of just a few entries I agree that DELETE is probably better. But don't forget you're going to need to have those tables vacuumed fairly regularly now, else they'll start to bloat. regards, tom lane
> Okay, for a table of just a few entries I agree that DELETE is > probably better. But don't forget you're going to need to have those > tables vacuumed fairly regularly now, else they'll start to bloat. I think we'll go with DELETE also for another reason: Just after we figured out the cause of the spikes we started to investigate a long-term issue we had with PostgreSQL: pg_dump of big database was blocking some of our applications. And yes, we replaced TRUNCATE with DELETE and everything is running as expected. Looking at the docs now I see there is a new paragraph in 8.3 docs mentioning that TRUNCATE is not MVCC-safe and also the blocking issue. It's a pity that the warning wasn't there in 7.1 times :-) Thanks, Kuba Tom Lane napsal(a): > Jakub Ouhrabka <kuba@comgate.cz> writes: >>>> Huh. One transaction truncating a dozen tables? That would match the >>>> sinval trace all right ... > >> It should be 4 tables - the shown log looks like there were more truncates? > > Actually, counting up the entries, there are close to 2 dozen relations > apparently being truncated in the trace you showed. But that might be > only four tables at the user level, since each index on these tables > would appear separately, and you might have a toast table plus index > for each one too. If you want to dig down, the table OIDs are visible > in the trace, in the messages with type -1: > >>> LOG: sending inval msg -1 0 30036 0 30700 3218341912 > ^^^^^ ^^^^^ > DBOID RELOID > > so you could look into pg_class to confirm what's what. > >> Yes, performance was the initial reason to use truncate instead of >> delete many years ago. But today the truncated tables usualy contain >> exactly one row - quick measurements now show that it's faster to issue >> delete instead of truncate in this case. > > Okay, for a table of just a few entries I agree that DELETE is probably > better. But don't forget you're going to need to have those tables > vacuumed fairly regularly now, else they'll start to bloat. > > regards, tom lane
On Mon, 2008-01-07 at 19:54 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Perhaps it would make sense to try to take the "fast path" in > > SIDelExpiredDataEntries with only a shared lock rather than exclusive. > > I think the real problem here is that sinval catchup processing is well > designed to create contention :-(. Thinking some more about handling TRUNCATEs... > Some ideas for improving matters: > > 1. Try to avoid having all the backends hit the queue at once. Instead > of SIGUSR1'ing everybody at the same time, maybe hit only the process > with the oldest message pointer, and have him hit the next oldest after > he's done reading the queue. > > 2. Try to take more than one message off the queue per SInvalLock cycle. > (There is a tuning tradeoff here, since it would mean holding the lock > for longer at a time.) > > 3. Try to avoid having every backend run SIDelExpiredDataEntries every > time through ReceiveSharedInvalidMessages. It's not critical to delete > entries until the queue starts getting full --- maybe we could rejigger > the logic so it only happens once when somebody notices the queue is > getting full, or so that only the guy(s) who had nextMsgNum == minMsgNum > do it, or something like that? (2) is unnecessary if we can reduce the number of Exclusive lockers so that repeated access to the backend's messages is not contended. (1) would do this, but seems like it would be complex. We can reduce the possibility of multiple re-signals though. (3) seems like the easiest route, as long as we get a reasonable algorithm for reducing the access rate to a reasonable level. I'm posting a patch for discussion to -patches now that will do this. It seems straightforward enough to include in 8.3, but that may rise a few eyebrows, but read the patch first. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Fri, 25 Jan 2008, Simon Riggs wrote: >> 1. Try to avoid having all the backends hit the queue at once. Instead >> of SIGUSR1'ing everybody at the same time, maybe hit only the process >> with the oldest message pointer, and have him hit the next oldest after >> he's done reading the queue. My feeling was that an "obvious" way to deal with this is to implement some sort of "random early detect". That is, randomly SIGUSR1 processes as entries are added to the queue. The signals should become more frequent as the queue length increases, until it reaches the current cut-off of signalling everyone when the queue really is full. The hope would be that that would never happen. Matthew
Added to TODO: * Improve performance of shared invalidation queue for multiple CPUs http://archives.postgresql.org/pgsql-performance/2008-01/msg00023.php --------------------------------------------------------------------------- Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Perhaps it would make sense to try to take the "fast path" in > > SIDelExpiredDataEntries with only a shared lock rather than exclusive. > > I think the real problem here is that sinval catchup processing is well > designed to create contention :-(. Once we've decided that the message > queue is getting too full, we SIGUSR1 all the backends at once (or as > fast as the postmaster can do it anyway), then they all go off and try > to touch the sinval queue. Backends that haven't awoken even once > since the last time will have to process the entire queue contents, > and they're all trying to do that at the same time. What's worse, they > take and release the SInvalLock once for each message they take off the > queue. This isn't so horrid for one-core machines (since each process > will monopolize the CPU for probably less than one timeslice while it's > catching up) but it's pretty obvious where all the contention is coming > from on an 8-core. > > Some ideas for improving matters: > > 1. Try to avoid having all the backends hit the queue at once. Instead > of SIGUSR1'ing everybody at the same time, maybe hit only the process > with the oldest message pointer, and have him hit the next oldest after > he's done reading the queue. > > 2. Try to take more than one message off the queue per SInvalLock cycle. > (There is a tuning tradeoff here, since it would mean holding the lock > for longer at a time.) > > 3. Try to avoid having every backend run SIDelExpiredDataEntries every > time through ReceiveSharedInvalidMessages. It's not critical to delete > entries until the queue starts getting full --- maybe we could rejigger > the logic so it only happens once when somebody notices the queue is > getting full, or so that only the guy(s) who had nextMsgNum == minMsgNum > do it, or something like that? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +