Thread: Idle processes chewing up CPU?
I recently migrated from MSSQL2000 to Postgres 8.3 for Windows, and overall it’s running great.
Using the Process Explorer tool, I've noticed that a child postgres.exe is chewing up 25% of the CPU usage each (we have two dual-core CPUs, presumably it’s chewing up one core). Using SELECT * FROM pg_stat_activity, I located the process id (#3884), and it showed:
datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port |
11511 | postgres | 3884 | 16395 | Brendan.Hill | <IDLE> | f | 2009-07-28 13:54:50.055+10 | 2009-07-28 13:54:44.508+10 | ### | 3353 |
No processes are waiting. This was taken at 2:19pm, so it had been running for about 25 minutes. This has happened before – we had two runaway processes, and they each chewed up 25% of the CPU for several weeks, before we forcefully shut them down. We had to forcefully shut them down before we could restart the Postgres service.
Other requests are being served, however we’ve had reports of slowness (which sparked off the initial search). I can’t confirm this runaway process is the source of it, but it shouldn’t be happening either way. I’ve killed it just in case.
Any thoughts on what is causing this, or how I could diagnose the problem further?
Regards,
Brendan Hill
Chief Information Officer
Jims Group Pty Ltd
48 Edinburgh Rd
Mooroolbark VIC 3138
For all Jims IT enquiries: infotech@jims.net
For emergencies: 1300 130 490 (intl +61 4 3456 5776)
"Brendan Hill" <brendanh@jims.net> writes: > Using the Process Explorer tool, I've noticed that a child postgres.exe is > chewing up 25% of the CPU usage each (we have two dual-core CPUs, presumably > it's chewing up one core). Using SELECT * FROM pg_stat_activity, I located > the process id (#3884), and it showed: > <IDLE> Hmph. When a new command is received, there's some amount of parsing overhead that happens before it changes the status display to not-IDLE, but it shouldn't be minutes' worth. Can you get a stack trace next time that happens? regards, tom lane
Hi Tom, Given it's on Windows, any suggestion for how I would get hold of this? (Process Monitor tool perhaps?) Regards, -Brendan -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, 29 July 2009 4:13 AM To: Brendan Hill Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Idle processes chewing up CPU? "Brendan Hill" <brendanh@jims.net> writes: > Using the Process Explorer tool, I've noticed that a child postgres.exe is > chewing up 25% of the CPU usage each (we have two dual-core CPUs, presumably > it's chewing up one core). Using SELECT * FROM pg_stat_activity, I located > the process id (#3884), and it showed: > <IDLE> Hmph. When a new command is received, there's some amount of parsing overhead that happens before it changes the status display to not-IDLE, but it shouldn't be minutes' worth. Can you get a stack trace next time that happens? regards, tom lane
On Tue, Jul 28, 2009 at 7:17 PM, Brendan Hill<brendanh@jims.net> wrote: > Hi Tom, > > Given it's on Windows, any suggestion for how I would get hold of this? > (Process Monitor tool perhaps?) I'd bet there's a windows faq somewhere on system monitoring (googles) Is this at all helpful, or is this problem beyond: http://technet.microsoft.com/en-us/sysinternals/default.aspx ?
Brendan Hill wrote: > Hi Tom, > > Given it's on Windows, any suggestion for how I would get hold of this? > (Process Monitor tool perhaps?) I think you can get stack traces from Process Monitor using "Tools -> Stack Summary". I find it a bit hard to interpret this data, though, and I'm not sure how useful it is for this sort of thing. [ The following instructions may be put on the PostgreSQL wiki as advice for getting debugging details for runaway PostgreSQL processes on Windows if desired ]: You're better off using Process Explorer in conjunction with the Debugging Tools for Windows. Install the Debugging Tools for Windows: http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx and Process Explorer: http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx Now launch Process Explorer (procexp.exe) and in the Options menu choose "Configure symbols". Replace the dbghelp.dll path with: C:\Program Files\Debugging Tools for Windows (x86)\dbghelp.dll In the same dialog put this string in the symbol path field (all one line, the only space being between "Program" and "Files"): C:\Program Files\PostgreSQL\8.4\symbols;SRV*c:\localsymbols*http://msdl.microsoft.com/download/symbols; (Change the path to your PostgreSQL symbols directory as appropriate for your version). This will tell Process Explorer where to look for PostgreSQL debug symbols, and tell it to use the Microsoft symbol server to get windows symbols, caching them in c:\localsymbols, which it will create if required. Now you can choose "Properties" on a process in Process Explorer and in the Thread tab see the active threads and what they're doing. You should see something like: TID CSwitch delta Start Address 1260 8 postgres.exe!mainCRTStartup 2792 postgres.exe!pg_signal_thread (If you see lots of references to "pg_init" or similar instead, your symbol path is wrong and Process Explorer can't find the PostgreSQL symbols.) Now you need to identify the active thread and get a stack trace from it. It'll usually have a non-zero cswitch delta. Select it and click "Stack". After a short delay, a stack trace will be shown. Select it all, and click "Copy". Here's a stack trace obtained from PostgreSQL 8.4 while it's executing: select generate_series(1,100000000000000); ... just so you have some idea what to expect: ntkrnlpa.exe!NtInitialUserProcessBuffer+0x26 ntkrnlpa.exe!RtlFreeHeapSlowly+0x88 ntkrnlpa.exe!NtCallbackReturn+0x29 ntkrnlpa.exe!MiGrowWsleHash+0xb0 ntkrnlpa.exe!MiTrimWorkingSet+0xc4 hal.dll!HalpApcInterrupt+0xc6 postgres.exe!ExecProcNode+0x5 postgres.exe!ExecutePlan+0x93 postgres.exe!standard_ExecutorRun+0x7a postgres.exe!PortalRunSelect+0x6a postgres.exe!PortalRun+0x14f postgres.exe!exec_simple_query+0x381 postgres.exe!PostgresMain+0xc67 postgres.exe!BackendRun+0x204 postgres.exe!SubPostmasterMain+0x224 postgres.exe!main+0x177 postgres.exe!__tmainCRTStartup+0x10f kernel32.dll!BaseProcessStart+0x23 You can also, if you prefer, use windbg.exe from the Debugging Tools for Windows to get a stack trace. This is widely documented. -- Craig Ringer
Craig Ringer wrote: > Brendan Hill wrote: >> Hi Tom, >> >> Given it's on Windows, any suggestion for how I would get hold of this? >> (Process Monitor tool perhaps?) > > I think you can get stack traces from Process Monitor using "Tools -> > Stack Summary". I find it a bit hard to interpret this data, though, and > I'm not sure how useful it is for this sort of thing. > > > > [ The following instructions may be put on the PostgreSQL wiki as advice > for getting debugging details for runaway PostgreSQL processes on > Windows if desired ]: Actually, I've expanded on the instructions and done it. See: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows Accessible from "General Articles and Guides" -> "Troubleshooting" -> "Generating_a_stack_trace_of_a_PostgreSQL_backend". It'd be rather helpful if others could fill in the equivalent for gdb on Linux/bsd/other unix as linked to here: http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend -- Craig Ringer
On Wed, Jul 29, 2009 at 12:08, Craig Ringer<craig@postnewspapers.com.au> wrote: > Craig Ringer wrote: >> >> Brendan Hill wrote: >>> >>> Hi Tom, >>> >>> Given it's on Windows, any suggestion for how I would get hold of this? >>> (Process Monitor tool perhaps?) >> >> I think you can get stack traces from Process Monitor using "Tools -> >> Stack Summary". I find it a bit hard to interpret this data, though, and I'm >> not sure how useful it is for this sort of thing. >> >> >> >> [ The following instructions may be put on the PostgreSQL wiki as advice >> for getting debugging details for runaway PostgreSQL processes on Windows if >> desired ]: > > Actually, I've expanded on the instructions and done it. See: > > http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows > > Accessible from "General Articles and Guides" -> "Troubleshooting" -> > "Generating_a_stack_trace_of_a_PostgreSQL_backend". This is very useful, thanks for putting it up! -- Magnus Hagander Self: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Hi all, We managed to trap further details of the problem today. Some large queries were terminated prematurely, perhaps causing a dirty SSL or TCP disconnect, and they seem to have left a few runaway processes. SELECT * FROM pg_stat_activity showed each process was <IDLE>, while Process Explored showed them each chewing up 25% CPU time (ie. one core on a quad core system). I copied a few of the stack traces (at the end of this email), it kept changing each time I looked. Mostly LIBEAY32.DLL related, suggestion some connection to SSL - our bin\libeay32.DLL is version 0.9.8.5, last modified 2007/02/28. In case it was simply a dirty SSL disconnect, I tried running a large query, unplugged my network cable, and monitored the process. Rather than running into a loop, the child process just shut down at the end of the query. So I'd appreciate any advice you have on what may be causing this and how we can get around it in future. If necessary, we'll write a temporary program to poll the pg_stat_activity and currently running processes, and alert us if one goes <IDLE>/chewing CPU, but obviously this isn't a long term solution. Thanks for your help, -Brendan ntkrnlpa.exe+0x8dafe ntkrnlpa.exe+0x29a82 ntkrnlpa.exe+0x33198 hal.dll+0x6199 hal.dll+0x63d9 hal.dll+0x6577 hal.dll+0x3902 mswsock.dll+0x1445 WSOCK32.dll!recv+0x31 LIBEAY32.dll!BIO_sock_should_retry+0x57 ntkrnlpa.exe+0x8dafe ntkrnlpa.exe+0x29a82 ntkrnlpa.exe+0x33198 hal.dll+0x6199 hal.dll+0x63d9 hal.dll+0x6577 hal.dll+0x3902 postgres.exe!process_implied_equality+0x18d50e ntkrnlpa.exe+0x8dafe ntkrnlpa.exe+0x29a82 ntkrnlpa.exe+0x33198 hal.dll+0x6199 hal.dll+0x63d9 hal.dll+0x6577 hal.dll+0x3902 LIBEAY32.dll!ERR_get_state ntkrnlpa.exe+0x8dafe ntkrnlpa.exe+0x29a82 ntkrnlpa.exe+0x33198 hal.dll+0x6199 hal.dll+0x63d9 hal.dll+0x6577 hal.dll+0x3902 ntkrnlpa.exe+0x89751 ntdll.dll!KiFastSystemCallRet WS2_32.dll!WSARecv+0x65 WSOCK32.dll!recv+0x31 LIBEAY32.dll!BIO_sock_should_retry+0x57 ntkrnlpa.exe+0x8dafe ntkrnlpa.exe+0x29a82 ntkrnlpa.exe+0x33198 hal.dll+0x6199 hal.dll+0x63d9 hal.dll+0x6456 ntkrnlpa.exe+0x312be ntkrnlpa.exe+0x2ab9b ntkrnlpa.exe+0x1e257 afd.sys+0x11905 afd.sys+0x10978 afd.sys+0xf097 ntkrnlpa.exe+0x1df85 ntkrnlpa.exe+0xf5437 ntkrnlpa.exe+0xf61bf ntkrnlpa.exe+0xeed08 ntkrnlpa.exe+0x897bc ntdll.dll!KiFastSystemCallRet WS2_32.dll!WSARecv+0x65 WSOCK32.dll!recv+0x31 LIBEAY32.dll!BIO_sock_should_retry+0x57 -----Original Message----- From: Craig Ringer [mailto:craig@postnewspapers.com.au] Sent: Wednesday, 29 July 2009 8:09 PM To: Brendan Hill Cc: 'Tom Lane'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Idle processes chewing up CPU? Craig Ringer wrote: > Brendan Hill wrote: >> Hi Tom, >> >> Given it's on Windows, any suggestion for how I would get hold of this? >> (Process Monitor tool perhaps?) > > I think you can get stack traces from Process Monitor using "Tools -> > Stack Summary". I find it a bit hard to interpret this data, though, and > I'm not sure how useful it is for this sort of thing. > > > > [ The following instructions may be put on the PostgreSQL wiki as advice > for getting debugging details for runaway PostgreSQL processes on > Windows if desired ]: Actually, I've expanded on the instructions and done it. See: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ L_backend_on_Windows Accessible from "General Articles and Guides" -> "Troubleshooting" -> "Generating_a_stack_trace_of_a_PostgreSQL_backend". It'd be rather helpful if others could fill in the equivalent for gdb on Linux/bsd/other unix as linked to here: http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_bac kend -- Craig Ringer
On Wed, 2009-08-05 at 15:26 +1000, Brendan Hill wrote: > I copied a few of the stack traces (at the end of this email), it kept > changing each time I looked. Yep, that's to be expected. If the process is busy, unless it's in a _REALLY_ simple infinite loop, it'll be looping through some non-trivial sequence of function calls. Thus, most stack traces will be different. Unfortunately, you don't appear to have set up your debugging environment, so your stack traces aren't annotated with any sort of symbolic information that might tell the reader what's actually happening. The numeric offsets can be converted to symbolic names if you know the _EXACT_ version of the module (eg "hal.dll") in question, but you haven't provided that, and even if you did it's a pain to do. If you did set your symbol path, might you have an overzealous software firewall that's interfering with requests to the symbol server. Make sure that HTTP (port 80) access to msdl.microsoft.com for windbg.exe and procexp.exe is unimpeded, or download the symbol bundle for your OS (warning: huge download) from http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx . Because your symbol path doesn't appear to be configured there's no guarantee the symbols for postgres.exe are right either, and more importantly there's no "depth" to the trace; it doesn't show the call path within postgres / libeay32 / etc . Take this trace, for example. ntkrnlpa.exe+0x8dafe ntkrnlpa.exe+0x29a82 ntkrnlpa.exe+0x33198 hal.dll+0x6199 hal.dll+0x63d9 hal.dll+0x6577 hal.dll+0x3902 postgres.exe!process_implied_equality+0x18d50e process_implied_equality(...) from backend/optimizer/plan/initsplan.c is deep in the query planner, and is "currently used only when an EquivalenceClass is found to contain pseudoconstants". It's not going to be calling into hal.dll - not, at least, without a fairly long call chain between it and hal.dll . So we can't trust that postgres was even in the "process_implied_equality" function when it called into hal.dll and if it was we don't have any idea how it got there. The call trace is basically useless. (New section added to wiki article for this: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows#How_to_make_sure_a_stack_trace_is_useful ) A stack trace that includes symbols for the kernel / HAL calls should look something like this: ntkrnlpa.exe!KiSwapContext+0x2f ntkrnlpa.exe!KiSwapThread+0x8a ntkrnlpa.exe!KeWaitForSingleObject+0x1c2 ntkrnlpa.exe!KiSuspendThread+0x18 ntkrnlpa.exe!KiDeliverApc+0x124 ntkrnlpa.exe!KiSwapThread+0xa8 ntkrnlpa.exe!KeWaitForMultipleObjects+0x284 ntkrnlpa.exe!NtWaitForMultipleObjects+0x297 ntkrnlpa.exe!KiFastCallEntry+0xfc ntdll.dll!KiFastSystemCallRet ntdll.dll!ZwWaitForMultipleObjects+0xc kernel32.dll!WaitForMultipleObjectsEx+0x12c postgres.exe!pgwin32_waitforsinglesocket+0x1f0 postgres.exe!pgwin32_recv+0x90 postgres.exe!secure_read+0x17d postgres.exe!pq_recvbuf+0x71 postgres.exe!pq_getbyte+0x15 postgres.exe!SocketBackend+0x6 postgres.exe!PostgresMain+0xbe8 postgres.exe!BackendRun+0x204 postgres.exe!SubPostmasterMain+0x224 postgres.exe!main+0x177 postgres.exe!__tmainCRTStartup+0x10f kernel32.dll!BaseProcessStart+0x23 See how it has "ModuleNames!SymbolicNames+offsets" instead of just "ModuleNames+offsets" ? -- Craig Ringer
On Wed, 2009-08-05 at 16:44 +1000, Brendan Hill wrote: > Hi Craig, > > Sorry, I had the stack trace so I thought it was enough. I'll make sure the > debug environment is set up and post the full stack traces again. No worries. Sorry it cost you time. I've extended the wiki article on win32 debug info to (hopefully) explain how to identify a useful stack trace, or at least a likely bogus one. If you feel like a look I'd value your opinion especially when it comes to clarity/readability. http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows -- Craig Ringer
Hi Craig, Sorry, I had the stack trace so I thought it was enough. I'll make sure the debug environment is set up and post the full stack traces again. -Brendan -----Original Message----- From: Craig Ringer [mailto:craig@postnewspapers.com.au] Sent: Wednesday, 5 August 2009 4:35 PM To: Brendan Hill Cc: 'Tom Lane'; pgsql-general@postgresql.org Subject: RE: [GENERAL] Idle processes chewing up CPU? On Wed, 2009-08-05 at 15:26 +1000, Brendan Hill wrote: > I copied a few of the stack traces (at the end of this email), it kept > changing each time I looked. Yep, that's to be expected. If the process is busy, unless it's in a _REALLY_ simple infinite loop, it'll be looping through some non-trivial sequence of function calls. Thus, most stack traces will be different. Unfortunately, you don't appear to have set up your debugging environment, so your stack traces aren't annotated with any sort of symbolic information that might tell the reader what's actually happening. The numeric offsets can be converted to symbolic names if you know the _EXACT_ version of the module (eg "hal.dll") in question, but you haven't provided that, and even if you did it's a pain to do. If you did set your symbol path, might you have an overzealous software firewall that's interfering with requests to the symbol server. Make sure that HTTP (port 80) access to msdl.microsoft.com for windbg.exe and procexp.exe is unimpeded, or download the symbol bundle for your OS (warning: huge download) from http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx . Because your symbol path doesn't appear to be configured there's no guarantee the symbols for postgres.exe are right either, and more importantly there's no "depth" to the trace; it doesn't show the call path within postgres / libeay32 / etc . Take this trace, for example. ntkrnlpa.exe+0x8dafe ntkrnlpa.exe+0x29a82 ntkrnlpa.exe+0x33198 hal.dll+0x6199 hal.dll+0x63d9 hal.dll+0x6577 hal.dll+0x3902 postgres.exe!process_implied_equality+0x18d50e process_implied_equality(...) from backend/optimizer/plan/initsplan.c is deep in the query planner, and is "currently used only when an EquivalenceClass is found to contain pseudoconstants". It's not going to be calling into hal.dll - not, at least, without a fairly long call chain between it and hal.dll . So we can't trust that postgres was even in the "process_implied_equality" function when it called into hal.dll and if it was we don't have any idea how it got there. The call trace is basically useless. (New section added to wiki article for this: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ L_backend_on_Windows#How_to_make_sure_a_stack_trace_is_useful ) A stack trace that includes symbols for the kernel / HAL calls should look something like this: ntkrnlpa.exe!KiSwapContext+0x2f ntkrnlpa.exe!KiSwapThread+0x8a ntkrnlpa.exe!KeWaitForSingleObject+0x1c2 ntkrnlpa.exe!KiSuspendThread+0x18 ntkrnlpa.exe!KiDeliverApc+0x124 ntkrnlpa.exe!KiSwapThread+0xa8 ntkrnlpa.exe!KeWaitForMultipleObjects+0x284 ntkrnlpa.exe!NtWaitForMultipleObjects+0x297 ntkrnlpa.exe!KiFastCallEntry+0xfc ntdll.dll!KiFastSystemCallRet ntdll.dll!ZwWaitForMultipleObjects+0xc kernel32.dll!WaitForMultipleObjectsEx+0x12c postgres.exe!pgwin32_waitforsinglesocket+0x1f0 postgres.exe!pgwin32_recv+0x90 postgres.exe!secure_read+0x17d postgres.exe!pq_recvbuf+0x71 postgres.exe!pq_getbyte+0x15 postgres.exe!SocketBackend+0x6 postgres.exe!PostgresMain+0xbe8 postgres.exe!BackendRun+0x204 postgres.exe!SubPostmasterMain+0x224 postgres.exe!main+0x177 postgres.exe!__tmainCRTStartup+0x10f kernel32.dll!BaseProcessStart+0x23 See how it has "ModuleNames!SymbolicNames+offsets" instead of just "ModuleNames+offsets" ? -- Craig Ringer
Hi Craig, the instructions are great. I've gone through them and confirmed I'm getting the full details in the stack traces. When it happens again, I'll post the most useful stack traces and we'll be able to look at it properly. Thanks again for your help so far, the responsiveness is terrific. -Brendan -----Original Message----- From: Craig Ringer [mailto:craig@postnewspapers.com.au] Sent: Wednesday, 5 August 2009 5:44 PM To: Brendan Hill Cc: 'Tom Lane'; pgsql-general@postgresql.org Subject: RE: [GENERAL] Idle processes chewing up CPU? On Wed, 2009-08-05 at 16:44 +1000, Brendan Hill wrote: > Hi Craig, > > Sorry, I had the stack trace so I thought it was enough. I'll make sure the > debug environment is set up and post the full stack traces again. No worries. Sorry it cost you time. I've extended the wiki article on win32 debug info to (hopefully) explain how to identify a useful stack trace, or at least a likely bogus one. If you feel like a look I'd value your opinion especially when it comes to clarity/readability. http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ L_backend_on_Windows -- Craig Ringer
Hi Craig/Tom, I've managed to trap the full stack trace this time - 2 processes chewing up 25% each (1 core each on a quad core server), while SELECT * FROM pg_stat_activity revealed they were <IDLE>. I also confirmed that the two runaway processes were started by a developer remotely connecting pgAdmin via SSL. It appears that no actual queries were run, just the connection established. Unfortunately I couldn't confirm if the connections were dirty disconnected. Relevant DLL versions are: 8.3\bin\LIBEAY32.DLL - 0.9.8.5 8.3\bin\SSLEAY32.DLL - 0.9.8.5 Other DLLS - standard for Windows 2003 Server SP2 The juiciest stack traces I captured (in no particular order) were: ntkrnlpa.exe!KiUnexpectedInterrupt+0x48 ntkrnlpa.exe!KeWaitForSingleObject+0x346 ntkrnlpa.exe!ZwYieldExecution+0x3514 hal.dll!KfRaiseIrql+0xe5 hal.dll!KeRaiseIrqlToSynchLevel+0x8d hal.dll!HalEndSystemInterrupt+0x67 hal.dll!HalInitializeProcessor+0x856 LIBEAY32.dll!lh_doall_arg+0x1c4 ntkrnlpa.exe!KiUnexpectedInterrupt+0x48 ntkrnlpa.exe!KeWaitForSingleObject+0x346 ntkrnlpa.exe!ZwYieldExecution+0x3514 hal.dll!KfRaiseIrql+0xe5 hal.dll!KeRaiseIrqlToSynchLevel+0x8d hal.dll!KfLowerIrql+0x62 ntkrnlpa.exe!ZwYieldExecution+0x163a ntkrnlpa.exe!KeInsertQueueApc+0x57 ntkrnlpa.exe!IoCancelIrp+0x27d hal.dll!HalEndSystemInterrupt+0x6e hal.dll!HalInitializeProcessor+0x856 ntkrnlpa.exe!IofCallDriver+0x45 ntkrnlpa.exe!NtWriteFile+0x2943 ntkrnlpa.exe!NtWriteFile+0x36cb ntkrnlpa.exe!NtDeviceIoControlFile+0x2a ntkrnlpa.exe!KeReleaseInStackQueuedSpinLockFromDpcLevel+0xb64 ntdll.dll!KiFastSystemCallRet WS2_32.dll!WSARecv+0x65 WSOCK32.dll!recv+0x31 LIBEAY32.dll!BIO_sock_should_retry+0x57 postgres.exe!my_sock_read+0x1b LIBEAY32.dll!BIO_read+0x6f SSLEAY32.dll!SSLv3_client_method+0x1ee1 SSLEAY32.dll!SSLv3_client_method+0x22ea mswsock.dll!StartWsdpService+0x500 SSLEAY32.dll!SSLv3_client_method+0x225a SSLEAY32.dll!SSLv3_client_method+0x2a15 postgres.exe!pgwin32_waitforsinglesocket+0x1ed postgres.exe!secure_read+0x26 postgres.exe!pq_recvbuf+0x71 postgres.exe!pq_getbyte+0x15 postgres.exe!SocketBackend+0x6 postgres.exe!PostgresMain+0xbf8 postgres.exe!BackendRun+0x200 postgres.exe!SubPostmasterMain+0x21d postgres.exe!main+0x177 postgres.exe!__tmainCRTStartup+0x10f kernel32.dll!ProcessIdToSessionId+0x209 ntkrnlpa.exe!ZwYieldExecution+0x163a ntkrnlpa.exe!KeSetEvent+0xcc ntkrnlpa.exe!PsLookupThreadByThreadId+0x54bc ntkrnlpa.exe!KiDeliverApc+0xbb ntkrnlpa.exe!ZwYieldExecution+0x3801 ntkrnlpa.exe!KeWaitForSingleObject+0x346 ntkrnlpa.exe!ZwYieldExecution+0x3514 ntkrnlpa.exe!KiCheckForKernelApcDelivery+0x1c ntkrnlpa.exe!wctomb+0x4229 ntkrnlpa.exe!IofCallDriver+0x45 ntkrnlpa.exe!NtWriteFile+0x2943 ntkrnlpa.exe!NtWriteFile+0x36cb ntkrnlpa.exe!NtDeviceIoControlFile+0x2a ntkrnlpa.exe!KeReleaseInStackQueuedSpinLockFromDpcLevel+0xb64 ntdll.dll!KiFastSystemCallRet WS2_32.dll!WSARecv+0x65 WSOCK32.dll!recv+0x31 LIBEAY32.dll!BIO_sock_should_retry+0x57 postgres.exe!my_sock_read+0x1b LIBEAY32.dll!BIO_read+0x6f SSLEAY32.dll!SSLv3_client_method+0x1ee1 SSLEAY32.dll!SSLv3_client_method+0x22ea mswsock.dll!StartWsdpService+0x500 SSLEAY32.dll!SSLv3_client_method+0x225a SSLEAY32.dll!SSLv3_client_method+0x2a15 postgres.exe!pgwin32_waitforsinglesocket+0x1ed postgres.exe!secure_read+0x26 postgres.exe!pq_recvbuf+0x71 postgres.exe!pq_getbyte+0x15 postgres.exe!SocketBackend+0x6 postgres.exe!PostgresMain+0xbf8 postgres.exe!BackendRun+0x200 postgres.exe!SubPostmasterMain+0x21d postgres.exe!main+0x177 postgres.exe!__tmainCRTStartup+0x10f kernel32.dll!ProcessIdToSessionId+0x209 ntkrnlpa.exe!KiUnexpectedInterrupt+0x48 ntkrnlpa.exe!KeWaitForSingleObject+0x346 ntkrnlpa.exe!ZwYieldExecution+0x3514 ntkrnlpa.exe!KiCheckForKernelApcDelivery+0x1c ntkrnlpa.exe!NtWriteFile+0x3195 ntkrnlpa.exe!NtDeviceIoControlFile+0x2a ntkrnlpa.exe!KeReleaseInStackQueuedSpinLockFromDpcLevel+0xb64 ntdll.dll!KiFastSystemCallRet WS2_32.dll!WSARecv+0x65 WSOCK32.dll!recv+0x31 LIBEAY32.dll!BIO_sock_should_retry+0x57 postgres.exe!my_sock_read+0x1b LIBEAY32.dll!BIO_read+0x6f SSLEAY32.dll!SSLv3_client_method+0x1ee1 SSLEAY32.dll!SSLv3_client_method+0x22ea mswsock.dll!StartWsdpService+0x500 SSLEAY32.dll!SSLv3_client_method+0x225a SSLEAY32.dll!SSLv3_client_method+0x2a15 postgres.exe!pgwin32_waitforsinglesocket+0x1ed postgres.exe!secure_read+0x26 postgres.exe!pq_recvbuf+0x71 postgres.exe!pq_getbyte+0x15 postgres.exe!SocketBackend+0x6 postgres.exe!PostgresMain+0xbf8 postgres.exe!BackendRun+0x200 postgres.exe!SubPostmasterMain+0x21d postgres.exe!main+0x177 postgres.exe!__tmainCRTStartup+0x10f kernel32.dll!ProcessIdToSessionId+0x209 I'd appreciate any help diagnosing this problem - cutting off remote access via SSL isn't the ideal solution. Regards, -Brendan -----Original Message----- From: Craig Ringer [mailto:craig@postnewspapers.com.au] Sent: Wednesday, 5 August 2009 5:44 PM To: Brendan Hill Cc: 'Tom Lane'; pgsql-general@postgresql.org Subject: RE: [GENERAL] Idle processes chewing up CPU? On Wed, 2009-08-05 at 16:44 +1000, Brendan Hill wrote: > Hi Craig, > > Sorry, I had the stack trace so I thought it was enough. I'll make sure the > debug environment is set up and post the full stack traces again. No worries. Sorry it cost you time. I've extended the wiki article on win32 debug info to (hopefully) explain how to identify a useful stack trace, or at least a likely bogus one. If you feel like a look I'd value your opinion especially when it comes to clarity/readability. http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ L_backend_on_Windows -- Craig Ringer
On 19/08/2009 12:31 PM, Brendan Hill wrote: > Hi Craig/Tom, > > I've managed to trap the full stack trace this time The common part of those traces is: > ntdll.dll!KiFastSystemCallRet > WS2_32.dll!WSARecv+0x65 > WSOCK32.dll!recv+0x31 > LIBEAY32.dll!BIO_sock_should_retry+0x57 > postgres.exe!my_sock_read+0x1b > LIBEAY32.dll!BIO_read+0x6f > SSLEAY32.dll!SSLv3_client_method+0x1ee1 > SSLEAY32.dll!SSLv3_client_method+0x22ea > mswsock.dll!StartWsdpService+0x500 > SSLEAY32.dll!SSLv3_client_method+0x225a > SSLEAY32.dll!SSLv3_client_method+0x2a15 > postgres.exe!pgwin32_waitforsinglesocket+0x1ed > postgres.exe!secure_read+0x26 > postgres.exe!pq_recvbuf+0x71 > postgres.exe!pq_getbyte+0x15 > postgres.exe!SocketBackend+0x6 > postgres.exe!PostgresMain+0xbf8 > postgres.exe!BackendRun+0x200 > postgres.exe!SubPostmasterMain+0x21d > postgres.exe!main+0x177 > postgres.exe!__tmainCRTStartup+0x10f > kernel32.dll!ProcessIdToSessionId+0x209 Now, it's not possible to tell for sure from the traces alone whether this part of the trace shows the same instances of the same function calls, or whether there's a loop happening such that (eg) pgwin32_waitforsinglesocket is being called over and over and over. To find that out, you'd need to attach a debugger and set a breakpoint somewhere suitable. Personally, though, as a somewhat informed stab in the dark I suspect that the above part of the call stack is actually entered once and not left. I'd say that when Pg calls my_sock_read(...), resulting in a call to recv(...) and from there a kernel system call, that's as far as it goes. The system call never returns. Why? I suspect you have a buggy network driver or faulty network card. The unexpected interrupt hander being called in one of the stack tracces certainly has to make you wonder. > I'd appreciate any help diagnosing this problem - cutting off remote access > via SSL isn't the ideal solution. I'd replace the NIC with one from a different manufacturer, at least temporarily. I won't be shocked if the problem goes away. -- Craig Ringer
Hi Craig, thanks for the analysis. If I attach a debugger on the runaway child process, will this halt execution for all the other child processes (ie. freeze the server)? And, can I attach Visual Studio C++ 2008, or is there a recommended debugger for Windows debugging? Given the reliability of the server in the past, I'd probably be expecting an issue with OpenSSL instead, but with debugging attached I should be able to say for sure. Regards, -Brendan -----Original Message----- From: Craig Ringer [mailto:craig@postnewspapers.com.au] Sent: Wednesday, 19 August 2009 3:12 PM To: Brendan Hill Cc: pgsql-general@postgresql.org; 'Tom Lane' Subject: Re: [GENERAL] Idle processes chewing up CPU? On 19/08/2009 12:31 PM, Brendan Hill wrote: > Hi Craig/Tom, > > I've managed to trap the full stack trace this time The common part of those traces is: > ntdll.dll!KiFastSystemCallRet > WS2_32.dll!WSARecv+0x65 > WSOCK32.dll!recv+0x31 > LIBEAY32.dll!BIO_sock_should_retry+0x57 > postgres.exe!my_sock_read+0x1b > LIBEAY32.dll!BIO_read+0x6f > SSLEAY32.dll!SSLv3_client_method+0x1ee1 > SSLEAY32.dll!SSLv3_client_method+0x22ea > mswsock.dll!StartWsdpService+0x500 > SSLEAY32.dll!SSLv3_client_method+0x225a > SSLEAY32.dll!SSLv3_client_method+0x2a15 > postgres.exe!pgwin32_waitforsinglesocket+0x1ed > postgres.exe!secure_read+0x26 > postgres.exe!pq_recvbuf+0x71 > postgres.exe!pq_getbyte+0x15 > postgres.exe!SocketBackend+0x6 > postgres.exe!PostgresMain+0xbf8 > postgres.exe!BackendRun+0x200 > postgres.exe!SubPostmasterMain+0x21d > postgres.exe!main+0x177 > postgres.exe!__tmainCRTStartup+0x10f > kernel32.dll!ProcessIdToSessionId+0x209 Now, it's not possible to tell for sure from the traces alone whether this part of the trace shows the same instances of the same function calls, or whether there's a loop happening such that (eg) pgwin32_waitforsinglesocket is being called over and over and over. To find that out, you'd need to attach a debugger and set a breakpoint somewhere suitable. Personally, though, as a somewhat informed stab in the dark I suspect that the above part of the call stack is actually entered once and not left. I'd say that when Pg calls my_sock_read(...), resulting in a call to recv(...) and from there a kernel system call, that's as far as it goes. The system call never returns. Why? I suspect you have a buggy network driver or faulty network card. The unexpected interrupt hander being called in one of the stack tracces certainly has to make you wonder. > I'd appreciate any help diagnosing this problem - cutting off remote access > via SSL isn't the ideal solution. I'd replace the NIC with one from a different manufacturer, at least temporarily. I won't be shocked if the problem goes away. -- Craig Ringer
On 19/08/2009 1:34 PM, Brendan Hill wrote: > Hi Craig, thanks for the analysis. If I attach a debugger on the runaway > child process, will this halt execution for all the other child processes > (ie. freeze the server)? And, can I attach Visual Studio C++ 2008, or is > there a recommended debugger for Windows debugging? Visual C++ 2008's debugger should be fine - and it's certainly a lot nicer to use than windbg.exe . LOTS nicer. I'm surprised you have VS 2008 on your production server, though - or are you planning on using remote debugging? Anyway: If you attach to a given backend, execution of the other backends won't freeze. If you promptly unpause execution of the backend you attached to everything will run normally. You might not want to interrupt the backend's execution for too long at a time though, as my understanding is that Pg does have tasks that require synchronization across all backends and leaving one in a state of paused execution for too long might slow things down. I did some quick testing before posting. First, I downloaded and unpacked the 8.4.0 sources since that's what I'm running on my workstation. I then establishined two sessions to an otherwise idle 8.4 DB on WinXP, then attaching VS 2008 EE's debugger to one of them: Tools -> Attach to Process, check "show processes from all users", select the target postgres.exe by pid, attach. It took a while for VS to load symbols for the first time, but the other backend was responsive during that time. When VS finished loading symbols it auto-resumed execution of the backend. When I pause execution the other backend remains responsive. I can still establish new connections too. With execution running normally I added a breakpoint at pq_recvbuf: Debug -> New Breakpoint -> Break at Function (CTRL-B), "pq_recvbuf", line 1 char 1 language "C", OK then issued a query to the backend I was debugging. It processed the query and then execution stopped at the breakpoint. I was prompted to locate the source file I'd broken in, and when I did so it showed an execution marker at the appropriate point, I could step execution through the sources, etc. When I was done, I just detached from the process with Tools -> Detach All, leaving it running as before. In your position I'd start by waiting until you have an out-of-control backend, attaching to it without pausing it, and setting a breakpoint at my_sock_read. If the breakpoint is hit then something's called my_sock_read again; it won't trigger if my_sock_read is somewhere on the call stack, only when the current point of execution enters the function. You can step through execution from there see where it's looping. If you find that my_sock_read isn't being called repeatedly, then the infinite loop is in my_sock_read or something it's calling. Break into execution and step through to see what Pg is doing. > Given the reliability of the server in the past, I'd probably be expecting > an issue with OpenSSL instead, but with debugging attached I should be able > to say for sure. Yep. If, for example, you waited until a backend was in the problem state where it was using 100% CPU, attached the debugger, and set a breakpoint at the start of my_sock_read in postgres.exe then you could see if my_sock_read(...) was being called repeatedly or just once. -- Craig Ringer
Hi Craig, I've debugged the runaway process, though I'm not sure of the solution yet. My best interpretation is that an SSL client dirty disconnected while running a request. This caused an infinite loop in pq_recvbuf(), calling secure_read(), triggering my_sock_read() over and over. Calling SSL_get_error() in secure_read() returns 10045 (either connection reset, or WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think errno=EINTR has occurred, so it immediately tries again. I'd appreciate any further interpretation and advice on this, how to diagnose it further, or whether it's been addressed in a later version. Hope you appreciate I've erred on the side of too much detail in the following analysis. As a side note, I had the postgres.exe process in debug mode for 2 hours, and experienced no server downtime at all, which is good to know. Regards, -Brendan ========================================= Main call stack: ========================================= > postgres.exe!pq_recvbuf() Line 746 + 0x1a bytes postgres.exe!pq_getbyte() Line 795 + 0x5 bytes postgres.exe!SocketBackend(StringInfoData * inBuf=0x00000000) Line 317 + 0x5 bytes postgres.exe!PostgresMain(int argc=4, char * * argv=0x0155db48, const char * username=0x01064e20) Line 3595 + 0x23 bytes postgres.exe!BackendRun(Port * port=0x00bdfd40) Line 3208 postgres.exe!SubPostmasterMain(int argc=3, char * * argv=0x01063f28) Line 3674 + 0x8 bytes postgres.exe!main(int argc=3, char * * argv=0x01063f28) Line 165 + 0x7 bytes postgres.exe!__tmainCRTStartup() Line 597 + 0x17 bytes kernel32.dll!_BaseProcessStart@4() + 0x23 bytes Relevant variables: PqRecvLength: 0 PqRecvPointer: 0 PqRecvBuffer: Q...,SELECT COUNT(*) FROM. (SELECT tgargs from pg_trigger tr. LEFT JOIN p-g_depend dep ON dep.objid=tr.oid AND deptype = 'i'. LEFT JOIN pg_constraint co ON refobjid = co.oid AND contype = 'f'. WHERE tgisconstraint AND co.oid IS NULL. GROUP BY tgargs. HAVING count(1) = 3) AS foo.M pg_namespace nsp. LEFT OUTER JOIN pg_description des ON des.objoid=nsp.oid. WHERE NOT ((nspname = 'pg_catalog' and (SELECT count(*) FROM pg_class WHERE relname = 'pg_class' AND relnamespace = nsp.oid) > 0) OR.(nspname = 'pgagent' and (SELECT count(*) FROM pg_class WHERE relname = 'pga_job' AND relnamespace = nsp.oid) > 0) OR.(nspname = 'information_schema' and (SELECT count(*) FROM pg_class WHERE relname = 'tables' AND relnamespace = nsp.oid) >0x00789E10 20 30 29 20 4f 52 0a 28 6e 73 70 6e 61 6d 65 20 0) OR.(nspname = 'dbo' and (SELECT count(*) FROM pg_class WHERE relname = 'systables' AND relnamespace = nsp.oid) > 0) OR.(nspname = 'sys' and (SELECT count(*) FROM pg_class WHERE relname = 'all_tables' AND relnamespace = nsp.oid) > 0)). ORDER BY 1, nspname MyProcPort: 0x00bdfd40 sock 2044 proto 196608 laddr {addr={...} salen=16 } raddr {addr={...} salen=16 } remote_host 0x0106acf0 "<removed>" remote_port 0x010694a0 "3307" canAcceptConnections CAC_OK database_name 0x01064df8 "<removed>" user_name 0x01064e20 "<removed>" cmdline_options 0x00000000 <Bad Ptr> guc_options 0x00000000 {type=??? length=??? head=??? ...} auth_method uaMD5 auth_arg 0x00000000 <Bad Ptr> md5Salt 0x00bdfe7c "<removed>" cryptSalt 0x00bdfe80 "UA" SessionStartTime 303975637670000 default_keepalives_idle 0 default_keepalives_interval 0 default_keepalives_count 0 keepalives_idle 0 keepalives_interval 0 keepalives_count 0 gss 0x010694e0 {outbuf={...} cred=0x00000000 ctx=0x00000000 ...} ssl 0x003fbb78 {version=769 type=8192 method=0x1002b298 ...} peer 0x00000000 {cert_info=??? sig_alg=??? signature=??? ...} peer_dn 0x00bdfeb4 "(anonymous)" peer_cn 0x00bdff35 "(anonymous)" count 5287 unsigned long pq_recvbuf() calls "secure_read(MyProcPort, PqRecvBuffer + PqRecvLength, PQ_BUFFER_SIZE - PqRecvLength);", which returns -1. I've inferred from control flow that errno=EINTR (can't debug the value unfortunately), so it continues the loop. ========================================= Within secure_read() ========================================= SSL_get_error() returns 10054 (connection reset or WSAEOPNOTSUPP, I'm not sure) The switch() command jumps to SSL_ERROR_SYSCALL, then immediately jumps to SSL_ERROR_NONE !?, and port->count NOT incremented - this was strange, suggesting a source code mismatch or debugger problem perhaps. secure_read() returns -1, popping back to pq_recvbuf() ========================================= Within my_sock_read() ========================================= secure_read() causes a call to my_sock_read(), which has a separate stack trace (waiting on single object, I assume): > postgres.exe!my_sock_read(bio_st * h=0x003fbcf0, char * buf=0x0165bc98, int size=5) Line 443 C libeay32.dll!0026a78f() [Frames below may be incorrect and/or missing, no symbols loaded for libeay32.dll] mswsock.dll!_WSPEventSelect@16() + 0x92 bytes ffffffff() Within prepare_for_client_read(): DoingCommandRead = 1 Within EnableNotifyInterrupt(): IsTransactionOrTransactionBlock() = 0, notifyInterruptOccurred = 0 Within EnableCatchupInterrupt(): catchupInterruptOccurred = 0 After "res = std_sock_read(h, buf, size);" (can't debug result, unfortunately), (buf, size, h) hadn't change at all Within client_read_ended(): DoingCommandRead = 1 Exiting my_sock_read() and wandering through some assembly, it finds itself back into socket_read(), then back in pg_recvbuf(), and the loop continues.
"Brendan Hill" <brendanh@jims.net> writes: > My best interpretation is that an SSL client dirty disconnected while > running a request. This caused an infinite loop in pq_recvbuf(), calling > secure_read(), triggering my_sock_read() over and over. Calling > SSL_get_error() in secure_read() returns 10045 (either connection reset, or > WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think > errno=EINTR has occurred, so it immediately tries again. I wonder if this would be a good idea: #ifdef USE_SSL if (port->ssl) { int err; rloop: + errno = 0; n = SSL_read(port->ssl, ptr, len); err = SSL_get_error(port->ssl, n); switch (err) { case SSL_ERROR_NONE: port->count += n; break; It looks to me like the basic issue is that pq_recvbuf is expecting a relevant value of errno when secure_read returns -1, and there's some path in the Windows case where errno doesn't get set, and if it just happens to have been EINTR then we've got a loop. regards, tom lane
Hi Tom, Makes sense to me. Seems to be happening rarely now. I'm not all that familiar with the open source process, is this likely to be included in the next release version? -Brendan -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, 21 September 2009 5:25 AM To: Brendan Hill Cc: 'Craig Ringer'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Idle processes chewing up CPU? "Brendan Hill" <brendanh@jims.net> writes: > My best interpretation is that an SSL client dirty disconnected while > running a request. This caused an infinite loop in pq_recvbuf(), calling > secure_read(), triggering my_sock_read() over and over. Calling > SSL_get_error() in secure_read() returns 10045 (either connection reset, or > WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think > errno=EINTR has occurred, so it immediately tries again. I wonder if this would be a good idea: #ifdef USE_SSL if (port->ssl) { int err; rloop: + errno = 0; n = SSL_read(port->ssl, ptr, len); err = SSL_get_error(port->ssl, n); switch (err) { case SSL_ERROR_NONE: port->count += n; break; It looks to me like the basic issue is that pq_recvbuf is expecting a relevant value of errno when secure_read returns -1, and there's some path in the Windows case where errno doesn't get set, and if it just happens to have been EINTR then we've got a loop. regards, tom lane
"Brendan Hill" <brendanh@jims.net> writes: > Makes sense to me. Seems to be happening rarely now. > I'm not all that familiar with the open source process, is this likely to be > included in the next release version? Can you confirm that that change actually fixes the problem you're seeing? I'm happy to apply it if it does, but I'd like to know that the problem is dealt with. regards, tom lane > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Monday, 21 September 2009 5:25 AM > To: Brendan Hill > Cc: 'Craig Ringer'; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Idle processes chewing up CPU? > "Brendan Hill" <brendanh@jims.net> writes: >> My best interpretation is that an SSL client dirty disconnected while >> running a request. This caused an infinite loop in pq_recvbuf(), calling >> secure_read(), triggering my_sock_read() over and over. Calling >> SSL_get_error() in secure_read() returns 10045 (either connection reset, > or >> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think >> errno=EINTR has occurred, so it immediately tries again. > I wonder if this would be a good idea: > #ifdef USE_SSL > if (port->ssl) > { > int err; > rloop: > + errno = 0; > n = SSL_read(port->ssl, ptr, len); > err = SSL_get_error(port->ssl, n); > switch (err) > { > case SSL_ERROR_NONE: > port->count += n; > break; > It looks to me like the basic issue is that pq_recvbuf is expecting > a relevant value of errno when secure_read returns -1, and there's > some path in the Windows case where errno doesn't get set, and if > it just happens to have been EINTR then we've got a loop. > regards, tom lane > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Hi Tom, Bit of a catch 22 - since it happens rarely, there's no definitive confirmation that it's fixed the problem. Also, not sure if I'm comfortable applying the change and recompiling myself, wouldn't have a clue where to start. I can't see how the change would hurt though, seems like a good idea regardless, what do you think? Regards, -Brendan -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Sunday, 27 September 2009 2:42 PM To: Brendan Hill Cc: 'Craig Ringer'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Idle processes chewing up CPU? "Brendan Hill" <brendanh@jims.net> writes: > Makes sense to me. Seems to be happening rarely now. > I'm not all that familiar with the open source process, is this likely to be > included in the next release version? Can you confirm that that change actually fixes the problem you're seeing? I'm happy to apply it if it does, but I'd like to know that the problem is dealt with. regards, tom lane > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Monday, 21 September 2009 5:25 AM > To: Brendan Hill > Cc: 'Craig Ringer'; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Idle processes chewing up CPU? > "Brendan Hill" <brendanh@jims.net> writes: >> My best interpretation is that an SSL client dirty disconnected while >> running a request. This caused an infinite loop in pq_recvbuf(), calling >> secure_read(), triggering my_sock_read() over and over. Calling >> SSL_get_error() in secure_read() returns 10045 (either connection reset, > or >> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think >> errno=EINTR has occurred, so it immediately tries again. > I wonder if this would be a good idea: > #ifdef USE_SSL > if (port->ssl) > { > int err; > rloop: > + errno = 0; > n = SSL_read(port->ssl, ptr, len); > err = SSL_get_error(port->ssl, n); > switch (err) > { > case SSL_ERROR_NONE: > port->count += n; > break; > It looks to me like the basic issue is that pq_recvbuf is expecting > a relevant value of errno when secure_read returns -1, and there's > some path in the Windows case where errno doesn't get set, and if > it just happens to have been EINTR then we've got a loop. > regards, tom lane > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
"Brendan Hill" <brendanh@jims.net> writes: > Bit of a catch 22 - since it happens rarely, there's no definitive > confirmation that it's fixed the problem. > Also, not sure if I'm comfortable applying the change and recompiling > myself, wouldn't have a clue where to start. Uh, so you haven't actually tested it at all? > I can't see how the change would hurt though, seems like a good idea > regardless, what do you think? I'm not really inclined to throw it in there just on speculation that it *might* mask a bug that's really somewhere else. I can't help you with producing a recompiled executable for Windows, but there are other people around the list who can. regards, tom lane
"Brendan Hill" <brendanh@jims.net> writes: > I think I've confirmed the fix. Using a dirty disconnect generator, I was > able to reliably recreate the problem within about 30-60 seconds. The > symptoms were the same as before, however it occurred around SSL_write > instead of SSL_read - I assume this was due to the artificial nature of the > dirty disconnect (easier for the client to artificially break the connection > while waiting/receiving, than sending). > The solution you proposed solved it for SSL_write (ran for 30 minutes, no > runaway processes), and I think it's safe to assume SSL_read too. So I > suggest two additions: Done, and also the same on the libpq side, since presumably it could happen at that end as well. I suspect there is some underlying OpenSSL bug we are masking here, but it's cheap enough that we might as well just do it. regards, tom lane
Hi Tom, I think I've confirmed the fix. Using a dirty disconnect generator, I was able to reliably recreate the problem within about 30-60 seconds. The symptoms were the same as before, however it occurred around SSL_write instead of SSL_read - I assume this was due to the artificial nature of the dirty disconnect (easier for the client to artificially break the connection while waiting/receiving, than sending). The solution you proposed solved it for SSL_write (ran for 30 minutes, no runaway processes), and I think it's safe to assume SSL_read too. So I suggest two additions: ==================================================== rloop: + errno = 0; n = SSL_read(port->ssl, ptr, len); err = SSL_get_error(port->ssl, n); switch (err) { case SSL_ERROR_NONE: port->count += n; break; ==================================================== And: ==================================================== wloop: + errno = 0; n = SSL_write(port->ssl, ptr, len); err = SSL_get_error(port->ssl, n); switch (err) { case SSL_ERROR_NONE: port->count += n; break; ==================================================== I'm not comfortable running my own compiled version in production (it was rather difficult to get it working), so I'm interested to know when the next release is planned. We can test beta copies on a non-critical load balancing server if necessary. Cheers, -Brendan -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Sunday, 27 September 2009 2:42 PM To: Brendan Hill Cc: 'Craig Ringer'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Idle processes chewing up CPU? "Brendan Hill" <brendanh@jims.net> writes: > Makes sense to me. Seems to be happening rarely now. > I'm not all that familiar with the open source process, is this likely to be > included in the next release version? Can you confirm that that change actually fixes the problem you're seeing? I'm happy to apply it if it does, but I'd like to know that the problem is dealt with. regards, tom lane > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Monday, 21 September 2009 5:25 AM > To: Brendan Hill > Cc: 'Craig Ringer'; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Idle processes chewing up CPU? > "Brendan Hill" <brendanh@jims.net> writes: >> My best interpretation is that an SSL client dirty disconnected while >> running a request. This caused an infinite loop in pq_recvbuf(), calling >> secure_read(), triggering my_sock_read() over and over. Calling >> SSL_get_error() in secure_read() returns 10045 (either connection reset, > or >> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think >> errno=EINTR has occurred, so it immediately tries again. > I wonder if this would be a good idea: > #ifdef USE_SSL > if (port->ssl) > { > int err; > rloop: > + errno = 0; > n = SSL_read(port->ssl, ptr, len); > err = SSL_get_error(port->ssl, n); > switch (err) > { > case SSL_ERROR_NONE: > port->count += n; > break; > It looks to me like the basic issue is that pq_recvbuf is expecting > a relevant value of errno when secure_read returns -1, and there's > some path in the Windows case where errno doesn't get set, and if > it just happens to have been EINTR then we've got a loop. > regards, tom lane > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general