Home > mailing lists

Thread: Idle processes chewing up CPU?

Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

28 July 2009, 11:18:10

I recently migrated from MSSQL2000 to Postgres 8.3 for Windows, and overall it’s running great.

Using the Process Explorer tool, I've noticed that a child postgres.exe is chewing up 25% of the CPU usage each (we have two dual-core CPUs, presumably it’s chewing up one core). Using SELECT * FROM pg_stat_activity, I located the process id (#3884), and it showed:

datid	datname	procpid	usesysid	usename	current_query	waiting	xact_start	query_start	backend_start	client_addr	client_port
11511	postgres	3884	16395	Brendan.Hill	<IDLE>	f		2009-07-28 13:54:50.055+10	2009-07-28 13:54:44.508+10	###	3353

No processes are waiting. This was taken at 2:19pm, so it had been running for about 25 minutes. This has happened before – we had two runaway processes, and they each chewed up 25% of the CPU for several weeks, before we forcefully shut them down. We had to forcefully shut them down before we could restart the Postgres service.

Other requests are being served, however we’ve had reports of slowness (which sparked off the initial search). I can’t confirm this runaway process is the source of it, but it shouldn’t be happening either way. I’ve killed it just in case.

Any thoughts on what is causing this, or how I could diagnose the problem further?

Regards,

Brendan Hill

Chief Information Officer

Jims Group Pty Ltd

48 Edinburgh Rd

Mooroolbark VIC 3138

For all Jims IT enquiries: infotech@jims.net

For emergencies: 1300 130 490 (intl +61 4 3456 5776)

Re: Idle processes chewing up CPU?

From

Tom Lane

Date:

28 July 2009, 15:27:41

"Brendan Hill" <brendanh@jims.net> writes:
> Using the Process Explorer tool, I've noticed that a child postgres.exe is
> chewing up 25% of the CPU usage each (we have two dual-core CPUs, presumably
> it's chewing up one core). Using SELECT * FROM pg_stat_activity, I located
> the process id (#3884), and it showed:
> <IDLE>

Hmph.  When a new command is received, there's some amount of parsing
overhead that happens before it changes the status display to not-IDLE,
but it shouldn't be minutes' worth.  Can you get a stack trace next time
that happens?

            regards, tom lane

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

28 July 2009, 22:34:26

Hi Tom,

Given it's on Windows, any suggestion for how I would get hold of this?
(Process Monitor tool perhaps?)

Regards,
-Brendan

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, 29 July 2009 4:13 AM
To: Brendan Hill
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Idle processes chewing up CPU?

"Brendan Hill" <brendanh@jims.net> writes:
> Using the Process Explorer tool, I've noticed that a child postgres.exe is
> chewing up 25% of the CPU usage each (we have two dual-core CPUs,
presumably
> it's chewing up one core). Using SELECT * FROM pg_stat_activity, I located
> the process id (#3884), and it showed:
> <IDLE>

Hmph.  When a new command is received, there's some amount of parsing
overhead that happens before it changes the status display to not-IDLE,
but it shouldn't be minutes' worth.  Can you get a stack trace next time
that happens?

            regards, tom lane

Re: Idle processes chewing up CPU?

From

Scott Marlowe

Date:

28 July 2009, 22:52:37

On Tue, Jul 28, 2009 at 7:17 PM, Brendan Hill<brendanh@jims.net> wrote:
> Hi Tom,
>
> Given it's on Windows, any suggestion for how I would get hold of this?
> (Process Monitor tool perhaps?)

I'd bet there's a windows faq somewhere on system monitoring (googles)
Is this at all helpful, or is this problem beyond:
http://technet.microsoft.com/en-us/sysinternals/default.aspx
?

Re: Idle processes chewing up CPU?

From

Craig Ringer

Date:

29 July 2009, 05:36:00

Brendan Hill wrote:
> Hi Tom,
>
> Given it's on Windows, any suggestion for how I would get hold of this?
> (Process Monitor tool perhaps?)

I think you can get stack traces from Process Monitor using "Tools ->
Stack Summary". I find it a bit hard to interpret this data, though, and
I'm not sure how useful it is for this sort of thing.

[ The following instructions may be put on the PostgreSQL wiki as advice
for getting debugging details for runaway PostgreSQL processes on
Windows if desired ]:

You're better off using Process Explorer in conjunction with the
Debugging Tools for Windows. Install the Debugging Tools for Windows:

http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

and Process Explorer:

http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx

Now launch Process Explorer (procexp.exe) and in the Options menu choose
"Configure symbols". Replace the dbghelp.dll path with:

   C:\Program Files\Debugging Tools for Windows (x86)\dbghelp.dll

In the same dialog put this string in the symbol path field (all one
line, the only space being between "Program" and "Files"):

   C:\Program
Files\PostgreSQL\8.4\symbols;SRV*c:\localsymbols*http://msdl.microsoft.com/download/symbols;

(Change the path to your PostgreSQL symbols directory as appropriate for
your version). This will tell Process Explorer where to look for
PostgreSQL debug symbols, and tell it to use the Microsoft symbol server
to get windows symbols, caching them in c:\localsymbols, which it will
create if required.

Now you can choose "Properties" on a process in Process Explorer and in
the Thread tab see the active threads and what they're doing. You should
see something like:

TID      CSwitch delta       Start Address
1260     8                   postgres.exe!mainCRTStartup
2792                         postgres.exe!pg_signal_thread

(If you see lots of references to "pg_init" or similar instead, your
symbol path is wrong and Process Explorer can't find the PostgreSQL
symbols.)

Now you need to identify the active thread and get a stack trace from
it. It'll usually have a non-zero cswitch delta. Select it and click
"Stack". After a short delay, a stack trace will be shown. Select it
all, and click "Copy".

Here's a stack trace obtained from PostgreSQL 8.4 while it's executing:

    select generate_series(1,100000000000000);

... just so you have some idea what to expect:

ntkrnlpa.exe!NtInitialUserProcessBuffer+0x26
ntkrnlpa.exe!RtlFreeHeapSlowly+0x88
ntkrnlpa.exe!NtCallbackReturn+0x29
ntkrnlpa.exe!MiGrowWsleHash+0xb0
ntkrnlpa.exe!MiTrimWorkingSet+0xc4
hal.dll!HalpApcInterrupt+0xc6
postgres.exe!ExecProcNode+0x5
postgres.exe!ExecutePlan+0x93
postgres.exe!standard_ExecutorRun+0x7a
postgres.exe!PortalRunSelect+0x6a
postgres.exe!PortalRun+0x14f
postgres.exe!exec_simple_query+0x381
postgres.exe!PostgresMain+0xc67
postgres.exe!BackendRun+0x204
postgres.exe!SubPostmasterMain+0x224
postgres.exe!main+0x177
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!BaseProcessStart+0x23

You can also, if you prefer, use windbg.exe from the Debugging Tools for
Windows to get a stack trace. This is widely documented.

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

Craig Ringer

Date:

29 July 2009, 07:09:17

Craig Ringer wrote:
> Brendan Hill wrote:
>> Hi Tom,
>>
>> Given it's on Windows, any suggestion for how I would get hold of this?
>> (Process Monitor tool perhaps?)
>
> I think you can get stack traces from Process Monitor using "Tools ->
> Stack Summary". I find it a bit hard to interpret this data, though, and
> I'm not sure how useful it is for this sort of thing.
>
>
>
> [ The following instructions may be put on the PostgreSQL wiki as advice
> for getting debugging details for runaway PostgreSQL processes on
> Windows if desired ]:

Actually, I've expanded on the instructions and done it. See:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows

Accessible from "General Articles and Guides" -> "Troubleshooting" ->
"Generating_a_stack_trace_of_a_PostgreSQL_backend".

It'd be rather helpful if others could fill in the equivalent for gdb on
Linux/bsd/other unix as linked to here:

http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

Magnus Hagander

Date:

29 July 2009, 07:20:15

On Wed, Jul 29, 2009 at 12:08, Craig Ringer<craig@postnewspapers.com.au> wrote:
> Craig Ringer wrote:
>>
>> Brendan Hill wrote:
>>>
>>> Hi Tom,
>>>
>>> Given it's on Windows, any suggestion for how I would get hold of this?
>>> (Process Monitor tool perhaps?)
>>
>> I think you can get stack traces from Process Monitor using "Tools ->
>> Stack Summary". I find it a bit hard to interpret this data, though, and I'm
>> not sure how useful it is for this sort of thing.
>>
>>
>>
>> [ The following instructions may be put on the PostgreSQL wiki as advice
>> for getting debugging details for runaway PostgreSQL processes on Windows if
>> desired ]:
>
> Actually, I've expanded on the instructions and done it. See:
>
> http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows
>
> Accessible from "General Articles and Guides" -> "Troubleshooting" ->
> "Generating_a_stack_trace_of_a_PostgreSQL_backend".

This is very useful, thanks for putting it up!


--
 Magnus Hagander
 Self: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

05 August 2009, 02:41:01

Hi all,

We managed to trap further details of the problem today. Some large queries
were terminated prematurely, perhaps causing a dirty SSL or TCP disconnect,
and they seem to have left a few runaway processes. SELECT * FROM
pg_stat_activity showed each process was <IDLE>, while Process Explored
showed them each chewing up 25% CPU time (ie. one core on a quad core
system).

I copied a few of the stack traces (at the end of this email), it kept
changing each time I looked. Mostly LIBEAY32.DLL related, suggestion some
connection to SSL - our bin\libeay32.DLL is version 0.9.8.5, last modified
2007/02/28.

In case it was simply a dirty SSL disconnect, I tried running a large query,
unplugged my network cable, and monitored the process. Rather than running
into a loop, the child process just shut down at the end of the query.

So I'd appreciate any advice you have on what may be causing this and how we
can get around it in future. If necessary, we'll write a temporary program
to poll the pg_stat_activity and currently running processes, and alert us
if one goes <IDLE>/chewing CPU, but obviously this isn't a long term
solution.

Thanks for your help,
-Brendan

ntkrnlpa.exe+0x8dafe
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x33198
hal.dll+0x6199
hal.dll+0x63d9
hal.dll+0x6577
hal.dll+0x3902
mswsock.dll+0x1445
WSOCK32.dll!recv+0x31
LIBEAY32.dll!BIO_sock_should_retry+0x57

ntkrnlpa.exe+0x8dafe
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x33198
hal.dll+0x6199
hal.dll+0x63d9
hal.dll+0x6577
hal.dll+0x3902
postgres.exe!process_implied_equality+0x18d50e

ntkrnlpa.exe+0x8dafe
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x33198
hal.dll+0x6199
hal.dll+0x63d9
hal.dll+0x6577
hal.dll+0x3902
LIBEAY32.dll!ERR_get_state

ntkrnlpa.exe+0x8dafe
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x33198
hal.dll+0x6199
hal.dll+0x63d9
hal.dll+0x6577
hal.dll+0x3902
ntkrnlpa.exe+0x89751
ntdll.dll!KiFastSystemCallRet
WS2_32.dll!WSARecv+0x65
WSOCK32.dll!recv+0x31
LIBEAY32.dll!BIO_sock_should_retry+0x57

ntkrnlpa.exe+0x8dafe
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x33198
hal.dll+0x6199
hal.dll+0x63d9
hal.dll+0x6456
ntkrnlpa.exe+0x312be
ntkrnlpa.exe+0x2ab9b
ntkrnlpa.exe+0x1e257
afd.sys+0x11905
afd.sys+0x10978
afd.sys+0xf097
ntkrnlpa.exe+0x1df85
ntkrnlpa.exe+0xf5437
ntkrnlpa.exe+0xf61bf
ntkrnlpa.exe+0xeed08
ntkrnlpa.exe+0x897bc
ntdll.dll!KiFastSystemCallRet
WS2_32.dll!WSARecv+0x65
WSOCK32.dll!recv+0x31
LIBEAY32.dll!BIO_sock_should_retry+0x57

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Wednesday, 29 July 2009 8:09 PM
To: Brendan Hill
Cc: 'Tom Lane'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Idle processes chewing up CPU?

Craig Ringer wrote:
> Brendan Hill wrote:
>> Hi Tom,
>>
>> Given it's on Windows, any suggestion for how I would get hold of this?
>> (Process Monitor tool perhaps?)
>
> I think you can get stack traces from Process Monitor using "Tools ->
> Stack Summary". I find it a bit hard to interpret this data, though, and
> I'm not sure how useful it is for this sort of thing.
>
>
>
> [ The following instructions may be put on the PostgreSQL wiki as advice
> for getting debugging details for runaway PostgreSQL processes on
> Windows if desired ]:

Actually, I've expanded on the instructions and done it. See:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ
L_backend_on_Windows

Accessible from "General Articles and Guides" -> "Troubleshooting" ->
"Generating_a_stack_trace_of_a_PostgreSQL_backend".

It'd be rather helpful if others could fill in the equivalent for gdb on
Linux/bsd/other unix as linked to here:

http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_bac
kend

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

Craig Ringer

Date:

05 August 2009, 03:35:01

On Wed, 2009-08-05 at 15:26 +1000, Brendan Hill wrote:

> I copied a few of the stack traces (at the end of this email), it kept
> changing each time I looked.

Yep, that's to be expected. If the process is busy, unless it's in a
_REALLY_ simple infinite loop, it'll be looping through some non-trivial
sequence of function calls. Thus, most stack traces will be different.

Unfortunately, you don't appear to have set up your debugging
environment, so your stack traces aren't annotated with any sort of
symbolic information that might tell the reader what's actually
happening. The numeric offsets can be converted to symbolic names if you
know the _EXACT_ version of the module (eg "hal.dll") in question, but
you haven't provided that, and even if you did it's a pain to do.

If you did set your symbol path, might you have an overzealous software
firewall that's interfering with requests to the symbol server. Make
sure that HTTP (port 80) access to msdl.microsoft.com for windbg.exe and
procexp.exe is unimpeded, or download the symbol bundle for your OS
(warning: huge download) from
http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx .

Because your symbol path doesn't appear to be configured there's no
guarantee the symbols for postgres.exe are right either, and more
importantly there's no "depth" to the trace; it doesn't show the call
path within postgres / libeay32 / etc .

Take this trace, for example.

ntkrnlpa.exe+0x8dafe
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x33198
hal.dll+0x6199
hal.dll+0x63d9
hal.dll+0x6577
hal.dll+0x3902
postgres.exe!process_implied_equality+0x18d50e

process_implied_equality(...) from backend/optimizer/plan/initsplan.c is
deep in the query planner, and is "currently used only when an
EquivalenceClass is found to contain pseudoconstants".

It's not going to be calling into hal.dll - not, at least, without a
fairly long call chain between it and hal.dll . So we can't trust that
postgres was even in the "process_implied_equality" function when it
called into hal.dll and if it was we don't have any idea how it got
there. The call trace is basically useless.

(New section added to wiki article for this:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows#How_to_make_sure_a_stack_trace_is_useful
)

A stack trace that includes symbols for the kernel / HAL calls should
look something like this:

ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
ntkrnlpa.exe!KiSuspendThread+0x18
ntkrnlpa.exe!KiDeliverApc+0x124
ntkrnlpa.exe!KiSwapThread+0xa8
ntkrnlpa.exe!KeWaitForMultipleObjects+0x284
ntkrnlpa.exe!NtWaitForMultipleObjects+0x297
ntkrnlpa.exe!KiFastCallEntry+0xfc
ntdll.dll!KiFastSystemCallRet
ntdll.dll!ZwWaitForMultipleObjects+0xc
kernel32.dll!WaitForMultipleObjectsEx+0x12c
postgres.exe!pgwin32_waitforsinglesocket+0x1f0
postgres.exe!pgwin32_recv+0x90
postgres.exe!secure_read+0x17d
postgres.exe!pq_recvbuf+0x71
postgres.exe!pq_getbyte+0x15
postgres.exe!SocketBackend+0x6
postgres.exe!PostgresMain+0xbe8
postgres.exe!BackendRun+0x204
postgres.exe!SubPostmasterMain+0x224
postgres.exe!main+0x177
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!BaseProcessStart+0x23

See how it has "ModuleNames!SymbolicNames+offsets" instead of just
"ModuleNames+offsets" ?

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

Craig Ringer

Date:

05 August 2009, 04:44:22

On Wed, 2009-08-05 at 16:44 +1000, Brendan Hill wrote:
> Hi Craig,
>
> Sorry, I had the stack trace so I thought it was enough. I'll make sure the
> debug environment is set up and post the full stack traces again.

No worries. Sorry it cost you time.

I've extended the wiki article on win32 debug info to (hopefully)
explain how to identify a useful stack trace, or at least a likely bogus
one. If you feel like a look I'd value your opinion especially when it
comes to clarity/readability.

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

05 August 2009, 13:44:28

Hi Craig,

Sorry, I had the stack trace so I thought it was enough. I'll make sure the
debug environment is set up and post the full stack traces again.

-Brendan

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Wednesday, 5 August 2009 4:35 PM
To: Brendan Hill
Cc: 'Tom Lane'; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Idle processes chewing up CPU?

On Wed, 2009-08-05 at 15:26 +1000, Brendan Hill wrote:

> I copied a few of the stack traces (at the end of this email), it kept
> changing each time I looked.

Yep, that's to be expected. If the process is busy, unless it's in a
_REALLY_ simple infinite loop, it'll be looping through some non-trivial
sequence of function calls. Thus, most stack traces will be different.

Unfortunately, you don't appear to have set up your debugging
environment, so your stack traces aren't annotated with any sort of
symbolic information that might tell the reader what's actually
happening. The numeric offsets can be converted to symbolic names if you
know the _EXACT_ version of the module (eg "hal.dll") in question, but
you haven't provided that, and even if you did it's a pain to do.

If you did set your symbol path, might you have an overzealous software
firewall that's interfering with requests to the symbol server. Make
sure that HTTP (port 80) access to msdl.microsoft.com for windbg.exe and
procexp.exe is unimpeded, or download the symbol bundle for your OS
(warning: huge download) from
http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx .

Because your symbol path doesn't appear to be configured there's no
guarantee the symbols for postgres.exe are right either, and more
importantly there's no "depth" to the trace; it doesn't show the call
path within postgres / libeay32 / etc .

Take this trace, for example.

ntkrnlpa.exe+0x8dafe
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x33198
hal.dll+0x6199
hal.dll+0x63d9
hal.dll+0x6577
hal.dll+0x3902
postgres.exe!process_implied_equality+0x18d50e

process_implied_equality(...) from backend/optimizer/plan/initsplan.c is
deep in the query planner, and is "currently used only when an
EquivalenceClass is found to contain pseudoconstants".

It's not going to be calling into hal.dll - not, at least, without a
fairly long call chain between it and hal.dll . So we can't trust that
postgres was even in the "process_implied_equality" function when it
called into hal.dll and if it was we don't have any idea how it got
there. The call trace is basically useless.

(New section added to wiki article for this:
http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ
L_backend_on_Windows#How_to_make_sure_a_stack_trace_is_useful
)

A stack trace that includes symbols for the kernel / HAL calls should
look something like this:

ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
ntkrnlpa.exe!KiSuspendThread+0x18
ntkrnlpa.exe!KiDeliverApc+0x124
ntkrnlpa.exe!KiSwapThread+0xa8
ntkrnlpa.exe!KeWaitForMultipleObjects+0x284
ntkrnlpa.exe!NtWaitForMultipleObjects+0x297
ntkrnlpa.exe!KiFastCallEntry+0xfc
ntdll.dll!KiFastSystemCallRet
ntdll.dll!ZwWaitForMultipleObjects+0xc
kernel32.dll!WaitForMultipleObjectsEx+0x12c
postgres.exe!pgwin32_waitforsinglesocket+0x1f0
postgres.exe!pgwin32_recv+0x90
postgres.exe!secure_read+0x17d
postgres.exe!pq_recvbuf+0x71
postgres.exe!pq_getbyte+0x15
postgres.exe!SocketBackend+0x6
postgres.exe!PostgresMain+0xbe8
postgres.exe!BackendRun+0x204
postgres.exe!SubPostmasterMain+0x224
postgres.exe!main+0x177
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!BaseProcessStart+0x23

See how it has "ModuleNames!SymbolicNames+offsets" instead of just
"ModuleNames+offsets" ?

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

06 August 2009, 22:19:34

Hi Craig, the instructions are great. I've gone through them and confirmed
I'm getting the full details in the stack traces. When it happens again,
I'll post the most useful stack traces and we'll be able to look at it
properly. Thanks again for your help so far, the responsiveness is terrific.

-Brendan

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Wednesday, 5 August 2009 5:44 PM
To: Brendan Hill
Cc: 'Tom Lane'; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Idle processes chewing up CPU?

On Wed, 2009-08-05 at 16:44 +1000, Brendan Hill wrote:
> Hi Craig,
>
> Sorry, I had the stack trace so I thought it was enough. I'll make sure
the
> debug environment is set up and post the full stack traces again.

No worries. Sorry it cost you time.

I've extended the wiki article on win32 debug info to (hopefully)
explain how to identify a useful stack trace, or at least a likely bogus
one. If you feel like a look I'd value your opinion especially when it
comes to clarity/readability.

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ
L_backend_on_Windows

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

19 August 2009, 01:42:18

Hi Craig/Tom,

I've managed to trap the full stack trace this time - 2 processes chewing up
25% each (1 core each on a quad core server), while SELECT * FROM
pg_stat_activity revealed they were <IDLE>. I also confirmed that the two
runaway processes were started by a developer remotely connecting pgAdmin
via SSL. It appears that no actual queries were run, just the connection
established. Unfortunately I couldn't confirm if the connections were dirty
disconnected.


Relevant DLL versions are:

8.3\bin\LIBEAY32.DLL - 0.9.8.5
8.3\bin\SSLEAY32.DLL - 0.9.8.5
Other DLLS - standard for Windows 2003 Server SP2


The juiciest stack traces I captured (in no particular order) were:

ntkrnlpa.exe!KiUnexpectedInterrupt+0x48
ntkrnlpa.exe!KeWaitForSingleObject+0x346
ntkrnlpa.exe!ZwYieldExecution+0x3514
hal.dll!KfRaiseIrql+0xe5
hal.dll!KeRaiseIrqlToSynchLevel+0x8d
hal.dll!HalEndSystemInterrupt+0x67
hal.dll!HalInitializeProcessor+0x856
LIBEAY32.dll!lh_doall_arg+0x1c4


ntkrnlpa.exe!KiUnexpectedInterrupt+0x48
ntkrnlpa.exe!KeWaitForSingleObject+0x346
ntkrnlpa.exe!ZwYieldExecution+0x3514
hal.dll!KfRaiseIrql+0xe5
hal.dll!KeRaiseIrqlToSynchLevel+0x8d
hal.dll!KfLowerIrql+0x62
ntkrnlpa.exe!ZwYieldExecution+0x163a
ntkrnlpa.exe!KeInsertQueueApc+0x57
ntkrnlpa.exe!IoCancelIrp+0x27d
hal.dll!HalEndSystemInterrupt+0x6e
hal.dll!HalInitializeProcessor+0x856
ntkrnlpa.exe!IofCallDriver+0x45
ntkrnlpa.exe!NtWriteFile+0x2943
ntkrnlpa.exe!NtWriteFile+0x36cb
ntkrnlpa.exe!NtDeviceIoControlFile+0x2a
ntkrnlpa.exe!KeReleaseInStackQueuedSpinLockFromDpcLevel+0xb64
ntdll.dll!KiFastSystemCallRet
WS2_32.dll!WSARecv+0x65
WSOCK32.dll!recv+0x31
LIBEAY32.dll!BIO_sock_should_retry+0x57
postgres.exe!my_sock_read+0x1b
LIBEAY32.dll!BIO_read+0x6f
SSLEAY32.dll!SSLv3_client_method+0x1ee1
SSLEAY32.dll!SSLv3_client_method+0x22ea
mswsock.dll!StartWsdpService+0x500
SSLEAY32.dll!SSLv3_client_method+0x225a
SSLEAY32.dll!SSLv3_client_method+0x2a15
postgres.exe!pgwin32_waitforsinglesocket+0x1ed
postgres.exe!secure_read+0x26
postgres.exe!pq_recvbuf+0x71
postgres.exe!pq_getbyte+0x15
postgres.exe!SocketBackend+0x6
postgres.exe!PostgresMain+0xbf8
postgres.exe!BackendRun+0x200
postgres.exe!SubPostmasterMain+0x21d
postgres.exe!main+0x177
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!ProcessIdToSessionId+0x209


ntkrnlpa.exe!ZwYieldExecution+0x163a
ntkrnlpa.exe!KeSetEvent+0xcc
ntkrnlpa.exe!PsLookupThreadByThreadId+0x54bc
ntkrnlpa.exe!KiDeliverApc+0xbb
ntkrnlpa.exe!ZwYieldExecution+0x3801
ntkrnlpa.exe!KeWaitForSingleObject+0x346
ntkrnlpa.exe!ZwYieldExecution+0x3514
ntkrnlpa.exe!KiCheckForKernelApcDelivery+0x1c
ntkrnlpa.exe!wctomb+0x4229
ntkrnlpa.exe!IofCallDriver+0x45
ntkrnlpa.exe!NtWriteFile+0x2943
ntkrnlpa.exe!NtWriteFile+0x36cb
ntkrnlpa.exe!NtDeviceIoControlFile+0x2a
ntkrnlpa.exe!KeReleaseInStackQueuedSpinLockFromDpcLevel+0xb64
ntdll.dll!KiFastSystemCallRet
WS2_32.dll!WSARecv+0x65
WSOCK32.dll!recv+0x31
LIBEAY32.dll!BIO_sock_should_retry+0x57
postgres.exe!my_sock_read+0x1b
LIBEAY32.dll!BIO_read+0x6f
SSLEAY32.dll!SSLv3_client_method+0x1ee1
SSLEAY32.dll!SSLv3_client_method+0x22ea
mswsock.dll!StartWsdpService+0x500
SSLEAY32.dll!SSLv3_client_method+0x225a
SSLEAY32.dll!SSLv3_client_method+0x2a15
postgres.exe!pgwin32_waitforsinglesocket+0x1ed
postgres.exe!secure_read+0x26
postgres.exe!pq_recvbuf+0x71
postgres.exe!pq_getbyte+0x15
postgres.exe!SocketBackend+0x6
postgres.exe!PostgresMain+0xbf8
postgres.exe!BackendRun+0x200
postgres.exe!SubPostmasterMain+0x21d
postgres.exe!main+0x177
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!ProcessIdToSessionId+0x209


ntkrnlpa.exe!KiUnexpectedInterrupt+0x48
ntkrnlpa.exe!KeWaitForSingleObject+0x346
ntkrnlpa.exe!ZwYieldExecution+0x3514
ntkrnlpa.exe!KiCheckForKernelApcDelivery+0x1c
ntkrnlpa.exe!NtWriteFile+0x3195
ntkrnlpa.exe!NtDeviceIoControlFile+0x2a
ntkrnlpa.exe!KeReleaseInStackQueuedSpinLockFromDpcLevel+0xb64
ntdll.dll!KiFastSystemCallRet
WS2_32.dll!WSARecv+0x65
WSOCK32.dll!recv+0x31
LIBEAY32.dll!BIO_sock_should_retry+0x57
postgres.exe!my_sock_read+0x1b
LIBEAY32.dll!BIO_read+0x6f
SSLEAY32.dll!SSLv3_client_method+0x1ee1
SSLEAY32.dll!SSLv3_client_method+0x22ea
mswsock.dll!StartWsdpService+0x500
SSLEAY32.dll!SSLv3_client_method+0x225a
SSLEAY32.dll!SSLv3_client_method+0x2a15
postgres.exe!pgwin32_waitforsinglesocket+0x1ed
postgres.exe!secure_read+0x26
postgres.exe!pq_recvbuf+0x71
postgres.exe!pq_getbyte+0x15
postgres.exe!SocketBackend+0x6
postgres.exe!PostgresMain+0xbf8
postgres.exe!BackendRun+0x200
postgres.exe!SubPostmasterMain+0x21d
postgres.exe!main+0x177
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!ProcessIdToSessionId+0x209


I'd appreciate any help diagnosing this problem - cutting off remote access
via SSL isn't the ideal solution.

Regards,
-Brendan



-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Wednesday, 5 August 2009 5:44 PM
To: Brendan Hill
Cc: 'Tom Lane'; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Idle processes chewing up CPU?

On Wed, 2009-08-05 at 16:44 +1000, Brendan Hill wrote:
> Hi Craig,
>
> Sorry, I had the stack trace so I thought it was enough. I'll make sure
the
> debug environment is set up and post the full stack traces again.

No worries. Sorry it cost you time.

I've extended the wiki article on win32 debug info to (hopefully)
explain how to identify a useful stack trace, or at least a likely bogus
one. If you feel like a look I'd value your opinion especially when it
comes to clarity/readability.

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQ
L_backend_on_Windows

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

Craig Ringer

Date:

19 August 2009, 02:11:55

On 19/08/2009 12:31 PM, Brendan Hill wrote:
> Hi Craig/Tom,
>
> I've managed to trap the full stack trace this time

The common part of those traces is:

 > ntdll.dll!KiFastSystemCallRet
 > WS2_32.dll!WSARecv+0x65
 > WSOCK32.dll!recv+0x31
 > LIBEAY32.dll!BIO_sock_should_retry+0x57
 > postgres.exe!my_sock_read+0x1b
 > LIBEAY32.dll!BIO_read+0x6f
 > SSLEAY32.dll!SSLv3_client_method+0x1ee1
 > SSLEAY32.dll!SSLv3_client_method+0x22ea
 > mswsock.dll!StartWsdpService+0x500
 > SSLEAY32.dll!SSLv3_client_method+0x225a
 > SSLEAY32.dll!SSLv3_client_method+0x2a15
 > postgres.exe!pgwin32_waitforsinglesocket+0x1ed
 > postgres.exe!secure_read+0x26
 > postgres.exe!pq_recvbuf+0x71
 > postgres.exe!pq_getbyte+0x15
 > postgres.exe!SocketBackend+0x6
 > postgres.exe!PostgresMain+0xbf8
 > postgres.exe!BackendRun+0x200
 > postgres.exe!SubPostmasterMain+0x21d
 > postgres.exe!main+0x177
 > postgres.exe!__tmainCRTStartup+0x10f
 > kernel32.dll!ProcessIdToSessionId+0x209

Now, it's not possible to tell for sure from the traces alone whether
this part of the trace shows the same instances of the same function
calls, or whether there's a loop happening such that (eg)
pgwin32_waitforsinglesocket is being called over and over and over. To
find that out, you'd need to attach a debugger and set a breakpoint
somewhere suitable.

Personally, though, as a somewhat informed stab in the dark I suspect
that the above part of the call stack is actually entered once and not
left. I'd say that when Pg calls my_sock_read(...), resulting in a call
to recv(...) and from there a kernel system call, that's as far as it
goes. The system call never returns.

Why? I suspect you have a buggy network driver or faulty network card.
The unexpected interrupt hander being called in one of the stack tracces
certainly has to make you wonder.

> I'd appreciate any help diagnosing this problem - cutting off remote access
> via SSL isn't the ideal solution.

I'd replace the NIC with one from a different manufacturer, at least
temporarily. I won't be shocked if the problem goes away.

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

19 August 2009, 02:40:39

Hi Craig, thanks for the analysis. If I attach a debugger on the runaway
child process, will this halt execution for all the other child processes
(ie. freeze the server)? And, can I attach Visual Studio C++ 2008, or is
there a recommended debugger for Windows debugging?

Given the reliability of the server in the past, I'd probably be expecting
an issue with OpenSSL instead, but with debugging attached I should be able
to say for sure.

Regards,
-Brendan

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Wednesday, 19 August 2009 3:12 PM
To: Brendan Hill
Cc: pgsql-general@postgresql.org; 'Tom Lane'
Subject: Re: [GENERAL] Idle processes chewing up CPU?

On 19/08/2009 12:31 PM, Brendan Hill wrote:
> Hi Craig/Tom,
>
> I've managed to trap the full stack trace this time

The common part of those traces is:

 > ntdll.dll!KiFastSystemCallRet
 > WS2_32.dll!WSARecv+0x65
 > WSOCK32.dll!recv+0x31
 > LIBEAY32.dll!BIO_sock_should_retry+0x57
 > postgres.exe!my_sock_read+0x1b
 > LIBEAY32.dll!BIO_read+0x6f
 > SSLEAY32.dll!SSLv3_client_method+0x1ee1
 > SSLEAY32.dll!SSLv3_client_method+0x22ea
 > mswsock.dll!StartWsdpService+0x500
 > SSLEAY32.dll!SSLv3_client_method+0x225a
 > SSLEAY32.dll!SSLv3_client_method+0x2a15
 > postgres.exe!pgwin32_waitforsinglesocket+0x1ed
 > postgres.exe!secure_read+0x26
 > postgres.exe!pq_recvbuf+0x71
 > postgres.exe!pq_getbyte+0x15
 > postgres.exe!SocketBackend+0x6
 > postgres.exe!PostgresMain+0xbf8
 > postgres.exe!BackendRun+0x200
 > postgres.exe!SubPostmasterMain+0x21d
 > postgres.exe!main+0x177
 > postgres.exe!__tmainCRTStartup+0x10f
 > kernel32.dll!ProcessIdToSessionId+0x209

Now, it's not possible to tell for sure from the traces alone whether
this part of the trace shows the same instances of the same function
calls, or whether there's a loop happening such that (eg)
pgwin32_waitforsinglesocket is being called over and over and over. To
find that out, you'd need to attach a debugger and set a breakpoint
somewhere suitable.

Personally, though, as a somewhat informed stab in the dark I suspect
that the above part of the call stack is actually entered once and not
left. I'd say that when Pg calls my_sock_read(...), resulting in a call
to recv(...) and from there a kernel system call, that's as far as it
goes. The system call never returns.

Why? I suspect you have a buggy network driver or faulty network card.
The unexpected interrupt hander being called in one of the stack tracces
certainly has to make you wonder.

> I'd appreciate any help diagnosing this problem - cutting off remote
access
> via SSL isn't the ideal solution.

I'd replace the NIC with one from a different manufacturer, at least
temporarily. I won't be shocked if the problem goes away.

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

Craig Ringer

Date:

19 August 2009, 04:32:51

On 19/08/2009 1:34 PM, Brendan Hill wrote:
> Hi Craig, thanks for the analysis. If I attach a debugger on the runaway
> child process, will this halt execution for all the other child processes
> (ie. freeze the server)? And, can I attach Visual Studio C++ 2008, or is
> there a recommended debugger for Windows debugging?

Visual C++ 2008's debugger should be fine - and it's certainly a lot
nicer to use than windbg.exe . LOTS nicer. I'm surprised you have VS
2008 on your production server, though - or are you planning on using
remote debugging?

Anyway: If you attach to a given backend, execution of the other
backends won't freeze. If you promptly unpause execution of the backend
you attached to everything will run normally. You might not want to
interrupt the backend's execution for too long at a time though, as my
understanding is that Pg does have tasks that require synchronization
across all backends and leaving one in a state of paused execution for
too long might slow things down.

I did some quick testing before posting. First, I downloaded and
unpacked the 8.4.0 sources since that's what I'm running on my
workstation. I then establishined two sessions to an otherwise idle 8.4
DB on WinXP, then attaching VS 2008 EE's debugger to one of them:

   Tools -> Attach to Process, check "show processes from all users",
   select the target postgres.exe by pid, attach.

It took a while for VS to load symbols for the first time, but the other
backend was responsive during that time. When VS finished loading
symbols it auto-resumed execution of the backend.

When I pause execution the other backend remains responsive.  I can
still establish new connections too.

With execution running normally I added a breakpoint at pq_recvbuf:

   Debug -> New Breakpoint -> Break at Function (CTRL-B),
   "pq_recvbuf", line 1 char 1 language "C", OK

then issued a query to the backend I was debugging. It processed the
query and then execution stopped at the breakpoint. I was prompted to
locate the source file I'd broken in, and when I did so it showed an
execution marker at the appropriate point, I could step execution
through the sources, etc.

When I was done, I just detached from the process with Tools -> Detach
All, leaving it running as before.

In your position I'd start by waiting until you have an out-of-control
backend, attaching to it without pausing it, and setting a breakpoint at
my_sock_read. If the breakpoint is hit then something's called
my_sock_read again; it won't trigger if my_sock_read is somewhere on the
call stack, only when the current point of execution enters the
function. You can step through execution from there see where it's looping.

If you find that my_sock_read isn't being called repeatedly, then the
infinite loop is in my_sock_read or something it's calling. Break into
execution and step through to see what Pg is doing.

> Given the reliability of the server in the past, I'd probably be expecting
> an issue with OpenSSL instead, but with debugging attached I should be able
> to say for sure.

Yep. If, for example, you waited until a backend was in the problem
state where it was using 100% CPU, attached the debugger, and set a
breakpoint at the start of my_sock_read in postgres.exe then you could
see if my_sock_read(...) was being called repeatedly or just once.

--
Craig Ringer

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

20 September 2009, 15:31:37

Hi Craig, I've debugged the runaway process, though I'm not sure of the
solution yet.

My best interpretation is that an SSL client dirty disconnected while
running a request. This caused an infinite loop in pq_recvbuf(), calling
secure_read(), triggering my_sock_read() over and over. Calling
SSL_get_error() in secure_read() returns 10045 (either connection reset, or
WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think
errno=EINTR has occurred, so it immediately tries again.

I'd appreciate any further interpretation and advice on this, how to
diagnose it further, or whether it's been addressed in a later version. Hope
you appreciate I've erred on the side of too much detail in the following
analysis.

As a side note, I had the postgres.exe process in debug mode for 2 hours,
and experienced no server downtime at all, which is good to know.

Regards,
-Brendan

=========================================
Main call stack:
=========================================

>    postgres.exe!pq_recvbuf()  Line 746 + 0x1a bytes
     postgres.exe!pq_getbyte()  Line 795 + 0x5 bytes
     postgres.exe!SocketBackend(StringInfoData * inBuf=0x00000000)  Line
317 + 0x5 bytes
     postgres.exe!PostgresMain(int argc=4, char * * argv=0x0155db48,
const char * username=0x01064e20)  Line 3595 + 0x23 bytes
     postgres.exe!BackendRun(Port * port=0x00bdfd40)  Line 3208
     postgres.exe!SubPostmasterMain(int argc=3, char * * argv=0x01063f28)
Line 3674 + 0x8 bytes
     postgres.exe!main(int argc=3, char * * argv=0x01063f28)  Line 165 +
0x7 bytes
     postgres.exe!__tmainCRTStartup()  Line 597 + 0x17 bytes
     kernel32.dll!_BaseProcessStart@4()  + 0x23 bytes

Relevant variables:

PqRecvLength: 0

PqRecvPointer: 0

PqRecvBuffer: Q...,SELECT COUNT(*) FROM.   (SELECT tgargs from pg_trigger
tr.      LEFT JOIN p-g_depend dep ON dep.objid=tr.oid AND deptype = 'i'.
LEFT JOIN pg_constraint co ON refobjid = co.oid AND contype = 'f'.     WHERE
tgisconstraint AND co.oid IS NULL.     GROUP BY tgargs.    HAVING count(1) =
3) AS foo.M pg_namespace nsp.  LEFT OUTER JOIN pg_description des ON
des.objoid=nsp.oid. WHERE NOT ((nspname = 'pg_catalog' and (SELECT count(*)
FROM pg_class WHERE relname = 'pg_class' AND relnamespace = nsp.oid) > 0)
OR.(nspname = 'pgagent' and (SELECT count(*) FROM pg_class WHERE relname =
'pga_job' AND relnamespace = nsp.oid) > 0) OR.(nspname =
'information_schema' and (SELECT count(*) FROM pg_class WHERE relname =
'tables' AND relnamespace = nsp.oid) >0x00789E10  20 30 29 20 4f 52 0a 28 6e
73 70 6e 61 6d 65 20   0) OR.(nspname = 'dbo' and (SELECT count(*) FROM
pg_class WHERE relname = 'systables' AND relnamespace = nsp.oid) > 0)
OR.(nspname = 'sys' and (SELECT count(*) FROM pg_class WHERE relname =
'all_tables' AND relnamespace = nsp.oid) > 0)). ORDER BY 1, nspname

MyProcPort:    0x00bdfd40
    sock    2044
    proto    196608
    laddr    {addr={...} salen=16 }
    raddr    {addr={...} salen=16 }
    remote_host    0x0106acf0 "<removed>"
    remote_port    0x010694a0 "3307"
    canAcceptConnections    CAC_OK
    database_name    0x01064df8 "<removed>"
    user_name    0x01064e20 "<removed>"
    cmdline_options    0x00000000 <Bad Ptr>
    guc_options    0x00000000 {type=??? length=??? head=??? ...}
    auth_method    uaMD5
    auth_arg    0x00000000 <Bad Ptr>
    md5Salt    0x00bdfe7c "<removed>"
    cryptSalt    0x00bdfe80 "UA"
    SessionStartTime    303975637670000
    default_keepalives_idle    0
    default_keepalives_interval    0
    default_keepalives_count    0
    keepalives_idle    0
    keepalives_interval    0
    keepalives_count    0
    gss    0x010694e0 {outbuf={...} cred=0x00000000 ctx=0x00000000 ...}
    ssl    0x003fbb78 {version=769 type=8192 method=0x1002b298 ...}
    peer    0x00000000 {cert_info=??? sig_alg=??? signature=??? ...}
    peer_dn    0x00bdfeb4 "(anonymous)"
    peer_cn    0x00bdff35 "(anonymous)"
    count    5287    unsigned long

pq_recvbuf() calls "secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
PQ_BUFFER_SIZE - PqRecvLength);", which returns -1. I've inferred from
control flow that errno=EINTR (can't debug the value unfortunately), so it
continues the loop.


=========================================
Within secure_read()
=========================================

SSL_get_error() returns 10054 (connection reset or WSAEOPNOTSUPP, I'm not
sure)

The switch() command jumps to SSL_ERROR_SYSCALL, then immediately jumps to
SSL_ERROR_NONE !?, and port->count NOT incremented - this was strange,
suggesting a source code mismatch or debugger problem perhaps.

secure_read() returns -1, popping back to pq_recvbuf()


=========================================
Within my_sock_read()
=========================================

secure_read() causes a call to my_sock_read(), which has a separate stack
trace (waiting on single object, I assume):

>    postgres.exe!my_sock_read(bio_st * h=0x003fbcf0, char *
buf=0x0165bc98, int size=5)  Line 443    C
     libeay32.dll!0026a78f()
     [Frames below may be incorrect and/or missing, no symbols loaded for
libeay32.dll]
     mswsock.dll!_WSPEventSelect@16()  + 0x92 bytes
     ffffffff()

Within prepare_for_client_read(): DoingCommandRead = 1

Within EnableNotifyInterrupt(): IsTransactionOrTransactionBlock() = 0,
notifyInterruptOccurred = 0

Within EnableCatchupInterrupt(): catchupInterruptOccurred = 0

After "res = std_sock_read(h, buf, size);" (can't debug result,
unfortunately), (buf, size, h) hadn't change at all

Within client_read_ended(): DoingCommandRead = 1

Exiting my_sock_read() and wandering through some assembly, it finds itself
back into socket_read(), then back in pg_recvbuf(), and the loop continues.

Re: Idle processes chewing up CPU?

From

Tom Lane

Date:

20 September 2009, 16:24:48

"Brendan Hill" <brendanh@jims.net> writes:
> My best interpretation is that an SSL client dirty disconnected while
> running a request. This caused an infinite loop in pq_recvbuf(), calling
> secure_read(), triggering my_sock_read() over and over. Calling
> SSL_get_error() in secure_read() returns 10045 (either connection reset, or
> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think
> errno=EINTR has occurred, so it immediately tries again.

I wonder if this would be a good idea:

  #ifdef USE_SSL
      if (port->ssl)
      {
          int            err;

  rloop:
+        errno = 0;
          n = SSL_read(port->ssl, ptr, len);
          err = SSL_get_error(port->ssl, n);
          switch (err)
          {
              case SSL_ERROR_NONE:
                  port->count += n;
                  break;

It looks to me like the basic issue is that pq_recvbuf is expecting
a relevant value of errno when secure_read returns -1, and there's
some path in the Windows case where errno doesn't get set, and if
it just happens to have been EINTR then we've got a loop.

            regards, tom lane

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

24 September 2009, 01:10:37

Hi Tom,

Makes sense to me. Seems to be happening rarely now.

I'm not all that familiar with the open source process, is this likely to be
included in the next release version?

-Brendan

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, 21 September 2009 5:25 AM
To: Brendan Hill
Cc: 'Craig Ringer'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Idle processes chewing up CPU?

"Brendan Hill" <brendanh@jims.net> writes:
> My best interpretation is that an SSL client dirty disconnected while
> running a request. This caused an infinite loop in pq_recvbuf(), calling
> secure_read(), triggering my_sock_read() over and over. Calling
> SSL_get_error() in secure_read() returns 10045 (either connection reset,
or
> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think
> errno=EINTR has occurred, so it immediately tries again.

I wonder if this would be a good idea:

  #ifdef USE_SSL
      if (port->ssl)
      {
          int            err;

  rloop:
+        errno = 0;
          n = SSL_read(port->ssl, ptr, len);
          err = SSL_get_error(port->ssl, n);
          switch (err)
          {
              case SSL_ERROR_NONE:
                  port->count += n;
                  break;

It looks to me like the basic issue is that pq_recvbuf is expecting
a relevant value of errno when secure_read returns -1, and there's
some path in the Windows case where errno doesn't get set, and if
it just happens to have been EINTR then we've got a loop.

            regards, tom lane

Re: Idle processes chewing up CPU?

From

Tom Lane

Date:

27 September 2009, 01:43:21

"Brendan Hill" <brendanh@jims.net> writes:
> Makes sense to me. Seems to be happening rarely now.

> I'm not all that familiar with the open source process, is this likely to be
> included in the next release version?

Can you confirm that that change actually fixes the problem you're
seeing?  I'm happy to apply it if it does, but I'd like to know that
the problem is dealt with.

            regards, tom lane


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Monday, 21 September 2009 5:25 AM
> To: Brendan Hill
> Cc: 'Craig Ringer'; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Idle processes chewing up CPU?

> "Brendan Hill" <brendanh@jims.net> writes:
>> My best interpretation is that an SSL client dirty disconnected while
>> running a request. This caused an infinite loop in pq_recvbuf(), calling
>> secure_read(), triggering my_sock_read() over and over. Calling
>> SSL_get_error() in secure_read() returns 10045 (either connection reset,
> or
>> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think
>> errno=EINTR has occurred, so it immediately tries again.

> I wonder if this would be a good idea:

>   #ifdef USE_SSL
>       if (port->ssl)
>       {
>           int            err;

>   rloop:
> +        errno = 0;
>           n = SSL_read(port->ssl, ptr, len);
>           err = SSL_get_error(port->ssl, n);
>           switch (err)
>           {
>               case SSL_ERROR_NONE:
>                   port->count += n;
>                   break;

> It looks to me like the basic issue is that pq_recvbuf is expecting
> a relevant value of errno when secure_read returns -1, and there's
> some path in the Windows case where errno doesn't get set, and if
> it just happens to have been EINTR then we've got a loop.

>             regards, tom lane


> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

28 September 2009, 21:41:42

Hi Tom,

Bit of a catch 22 - since it happens rarely, there's no definitive
confirmation that it's fixed the problem.

Also, not sure if I'm comfortable applying the change and recompiling
myself, wouldn't have a clue where to start.

I can't see how the change would hurt though, seems like a good idea
regardless, what do you think?

Regards,
-Brendan



-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Sunday, 27 September 2009 2:42 PM
To: Brendan Hill
Cc: 'Craig Ringer'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Idle processes chewing up CPU?

"Brendan Hill" <brendanh@jims.net> writes:
> Makes sense to me. Seems to be happening rarely now.

> I'm not all that familiar with the open source process, is this likely to
be
> included in the next release version?

Can you confirm that that change actually fixes the problem you're
seeing?  I'm happy to apply it if it does, but I'd like to know that
the problem is dealt with.

            regards, tom lane


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Monday, 21 September 2009 5:25 AM
> To: Brendan Hill
> Cc: 'Craig Ringer'; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Idle processes chewing up CPU?

> "Brendan Hill" <brendanh@jims.net> writes:
>> My best interpretation is that an SSL client dirty disconnected while
>> running a request. This caused an infinite loop in pq_recvbuf(), calling
>> secure_read(), triggering my_sock_read() over and over. Calling
>> SSL_get_error() in secure_read() returns 10045 (either connection reset,
> or
>> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think
>> errno=EINTR has occurred, so it immediately tries again.

> I wonder if this would be a good idea:

>   #ifdef USE_SSL
>       if (port->ssl)
>       {
>           int            err;

>   rloop:
> +        errno = 0;
>           n = SSL_read(port->ssl, ptr, len);
>           err = SSL_get_error(port->ssl, n);
>           switch (err)
>           {
>               case SSL_ERROR_NONE:
>                   port->count += n;
>                   break;

> It looks to me like the basic issue is that pq_recvbuf is expecting
> a relevant value of errno when secure_read returns -1, and there's
> some path in the Windows case where errno doesn't get set, and if
> it just happens to have been EINTR then we've got a loop.

>             regards, tom lane


> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: Idle processes chewing up CPU?

From

Tom Lane

Date:

28 September 2009, 22:30:54

"Brendan Hill" <brendanh@jims.net> writes:
> Bit of a catch 22 - since it happens rarely, there's no definitive
> confirmation that it's fixed the problem.
> Also, not sure if I'm comfortable applying the change and recompiling
> myself, wouldn't have a clue where to start.

Uh, so you haven't actually tested it at all?

> I can't see how the change would hurt though, seems like a good idea
> regardless, what do you think?

I'm not really inclined to throw it in there just on speculation
that it *might* mask a bug that's really somewhere else.

I can't help you with producing a recompiled executable for Windows,
but there are other people around the list who can.

            regards, tom lane

Re: Idle processes chewing up CPU?

From

Tom Lane

Date:

29 December 2009, 23:48:17

"Brendan Hill" <brendanh@jims.net> writes:
> I think I've confirmed the fix. Using a dirty disconnect generator, I was
> able to reliably recreate the problem within about 30-60 seconds. The
> symptoms were the same as before, however it occurred around SSL_write
> instead of SSL_read - I assume this was due to the artificial nature of the
> dirty disconnect (easier for the client to artificially break the connection
> while waiting/receiving, than sending).

> The solution you proposed solved it for SSL_write (ran for 30 minutes, no
> runaway processes), and I think it's safe to assume SSL_read too. So I
> suggest two additions:

Done, and also the same on the libpq side, since presumably it could
happen at that end as well.  I suspect there is some underlying OpenSSL
bug we are masking here, but it's cheap enough that we might as well
just do it.

            regards, tom lane

Re: Idle processes chewing up CPU?

From

"Brendan Hill"

Date:

01 January 2010, 03:57:39

Hi Tom,

I think I've confirmed the fix. Using a dirty disconnect generator, I was
able to reliably recreate the problem within about 30-60 seconds. The
symptoms were the same as before, however it occurred around SSL_write
instead of SSL_read - I assume this was due to the artificial nature of the
dirty disconnect (easier for the client to artificially break the connection
while waiting/receiving, than sending).

The solution you proposed solved it for SSL_write (ran for 30 minutes, no
runaway processes), and I think it's safe to assume SSL_read too. So I
suggest two additions:

====================================================
rloop:
+        errno = 0;

           n = SSL_read(port->ssl, ptr, len);
           err = SSL_get_error(port->ssl, n);
           switch (err)
           {
               case SSL_ERROR_NONE:
                   port->count += n;
                   break;
====================================================

And:

====================================================
wloop:
+        errno = 0;

        n = SSL_write(port->ssl, ptr, len);
        err = SSL_get_error(port->ssl, n);
        switch (err)
        {
            case SSL_ERROR_NONE:
                port->count += n;
                break;
====================================================

I'm not comfortable running my own compiled version in production (it was
rather difficult to get it working), so I'm interested to know when the next
release is planned. We can test beta copies on a non-critical load balancing
server if necessary.

Cheers,
-Brendan



-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Sunday, 27 September 2009 2:42 PM
To: Brendan Hill
Cc: 'Craig Ringer'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Idle processes chewing up CPU?

"Brendan Hill" <brendanh@jims.net> writes:
> Makes sense to me. Seems to be happening rarely now.

> I'm not all that familiar with the open source process, is this likely to
be
> included in the next release version?

Can you confirm that that change actually fixes the problem you're
seeing?  I'm happy to apply it if it does, but I'd like to know that
the problem is dealt with.

            regards, tom lane


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Monday, 21 September 2009 5:25 AM
> To: Brendan Hill
> Cc: 'Craig Ringer'; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Idle processes chewing up CPU?

> "Brendan Hill" <brendanh@jims.net> writes:
>> My best interpretation is that an SSL client dirty disconnected while
>> running a request. This caused an infinite loop in pq_recvbuf(), calling
>> secure_read(), triggering my_sock_read() over and over. Calling
>> SSL_get_error() in secure_read() returns 10045 (either connection reset,
> or
>> WSAEOPNOTSUPP, I'm not sure) - after this, pq_recvbuf() appears to think
>> errno=EINTR has occurred, so it immediately tries again.

> I wonder if this would be a good idea:

>   #ifdef USE_SSL
>       if (port->ssl)
>       {
>           int            err;

>   rloop:
> +        errno = 0;
>           n = SSL_read(port->ssl, ptr, len);
>           err = SSL_get_error(port->ssl, n);
>           switch (err)
>           {
>               case SSL_ERROR_NONE:
>                   port->count += n;
>                   break;

> It looks to me like the basic issue is that pq_recvbuf is expecting
> a relevant value of errno when secure_read returns -1, and there's
> some path in the Windows case where errno doesn't get set, and if
> it just happens to have been EINTR then we've got a loop.

>             regards, tom lane


> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general