Thread: Debugging a backend stuck consuming CPU

Debugging a backend stuck consuming CPU

From
"ktm@rice.edu"
Date:
Hi PostgreSQL community,

I am investigating a problem with a backend that appears to be stuck
and spinning while performing a "DISCARD ALL" command. The system is
running an older release 9.2.2. Are there any bugs that could be
causing this behavior? How can I tell what the process is actually
doing? It does not respond to either pg_cancel_backend() or to
pg_terminate_backend() so it is acting like it is not receiving the
signals. Any ideas on how to debug this would be appriciated.

Regards,
Ken


Re: Debugging a backend stuck consuming CPU

From
Tom Lane
Date:
"ktm@rice.edu" <ktm@rice.edu> writes:
> I am investigating a problem with a backend that appears to be stuck
> and spinning while performing a "DISCARD ALL" command. The system is
> running an older release 9.2.2.

You do realize that the current release in that series is 9.2.17.

> Are there any bugs that could be causing this behavior?

Known bugs are summarized here:
http://www.postgresql.org/docs/9.2/static/release.html

> How can I tell what the process is actually doing?

Getting a stack trace with gdb might be informative:
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane


Re: Debugging a backend stuck consuming CPU

From
Adrian Klaver
Date:
On 05/19/2016 06:46 AM, ktm@rice.edu wrote:
> Hi PostgreSQL community,
>
> I am investigating a problem with a backend that appears to be stuck
> and spinning while performing a "DISCARD ALL" command. The system is
> running an older release 9.2.2. Are there any bugs that could be
> causing this behavior? How can I tell what the process is actually
> doing? It does not respond to either pg_cancel_backend() or to
> pg_terminate_backend() so it is acting like it is not receiving the
> signals. Any ideas on how to debug this would be appriciated.

What does the Postgres log show around the time of the command and/or
currently?

What OS and does it's system log show anything?

Can you query pg_stat_activity?

http://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW

>
> Regards,
> Ken
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Debugging a backend stuck consuming CPU

From
"ktm@rice.edu"
Date:
On Thu, May 19, 2016 at 09:58:45AM -0400, Tom Lane wrote:
> "ktm@rice.edu" <ktm@rice.edu> writes:
> > I am investigating a problem with a backend that appears to be stuck
> > and spinning while performing a "DISCARD ALL" command. The system is
> > running an older release 9.2.2.
>
> You do realize that the current release in that series is 9.2.17.
>
> > Are there any bugs that could be causing this behavior?
>
> Known bugs are summarized here:
> http://www.postgresql.org/docs/9.2/static/release.html
>
> > How can I tell what the process is actually doing?
>
> Getting a stack trace with gdb might be informative:
> https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
>
>             regards, tom lane
>

Hi,

We do have an upgrade scheduled. I will get a stack trace and see what
information that provides.

Regards,
Ken


Re: Debugging a backend stuck consuming CPU

From
Merlin Moncure
Date:
On Thu, May 19, 2016 at 8:46 AM, ktm@rice.edu <ktm@rice.edu> wrote:
> Hi PostgreSQL community,
>
> I am investigating a problem with a backend that appears to be stuck
> and spinning while performing a "DISCARD ALL" command. The system is
> running an older release 9.2.2. Are there any bugs that could be
> causing this behavior? How can I tell what the process is actually
> doing? It does not respond to either pg_cancel_backend() or to
> pg_terminate_backend() so it is acting like it is not receiving the
> signals. Any ideas on how to debug this would be appriciated.

1. Grab a stack trace and a 'perf top' for forensic analysis in case
it happens again
2. Install latest 9.2 binaries
3. Restart database

There's not much value in analyzing such an out of date database;
there's a very high probability your bug has already been fixed.

merlin


Re: Debugging a backend stuck consuming CPU

From
"ktm@rice.edu"
Date:
On Thu, May 19, 2016 at 09:58:45AM -0400, Tom Lane wrote:
> "ktm@rice.edu" <ktm@rice.edu> writes:
> > I am investigating a problem with a backend that appears to be stuck
> > and spinning while performing a "DISCARD ALL" command. The system is
> > running an older release 9.2.2.
>
> You do realize that the current release in that series is 9.2.17.
>
> > Are there any bugs that could be causing this behavior?
>
> Known bugs are summarized here:
> http://www.postgresql.org/docs/9.2/static/release.html
>
> > How can I tell what the process is actually doing?
>
> Getting a stack trace with gdb might be informative:
> https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
>
>             regards, tom lane
>

Hi,

The stack trace just appeared to be what I would expect while a 'DISCARD ALL'
command was being run:

Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0  0x000000000073bc7c in MemoryContextSetParent ()
#1  0x000000000073bde3 in MemoryContextDelete ()
#2  0x000000000054e3a9 in DropAllPreparedStatements ()
#3  0x00000000005365f3 in DiscardCommand ()
#4  0x00000000006582c7 in ?? ()
#5  0x00000000006592bd in ?? ()
#6  0x0000000000659a42 in PortalRun ()
#7  0x000000000065603d in ?? ()
#8  0x0000000000656ed0 in PostgresMain ()
#9  0x0000000000613b91 in ?? ()
#10 0x00000000006167fc in PostmasterMain ()
#11 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0  0x000000000073bc7c in MemoryContextSetParent ()
#1  0x000000000073bde3 in MemoryContextDelete ()
#2  0x000000000054e3a9 in DropAllPreparedStatements ()
#3  0x00000000005365f3 in DiscardCommand ()
#4  0x00000000006582c7 in ?? ()
#5  0x00000000006592bd in ?? ()
#6  0x0000000000659a42 in PortalRun ()
#7  0x000000000065603d in ?? ()
#8  0x0000000000656ed0 in PostgresMain ()
#9  0x0000000000613b91 in ?? ()
#10 0x00000000006167fc in PostmasterMain ()
#11 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0  0x000000000073bc7c in MemoryContextSetParent ()
#1  0x000000000073bde3 in MemoryContextDelete ()
#2  0x000000000070e7df in DropCachedPlan ()
#3  0x000000000054e3a9 in DropAllPreparedStatements ()
#4  0x00000000005365f3 in DiscardCommand ()
#5  0x00000000006582c7 in ?? ()
#6  0x00000000006592bd in ?? ()
#7  0x0000000000659a42 in PortalRun ()
#8  0x000000000065603d in ?? ()
#9  0x0000000000656ed0 in PostgresMain ()
#10 0x0000000000613b91 in ?? ()
#11 0x00000000006167fc in PostmasterMain ()
#12 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0  0x000000000073bc7c in MemoryContextSetParent ()
#1  0x000000000073bde3 in MemoryContextDelete ()
#2  0x000000000054e3a9 in DropAllPreparedStatements ()
#3  0x00000000005365f3 in DiscardCommand ()
#4  0x00000000006582c7 in ?? ()
#5  0x00000000006592bd in ?? ()
#6  0x0000000000659a42 in PortalRun ()
#7  0x000000000065603d in ?? ()
#8  0x0000000000656ed0 in PostgresMain ()
#9  0x0000000000613b91 in ?? ()
#10 0x00000000006167fc in PostmasterMain ()
#11 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000070e7ff in DropCachedPlan ()
#0  0x000000000070e7ff in DropCachedPlan ()
#1  0x000000000054e3a9 in DropAllPreparedStatements ()
#2  0x00000000005365f3 in DiscardCommand ()
#3  0x00000000006582c7 in ?? ()
#4  0x00000000006592bd in ?? ()
#5  0x0000000000659a42 in PortalRun ()
#6  0x000000000065603d in ?? ()
#7  0x0000000000656ed0 in PostgresMain ()
#8  0x0000000000613b91 in ?? ()
#9  0x00000000006167fc in PostmasterMain ()
#10 0x00000000005b5290 in main ()
Detaching from program: /usr/pgsql-9.2/bin/postgres, process 38604
Undefined command: "exit".  Try "help".
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000070e7ff in DropCachedPlan ()
#0  0x000000000070e7ff in DropCachedPlan ()
#1  0x000000000054e3a9 in DropAllPreparedStatements ()
#2  0x00000000005365f3 in DiscardCommand ()
#3  0x00000000006582c7 in ?? ()
#4  0x00000000006592bd in ?? ()
#5  0x0000000000659a42 in PortalRun ()
#6  0x000000000065603d in ?? ()
#7  0x0000000000656ed0 in PostgresMain ()
#8  0x0000000000613b91 in ?? ()
#9  0x00000000006167fc in PostmasterMain ()
#10 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000070e7ff in DropCachedPlan ()
#0  0x000000000070e7ff in DropCachedPlan ()
#1  0x000000000054e3a9 in DropAllPreparedStatements ()
#2  0x00000000005365f3 in DiscardCommand ()
#3  0x00000000006582c7 in ?? ()
#4  0x00000000006592bd in ?? ()
#5  0x0000000000659a42 in PortalRun ()
#6  0x000000000065603d in ?? ()
#7  0x0000000000656ed0 in PostgresMain ()
#8  0x0000000000613b91 in ?? ()
#9  0x00000000006167fc in PostmasterMain ()
#10 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0  0x000000000073bc7c in MemoryContextSetParent ()
#1  0x000000000073bde3 in MemoryContextDelete ()
#2  0x000000000070e7df in DropCachedPlan ()
#3  0x000000000054e3a9 in DropAllPreparedStatements ()
#4  0x00000000005365f3 in DiscardCommand ()
#5  0x00000000006582c7 in ?? ()
#6  0x00000000006592bd in ?? ()
#7  0x0000000000659a42 in PortalRun ()
#8  0x000000000065603d in ?? ()
#9  0x0000000000656ed0 in PostgresMain ()
#10 0x0000000000613b91 in ?? ()
#11 0x00000000006167fc in PostmasterMain ()
#12 0x00000000005b5290 in main ()
Continuing.

Does a DISCARD command take alot of time, or is it like TRUNCATE? The
backend does have a very large memory footprint (12GB).

Regards,
Ken


Re: Debugging a backend stuck consuming CPU

From
Tom Lane
Date:
"ktm@rice.edu" <ktm@rice.edu> writes:
> The stack trace just appeared to be what I would expect while a 'DISCARD ALL'
> command was being run:

> #0  0x000000000073bc7c in MemoryContextSetParent ()
> #1  0x000000000073bde3 in MemoryContextDelete ()
> #2  0x000000000054e3a9 in DropAllPreparedStatements ()
> #3  0x00000000005365f3 in DiscardCommand ()

Hmm, what it seems from these traces is that you've got a whole heck of
a lot of prepared statements.

> The backend does have a very large memory footprint (12GB).

Um.

The most likely explanation is that you are hitting O(N^2) behavior as
a consequence of MemoryContextSetParent being O(N) in the number of
sibling contexts of the context to be deleted.  We fixed that for 9.6
(commit 25c539233044c235e97fd7c9dc600fb5f08fe065) but there's no easy
solution in older branches, short of not using so many prepared
statements.  I'm a bit surprised that you could have gotten up to 12GB
worth of prepared statements in an application that sends DISCARD ALL
periodically.

            regards, tom lane


Re: Debugging a backend stuck consuming CPU

From
"ktm@rice.edu"
Date:
On Thu, May 19, 2016 at 05:52:26PM -0400, Tom Lane wrote:
> "ktm@rice.edu" <ktm@rice.edu> writes:
> > The stack trace just appeared to be what I would expect while a 'DISCARD ALL'
> > command was being run:
>
> > #0  0x000000000073bc7c in MemoryContextSetParent ()
> > #1  0x000000000073bde3 in MemoryContextDelete ()
> > #2  0x000000000054e3a9 in DropAllPreparedStatements ()
> > #3  0x00000000005365f3 in DiscardCommand ()
>
> Hmm, what it seems from these traces is that you've got a whole heck of
> a lot of prepared statements.
>
> > The backend does have a very large memory footprint (12GB).
>
> Um.
>
> The most likely explanation is that you are hitting O(N^2) behavior as
> a consequence of MemoryContextSetParent being O(N) in the number of
> sibling contexts of the context to be deleted.  We fixed that for 9.6
> (commit 25c539233044c235e97fd7c9dc600fb5f08fe065) but there's no easy
> solution in older branches, short of not using so many prepared
> statements.  I'm a bit surprised that you could have gotten up to 12GB
> worth of prepared statements in an application that sends DISCARD ALL
> periodically.
>
>             regards, tom lane
>

Hi,

The DISCARD ALL is only sent by pgbouncer at the end of the processing.
The actual process builds up a cache to be used later whose size is
proportional to the number of items. The initial run is large, but the
regular runs are much smaller and cleanup quickly. I was more concerned
with incorrect behavior leading to DB corruption. Thank you for your
suggestions and assistance.

Regards,
Ken