Thread: Pg stuck at 100% cpu, for multiple days

Pg stuck at 100% cpu, for multiple days

From
hubert depesz lubaczewski
Date:
Hi,

Originally I posted it on -general, but Joe Conway suggested I repost in
here for greater visibility...

We hit a problem with Pg 12.6 (I know, we should upgrade, but that will
take long time to prepare).

Anyway - it's 12.6 on aarm64.
Couple of days there was replication slot started, and now it seems to
be stuck.

=# select * from pg_stat_activity  where pid = 22697 \gx
─[ RECORD 1 ]────┬──────────────────────────────────────────────────────────
datid            │ 16591
datname          │ canvas
pid              │ 22697
usesysid         │ 16505
usename          │ <CENSORED>
application_name │ PostgreSQL JDBC Driver
client_addr      │ <CENSORED>
client_hostname  │ [null]
client_port      │ 43160
backend_start    │ 2021-08-18 02:12:05.758297+00
xact_start       │ [null]
query_start      │ 2021-08-18 02:12:05.772271+00
state_change     │ 2021-08-18 02:12:05.773428+00
wait_event_type  │ [null]
wait_event       │ [null]
state            │ active
backend_xid      │ [null]
backend_xmin     │ [null]
query            │ SELECT COUNT(1) FROM pg_publication WHERE pubname = 'cdc'
backend_type     │ walsender

=# select pg_current_wal_lsn(), pg_size_pretty( pg_current_wal_lsn() - sent_lsn), * from pg_stat_replication where pid
=22697 \gx
 
─[ RECORD 1 ]──────┬──────────────────────────────
pg_current_wal_lsn │ 1B14/718EA0B8
pg_size_pretty     │ 290 GB
pid                │ 22697
usesysid           │ 16505
usename            │ <CENSORED>
application_name   │ PostgreSQL JDBC Driver
client_addr        │ <CENSORED>
client_hostname    │ [null]
client_port        │ 43160
backend_start      │ 2021-08-18 02:12:05.758297+00
backend_xmin       │ [null]
state              │ streaming
sent_lsn           │ 1ACC/D8689A8
write_lsn          │ 1ACC/D527BD8
flush_lsn          │ 1ACC/C97DF48
replay_lsn         │ 1ACC/C97DF48
write_lag          │ 00:00:00.257041
flush_lag          │ 00:00:01.26665
replay_lag         │ 00:00:01.26665
sync_priority      │ 0
sync_state         │ async
reply_time         │ 1999-12-21 03:15:13.449225+00

top shows the process using 100% of cpu. I tried strace'ing, but strace doesn't
show *anything* - it just sits there.

Got backtrace:

~# gdb --pid=22697 --batch -ex backtrace
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
1448    ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0  hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
#1  0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at
./build/../src/backend/utils/cache/relfilenodemap.c:64
#2  0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at
./build/../src/backend/utils/cache/inval.c:595
#3  0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998
<formatted_start_time>,txn=0xaaaac326b998 <formatted_start_time>) at
./build/../src/backend/replication/logical/reorderbuffer.c:2149
#4  ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, commit_lsn=187650393290540,
end_lsn=<optimizedout>, commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1770
 
#5  0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960, ctx=0xaaaaf5d396a0)
at./build/../src/backend/replication/logical/decode.c:640
 
#6  DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf@entry=0xffffdd90f9c0) at
./build/../src/backend/replication/logical/decode.c:248
#7  0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at
./build/../src/backend/replication/logical/decode.c:117
#8  0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840
#9  0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2189
#10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at
./build/../src/backend/replication/walsender.c:1133
#11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL
1A2D/4B3640(\"proto_version\" '1', \"publication_names\" 'cdc')") at
./build/../src/backend/replication/walsender.c:1549
#12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4257
 
#13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at
./build/../src/backend/postmaster/postmaster.c:4484
#14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228

The thing is - I can't close it with pg_terminate_backend(), and I'd
rather not kill -9, as it will, I think, close all other connections,
and this is prod server.

The other end of the connection was something in kubernetes, and it no
longer exists.

Is there anything we could do about it?

Best regards,

depesz




Re: Pg stuck at 100% cpu, for multiple days

From
Laurenz Albe
Date:
On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote:
> The thing is - I can't close it with pg_terminate_backend(), and I'd
> rather not kill -9, as it will, I think, close all other connections,
> and this is prod server.

Of course the cause should be fixed, but to serve your immediate need:

After calling pg_terminate_backend(), you can attach gdb to the backend and then run

  print ProcessInterrupts()

That will cause the backend to exit normally without crashing the server.

Yours,
Laurenz Albe




Re: Pg stuck at 100% cpu, for multiple days

From
Justin Pryzby
Date:
On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote:
> On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote:
> > The thing is - I can't close it with pg_terminate_backend(), and I'd
> > rather not kill -9, as it will, I think, close all other connections,
> > and this is prod server.
> 
> Of course the cause should be fixed, but to serve your immediate need:

You might save a coredump of the process using gdb gcore before killing it, in
case someone thinks how to debug it next month.

Depending on your OS, you might have to do something special to get shared
buffers included in the dump (or excluded, if that's what's desirable).

I wonder how far up the stacktrace it's stuck ?
You could set a breakpoint on LogicalDecodingProcessRecord and then "c"ontinue,
and see if it hits the breakpoint in a few seconds.  If not, try the next
frame until you know which one is being called repeatedly.

Maybe CheckForInterrupts should be added somewhere...

-- 
Justin



Re: Pg stuck at 100% cpu, for multiple days

From
Joe Conway
Date:
On 8/30/21 3:34 PM, Justin Pryzby wrote:
> On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote:
>> On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote:
>> > The thing is - I can't close it with pg_terminate_backend(), and I'd
>> > rather not kill -9, as it will, I think, close all other connections,
>> > and this is prod server.
>> 
>> Of course the cause should be fixed, but to serve your immediate need:
> 
> You might save a coredump of the process using gdb gcore before killing it, in
> case someone thinks how to debug it next month.
> 
> Depending on your OS, you might have to do something special to get shared
> buffers included in the dump (or excluded, if that's what's desirable).
> 
> I wonder how far up the stacktrace it's stuck ?
> You could set a breakpoint on LogicalDecodingProcessRecord and then "c"ontinue,
> and see if it hits the breakpoint in a few seconds.  If not, try the next
> frame until you know which one is being called repeatedly.
> 
> Maybe CheckForInterrupts should be added somewhere...

The spot in the backtrace...

#0  hash_seq_search (status=status@entry=0xffffdd90f380) at 
./build/../src/backend/utils/hash/dynahash.c:1448

...is in the middle of this while loop:
8<-----------------------------------------
     while ((curElem = segp[segment_ndx]) == NULL)
     {
         /* empty bucket, advance to next */
         if (++curBucket > max_bucket)
         {
             status->curBucket = curBucket;
             hash_seq_term(status);
             return NULL;  /* search is done */
         }
         if (++segment_ndx >= ssize)
         {
             segment_num++;
             segment_ndx = 0;
             segp = hashp->dir[segment_num];
         }
     }
8<-----------------------------------------

It would be interesting to step through a few times to see if it is 
really stuck in that loop. That would be consistent with 100% CPU and 
not checking for interrupts I think.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: Pg stuck at 100% cpu, for multiple days

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> It would be interesting to step through a few times to see if it is 
> really stuck in that loop.

Yeah, this single data point is not enough justification to blame
dynahash.c (which is *extremely* battle-tested code, you'll recall).
I'm inclined to guess that the looping is happening a few stack levels
further up, in the logical-decoding code (which is, um, not so much).

            regards, tom lane



Re: Pg stuck at 100% cpu, for multiple days

From
Joe Conway
Date:
On 8/30/21 8:22 PM, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> It would be interesting to step through a few times to see if it is 
>> really stuck in that loop.
> 
> Yeah, this single data point is not enough justification to blame
> dynahash.c (which is *extremely* battle-tested code, you'll recall).
> I'm inclined to guess that the looping is happening a few stack levels
> further up, in the logical-decoding code (which is, um, not so much).


Heh, fair point :-)

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: Pg stuck at 100% cpu, for multiple days

From
Michael Paquier
Date:
On Mon, Aug 30, 2021 at 09:16:51PM -0400, Joe Conway wrote:
> On 8/30/21 8:22 PM, Tom Lane wrote:
>> Yeah, this single data point is not enough justification to blame
>> dynahash.c (which is *extremely* battle-tested code, you'll recall).
>> I'm inclined to guess that the looping is happening a few stack levels
>> further up, in the logical-decoding code (which is, um, not so much).
>
> Heh, fair point :-)

It looks something is messed up with the list of invalidation messages
to process in this code path.  Maybe some incorrect memory context
handling?
--
Michael

Attachment

Re: Pg stuck at 100% cpu, for multiple days

From
hubert depesz lubaczewski
Date:
On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote:
> On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote:
> > The thing is - I can't close it with pg_terminate_backend(), and I'd
> > rather not kill -9, as it will, I think, close all other connections,
> > and this is prod server.
> 
> Of course the cause should be fixed, but to serve your immediate need:
> 
> After calling pg_terminate_backend(), you can attach gdb to the backend and then run
> 
>   print ProcessInterrupts()
> 
> That will cause the backend to exit normally without crashing the server.

I got this mail too late, and the decision was made to restart Pg. After
restart all cleaned up nicely.

So, while I can't help more with diagnosing the problem, I think it
might be good to try to figure out what could have happened.

On my end I gathered some more data:
1. the logical replication app is debezium
2. as far as I can tell it was patched against
   https://issues.redhat.com/browse/DBZ-1596
3. app was gone (kubernetes cluister was shut down) in the mean time.
4. the backend was up and running for 12 days, in the tight loop.

depesz



Re: Pg stuck at 100% cpu, for multiple days

From
hubert depesz lubaczewski
Date:
On Mon, Aug 30, 2021 at 08:15:24PM -0400, Joe Conway wrote:
> It would be interesting to step through a few times to see if it is really
> stuck in that loop. That would be consistent with 100% CPU and not checking
> for interrupts I think.

If the problem will happen again, will do my best to get more
information.

Best regards,

depesz




Re: Pg stuck at 100% cpu, for multiple days

From
Amit Kapila
Date:
On Tue, Aug 31, 2021 at 11:41 AM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote:
> > On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote:
> > > The thing is - I can't close it with pg_terminate_backend(), and I'd
> > > rather not kill -9, as it will, I think, close all other connections,
> > > and this is prod server.
> >
> > Of course the cause should be fixed, but to serve your immediate need:
> >
> > After calling pg_terminate_backend(), you can attach gdb to the backend and then run
> >
> >   print ProcessInterrupts()
> >
> > That will cause the backend to exit normally without crashing the server.
>
> I got this mail too late, and the decision was made to restart Pg. After
> restart all cleaned up nicely.
>
> So, while I can't help more with diagnosing the problem, I think it
> might be good to try to figure out what could have happened.
>

One possibility could be there are quite a few DDLs happening in this
application at some particular point in time which can lead to high
CPU usage. Prior to commit d7eb52d718 in PG-14, we use to execute all
invalidations at each command end during logical decoding which might
lead to such behavior temporarily. I think a bit of debugging when it
shows this symptom could help us to identify if it is the problem I am
speculating here.

-- 
With Regards,
Amit Kapila.



Re: Pg stuck at 100% cpu, for multiple days

From
hubert depesz lubaczewski
Date:
On Tue, Aug 31, 2021 at 04:00:14PM +0530, Amit Kapila wrote:
> One possibility could be there are quite a few DDLs happening in this
> application at some particular point in time which can lead to high

While not impossible, I'd rather say it's not very likely. We don't use
temporary tables, and while there are DB migrations, I don't think they
are often, and definitely don't take many days.

Best regards,

depesz