Thread: Pg stuck at 100% cpu, for multiple days
Hi, Originally I posted it on -general, but Joe Conway suggested I repost in here for greater visibility... We hit a problem with Pg 12.6 (I know, we should upgrade, but that will take long time to prepare). Anyway - it's 12.6 on aarm64. Couple of days there was replication slot started, and now it seems to be stuck. =# select * from pg_stat_activity where pid = 22697 \gx ─[ RECORD 1 ]────┬────────────────────────────────────────────────────────── datid │ 16591 datname │ canvas pid │ 22697 usesysid │ 16505 usename │ <CENSORED> application_name │ PostgreSQL JDBC Driver client_addr │ <CENSORED> client_hostname │ [null] client_port │ 43160 backend_start │ 2021-08-18 02:12:05.758297+00 xact_start │ [null] query_start │ 2021-08-18 02:12:05.772271+00 state_change │ 2021-08-18 02:12:05.773428+00 wait_event_type │ [null] wait_event │ [null] state │ active backend_xid │ [null] backend_xmin │ [null] query │ SELECT COUNT(1) FROM pg_publication WHERE pubname = 'cdc' backend_type │ walsender =# select pg_current_wal_lsn(), pg_size_pretty( pg_current_wal_lsn() - sent_lsn), * from pg_stat_replication where pid =22697 \gx ─[ RECORD 1 ]──────┬────────────────────────────── pg_current_wal_lsn │ 1B14/718EA0B8 pg_size_pretty │ 290 GB pid │ 22697 usesysid │ 16505 usename │ <CENSORED> application_name │ PostgreSQL JDBC Driver client_addr │ <CENSORED> client_hostname │ [null] client_port │ 43160 backend_start │ 2021-08-18 02:12:05.758297+00 backend_xmin │ [null] state │ streaming sent_lsn │ 1ACC/D8689A8 write_lsn │ 1ACC/D527BD8 flush_lsn │ 1ACC/C97DF48 replay_lsn │ 1ACC/C97DF48 write_lag │ 00:00:00.257041 flush_lag │ 00:00:01.26665 replay_lag │ 00:00:01.26665 sync_priority │ 0 sync_state │ async reply_time │ 1999-12-21 03:15:13.449225+00 top shows the process using 100% of cpu. I tried strace'ing, but strace doesn't show *anything* - it just sits there. Got backtrace: ~# gdb --pid=22697 --batch -ex backtrace [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448 1448 ./build/../src/backend/utils/hash/dynahash.c: No such file or directory. #0 hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448 #1 0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at ./build/../src/backend/utils/cache/relfilenodemap.c:64 #2 0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at ./build/../src/backend/utils/cache/inval.c:595 #3 0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998 <formatted_start_time>,txn=0xaaaac326b998 <formatted_start_time>) at ./build/../src/backend/replication/logical/reorderbuffer.c:2149 #4 ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, commit_lsn=187650393290540, end_lsn=<optimizedout>, commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1770 #5 0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960, ctx=0xaaaaf5d396a0) at./build/../src/backend/replication/logical/decode.c:640 #6 DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf@entry=0xffffdd90f9c0) at ./build/../src/backend/replication/logical/decode.c:248 #7 0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at ./build/../src/backend/replication/logical/decode.c:117 #8 0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840 #9 0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2189 #10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at ./build/../src/backend/replication/walsender.c:1133 #11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL 1A2D/4B3640(\"proto_version\" '1', \"publication_names\" 'cdc')") at ./build/../src/backend/replication/walsender.c:1549 #12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>, username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4257 #13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4484 #14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167 #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725 #16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398 #17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228 The thing is - I can't close it with pg_terminate_backend(), and I'd rather not kill -9, as it will, I think, close all other connections, and this is prod server. The other end of the connection was something in kubernetes, and it no longer exists. Is there anything we could do about it? Best regards, depesz
On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote: > The thing is - I can't close it with pg_terminate_backend(), and I'd > rather not kill -9, as it will, I think, close all other connections, > and this is prod server. Of course the cause should be fixed, but to serve your immediate need: After calling pg_terminate_backend(), you can attach gdb to the backend and then run print ProcessInterrupts() That will cause the backend to exit normally without crashing the server. Yours, Laurenz Albe
On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote: > On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote: > > The thing is - I can't close it with pg_terminate_backend(), and I'd > > rather not kill -9, as it will, I think, close all other connections, > > and this is prod server. > > Of course the cause should be fixed, but to serve your immediate need: You might save a coredump of the process using gdb gcore before killing it, in case someone thinks how to debug it next month. Depending on your OS, you might have to do something special to get shared buffers included in the dump (or excluded, if that's what's desirable). I wonder how far up the stacktrace it's stuck ? You could set a breakpoint on LogicalDecodingProcessRecord and then "c"ontinue, and see if it hits the breakpoint in a few seconds. If not, try the next frame until you know which one is being called repeatedly. Maybe CheckForInterrupts should be added somewhere... -- Justin
On 8/30/21 3:34 PM, Justin Pryzby wrote: > On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote: >> On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote: >> > The thing is - I can't close it with pg_terminate_backend(), and I'd >> > rather not kill -9, as it will, I think, close all other connections, >> > and this is prod server. >> >> Of course the cause should be fixed, but to serve your immediate need: > > You might save a coredump of the process using gdb gcore before killing it, in > case someone thinks how to debug it next month. > > Depending on your OS, you might have to do something special to get shared > buffers included in the dump (or excluded, if that's what's desirable). > > I wonder how far up the stacktrace it's stuck ? > You could set a breakpoint on LogicalDecodingProcessRecord and then "c"ontinue, > and see if it hits the breakpoint in a few seconds. If not, try the next > frame until you know which one is being called repeatedly. > > Maybe CheckForInterrupts should be added somewhere... The spot in the backtrace... #0 hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448 ...is in the middle of this while loop: 8<----------------------------------------- while ((curElem = segp[segment_ndx]) == NULL) { /* empty bucket, advance to next */ if (++curBucket > max_bucket) { status->curBucket = curBucket; hash_seq_term(status); return NULL; /* search is done */ } if (++segment_ndx >= ssize) { segment_num++; segment_ndx = 0; segp = hashp->dir[segment_num]; } } 8<----------------------------------------- It would be interesting to step through a few times to see if it is really stuck in that loop. That would be consistent with 100% CPU and not checking for interrupts I think. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Joe Conway <mail@joeconway.com> writes: > It would be interesting to step through a few times to see if it is > really stuck in that loop. Yeah, this single data point is not enough justification to blame dynahash.c (which is *extremely* battle-tested code, you'll recall). I'm inclined to guess that the looping is happening a few stack levels further up, in the logical-decoding code (which is, um, not so much). regards, tom lane
On 8/30/21 8:22 PM, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> It would be interesting to step through a few times to see if it is >> really stuck in that loop. > > Yeah, this single data point is not enough justification to blame > dynahash.c (which is *extremely* battle-tested code, you'll recall). > I'm inclined to guess that the looping is happening a few stack levels > further up, in the logical-decoding code (which is, um, not so much). Heh, fair point :-) Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
On Mon, Aug 30, 2021 at 09:16:51PM -0400, Joe Conway wrote: > On 8/30/21 8:22 PM, Tom Lane wrote: >> Yeah, this single data point is not enough justification to blame >> dynahash.c (which is *extremely* battle-tested code, you'll recall). >> I'm inclined to guess that the looping is happening a few stack levels >> further up, in the logical-decoding code (which is, um, not so much). > > Heh, fair point :-) It looks something is messed up with the list of invalidation messages to process in this code path. Maybe some incorrect memory context handling? -- Michael
Attachment
On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote: > On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote: > > The thing is - I can't close it with pg_terminate_backend(), and I'd > > rather not kill -9, as it will, I think, close all other connections, > > and this is prod server. > > Of course the cause should be fixed, but to serve your immediate need: > > After calling pg_terminate_backend(), you can attach gdb to the backend and then run > > print ProcessInterrupts() > > That will cause the backend to exit normally without crashing the server. I got this mail too late, and the decision was made to restart Pg. After restart all cleaned up nicely. So, while I can't help more with diagnosing the problem, I think it might be good to try to figure out what could have happened. On my end I gathered some more data: 1. the logical replication app is debezium 2. as far as I can tell it was patched against https://issues.redhat.com/browse/DBZ-1596 3. app was gone (kubernetes cluister was shut down) in the mean time. 4. the backend was up and running for 12 days, in the tight loop. depesz
On Mon, Aug 30, 2021 at 08:15:24PM -0400, Joe Conway wrote: > It would be interesting to step through a few times to see if it is really > stuck in that loop. That would be consistent with 100% CPU and not checking > for interrupts I think. If the problem will happen again, will do my best to get more information. Best regards, depesz
On Tue, Aug 31, 2021 at 11:41 AM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Mon, Aug 30, 2021 at 09:09:20PM +0200, Laurenz Albe wrote: > > On Mon, 2021-08-30 at 17:18 +0200, hubert depesz lubaczewski wrote: > > > The thing is - I can't close it with pg_terminate_backend(), and I'd > > > rather not kill -9, as it will, I think, close all other connections, > > > and this is prod server. > > > > Of course the cause should be fixed, but to serve your immediate need: > > > > After calling pg_terminate_backend(), you can attach gdb to the backend and then run > > > > print ProcessInterrupts() > > > > That will cause the backend to exit normally without crashing the server. > > I got this mail too late, and the decision was made to restart Pg. After > restart all cleaned up nicely. > > So, while I can't help more with diagnosing the problem, I think it > might be good to try to figure out what could have happened. > One possibility could be there are quite a few DDLs happening in this application at some particular point in time which can lead to high CPU usage. Prior to commit d7eb52d718 in PG-14, we use to execute all invalidations at each command end during logical decoding which might lead to such behavior temporarily. I think a bit of debugging when it shows this symptom could help us to identify if it is the problem I am speculating here. -- With Regards, Amit Kapila.
On Tue, Aug 31, 2021 at 04:00:14PM +0530, Amit Kapila wrote: > One possibility could be there are quite a few DDLs happening in this > application at some particular point in time which can lead to high While not impossible, I'd rather say it's not very likely. We don't use temporary tables, and while there are DB migrations, I don't think they are often, and definitely don't take many days. Best regards, depesz