Pg stuck at 100% cpu, for multiple days - Mailing list pgsql-hackers
From | hubert depesz lubaczewski |
---|---|
Subject | Pg stuck at 100% cpu, for multiple days |
Date | |
Msg-id | 20210830151829.GB9945@depesz.com Whole thread Raw |
Responses |
Re: Pg stuck at 100% cpu, for multiple days
|
List | pgsql-hackers |
Hi, Originally I posted it on -general, but Joe Conway suggested I repost in here for greater visibility... We hit a problem with Pg 12.6 (I know, we should upgrade, but that will take long time to prepare). Anyway - it's 12.6 on aarm64. Couple of days there was replication slot started, and now it seems to be stuck. =# select * from pg_stat_activity where pid = 22697 \gx ─[ RECORD 1 ]────┬────────────────────────────────────────────────────────── datid │ 16591 datname │ canvas pid │ 22697 usesysid │ 16505 usename │ <CENSORED> application_name │ PostgreSQL JDBC Driver client_addr │ <CENSORED> client_hostname │ [null] client_port │ 43160 backend_start │ 2021-08-18 02:12:05.758297+00 xact_start │ [null] query_start │ 2021-08-18 02:12:05.772271+00 state_change │ 2021-08-18 02:12:05.773428+00 wait_event_type │ [null] wait_event │ [null] state │ active backend_xid │ [null] backend_xmin │ [null] query │ SELECT COUNT(1) FROM pg_publication WHERE pubname = 'cdc' backend_type │ walsender =# select pg_current_wal_lsn(), pg_size_pretty( pg_current_wal_lsn() - sent_lsn), * from pg_stat_replication where pid =22697 \gx ─[ RECORD 1 ]──────┬────────────────────────────── pg_current_wal_lsn │ 1B14/718EA0B8 pg_size_pretty │ 290 GB pid │ 22697 usesysid │ 16505 usename │ <CENSORED> application_name │ PostgreSQL JDBC Driver client_addr │ <CENSORED> client_hostname │ [null] client_port │ 43160 backend_start │ 2021-08-18 02:12:05.758297+00 backend_xmin │ [null] state │ streaming sent_lsn │ 1ACC/D8689A8 write_lsn │ 1ACC/D527BD8 flush_lsn │ 1ACC/C97DF48 replay_lsn │ 1ACC/C97DF48 write_lag │ 00:00:00.257041 flush_lag │ 00:00:01.26665 replay_lag │ 00:00:01.26665 sync_priority │ 0 sync_state │ async reply_time │ 1999-12-21 03:15:13.449225+00 top shows the process using 100% of cpu. I tried strace'ing, but strace doesn't show *anything* - it just sits there. Got backtrace: ~# gdb --pid=22697 --batch -ex backtrace [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448 1448 ./build/../src/backend/utils/hash/dynahash.c: No such file or directory. #0 hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448 #1 0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at ./build/../src/backend/utils/cache/relfilenodemap.c:64 #2 0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at ./build/../src/backend/utils/cache/inval.c:595 #3 0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998 <formatted_start_time>,txn=0xaaaac326b998 <formatted_start_time>) at ./build/../src/backend/replication/logical/reorderbuffer.c:2149 #4 ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, commit_lsn=187650393290540, end_lsn=<optimizedout>, commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1770 #5 0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960, ctx=0xaaaaf5d396a0) at./build/../src/backend/replication/logical/decode.c:640 #6 DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf@entry=0xffffdd90f9c0) at ./build/../src/backend/replication/logical/decode.c:248 #7 0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at ./build/../src/backend/replication/logical/decode.c:117 #8 0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840 #9 0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2189 #10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at ./build/../src/backend/replication/walsender.c:1133 #11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL 1A2D/4B3640(\"proto_version\" '1', \"publication_names\" 'cdc')") at ./build/../src/backend/replication/walsender.c:1549 #12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>, username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4257 #13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4484 #14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167 #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725 #16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398 #17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228 The thing is - I can't close it with pg_terminate_backend(), and I'd rather not kill -9, as it will, I think, close all other connections, and this is prod server. The other end of the connection was something in kubernetes, and it no longer exists. Is there anything we could do about it? Best regards, depesz
pgsql-hackers by date: