Re: Pg stuck at 100% cpu, for multiple days - Mailing list pgsql-general

From Joe Conway
Subject Re: Pg stuck at 100% cpu, for multiple days
Date
Msg-id cb86c11d-9c9d-d7ac-8261-c06fba3a6612@joeconway.com
Whole thread Raw
In response to Re: Pg stuck at 100% cpu, for multiple days  (hubert depesz lubaczewski <depesz@depesz.com>)
Responses Re: Pg stuck at 100% cpu, for multiple days
List pgsql-general
On 8/30/21 10:36 AM, hubert depesz lubaczewski
> Anyway - it's 12.6 on aarm64. Couple of days there was replication
> slot started, and now it seems to be stuck.

> #0  hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
> #1  0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at
./build/../src/backend/utils/cache/relfilenodemap.c:64
> #2  0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at
./build/../src/backend/utils/cache/inval.c:595
> #3  0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998
<formatted_start_time>,txn=0xaaaac326b998 <formatted_start_time>) at
./build/../src/backend/replication/logical/reorderbuffer.c:2149
> #4  ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, commit_lsn=187650393290540,
end_lsn=<optimizedout>, commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1770
 
> #5  0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960,
ctx=0xaaaaf5d396a0)at ./build/../src/backend/replication/logical/decode.c:640
 
> #6  DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf@entry=0xffffdd90f9c0) at
./build/../src/backend/replication/logical/decode.c:248
> #7  0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at
./build/../src/backend/replication/logical/decode.c:117
> #8  0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840
> #9  0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2189
> #10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at
./build/../src/backend/replication/walsender.c:1133
> #11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL
1A2D/4B3640(\"proto_version\" '1', \"publication_names\" 'cdc')") at
./build/../src/backend/replication/walsender.c:1549
> #12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4257
 
> #13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at
./build/../src/backend/postmaster/postmaster.c:4484
> #14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167
> #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
> #16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
> #17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228
> 
> The thing is - I can't close it with pg_terminate_backend(), and I'd
> rather not kill -9, as it will, I think, close all other connections,
> and this is prod server.

> still makes me ask: why does Pg end up in such place,> where it
> doesn't do any syscalls, doesn't accept pg_terminate_backend(), and
> is using 100% of cpu?
src/backend/utils/hash/dynahash.c:1448 is in the middle of a while loop, 
which is apparently not exiting.

There is no check for interrupts in there and it is a fairly tight loop 
which would explain both symptoms.

As to how it got that way, I have to assume data corruption or a bug of 
some sort. I would repost the details to hackers for better visibility.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



pgsql-general by date:

Previous
From: Mario Emmenlauer
Date:
Subject: Re: lib and share are installed differently, but why?
Next
From: Pól Ua Laoínecháin
Date:
Subject: Re: Arrays - selecting (and not removing) duplicates...