Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs

From hubert depesz lubaczewski
Subject Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date
Msg-id Y+ZVPHHcYirQDgJF@depesz.com
Whole thread Raw
In response to Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-bugs
Hi,
so, we have another bit of interesting information. maybe related, maybe
not.

We noticed weird situation on two clusters we're trying to upgrade.

In both cases sitaution looked the same:

1. there was another process (debezium) connected to source (pg12) using
   logical replication
2. pg12 -> pg14 replication failed with the message 'ERROR:  requested
   WAL segment ... has already been '
3. some time afterwards (most likely couple of hours) the process that
   is/was responsible for debezium replicaiton (pg process) stopped
   handling WAL, but instead is eating 100% of cpu.

When this situation happens, we can't pg_cancel_backend(pid) for the
"broken" wal sender, it also can't be pg_terminate_backend() !

strace of the process doesn't show anything.

When I tried to get backtrace from gdb all I got was:

(gdb) bt
#0  0x0000aaaad270521c in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

If I'd quit gdb, and restart, and redo bt, I get 

#0  0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad291ae58 in ?? ()

or

#0  0x0000aaaad2705244 in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

At this moment, the only thing that we can do is kill -9 the process (or
restart pg).

I don't know if it's relevant, but I have this case *right now*, and if
it's helpful I can provide more information before we will have to kill
it.

Best regards,

depesz




pgsql-bugs by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: BUG #17760: SCRAM authentication fails with "modern" (rsassaPss signature) server certificate
Next
From: Timur
Date:
Subject: CREATE INDEX CONCURRENTLY cannot be executed within a pipeline