Home > mailing lists

Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs

From	hubert depesz lubaczewski
Subject	Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date	February 10, 2023 17:31:24
Msg-id	Y+ZVPHHcYirQDgJF@depesz.com Whole thread Raw
In response to	Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Masahiko Sawada <sawada.mshk@gmail.com>)
List	pgsql-bugs

Tree view

Hi,
so, we have another bit of interesting information. maybe related, maybe
not.

We noticed weird situation on two clusters we're trying to upgrade.

In both cases sitaution looked the same:

1. there was another process (debezium) connected to source (pg12) using
   logical replication
2. pg12 -> pg14 replication failed with the message 'ERROR:  requested
   WAL segment ... has already been '
3. some time afterwards (most likely couple of hours) the process that
   is/was responsible for debezium replicaiton (pg process) stopped
   handling WAL, but instead is eating 100% of cpu.

When this situation happens, we can't pg_cancel_backend(pid) for the
"broken" wal sender, it also can't be pg_terminate_backend() !

strace of the process doesn't show anything.

When I tried to get backtrace from gdb all I got was:

(gdb) bt
#0  0x0000aaaad270521c in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

If I'd quit gdb, and restart, and redo bt, I get 

#0  0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad291ae58 in ?? ()

or

#0  0x0000aaaad2705244 in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

At this moment, the only thing that we can do is kill -9 the process (or
restart pg).

I don't know if it's relevant, but I have this case *right now*, and if
it's helpful I can provide more information before we will have to kill
it.

Best regards,

depesz

pgsql-bugs by date:

From: Heikki Linnakangas
Date: 10 February 2023, 17:02:31
Subject: Re: BUG #17760: SCRAM authentication fails with "modern" (rsassaPss signature) server certificate

From: Timur
Date: 10 February 2023, 19:24:23
Subject: CREATE INDEX CONCURRENTLY cannot be executed within a pipeline

Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs

Previous

Next