Re: BUG #17438: Logical replication hangs on master after huge DB load - Mailing list pgsql-bugs

From Amit Kapila
Subject Re: BUG #17438: Logical replication hangs on master after huge DB load
Date
Msg-id CAA4eK1JO_zijrTqoZdzMn0FtTfV=Nj6Fr++BfdsBkHZqfA_cPw@mail.gmail.com
Whole thread Raw
In response to BUG #17438: Logical replication hangs on master after huge DB load  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #17438: Logical replication hangs on master after huge DB load  (Sergey Belyashov <sergey.belyashov@gmail.com>)
List pgsql-bugs
On Mon, Mar 14, 2022 at 11:49 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17438
> Logged by:          Sergey Belyashov
> Email address:      sergey.belyashov@gmail.com
> PostgreSQL version: 14.2
> Operating system:   Debian 11, GNU/Linux x86_64
> Description:
>
> Master DB has few tables: A (few inserts per second, about 200 updates per
> second, ~100 deletes each 5 minutes), B (~100 inserts each 5 minutes), C
> (~200 inserts and ~200 updates per second). B and C are large partitioned by
> range tables (36 and 12 partitions). A is small table about 10K entries
> (often updates). Table A has publications for inserts and deletes. Table B
> has publication for all operations except truncate via root.
>
> I do some maintenance work. I stop production load on DB and do some high
> load operations with table C (for example: "insert into D select * from C").
> After completion replications for A and B freezes and loads CPU for 50-99%
> without actual data transmission. I try to disable/enable/refresh
> subscription, but no effect. I try to restart master - no result. Only
> drop/create of subscriptions helps me.
>

Is it possible to get some reproducible script/test for this problem?

> Publisher logs many messages like following:
> 2022-03-14 19:57:02.907 MSK [1771976] user@DB ERROR:  replication slot
> "A_sub" is active for PID 1766849
> 2022-03-14 19:57:02.907 MSK [1771976] user@DB STATEMENT:  START_REPLICATION
> SLOT "A_sub" LOGICAL 28C/60150F50 (proto_version '2', publication_names
> '"A_pub"')
> 2022-03-14 19:57:02.909 MSK [1771977] user@DB ERROR:  replication slot
> "B_sub" is active for PID 1766828
> 2022-03-14 19:57:02.909 MSK [1771977] user@DB STATEMENT:  START_REPLICATION
> SLOT "B_sub" LOGICAL 28C/AE2B7D8 (proto_version '2',
> publication_names '"B_pub"')
>
> Subscriber logs many messages like following:
> 2022-03-14 19:56:52.709 MSK [3266082] LOG:  logical replication apply worker
> for subscription "B_sub" has started
> 2022-03-14 19:56:52.710 MSK [993] LOG:  background worker "logical
> replication worker" (PID 3266080) exited with exit code 1
> 2022-03-14 19:56:52.814 MSK [3266081] ERROR:  could not start WAL streaming:
> ERROR:  replication slot "A_sub" is active for PID 1766849
> 2022-03-14 19:56:52.815 MSK [993] LOG:  background worker "logical
> replication worker" (PID 3266081) exited with exit code 1
> 2022-03-14 19:56:52.818 MSK [3266082] ERROR:  could not start WAL streaming:
> ERROR:  replication slot "B_sub" is active for PID 1766828
> 2022-03-14 19:56:52.819 MSK [993] LOG:  background worker "logical
> replication worker" (PID 3266082) exited with exit code 1
>

Just by seeing these LOGs, it seems subscriber side workers are
exiting due to some error and publisher-side (WALSender) still
continues due to which I think we are seeing ""A_sub" is active for
PID 1766849". Do you see any different type of error in
subscriber-side logs?

-- 
With Regards,
Amit Kapila.



pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: VACUUM can set pages all-frozen without also setting them all-visible
Next
From: Sergey Belyashov
Date:
Subject: Re: BUG #17438: Logical replication hangs on master after huge DB load