Re: Logical replication timeout problem - Mailing list pgsql-hackers

From Fabrice Chapuis
Subject Re: Logical replication timeout problem
Date
Msg-id CAA5-nLDF=fVzN4opizMpUCGJbAiB4mPMaPYZNe+SmuLjCQuwfA@mail.gmail.com
Whole thread Raw
In response to RE: Logical replication timeout problem  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
Responses RE: Logical replication timeout problem
List pgsql-hackers
Yes the refresh of MV is on the Publisher Side.
Thanks for your draft patch, I'll try it
I'll back to you as soonas possible

One question: why the refresh of the MV is a DDL not a DML?

Regards

Fabrice 

On Wed, 19 Oct 2022, 10:15 wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Hello Amit,
>
> In version 14.4 the timeout problem for logical replication happens again despite
> the patch provided for this issue in this version. When bulky materialized views
> are reloaded it broke logical replication. It is possible to solve this problem by
> using your new "streaming" option.
> Have you ever had this issue reported to you?
>
> Regards
>
> Fabrice
>
> 2022-10-10 17:19:02 CEST [538424]: [17-1]
> user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
> MATERIALIZED VIEW sxxxa00.table_base"
>         PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
> 2022-10-10 17:19:02 CEST [538424]: [18-1]
> user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
> refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:02 CEST [538424]: [19-1]
> user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652
> ms  statement: select refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:  automatic
> vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
>         pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable,
> oldest xmin: 1501528
>         index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
>         I/O timings: read: 1.494 ms, write: 0.000 ms
>         avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
>         buffer usage: 593301 hits, 77 misses, 294605 dirtied
>         WAL usage: 296644 records, 46119 full page images, 173652718 bytes
>         system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
> 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:  automatic
> analyze of table "dbxxxa00.sxxxa00.table_base"
>         I/O timings: read: 0.043 ms, write: 0.000 ms
>         avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
>         buffer usage: 30308 hits, 2 misses, 2 dirtied
>         system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:  checkpoint
> complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
> 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:  terminating
> logical replication worker due to timeout
> 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:  background
> worker "logical replication worker" (PID 716163) exited with exit code 1
> 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
> replication apply worker for subscription "subxxx_sxxxa00" has started

Thanks for reporting!

There is one thing I want to confirm:
Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
executed on the publisher-side?

If so, I think the reason for this timeout problem could be that during DDL
(`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
rewrite. Since these temporary data will not be processed by the pgoutput
plugin, our previous fix for DML had no impact on this case.
I think setting "streaming" option to "on" could work around this problem.

I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
I tried it locally and it seems to work.
Could you please confirm whether this problem is fixed after applying this
draft patch?

If this draft patch works, I will improve it and try to fix this problem.

Regards,
Wang wei

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Logical WAL sender unresponsive during decoding commit
Next
From: David Rowley
Date:
Subject: Re: Documentation refinement for Parallel Scans