RE: Logical replication timeout problem - Mailing list pgsql-hackers

From wangw.fnst@fujitsu.com
Subject RE: Logical replication timeout problem
Date
Msg-id OS3PR01MB627527AD69BACB66F944928A9E5A9@OS3PR01MB6275.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Logical replication timeout problem  (Fabrice Chapuis <fabrice636861@gmail.com>)
Responses Re: Logical replication timeout problem
List pgsql-hackers

On Wed, Jan 19, 2022 at 9:53 PM Fabrice Chapuis fabrice636861@gmail.com wrote:

> Hello Amit,

> If it takes little work for you, can you please send me a piece of code

> with the change needed to do the test

 

I wrote a patch(Send-keepalive.patch, please refer to attachment) according to

Amit's suggestions. But after I did some simple test about this patch by the

test script "test.sh"(please refer to attachment), I found the timeout problem

has not been fixed by this patch.

 

So I add some logs(please refer to Add-some-logs-to-debug.patch) to confirm newly

added WalSndKeepaliveIfNecessary() send keepalive message or not.

 

After applying the Send-keepalive.patch and Add-some-logs-to-debug.patch, I

found that the added message "send keep alive message" was not printed in

publisher-side log.

 

[publisher-side log]:

2022-01-20 15:21:50.057 CST [2400278] LOG:  checkpoint complete: wrote 61 buffers (0.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=9.838 s, sync=0.720 s, total=10.559 s; sync files=4, longest=0.563 s, average=0.180 s; distance=538053 kB, estimate=543889 kB

2022-01-20 15:21:50.977 CST [2400278] LOG:  checkpoints are occurring too frequently (11 seconds apart)

2022-01-20 15:21:50.977 CST [2400278] HINT:  Consider increasing the configuration parameter "max_wal_size".

2022-01-20 15:21:50.988 CST [2400278] LOG:  checkpoint starting: wal

2022-01-20 15:21:52.853 CST [2400404] LOG:  begin load changes

2022-01-20 15:21:52.853 CST [2400404] STATEMENT:  START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', publication_names '"pub"')

2022-01-20 15:22:52.969 CST [2410649] ERROR:  replication slot "sub" is active for PID 2400404

2022-01-20 15:22:52.969 CST [2410649] STATEMENT:  START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', publication_names '"pub"')

2022-01-20 15:22:57.980 CST [2410657] ERROR:  replication slot "sub" is active for PID 2400404

 

[subscriber-side log]:

2022-01-20 15:16:10.975 CST [2400335] LOG:  checkpoint starting: time

2022-01-20 15:16:16.052 CST [2400335] LOG:  checkpoint complete: wrote 51 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.830 s, sync=0.135 s, total=5.078 s; sync files=39, longest=0.079 s, average=0.004 s; distance=149 kB, estimate=149 kB

2022-01-20 15:22:52.738 CST [2400400] ERROR:  terminating logical replication worker due to timeout

2022-01-20 15:22:52.738 CST [2400332] LOG:  background worker "logical replication worker" (PID 2400400) exited with exit code 1

2022-01-20 15:22:52.740 CST [2410648] LOG:  logical replication apply worker for subscription "sub" has started

2022-01-20 15:22:52.969 CST [2410648] ERROR:  could not start WAL streaming: ERROR:  replication slot "sub" is active for PID 2400404

2022-01-20 15:22:52.970 CST [2400332] LOG:  background worker "logical replication worker" (PID 2410648) exited with exit code 1

2022-01-20 15:22:57.977 CST [2410656] LOG:  logical replication apply worker for subscription "sub" has started

 

It seems WalSndKeepaliveIfNecessary did not send keepalive message in testing. I

am still doing some research about the cause.

 

Attach the patches and test script mentioned above, in case someone wants to try.

If I miss something, please let me know.

 

Regards,

Wang wei

Attachment

pgsql-hackers by date:

Previous
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: [Proposal] Add foreign-server health checks infrastructure
Next
From: Peter Eisentraut
Date:
Subject: Re: pg_upgrade should truncate/remove its logs before running