Fwd: BUG #18433: Logical replication timeout - Mailing list pgsql-bugs

From Kostiantyn Tomakh
Subject Fwd: BUG #18433: Logical replication timeout
Date
Msg-id CAJP09w6nGkXPSd0zZ1PyMiRGMOa4-2gCobb9TzJJoJry-6i0fQ@mail.gmail.com
Whole thread Raw
In response to BUG #18433: Logical replication timeout  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18433: Logical replication timeout
Re: BUG #18433: Logical replication timeout
List pgsql-bugs

Hi, Shlok Kyal.
1.  We set wal_sender_timeout=1h and wal_receiver_timeout=1h
2. Logs after this changes for table size 39GB
DB_source logs
2024-04-22 14:13:25 UTC:10.10.10.10(64442):replication_role@db:[16497]:STATEMENT: CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput NOEXPORT_SNAPSHOT
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG: starting logical decoding for slot "db_name_public_subscription"
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL: Streaming transactions committing after 139C2/AD4A6FA0, reading WAL from 139C2/AD414CC0.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG: logical decoding found initial starting point at 139C2/AD416018
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL: Waiting for transactions (approximately 3) older than 2782438580 to end.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG: logical decoding found consistent point at 139C2/AD4A6F68
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL: There are no running transactions.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG: logical decoding found consistent point at 139C2/AD637D58
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL: There are no running transactions.
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT: CREATE_REPLICATION_SLOT "db_name_public_subscription_18989252_sync_17090" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG: starting logical decoding for slot "db_name_public_subscription_18989252_sync_17090"
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL: Streaming transactions committing after 139C2/AD637D90, reading WAL from 139C2/AD6317D8.
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG: logical decoding found consistent point at 139C2/AD637D58
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL: There are no running transactions.
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG: terminating walsender process due to replication timeout
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:CONTEXT: slot "db_name_public_subscription", output plugin "pgoutput", in the commit callback, associated LSN 139C6/3C286C98
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG: terminating walsender process due to replication timeout
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:CONTEXT: slot "db_name_public_subscription_18989252_sync_17090", output plugin "pgoutput", in the begin callback, associated LSN 139C5/4C0A92B8
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')

DB_destination logs

2024-04-22 14:13:25 UTC::@:[847]:LOG: logical replication apply worker for subscription "db_name_public_subscription" has started
2024-04-22 14:13:25 UTC::@:[848]:LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
2024-04-22 15:00:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:00:58 UTC::@:[403]:LOG: checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.752 s, sync=0.005 s, total=1.782 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65540 kB, estimate=4065870 kB
2024-04-22 15:05:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:05:58 UTC::@:[403]:LOG: checkpoint complete: wrote 12 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.249 s, sync=0.005 s, total=1.277 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65487 kB, estimate=3665831 kB
2024-04-22 15:10:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:11:00 UTC::@:[403]:LOG: checkpoint complete: wrote 24 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=2.455 s, sync=0.005 s, total=2.483 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65631 kB, estimate=3305811 kB
2024-04-22 15:15:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:15:59 UTC::@:[403]:LOG: checkpoint complete: wrote 16 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.652 s, sync=0.006 s, total=1.719 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65541 kB, estimate=2981784 kB
2024-04-22 15:20:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:20:59 UTC::@:[403]:LOG: checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.006 s, total=1.783 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65522 kB, estimate=2690158 kB
2024-04-22 15:25:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:25:58 UTC::@:[403]:LOG: checkpoint complete: wrote 6 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.645 s, sync=0.004 s, total=0.673 s; sync files=3, longest=0.004 s, average=0.002 s; distance=65456 kB, estimate=2427688 kB
2024-04-22 15:30:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:30:59 UTC::@:[403]:LOG: checkpoint complete: wrote 15 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.550 s, sync=0.005 s, total=1.580 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65573 kB, estimate=2191476 kB
2024-04-22 15:35:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:35:59 UTC::@:[403]:LOG: checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.005 s, total=1.779 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65541 kB, estimate=1978883 kB
2024-04-22 15:40:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:40:58 UTC::@:[403]:LOG: checkpoint complete: wrote 7 buffers (0.0%); 0 WAL file(s) added, 2 removed, 0 recycled; write=0.746 s, sync=0.004 s, total=0.778 s; sync files=3, longest=0.004 s, average=0.002 s; distance=131032 kB, estimate=1794098 kB
2024-04-22 15:45:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:45:59 UTC::@:[403]:LOG: checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.148 s, sync=0.005 s, total=1.180 s; sync files=3, longest=0.005 s, average=0.002 s; distance=65533 kB, estimate=1621241 kB
2024-04-22 15:48:33 UTC::@:[847]:FATAL: terminating logical replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG: background worker "logical replication worker" (PID 847) exited with exit code 1
2024-04-22 15:48:33 UTC::@:[848]:FATAL: terminating logical replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG: background worker "logical replication worker" (PID 848) exited with exit code 1
3. This log was got for a table size of 39 GB but we have tables much bigger than this.
Our workflow for the tables size 39 GB
1) Download schema from the source database instance
2) Deleted PK, FK, and Indexes  from the table schema(we did it to increase data load, startup process)
3)Upload the schema to the destination DB.
4) configurated identity replication full at source DB for this table
5) Configured Logical replication between source and destination DB this table
6) During catchup on this table  process we got the messages in log that you can see above.
7) We also tried create primary key for this table during catchup state but this process was blocked by logical replication worker so we had to cancel primary key creation process.
I want to highlight that we used this workflow for PostgreSQL 10 and it worked.
4. Can you clarify the point 'One important point. If there is no request to source DB logical replication works fine for big tables.'?
I meant that if the source database doesn't have delete, insert, update queries catch up process very short and we don't have this problem.
Thank you in advance.



пт, 19 апр. 2024 г. в 14:51, Shlok Kyal <shlok.kyal.oss@gmail.com>:
Hi,

> Hello, Shlok Kyal. Thank you very much for your answer.
> Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout from 30seconds up to 300 seconds for source and destination DB but we got the same error.
> We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at destination DB "ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL". We didn't have such errors on Postgresql 10.
>
>
> ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok.kyal.oss@gmail.com>:
>>
>> Hi,
>>
>> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference:      18433
>> > Logged by:          Kostiantyn
>> > Email address:      tomahkvt@gmail.com
>> > PostgreSQL version: 13.14
>> > Operating system:   AWS RDS
>> > Description:
>> >
>> > On Postgresql 10 we used the following approach for the big tables:
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) Configure Logical replication between source and destination DB.
>> > 5) When last table logical replication table synchronization worker for
>> > subscription "db_name_public_subscription", table "table_name" has
>> > finished
>> > 6) we created all the necessary PK, FK, and Indexes.
>> > This approach allowed to us upload data more quickly. This approach was
>> > working  great on PostgreSQL 10.
>> >
>> > We tried the same approach for Postgresql13, but we got an error.
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) configurated identity replication full at source DB for tables bigger
>> > than 100Gb
>> > 5) Configured Logical replication between source and destination DB.
>> > 6) During catchup on this big table  process we got the following
>> > messages:
>> > Source DB
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
>> > "db_name_public_subscription", output plugin "pgoutput", in the begin
>> > callback, associated LSN 13705/2E913050
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
>> > (proto_version '1', publication_names '"db_name_public_publication"')
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
>> > "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
>> > in the begin callback, associated LSN 13703/27942B70
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
>> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
>> > '"db_name_public_publication"')
>> >
>> > One important point. If there is no request to source DB logical replication
>> > works fine for big tables.
>> > I saw the messages in PostgreSQL bugs like
>> > https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
>> > But I also did some tests and increased wal_keep_size
>> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
>> > without success. The setup works in PG 13 only with a small amount of data.
>>
>> I went through the issue and I think that the given logs are appearing
>> due to some delay in the Apply Worker process.
>> I could reproduce it in my development environment by applying delays
>> in Apply Worker.
>>
>> I think this issue can be resolved by setting 'wal_sender_timeout' to
>> a greater value. Can you try setting 'wal_sender_timeout' to a greater
>> value?
>>
>> Also, I noticed that while using Postgresql13 you are configuring the
>> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
>> Postgresql10. Is there any specific reason for it?
>> I pointed it out because REPLICA IDENTITY FULL has a different
>> behaviour and sets the entire row as key.

Few suggestions:
1. Can you try increasing  wal_sender_timeout and wal_receiver_timeout
to more than 1hrs? As in [1] you mentioned you were getting the same
issue for wal_sender_timeout set to 1hr.
2. Do you find any other unusual logs in SourceDB or any errors logs
in DestinationDB, If so please share them ?
3. Is it possible to share a reproducible testcase for the issue?
4. I observe that if we take a lock on table in DestinationDB which is
subscribed to publication in SourceDB, we may end up getting the same
logs as you mentioned in [1]. Can you check if anything similar is
happening?
5. Can you clarify the point 'One important point. If there is no
request to source DB logical replication works fine for big tables.'?

[1]: https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org

Thanks and Regards
Shlok Kyal

pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Next
From: PG Bug reporting form
Date:
Subject: BUG #18445: date_part / extract range for hours do not match documentation