Re: Logical replication failed with SSL SYSCALL error - Mailing list pgsql-hackers

From shaurya jain
Subject Re: Logical replication failed with SSL SYSCALL error
Date
Msg-id CAHHJ3NTgRi70cwAWFULzAc+fsPeBf_=O_VAMO8tH5FbLXFFjag@mail.gmail.com
Whole thread Raw
In response to Re: Logical replication failed with SSL SYSCALL error  (vignesh C <vignesh21@gmail.com>)
List pgsql-hackers
Hi Vignesh,

That's really prompt and solves our problem. Thank you buddy.

Please go through my inline comments:-


On Thu, Apr 20, 2023 at 11:49 AM vignesh C <vignesh21@gmail.com> wrote:
On Wed, 19 Apr 2023 at 17:26, shaurya jain <12345shaurya@gmail.com> wrote:
>
> Hi Team,
>
> Could you please help me with this, It's urgent for the production environment.
>
> On Wed, Apr 19, 2023 at 3:44 PM shaurya jain <12345shaurya@gmail.com> wrote:
>>
>> Hi Team,
>>
>> Could you please help, It's urgent for the production env?
>>
>> On Sun, Apr 16, 2023 at 2:40 AM shaurya jain <12345shaurya@gmail.com> wrote:
>>>
>>> Hi Team,
>>>
>>> Postgres Version:- 13.8
>>> Issue:- Logical replication failing with SSL SYSCALL error
>>> Priority:-High
>>>
>>> We are migrating our database through logical replications, and all of sudden below error pops up in the source and target logs which leads us to nowhere.
>>>
>>> Logs from Source:-
>>> LOG:  could not send data to client: Connection reset by peer
>>> STATEMENT:  COPY public.test TO STDOUT
>>> FATAL:  connection to client lost
>>> STATEMENT:  COPY public.test TO STDOUT
>>>
>>> Logs from Target:-
>>> 2023-04-15 19:07:02 UTC::@:[1250]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:07:02 UTC::@:[1250]:CONTEXT: COPY test, line 365326932
>>> 2023-04-15 19:07:03 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 1250) exited with exit code 1
>>> 2023-04-15 19:07:03 UTC::@:[7155]:LOG: logical replication table synchronization worker for subscription " sub_tables_2_180", table "test" has started
>>> 2023-04-15 19:12:05 UTC:10.144.19.34(33276):postgres@webadmit_staging:[7112]:WARNING: there is no transaction in progress
>>> 2023-04-15 19:14:08 UTC:10.144.19.34(33324):postgres@webadmit_staging:[6052]:LOG: could not receive data from client: Connection reset by peer
>>> 2023-04-15 19:17:23 UTC::@:[2112]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:17:23 UTC::@:[1089]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:17:23 UTC::@:[2556]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 2556) exited with exit code 1
>>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 2112) exited with exit code 1
>>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical replication worker" (PID 1089) exited with exit code 1
>>> 2023-04-15 19:17:23 UTC::@:[7287]:LOG: logical replication apply worker for subscription "sub_tables_2_180" has started
>>> 2023-04-15 19:17:23 UTC::@:[7288]:LOG: logical replication apply worker for subscription "sub_tables_3_192" has started
>>> 2023-04-15 19:17:23 UTC::@:[7289]:LOG: logical replication apply worker for subscription "sub_tables_1_180" has started
>>>
>>> Just after this error, all other replication slots get disabled for some time and come back online along with COPY command with the new PID in pg_stat_activity.
>>>
>>> I have a few queries regarding this:-
>>>
>>> The exact reason for disconnection (Few articles claim memory and few network)
This might be because of network failure, did you notice any network
instability, could you check the TCP settings.
You could check the following configurations tcp_keepalives_idle,
tcp_keepalives_interval and tcp_keepalives_count.
This means it will connect the server based on tcp_keepalives_idle
seconds specified , if the server does not respond in
tcp_keepalives_interval seconds it'll try again, and will consider the
connection gone after tcp_keepalives_count failures. ---Yes you were correct, that ssue was related to network where VPN tunnel got restarted because of some miss configuration at tunnel side. By fixing that it stands resolved so far. These params were set to below values:-
  1. keepalives_idle 60
  2. keepalives_interval 100
  3. keepalives_count 60

>>> Will it lead to data inconsistency?
It will not lead to inconsistency. In case of failure the failed
transaction will be rolled back. Yes, Migration was up to the mark after fixing network.

>>> Does this new PID COPY command again migrate the whole data of the test table once again?
Yes, it will migrate the whole table data again in case of failures. Yes, I follow you on that. Is there any way to rsync instead of simple copy?

Regards,
Vignesh


--
Thanks and Regards,
Shaurya Jain
Mobile:- +91-8802809405

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply