Re: BUG? Slave don't reconnect to the master - Mailing list pgsql-general

From Олег Самойлов
Subject Re: BUG? Slave don't reconnect to the master
Date
Msg-id 2045A2B7-E972-44EF-B1D3-B86618FC9780@ya.ru
Whole thread Raw
In response to Re: BUG? Slave don't reconnect to the master  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: BUG? Slave don't reconnect to the master
List pgsql-general
Hi, Jehan.

> On 9 Sep 2020, at 18:19, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
>
> On Mon, 7 Sep 2020 23:46:17 +0300
> Олег Самойлов <splarv@ya.ru> wrote:
>
>>> [...]
>>>>>> 10:30:55.965 FATAL:  terminating walreceiver process dpue to
>>>>>> administrator cmd 10:30:55.966 LOG:  redo done at 0/1600C4B0
>>>>>> 10:30:55.966 LOG:  last completed transaction was at log time
>>>>>> 10:25:38.76429 10:30:55.968 LOG:  selected new timeline ID: 4
>>>>>> 10:30:56.001 LOG:  archive recovery complete
>>>>>> 10:30:56.005 LOG:  database system is ready to accept connections
>>>>>
>>>>>> The slave with didn't reconnected replication, tuchanka3c. Also I
>>>>>> separated logs copied from the old master by a blank line:
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>> 10:20:25.168 LOG:  database system was interrupted; last known up at
>>>>>> 10:20:19 10:20:25.180 LOG:  entering standby mode
>>>>>> 10:20:25.181 LOG:  redo starts at 0/11000098
>>>>>> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
>>>>>> 10:20:25.183 LOG:  database system is ready to accept read only
>>>>>> connections 10:20:25.193 LOG:  started streaming WAL from primary at
>>>>>> 0/12000000 on tl 3 10:25:05.370 LOG:  could not send data to client:
>>>>>> Connection reset by peer 10:26:38.655 FATAL:  terminating walreceiver
>>>>>> due to timeout 10:26:38.655 LOG:  record with incorrect prev-link
>>>>>> 0/1200C4B0 at 0/1600C4D8
>>>>>
>>>>> This message appear before the effective promotion of tuchanka3b. Do you
>>>>> have logs about what happen *after* the promotion?
>>>>
>>>> This is end of the slave log. Nothing. Just absent replication.
>>>
>>> This is unusual. Could you log some more details about replication
>>> tryouts to your PostgreSQL logs? Set log_replication_commands and lower
>>> log_min_messages to debug ?
>>
>> Sure, this is PostgreSQL logs for the cluster tuchanka3.
>> Tuchanka3a is an old (failed) master.
>
> According to your logs:
>
> 20:29:41 tuchanka3a: freeze
> 20:30:39 tuchanka3c: wal receiver timeout (default 60s timeout)
> 20:30:39 tuchanka3c: switched to archives, and error'ed (expected)
> 20:30:39 tuchanka3c: switched to stream again (expected)
>                     no more news from this new wal receiver
> 20:34:21 tuchanka3b: promoted
>
> I'm not sure where your floating IP is located at 20:30:39, but I suppose it
> is still on tuchanka3a as the wal receiver don't hit any connection error and
> tuchanka3b is not promoted yet.

I think so.

>
> So at this point, I suppose the wal receiver is stuck in libpqrcv_connect
> waiting for frozen tuchanka3a to answer, with no connection timeout. You might
> track tcp sockets on tuchanka3a to confirm this.

I don't know how to do this.

>
> To avoid such a wait, try to add eg. connect_timeout=2 to your primary_conninfo
> parameter. See:
> https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS

Nope, this was not enough. But I went further and I added tcp keepalive options too. So now paf file, for instance in
tuchanka3c,is: 

# recovery.conf for krogan3, pgsqlms pacemaker module
primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3c connect_timeout=5 keepalives=1
keepalives_idle=1keepalives_interval=3 keepalives_count=3' 
recovery_target_timeline = 'latest'
standby_mode = 'on'

And now the problem with PostgreSQL-STOP is solved. But I surprised, why this was needed? I though that
wal_receiver_timeoutmust be enough for this case. 

I need some more time to check this solution with other tests.


pgsql-general by date:

Previous
From: Gavin Flower
Date:
Subject: Re: Can I get some PostgreSQL developer feedback on these five general issues I have with PostgreSQL and its ecosystem?
Next
From: Guillaume Luchet
Date:
Subject: Need explanation on index size