Re: BUG? Slave don't reconnect to the master - Mailing list pgsql-general
From | Jehan-Guillaume de Rorthais |
---|---|
Subject | Re: BUG? Slave don't reconnect to the master |
Date | |
Msg-id | 20200929113100.70b57e60@firost Whole thread Raw |
In response to | Re: BUG? Slave don't reconnect to the master (Олег Самойлов <splarv@ya.ru>) |
Responses |
Re: BUG? Slave don't reconnect to the master
|
List | pgsql-general |
On Thu, 24 Sep 2020 15:22:46 +0300 Олег Самойлов <splarv@ya.ru> wrote: > Hi, Jehan. > > > On 9 Sep 2020, at 18:19, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> > > wrote: > > > > On Mon, 7 Sep 2020 23:46:17 +0300 > > Олег Самойлов <splarv@ya.ru> wrote: > > > >>> [...] > >>>>>> 10:30:55.965 FATAL: terminating walreceiver process dpue to > >>>>>> administrator cmd 10:30:55.966 LOG: redo done at 0/1600C4B0 > >>>>>> 10:30:55.966 LOG: last completed transaction was at log time > >>>>>> 10:25:38.76429 10:30:55.968 LOG: selected new timeline ID: 4 > >>>>>> 10:30:56.001 LOG: archive recovery complete > >>>>>> 10:30:56.005 LOG: database system is ready to accept connections > >>>>> > >>>>>> The slave with didn't reconnected replication, tuchanka3c. Also I > >>>>>> separated logs copied from the old master by a blank line: > >>>>>> > >>>>>> [...] > >>>>>> > >>>>>> 10:20:25.168 LOG: database system was interrupted; last known up at > >>>>>> 10:20:19 10:20:25.180 LOG: entering standby mode > >>>>>> 10:20:25.181 LOG: redo starts at 0/11000098 > >>>>>> 10:20:25.183 LOG: consistent recovery state reached at 0/11000A68 > >>>>>> 10:20:25.183 LOG: database system is ready to accept read only > >>>>>> connections 10:20:25.193 LOG: started streaming WAL from primary at > >>>>>> 0/12000000 on tl 3 10:25:05.370 LOG: could not send data to client: > >>>>>> Connection reset by peer 10:26:38.655 FATAL: terminating walreceiver > >>>>>> due to timeout 10:26:38.655 LOG: record with incorrect prev-link > >>>>>> 0/1200C4B0 at 0/1600C4D8 > >>>>> > >>>>> This message appear before the effective promotion of tuchanka3b. Do you > >>>>> have logs about what happen *after* the promotion? > >>>> > >>>> This is end of the slave log. Nothing. Just absent replication. > >>> > >>> This is unusual. Could you log some more details about replication > >>> tryouts to your PostgreSQL logs? Set log_replication_commands and lower > >>> log_min_messages to debug ? > >> > >> Sure, this is PostgreSQL logs for the cluster tuchanka3. > >> Tuchanka3a is an old (failed) master. > > > > According to your logs: > > > > 20:29:41 tuchanka3a: freeze > > 20:30:39 tuchanka3c: wal receiver timeout (default 60s timeout) > > 20:30:39 tuchanka3c: switched to archives, and error'ed (expected) > > 20:30:39 tuchanka3c: switched to stream again (expected) > > no more news from this new wal receiver > > 20:34:21 tuchanka3b: promoted > > > > I'm not sure where your floating IP is located at 20:30:39, but I suppose it > > is still on tuchanka3a as the wal receiver don't hit any connection error > > and tuchanka3b is not promoted yet. > > I think so. > > > > > So at this point, I suppose the wal receiver is stuck in libpqrcv_connect > > waiting for frozen tuchanka3a to answer, with no connection timeout. You > > might track tcp sockets on tuchanka3a to confirm this. > > I don't know how to do this. Use ss, see its manual page. Hare is an example, using standard 5432 pgsql port: ss -tapn 'dport = 5432 or sport = 5432' Look for Local and Peer addresses and their status. > > To avoid such a wait, try to add eg. connect_timeout=2 to your > > primary_conninfo parameter. See: > > https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS > > Nope, this was not enough. But I went further and I added tcp keepalive > options too. So now paf file, for instance in tuchanka3c, is: > > # recovery.conf for krogan3, pgsqlms pacemaker module > primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3c > connect_timeout=5 keepalives=1 keepalives_idle=1 keepalives_interval=3 > keepalives_count=3' recovery_target_timeline = 'latest' standby_mode = 'on' > > And now the problem with PostgreSQL-STOP is solved. But I surprised, why this > was needed? I though that wal_receiver_timeout must be enough for this case. Because wal_receiver_timeout apply on already established and streaming connections, when the server end streaming becomes silent. The timeout you have happen during the connection establishment, where connect_timeout takes effect. In regards with keepalive parameters, I am a bit surprised. According to the source code, parameters defaults are: keepalives=1 keepalives_idle=1 keepalives_interval=1 keepalives_count=1 But I just had a quick look there, so I probably miss something. Regards,
pgsql-general by date: