Re: BUG? Slave don't reconnect to the master - Mailing list pgsql-general

From Jehan-Guillaume de Rorthais
Subject Re: BUG? Slave don't reconnect to the master
Date
Msg-id 20200909171931.64fce7bb@firost
Whole thread Raw
In response to Re: BUG? Slave don't reconnect to the master  (Олег Самойлов <splarv@ya.ru>)
Responses Re: BUG? Slave don't reconnect to the master
List pgsql-general
On Mon, 7 Sep 2020 23:46:17 +0300
Олег Самойлов <splarv@ya.ru> wrote:

> [...]
> >>> why did you add "monitor interval=15"? No harm, but it is redundant with
> >>> "monitor interval=16 role=Master" and "monitor interval=17
> >>> role=Slave".
> >>
> >> I can't remember clearly. :) Look what happens without it.
> >>
> >> + pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms
> >> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2
> >> recovery_template=/var/lib/pgsql/krogan2.paf meta master notify=true
> >> resource-stickiness=10
> >> Warning: changing a monitor operation interval from 15 to 16 to make the
> >> operation unique
> >> Warning: changing a monitor operation interval from 16 to 17 to make the
> >> operation unique
> >
> > Something fishy here. This command lack op monitor settings. Pacemaker don't
> > add any default monitor operation with default interval if you don't give
> > one at resource creation.
> >
> > If you create such a resource with no monitoring, the cluster will
> > start/stop it when needed, but will NOT check for its health. See:
> >
> > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-resource-monitoring.html
>
> May be. But keep in mind, that I uses `pcs`, I do not edit the xml file
> directly. And I use too old pacemaker, the default package of CentOS 7 is
> pacemaker-1.1.21-4.el7.x86_64, while you link of documentation is for
> Pacemaker 2.0.

It's the same behavior between both 2.0 and 1.1, but ...(see bellow)

> >> So trivial monitor always exists by default with interval 15.
> >
> > nope.
>
> This is not true for CentOS 7. I removed my monitor options, for this example.
>
> pcs cluster cib original_cib.xml
> cp original_cib.xml configured_cib.xml
> pcs -f configured_cib.xml resource create krogan3DB ocf:heartbeat:pgsqlms
> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan3
> recovery_template=/var/lib/pgsql/krogan3.paf meta master notify=true
> resource-stickiness=10

I tried your command, and indeed, pcs creates the missing monitor operation
with a default interval of 15. This is surprising, it's the first time I cross
these warning messages. Thanks for this information, I wasn't aware of this pcs
behavior.

But anyway, it's not recommended to create your resources without specifying
interval and timeout for each operations. See PAF docs. Just create the two
monitor operations related to both roles and you'll not have these warnings.

> > [...]
> > OK, I understand now. If you want to edit an existing resource, use "pcs
> > resource update". Make sure read the pcs manual about how to use it to
> > edit/remove/add operations on a resource.
>
> This is not so easy. To edit existed resource I must to know the "interval"
> of this resource, but in this case I am not sure what the interval will be
> for the monitor operation of the master role. :) Because
> >>
> >> Warning: changing a monitor operation interval from 15 to 16 to make the
> >> operation unique
> >> Warning: changing a monitor operation interval from 16 to 17 to make the
> >> operation unique
>
> I am not sure in what order and what it will be. Thats why I configured as I
> configured. This just works.

Now we know where these warnings comes from, you have a solution (set both
of them explicitly)

> >> Looked like the default timeout 10 was not enough for the "master".
> >
> > It's written in PAF doc. See:
> > https://clusterlabs.github.io/PAF/configuration.html#resource-agent-actions
> >
> > Do not hesitate to report or submit some enhancements to the doc if
> > needed.
>
> May be the documentation was improved. Thanks that you have pointed me on
> that. After moving to CentOS 8 I will check with recommended parameters
> according to the documentation.

You can do it right now with CentOS 7. They are the same.

> > [...]
> >>>> 10:30:55.965 FATAL:  terminating walreceiver process dpue to
> >>>> administrator cmd 10:30:55.966 LOG:  redo done at 0/1600C4B0
> >>>> 10:30:55.966 LOG:  last completed transaction was at log time
> >>>> 10:25:38.76429 10:30:55.968 LOG:  selected new timeline ID: 4
> >>>> 10:30:56.001 LOG:  archive recovery complete
> >>>> 10:30:56.005 LOG:  database system is ready to accept connections
> >>>
> >>>> The slave with didn't reconnected replication, tuchanka3c. Also I
> >>>> separated logs copied from the old master by a blank line:
> >>>>
> >>>> [...]
> >>>>
> >>>> 10:20:25.168 LOG:  database system was interrupted; last known up at
> >>>> 10:20:19 10:20:25.180 LOG:  entering standby mode
> >>>> 10:20:25.181 LOG:  redo starts at 0/11000098
> >>>> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
> >>>> 10:20:25.183 LOG:  database system is ready to accept read only
> >>>> connections 10:20:25.193 LOG:  started streaming WAL from primary at
> >>>> 0/12000000 on tl 3 10:25:05.370 LOG:  could not send data to client:
> >>>> Connection reset by peer 10:26:38.655 FATAL:  terminating walreceiver
> >>>> due to timeout 10:26:38.655 LOG:  record with incorrect prev-link
> >>>> 0/1200C4B0 at 0/1600C4D8
> >>>
> >>> This message appear before the effective promotion of tuchanka3b. Do you
> >>> have logs about what happen *after* the promotion?
> >>
> >> This is end of the slave log. Nothing. Just absent replication.
> >
> > This is unusual. Could you log some more details about replication
> > tryouts to your PostgreSQL logs? Set log_replication_commands and lower
> > log_min_messages to debug ?
>
> Sure, this is PostgreSQL logs for the cluster tuchanka3.
> Tuchanka3a is an old (failed) master.

According to your logs:

20:29:41 tuchanka3a: freeze
20:30:39 tuchanka3c: wal receiver timeout (default 60s timeout)
20:30:39 tuchanka3c: switched to archives, and error'ed (expected)
20:30:39 tuchanka3c: switched to stream again (expected)
                     no more news from this new wal receiver
20:34:21 tuchanka3b: promoted

I'm not sure where your floating IP is located at 20:30:39, but I suppose it
is still on tuchanka3a as the wal receiver don't hit any connection error and
tuchanka3b is not promoted yet.

So at this point, I suppose the wal receiver is stuck in libpqrcv_connect
waiting for frozen tuchanka3a to answer, with no connection timeout. You might
track tcp sockets on tuchanka3a to confirm this.

To avoid such a wait, try to add eg. connect_timeout=2 to your primary_conninfo
parameter. See:
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS

Regards,



pgsql-general by date:

Previous
From: Brajendra Pratap Singh
Date:
Subject: Re: Schema/ROLE Rename Issue
Next
From: Aner Perez
Date:
Subject: Unexpected results when using GROUP BY GROUPING SETS and bind variables