Re: Replication with Patroni not working after killing secondary and starting again - Mailing list pgsql-general

From Zb B
Subject Re: Replication with Patroni not working after killing secondary and starting again
Date
Msg-id CAKwARkbqwVc35dZWFLvrwL_6FxvwJSq-UEzFareEcoLvqqNYsA@mail.gmail.com
Whole thread Raw
In response to Re: Replication with Patroni not working after killing secondary and starting again  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Responses Re: Replication with Patroni not working after killing secondary and starting again  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
List pgsql-general
> What does `patronictl list` show during that interval?

Well. I can't repeat the situation anymore. Now the replication starts immediately after starting the patroni on secondary. I did several switchover commands meanwhile though

Meanwhile I did another test where I run a Java app with a large number of *short* transactions (inserts) and during execution of this app I do the patroni switchover command:

patronictl -c /etc/patroni/patroni.yml switchover

It turned out the records were not replicated to the secondary and when I tried to execute the switchover command on the primary I got the following error:
Error: This cluster has no master

When I tried to execute the switchover command on  the secondary it worked but because there was a discrepancy between the primary and secondary the records on the old primary were rolled back (the number of records on primary and secondary became the same - the same as it was on the old secondary)

Apparently there is something wrong with my cluster. How to debug i?. Do I need to configure anything so the replication is synchronous?

 



pt., 29 kwi 2022 o 22:33 Peter J. Holzer <hjp-pgsql@hjp.at> napisał(a):
On 2022-04-28 11:09:12 +0200, Zb B wrote:
> > When the secondary starts up it should continue replicating from where
> > it stopped. However, it can only do this if the necessary information is
> > still available. If WAL files have been deleted in the mean time. it
> > can't replay them. There should be error messages in your logs on what
> > went wrong
>
> I did another test using different wal_sender_timeout parameter, as the time of
> the secondary being shut down was longer than the default 60s for this
> parameter.

I don't think this will help. It will just make the primary slower in
noticing that the secondary is gone.


> I was hoping it would help but the result was the same (records were not
> replicated to the secondary after the patroni start). Well, I just verified
> again that the records were replicated after about 15 minutes to the secondary,
> so probably the timeout setting helped, or I was not patient enough before.

The latter, I suspect. Although I'm surprised that it takes so long. In
my experience, that takes only a few seconds, certainly less than a
minute for replication to start (how long it takes to finish depends on
the amount of data, of course).

Patroni can nuke the secondary database and create a fresh copy
(using basebackup). That might take 15 minutes (depending on the
database size). I don't think it does that automatically, though. Also I
think you would have noticed that.

What does `patronictl list` show during that interval?


> Is it normal to wait so long for the replication? (the original
> transaction in primary took about 5 minutes and was about 3000 small
> records). I am providing more details for completeness below:
>
> I get the following errors on the primary DB:
> 2022-04-28 04:36:50.544 EDT [13794] WARNING:  archive_mode enabled, yet
> archive_command is not set
> 2022-04-28 04:37:34.893 EDT [14755] ERROR:  replication slot "xyzd3riardb05"
> does not exist
> 2022-04-28 04:37:34.893 EDT [14755] STATEMENT:  START_REPLICATION SLOT
> "xyzd3riardb05" 0/7000000 TIMELINE 18
...
> and after some time such errors stop to appear.

So the replication slot is probably created after some time and then
replication starts to work.

I think that replication slot is managed by Patroni. So the question
would be: Why does Patroni take so long to create it? Did it log
anything?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

pgsql-general by date:

Previous
From: Aaron Gray
Date:
Subject: Re: Whole Database or Table AES encryption
Next
From: Paul van der Linden
Date:
Subject: Completely wrong queryplan