Re: Replication with Patroni not working after killing secondary and starting again - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: Replication with Patroni not working after killing secondary and starting again
Date
Msg-id 20220429203307.hjlpiqulgw76rzdl@hjp.at
Whole thread Raw
In response to Re: Replication with Patroni not working after killing secondary and starting again  (Zb B <zbig.poland@gmail.com>)
Responses Re: Replication with Patroni not working after killing secondary and starting again  (Zb B <zbig.poland@gmail.com>)
List pgsql-general
On 2022-04-28 11:09:12 +0200, Zb B wrote:
> > When the secondary starts up it should continue replicating from where
> > it stopped. However, it can only do this if the necessary information is
> > still available. If WAL files have been deleted in the mean time. it
> > can't replay them. There should be error messages in your logs on what
> > went wrong
>
> I did another test using different wal_sender_timeout parameter, as the time of
> the secondary being shut down was longer than the default 60s for this
> parameter.

I don't think this will help. It will just make the primary slower in
noticing that the secondary is gone.


> I was hoping it would help but the result was the same (records were not
> replicated to the secondary after the patroni start). Well, I just verified
> again that the records were replicated after about 15 minutes to the secondary,
> so probably the timeout setting helped, or I was not patient enough before.

The latter, I suspect. Although I'm surprised that it takes so long. In
my experience, that takes only a few seconds, certainly less than a
minute for replication to start (how long it takes to finish depends on
the amount of data, of course).

Patroni can nuke the secondary database and create a fresh copy
(using basebackup). That might take 15 minutes (depending on the
database size). I don't think it does that automatically, though. Also I
think you would have noticed that.

What does `patronictl list` show during that interval?


> Is it normal to wait so long for the replication? (the original
> transaction in primary took about 5 minutes and was about 3000 small
> records). I am providing more details for completeness below:
>
> I get the following errors on the primary DB:
> 2022-04-28 04:36:50.544 EDT [13794] WARNING:  archive_mode enabled, yet
> archive_command is not set
> 2022-04-28 04:37:34.893 EDT [14755] ERROR:  replication slot "xyzd3riardb05"
> does not exist
> 2022-04-28 04:37:34.893 EDT [14755] STATEMENT:  START_REPLICATION SLOT
> "xyzd3riardb05" 0/7000000 TIMELINE 18
...
> and after some time such errors stop to appear.

So the replication slot is probably created after some time and then
replication starts to work.

I think that replication slot is managed by Patroni. So the question
would be: Why does Patroni take so long to create it? Did it log
anything?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

pgsql-general by date:

Previous
From: Jan Wieck
Date:
Subject: Re: External psql editor
Next
From: "David G. Johnston"
Date:
Subject: Re: External psql editor