Thread: In-place upgrade with streaming replicas

In-place upgrade with streaming replicas

From
richard@kojedz.in
Date:
Dear All,

I am trying to follow instructions regarding in-place upgrade with 
streaming replica servers. The documentation here: 
https://www.postgresql.org/docs/13/pgupgrade.html#:~:text=Prepare%20for%20standby%20server%20upgrades 
says that I should check 'Latest checkpoint location' in primary and 
replica servers. Now, I want to make this process automatic, so I would 
like to know a reliable way to make checkpoint locations match surely. 
During the automated upgrade procedure, I restart all servers on a 
different tcp ports, thus no legitim clients connect to primary, and 
thus they dont make any changes. Then, I issue CHECKPOINT on primary, 
retrieve pg_current_wal_lsn() on primary, and wait until all replicas 
report the same value in pg_last_wal_replay_lsn(), then I issue a 
CHECKPOINT on replicas. According to documentation this creates a 
RESTOREPOINT on replicas. Then, I repeat until pg_current_wal_lsn() does 
not change on primary. Then, if I shut down cluster in a way that first 
the primary is shut down, and just after the replicas, then, checkpoint 
locations will match. Howewer, if I accidentally shut down a replica 
before primary is shut down, the checkpoint locations wont match.

With this, I have the question, that after the shutdown of primary, what 
is the guarantee for replicas having the same checkpoint location? Why 
does the order of shutting down the servers matter? What would be the 
really exact and reliable way to ensure that replicas will have the same 
checkpoint location as the primary?

Thanks in advance,
Richard



Re: In-place upgrade with streaming replicas

From
Álvaro Herrera
Date:
On 2025-Feb-19, richard@kojedz.in wrote:

> With this, I have the question, that after the shutdown of primary, what is
> the guarantee for replicas having the same checkpoint location? Why does the
> order of shutting down the servers matter? What would be the really exact
> and reliable way to ensure that replicas will have the same checkpoint
> location as the primary?

The replicas can't write WAL by themselves, but they will replay
whatever the primary has sent; by shutting down the primary first and
letting the replicas catch up, you ensure that the replicas will
actually receive the shutdown record and replay it.  If you shut down
the replicas first, they can obviously never catch up with the shutdown
checkpoint of the primary.

As I recall, if you do shut down the primary first, one potential danger
is that the primary fails to send the checkpoint record before shutting
down, so the replicas won't receive it and obviously will not replay it;
or simply that they are behind enough that they receive it but don't
replay it.

You could use pg_controldata to read the last checkpoint info from all
nodes.  You can run it on the primary after shutting it down, and then
on each replica while it's still running to ensure that the correct
restartpoint has been created.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."                              (Brian Kernighan)



Re: In-place upgrade with streaming replicas

From
richard@kojedz.in
Date:
Dear Alvaro,

Thanks for your answers. Unfortunately, I was unaware of a shutdown 
record, that makes a difference then. So, I definitely must stop the 
primary first, then use pg_controldata to obtain checkpoint info. Then, 
can I query the replicas while they are up and running if they've 
received the shutdown record or not? So, after shutting down the 
primary, how will I know if a replica has received the mentioned record, 
and is safe to shutdown?

Thanks for the clarifications.

Best regards,
Richard

2025-02-19 16:54 időpontban Álvaro Herrera ezt írta:
> On 2025-Feb-19, richard@kojedz.in wrote:
> 
>> With this, I have the question, that after the shutdown of primary, 
>> what is
>> the guarantee for replicas having the same checkpoint location? Why 
>> does the
>> order of shutting down the servers matter? What would be the really 
>> exact
>> and reliable way to ensure that replicas will have the same checkpoint
>> location as the primary?
> 
> The replicas can't write WAL by themselves, but they will replay
> whatever the primary has sent; by shutting down the primary first and
> letting the replicas catch up, you ensure that the replicas will
> actually receive the shutdown record and replay it.  If you shut down
> the replicas first, they can obviously never catch up with the shutdown
> checkpoint of the primary.
> 
> As I recall, if you do shut down the primary first, one potential 
> danger
> is that the primary fails to send the checkpoint record before shutting
> down, so the replicas won't receive it and obviously will not replay 
> it;
> or simply that they are behind enough that they receive it but don't
> replay it.
> 
> You could use pg_controldata to read the last checkpoint info from all
> nodes.  You can run it on the primary after shutting it down, and then
> on each replica while it's still running to ensure that the correct
> restartpoint has been created.



Re: In-place upgrade with streaming replicas

From
Jerry Sievers
Date:
richard@kojedz.in writes:

> Dear Alvaro,
>
> Thanks for your answers. Unfortunately, I was unaware of a shutdown
> record, that makes a difference then. So, I definitely must stop the
> primary first, then use pg_controldata to obtain checkpoint
> info. Then, can I query the replicas while they are up and running if
> they've received the shutdown record or not? So, after shutting down
> the primary, how will I know if a replica has received the mentioned
> record, and is safe to shutdown?



Hmmm, not sure about that but what we do, is stop primary, wait a
$short time, then stop replicas...

Then run pg_controldata on all nodes | filter out only the line
indicating latest checkpoint and sort -u the output.  Expect only a
single line if all are matched.

You may also wish to first insure that you got the same number of
lines as total node count before doing the sorting and uniqueing.

Very rarely on our huge systems, we'd have a mismatch after the
verification in in those cases, our automated upgrade procedure
restarts all nodes and then does the shutdown and verify check again.

HTH


>
> Thanks for the clarifications.
>
> Best regards,
> Richard
>
> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta:
>> On 2025-Feb-19, richard@kojedz.in wrote:
>>
>>> With this, I have the question, that after the shutdown of primary,
>>> what is
>>> the guarantee for replicas having the same checkpoint location? Why
>>> does the
>>> order of shutting down the servers matter? What would be the really
>>> exact
>>> and reliable way to ensure that replicas will have the same checkpoint
>>> location as the primary?
>> The replicas can't write WAL by themselves, but they will replay
>> whatever the primary has sent; by shutting down the primary first and
>> letting the replicas catch up, you ensure that the replicas will
>> actually receive the shutdown record and replay it.  If you shut down
>> the replicas first, they can obviously never catch up with the shutdown
>> checkpoint of the primary.
>> As I recall, if you do shut down the primary first, one potential
>> danger
>> is that the primary fails to send the checkpoint record before shutting
>> down, so the replicas won't receive it and obviously will not replay
>> it;
>> or simply that they are behind enough that they receive it but don't
>> replay it.
>> You could use pg_controldata to read the last checkpoint info from
>> all
>> nodes.  You can run it on the primary after shutting it down, and then
>> on each replica while it's still running to ensure that the correct
>> restartpoint has been created.



Re: In-place upgrade with streaming replicas

From
richard@kojedz.in
Date:
Dear Jerry,

So, yes it turns out that some kind of loop must be involved here, as 
you described:

1. ensure cluster is running
2. stop primary
3. wait some time
4. stop replicas
5. check if checkpoint locations match. repeat from step 1 if 
out-of-sync.

My question here is, the unreliable step here is 3rd one. Can we query 
the replica runtime if he did catch up? I mean, that after stopping the 
primary, we can obtain the checkpoint location from pg_controldata, 
then, can we somehow query the running replica about that?

Thanks in advance,
Richard

2025-02-20 08:49 időpontban Jerry Sievers ezt írta:
> richard@kojedz.in writes:
> 
>> Dear Alvaro,
>> 
>> Thanks for your answers. Unfortunately, I was unaware of a shutdown
>> record, that makes a difference then. So, I definitely must stop the
>> primary first, then use pg_controldata to obtain checkpoint
>> info. Then, can I query the replicas while they are up and running if
>> they've received the shutdown record or not? So, after shutting down
>> the primary, how will I know if a replica has received the mentioned
>> record, and is safe to shutdown?
> 
> 
> 
> Hmmm, not sure about that but what we do, is stop primary, wait a
> $short time, then stop replicas...
> 
> Then run pg_controldata on all nodes | filter out only the line
> indicating latest checkpoint and sort -u the output.  Expect only a
> single line if all are matched.
> 
> You may also wish to first insure that you got the same number of
> lines as total node count before doing the sorting and uniqueing.
> 
> Very rarely on our huge systems, we'd have a mismatch after the
> verification in in those cases, our automated upgrade procedure
> restarts all nodes and then does the shutdown and verify check again.
> 
> HTH
> 
> 
>> 
>> Thanks for the clarifications.
>> 
>> Best regards,
>> Richard
>> 
>> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta:
>>> On 2025-Feb-19, richard@kojedz.in wrote:
>>> 
>>>> With this, I have the question, that after the shutdown of primary,
>>>> what is
>>>> the guarantee for replicas having the same checkpoint location? Why
>>>> does the
>>>> order of shutting down the servers matter? What would be the really
>>>> exact
>>>> and reliable way to ensure that replicas will have the same 
>>>> checkpoint
>>>> location as the primary?
>>> The replicas can't write WAL by themselves, but they will replay
>>> whatever the primary has sent; by shutting down the primary first and
>>> letting the replicas catch up, you ensure that the replicas will
>>> actually receive the shutdown record and replay it.  If you shut down
>>> the replicas first, they can obviously never catch up with the 
>>> shutdown
>>> checkpoint of the primary.
>>> As I recall, if you do shut down the primary first, one potential
>>> danger
>>> is that the primary fails to send the checkpoint record before 
>>> shutting
>>> down, so the replicas won't receive it and obviously will not replay
>>> it;
>>> or simply that they are behind enough that they receive it but don't
>>> replay it.
>>> You could use pg_controldata to read the last checkpoint info from
>>> all
>>> nodes.  You can run it on the primary after shutting it down, and 
>>> then
>>> on each replica while it's still running to ensure that the correct
>>> restartpoint has been created.



Re: In-place upgrade with streaming replicas

From
Jerry Sievers
Date:
richard@kojedz.in writes:

> Dear Jerry,
>
> So, yes it turns out that some kind of loop must be involved here, as
> you described:
>
> 1. ensure cluster is running
> 2. stop primary
> 3. wait some time
> 4. stop replicas
> 5. check if checkpoint locations match. repeat from step 1 if
> out-of-sync.
>
> My question here is, the unreliable step here is 3rd one. Can we query
> the replica runtime if he did catch up? I mean, that after stopping
> the primary, we can obtain the checkpoint location from
> pg_controldata, then, can we somehow query the running replica about
> that?
Assuming your client traffic has been stopped ahead of time and perhaps
you did a lockout via HBA or other means, including forcible termination
of persistent clients (we usually do a restart of the primary to insure
this)...

We don't wait more than a few seconds before also stopping the replicas
and the vast majority of times all nodes are at the same checkpoint.

Cheers!

>
> Thanks in advance,
> Richard
>
> 2025-02-20 08:49 időpontban Jerry Sievers ezt írta:
>> richard@kojedz.in writes:
>>
>>> Dear Alvaro,
>>> Thanks for your answers. Unfortunately, I was unaware of a shutdown
>>> record, that makes a difference then. So, I definitely must stop the
>>> primary first, then use pg_controldata to obtain checkpoint
>>> info. Then, can I query the replicas while they are up and running if
>>> they've received the shutdown record or not? So, after shutting down
>>> the primary, how will I know if a replica has received the mentioned
>>> record, and is safe to shutdown?
>> Hmmm, not sure about that but what we do, is stop primary, wait a
>> $short time, then stop replicas...
>> Then run pg_controldata on all nodes | filter out only the line
>> indicating latest checkpoint and sort -u the output.  Expect only a
>> single line if all are matched.
>> You may also wish to first insure that you got the same number of
>> lines as total node count before doing the sorting and uniqueing.
>> Very rarely on our huge systems, we'd have a mismatch after the
>> verification in in those cases, our automated upgrade procedure
>> restarts all nodes and then does the shutdown and verify check again.
>> HTH
>>
>>> Thanks for the clarifications.
>>> Best regards,
>>> Richard
>>> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta:
>>>> On 2025-Feb-19, richard@kojedz.in wrote:
>>>>
>>>>> With this, I have the question, that after the shutdown of primary,
>>>>> what is
>>>>> the guarantee for replicas having the same checkpoint location? Why
>>>>> does the
>>>>> order of shutting down the servers matter? What would be the really
>>>>> exact
>>>>> and reliable way to ensure that replicas will have the same
>>>>> checkpoint
>>>>> location as the primary?
>>>> The replicas can't write WAL by themselves, but they will replay
>>>> whatever the primary has sent; by shutting down the primary first and
>>>> letting the replicas catch up, you ensure that the replicas will
>>>> actually receive the shutdown record and replay it.  If you shut down
>>>> the replicas first, they can obviously never catch up with the
>>>> shutdown
>>>> checkpoint of the primary.
>>>> As I recall, if you do shut down the primary first, one potential
>>>> danger
>>>> is that the primary fails to send the checkpoint record before
>>>> shutting
>>>> down, so the replicas won't receive it and obviously will not replay
>>>> it;
>>>> or simply that they are behind enough that they receive it but don't
>>>> replay it.
>>>> You could use pg_controldata to read the last checkpoint info from
>>>> all
>>>> nodes.  You can run it on the primary after shutting it down, and
>>>> then
>>>> on each replica while it's still running to ensure that the correct
>>>> restartpoint has been created.



Re: In-place upgrade with streaming replicas

From
richard@kojedz.in
Date:
Dear Jerry,

Thanks for sharing your experiments, I will implement our upgrades in a 
similar way. Terminate/restart on different port, wait for catchup, stop 
primary, check replicas somehow (using pg_wal_lsn_diff()), then stop 
replicas too, check for pg_controldata match, and repeat if not.

Regards,
Richard

2025-02-21 04:57 időpontban Jerry Sievers ezt írta:
> richard@kojedz.in writes:
> 
>> Dear Jerry,
>> 
>> So, yes it turns out that some kind of loop must be involved here, as
>> you described:
>> 
>> 1. ensure cluster is running
>> 2. stop primary
>> 3. wait some time
>> 4. stop replicas
>> 5. check if checkpoint locations match. repeat from step 1 if
>> out-of-sync.
>> 
>> My question here is, the unreliable step here is 3rd one. Can we query
>> the replica runtime if he did catch up? I mean, that after stopping
>> the primary, we can obtain the checkpoint location from
>> pg_controldata, then, can we somehow query the running replica about
>> that?
> Assuming your client traffic has been stopped ahead of time and perhaps
> you did a lockout via HBA or other means, including forcible 
> termination
> of persistent clients (we usually do a restart of the primary to insure
> this)...
> 
> We don't wait more than a few seconds before also stopping the replicas
> and the vast majority of times all nodes are at the same checkpoint.
> 
> Cheers!
> 
>> 
>> Thanks in advance,
>> Richard
>> 
>> 2025-02-20 08:49 időpontban Jerry Sievers ezt írta:
>>> richard@kojedz.in writes:
>>> 
>>>> Dear Alvaro,
>>>> Thanks for your answers. Unfortunately, I was unaware of a shutdown
>>>> record, that makes a difference then. So, I definitely must stop the
>>>> primary first, then use pg_controldata to obtain checkpoint
>>>> info. Then, can I query the replicas while they are up and running 
>>>> if
>>>> they've received the shutdown record or not? So, after shutting down
>>>> the primary, how will I know if a replica has received the mentioned
>>>> record, and is safe to shutdown?
>>> Hmmm, not sure about that but what we do, is stop primary, wait a
>>> $short time, then stop replicas...
>>> Then run pg_controldata on all nodes | filter out only the line
>>> indicating latest checkpoint and sort -u the output.  Expect only a
>>> single line if all are matched.
>>> You may also wish to first insure that you got the same number of
>>> lines as total node count before doing the sorting and uniqueing.
>>> Very rarely on our huge systems, we'd have a mismatch after the
>>> verification in in those cases, our automated upgrade procedure
>>> restarts all nodes and then does the shutdown and verify check again.
>>> HTH
>>> 
>>>> Thanks for the clarifications.
>>>> Best regards,
>>>> Richard
>>>> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta:
>>>>> On 2025-Feb-19, richard@kojedz.in wrote:
>>>>> 
>>>>>> With this, I have the question, that after the shutdown of 
>>>>>> primary,
>>>>>> what is
>>>>>> the guarantee for replicas having the same checkpoint location? 
>>>>>> Why
>>>>>> does the
>>>>>> order of shutting down the servers matter? What would be the 
>>>>>> really
>>>>>> exact
>>>>>> and reliable way to ensure that replicas will have the same
>>>>>> checkpoint
>>>>>> location as the primary?
>>>>> The replicas can't write WAL by themselves, but they will replay
>>>>> whatever the primary has sent; by shutting down the primary first 
>>>>> and
>>>>> letting the replicas catch up, you ensure that the replicas will
>>>>> actually receive the shutdown record and replay it.  If you shut 
>>>>> down
>>>>> the replicas first, they can obviously never catch up with the
>>>>> shutdown
>>>>> checkpoint of the primary.
>>>>> As I recall, if you do shut down the primary first, one potential
>>>>> danger
>>>>> is that the primary fails to send the checkpoint record before
>>>>> shutting
>>>>> down, so the replicas won't receive it and obviously will not 
>>>>> replay
>>>>> it;
>>>>> or simply that they are behind enough that they receive it but 
>>>>> don't
>>>>> replay it.
>>>>> You could use pg_controldata to read the last checkpoint info from
>>>>> all
>>>>> nodes.  You can run it on the primary after shutting it down, and
>>>>> then
>>>>> on each replica while it's still running to ensure that the correct
>>>>> restartpoint has been created.