Re: In-place upgrade with streaming replicas - Mailing list pgsql-admin
From | richard@kojedz.in |
---|---|
Subject | Re: In-place upgrade with streaming replicas |
Date | |
Msg-id | 71ed34f8c1ed858262a7cf50b633682f@kojedz.in Whole thread Raw |
In response to | Re: In-place upgrade with streaming replicas (Jerry Sievers <gsievers19@comcast.net>) |
Responses |
Re: In-place upgrade with streaming replicas
|
List | pgsql-admin |
Dear Jerry, So, yes it turns out that some kind of loop must be involved here, as you described: 1. ensure cluster is running 2. stop primary 3. wait some time 4. stop replicas 5. check if checkpoint locations match. repeat from step 1 if out-of-sync. My question here is, the unreliable step here is 3rd one. Can we query the replica runtime if he did catch up? I mean, that after stopping the primary, we can obtain the checkpoint location from pg_controldata, then, can we somehow query the running replica about that? Thanks in advance, Richard 2025-02-20 08:49 időpontban Jerry Sievers ezt írta: > richard@kojedz.in writes: > >> Dear Alvaro, >> >> Thanks for your answers. Unfortunately, I was unaware of a shutdown >> record, that makes a difference then. So, I definitely must stop the >> primary first, then use pg_controldata to obtain checkpoint >> info. Then, can I query the replicas while they are up and running if >> they've received the shutdown record or not? So, after shutting down >> the primary, how will I know if a replica has received the mentioned >> record, and is safe to shutdown? > > > > Hmmm, not sure about that but what we do, is stop primary, wait a > $short time, then stop replicas... > > Then run pg_controldata on all nodes | filter out only the line > indicating latest checkpoint and sort -u the output. Expect only a > single line if all are matched. > > You may also wish to first insure that you got the same number of > lines as total node count before doing the sorting and uniqueing. > > Very rarely on our huge systems, we'd have a mismatch after the > verification in in those cases, our automated upgrade procedure > restarts all nodes and then does the shutdown and verify check again. > > HTH > > >> >> Thanks for the clarifications. >> >> Best regards, >> Richard >> >> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta: >>> On 2025-Feb-19, richard@kojedz.in wrote: >>> >>>> With this, I have the question, that after the shutdown of primary, >>>> what is >>>> the guarantee for replicas having the same checkpoint location? Why >>>> does the >>>> order of shutting down the servers matter? What would be the really >>>> exact >>>> and reliable way to ensure that replicas will have the same >>>> checkpoint >>>> location as the primary? >>> The replicas can't write WAL by themselves, but they will replay >>> whatever the primary has sent; by shutting down the primary first and >>> letting the replicas catch up, you ensure that the replicas will >>> actually receive the shutdown record and replay it. If you shut down >>> the replicas first, they can obviously never catch up with the >>> shutdown >>> checkpoint of the primary. >>> As I recall, if you do shut down the primary first, one potential >>> danger >>> is that the primary fails to send the checkpoint record before >>> shutting >>> down, so the replicas won't receive it and obviously will not replay >>> it; >>> or simply that they are behind enough that they receive it but don't >>> replay it. >>> You could use pg_controldata to read the last checkpoint info from >>> all >>> nodes. You can run it on the primary after shutting it down, and >>> then >>> on each replica while it's still running to ensure that the correct >>> restartpoint has been created.
pgsql-admin by date: