Thread: In-place upgrade with streaming replicas
Dear All, I am trying to follow instructions regarding in-place upgrade with streaming replica servers. The documentation here: https://www.postgresql.org/docs/13/pgupgrade.html#:~:text=Prepare%20for%20standby%20server%20upgrades says that I should check 'Latest checkpoint location' in primary and replica servers. Now, I want to make this process automatic, so I would like to know a reliable way to make checkpoint locations match surely. During the automated upgrade procedure, I restart all servers on a different tcp ports, thus no legitim clients connect to primary, and thus they dont make any changes. Then, I issue CHECKPOINT on primary, retrieve pg_current_wal_lsn() on primary, and wait until all replicas report the same value in pg_last_wal_replay_lsn(), then I issue a CHECKPOINT on replicas. According to documentation this creates a RESTOREPOINT on replicas. Then, I repeat until pg_current_wal_lsn() does not change on primary. Then, if I shut down cluster in a way that first the primary is shut down, and just after the replicas, then, checkpoint locations will match. Howewer, if I accidentally shut down a replica before primary is shut down, the checkpoint locations wont match. With this, I have the question, that after the shutdown of primary, what is the guarantee for replicas having the same checkpoint location? Why does the order of shutting down the servers matter? What would be the really exact and reliable way to ensure that replicas will have the same checkpoint location as the primary? Thanks in advance, Richard
On 2025-Feb-19, richard@kojedz.in wrote: > With this, I have the question, that after the shutdown of primary, what is > the guarantee for replicas having the same checkpoint location? Why does the > order of shutting down the servers matter? What would be the really exact > and reliable way to ensure that replicas will have the same checkpoint > location as the primary? The replicas can't write WAL by themselves, but they will replay whatever the primary has sent; by shutting down the primary first and letting the replicas catch up, you ensure that the replicas will actually receive the shutdown record and replay it. If you shut down the replicas first, they can obviously never catch up with the shutdown checkpoint of the primary. As I recall, if you do shut down the primary first, one potential danger is that the primary fails to send the checkpoint record before shutting down, so the replicas won't receive it and obviously will not replay it; or simply that they are behind enough that they receive it but don't replay it. You could use pg_controldata to read the last checkpoint info from all nodes. You can run it on the primary after shutting it down, and then on each replica while it's still running to ensure that the correct restartpoint has been created. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
Dear Alvaro, Thanks for your answers. Unfortunately, I was unaware of a shutdown record, that makes a difference then. So, I definitely must stop the primary first, then use pg_controldata to obtain checkpoint info. Then, can I query the replicas while they are up and running if they've received the shutdown record or not? So, after shutting down the primary, how will I know if a replica has received the mentioned record, and is safe to shutdown? Thanks for the clarifications. Best regards, Richard 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta: > On 2025-Feb-19, richard@kojedz.in wrote: > >> With this, I have the question, that after the shutdown of primary, >> what is >> the guarantee for replicas having the same checkpoint location? Why >> does the >> order of shutting down the servers matter? What would be the really >> exact >> and reliable way to ensure that replicas will have the same checkpoint >> location as the primary? > > The replicas can't write WAL by themselves, but they will replay > whatever the primary has sent; by shutting down the primary first and > letting the replicas catch up, you ensure that the replicas will > actually receive the shutdown record and replay it. If you shut down > the replicas first, they can obviously never catch up with the shutdown > checkpoint of the primary. > > As I recall, if you do shut down the primary first, one potential > danger > is that the primary fails to send the checkpoint record before shutting > down, so the replicas won't receive it and obviously will not replay > it; > or simply that they are behind enough that they receive it but don't > replay it. > > You could use pg_controldata to read the last checkpoint info from all > nodes. You can run it on the primary after shutting it down, and then > on each replica while it's still running to ensure that the correct > restartpoint has been created.
richard@kojedz.in writes: > Dear Alvaro, > > Thanks for your answers. Unfortunately, I was unaware of a shutdown > record, that makes a difference then. So, I definitely must stop the > primary first, then use pg_controldata to obtain checkpoint > info. Then, can I query the replicas while they are up and running if > they've received the shutdown record or not? So, after shutting down > the primary, how will I know if a replica has received the mentioned > record, and is safe to shutdown? Hmmm, not sure about that but what we do, is stop primary, wait a $short time, then stop replicas... Then run pg_controldata on all nodes | filter out only the line indicating latest checkpoint and sort -u the output. Expect only a single line if all are matched. You may also wish to first insure that you got the same number of lines as total node count before doing the sorting and uniqueing. Very rarely on our huge systems, we'd have a mismatch after the verification in in those cases, our automated upgrade procedure restarts all nodes and then does the shutdown and verify check again. HTH > > Thanks for the clarifications. > > Best regards, > Richard > > 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta: >> On 2025-Feb-19, richard@kojedz.in wrote: >> >>> With this, I have the question, that after the shutdown of primary, >>> what is >>> the guarantee for replicas having the same checkpoint location? Why >>> does the >>> order of shutting down the servers matter? What would be the really >>> exact >>> and reliable way to ensure that replicas will have the same checkpoint >>> location as the primary? >> The replicas can't write WAL by themselves, but they will replay >> whatever the primary has sent; by shutting down the primary first and >> letting the replicas catch up, you ensure that the replicas will >> actually receive the shutdown record and replay it. If you shut down >> the replicas first, they can obviously never catch up with the shutdown >> checkpoint of the primary. >> As I recall, if you do shut down the primary first, one potential >> danger >> is that the primary fails to send the checkpoint record before shutting >> down, so the replicas won't receive it and obviously will not replay >> it; >> or simply that they are behind enough that they receive it but don't >> replay it. >> You could use pg_controldata to read the last checkpoint info from >> all >> nodes. You can run it on the primary after shutting it down, and then >> on each replica while it's still running to ensure that the correct >> restartpoint has been created.
Dear Jerry, So, yes it turns out that some kind of loop must be involved here, as you described: 1. ensure cluster is running 2. stop primary 3. wait some time 4. stop replicas 5. check if checkpoint locations match. repeat from step 1 if out-of-sync. My question here is, the unreliable step here is 3rd one. Can we query the replica runtime if he did catch up? I mean, that after stopping the primary, we can obtain the checkpoint location from pg_controldata, then, can we somehow query the running replica about that? Thanks in advance, Richard 2025-02-20 08:49 időpontban Jerry Sievers ezt írta: > richard@kojedz.in writes: > >> Dear Alvaro, >> >> Thanks for your answers. Unfortunately, I was unaware of a shutdown >> record, that makes a difference then. So, I definitely must stop the >> primary first, then use pg_controldata to obtain checkpoint >> info. Then, can I query the replicas while they are up and running if >> they've received the shutdown record or not? So, after shutting down >> the primary, how will I know if a replica has received the mentioned >> record, and is safe to shutdown? > > > > Hmmm, not sure about that but what we do, is stop primary, wait a > $short time, then stop replicas... > > Then run pg_controldata on all nodes | filter out only the line > indicating latest checkpoint and sort -u the output. Expect only a > single line if all are matched. > > You may also wish to first insure that you got the same number of > lines as total node count before doing the sorting and uniqueing. > > Very rarely on our huge systems, we'd have a mismatch after the > verification in in those cases, our automated upgrade procedure > restarts all nodes and then does the shutdown and verify check again. > > HTH > > >> >> Thanks for the clarifications. >> >> Best regards, >> Richard >> >> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta: >>> On 2025-Feb-19, richard@kojedz.in wrote: >>> >>>> With this, I have the question, that after the shutdown of primary, >>>> what is >>>> the guarantee for replicas having the same checkpoint location? Why >>>> does the >>>> order of shutting down the servers matter? What would be the really >>>> exact >>>> and reliable way to ensure that replicas will have the same >>>> checkpoint >>>> location as the primary? >>> The replicas can't write WAL by themselves, but they will replay >>> whatever the primary has sent; by shutting down the primary first and >>> letting the replicas catch up, you ensure that the replicas will >>> actually receive the shutdown record and replay it. If you shut down >>> the replicas first, they can obviously never catch up with the >>> shutdown >>> checkpoint of the primary. >>> As I recall, if you do shut down the primary first, one potential >>> danger >>> is that the primary fails to send the checkpoint record before >>> shutting >>> down, so the replicas won't receive it and obviously will not replay >>> it; >>> or simply that they are behind enough that they receive it but don't >>> replay it. >>> You could use pg_controldata to read the last checkpoint info from >>> all >>> nodes. You can run it on the primary after shutting it down, and >>> then >>> on each replica while it's still running to ensure that the correct >>> restartpoint has been created.
richard@kojedz.in writes: > Dear Jerry, > > So, yes it turns out that some kind of loop must be involved here, as > you described: > > 1. ensure cluster is running > 2. stop primary > 3. wait some time > 4. stop replicas > 5. check if checkpoint locations match. repeat from step 1 if > out-of-sync. > > My question here is, the unreliable step here is 3rd one. Can we query > the replica runtime if he did catch up? I mean, that after stopping > the primary, we can obtain the checkpoint location from > pg_controldata, then, can we somehow query the running replica about > that? Assuming your client traffic has been stopped ahead of time and perhaps you did a lockout via HBA or other means, including forcible termination of persistent clients (we usually do a restart of the primary to insure this)... We don't wait more than a few seconds before also stopping the replicas and the vast majority of times all nodes are at the same checkpoint. Cheers! > > Thanks in advance, > Richard > > 2025-02-20 08:49 időpontban Jerry Sievers ezt írta: >> richard@kojedz.in writes: >> >>> Dear Alvaro, >>> Thanks for your answers. Unfortunately, I was unaware of a shutdown >>> record, that makes a difference then. So, I definitely must stop the >>> primary first, then use pg_controldata to obtain checkpoint >>> info. Then, can I query the replicas while they are up and running if >>> they've received the shutdown record or not? So, after shutting down >>> the primary, how will I know if a replica has received the mentioned >>> record, and is safe to shutdown? >> Hmmm, not sure about that but what we do, is stop primary, wait a >> $short time, then stop replicas... >> Then run pg_controldata on all nodes | filter out only the line >> indicating latest checkpoint and sort -u the output. Expect only a >> single line if all are matched. >> You may also wish to first insure that you got the same number of >> lines as total node count before doing the sorting and uniqueing. >> Very rarely on our huge systems, we'd have a mismatch after the >> verification in in those cases, our automated upgrade procedure >> restarts all nodes and then does the shutdown and verify check again. >> HTH >> >>> Thanks for the clarifications. >>> Best regards, >>> Richard >>> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta: >>>> On 2025-Feb-19, richard@kojedz.in wrote: >>>> >>>>> With this, I have the question, that after the shutdown of primary, >>>>> what is >>>>> the guarantee for replicas having the same checkpoint location? Why >>>>> does the >>>>> order of shutting down the servers matter? What would be the really >>>>> exact >>>>> and reliable way to ensure that replicas will have the same >>>>> checkpoint >>>>> location as the primary? >>>> The replicas can't write WAL by themselves, but they will replay >>>> whatever the primary has sent; by shutting down the primary first and >>>> letting the replicas catch up, you ensure that the replicas will >>>> actually receive the shutdown record and replay it. If you shut down >>>> the replicas first, they can obviously never catch up with the >>>> shutdown >>>> checkpoint of the primary. >>>> As I recall, if you do shut down the primary first, one potential >>>> danger >>>> is that the primary fails to send the checkpoint record before >>>> shutting >>>> down, so the replicas won't receive it and obviously will not replay >>>> it; >>>> or simply that they are behind enough that they receive it but don't >>>> replay it. >>>> You could use pg_controldata to read the last checkpoint info from >>>> all >>>> nodes. You can run it on the primary after shutting it down, and >>>> then >>>> on each replica while it's still running to ensure that the correct >>>> restartpoint has been created.
Dear Jerry, Thanks for sharing your experiments, I will implement our upgrades in a similar way. Terminate/restart on different port, wait for catchup, stop primary, check replicas somehow (using pg_wal_lsn_diff()), then stop replicas too, check for pg_controldata match, and repeat if not. Regards, Richard 2025-02-21 04:57 időpontban Jerry Sievers ezt írta: > richard@kojedz.in writes: > >> Dear Jerry, >> >> So, yes it turns out that some kind of loop must be involved here, as >> you described: >> >> 1. ensure cluster is running >> 2. stop primary >> 3. wait some time >> 4. stop replicas >> 5. check if checkpoint locations match. repeat from step 1 if >> out-of-sync. >> >> My question here is, the unreliable step here is 3rd one. Can we query >> the replica runtime if he did catch up? I mean, that after stopping >> the primary, we can obtain the checkpoint location from >> pg_controldata, then, can we somehow query the running replica about >> that? > Assuming your client traffic has been stopped ahead of time and perhaps > you did a lockout via HBA or other means, including forcible > termination > of persistent clients (we usually do a restart of the primary to insure > this)... > > We don't wait more than a few seconds before also stopping the replicas > and the vast majority of times all nodes are at the same checkpoint. > > Cheers! > >> >> Thanks in advance, >> Richard >> >> 2025-02-20 08:49 időpontban Jerry Sievers ezt írta: >>> richard@kojedz.in writes: >>> >>>> Dear Alvaro, >>>> Thanks for your answers. Unfortunately, I was unaware of a shutdown >>>> record, that makes a difference then. So, I definitely must stop the >>>> primary first, then use pg_controldata to obtain checkpoint >>>> info. Then, can I query the replicas while they are up and running >>>> if >>>> they've received the shutdown record or not? So, after shutting down >>>> the primary, how will I know if a replica has received the mentioned >>>> record, and is safe to shutdown? >>> Hmmm, not sure about that but what we do, is stop primary, wait a >>> $short time, then stop replicas... >>> Then run pg_controldata on all nodes | filter out only the line >>> indicating latest checkpoint and sort -u the output. Expect only a >>> single line if all are matched. >>> You may also wish to first insure that you got the same number of >>> lines as total node count before doing the sorting and uniqueing. >>> Very rarely on our huge systems, we'd have a mismatch after the >>> verification in in those cases, our automated upgrade procedure >>> restarts all nodes and then does the shutdown and verify check again. >>> HTH >>> >>>> Thanks for the clarifications. >>>> Best regards, >>>> Richard >>>> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta: >>>>> On 2025-Feb-19, richard@kojedz.in wrote: >>>>> >>>>>> With this, I have the question, that after the shutdown of >>>>>> primary, >>>>>> what is >>>>>> the guarantee for replicas having the same checkpoint location? >>>>>> Why >>>>>> does the >>>>>> order of shutting down the servers matter? What would be the >>>>>> really >>>>>> exact >>>>>> and reliable way to ensure that replicas will have the same >>>>>> checkpoint >>>>>> location as the primary? >>>>> The replicas can't write WAL by themselves, but they will replay >>>>> whatever the primary has sent; by shutting down the primary first >>>>> and >>>>> letting the replicas catch up, you ensure that the replicas will >>>>> actually receive the shutdown record and replay it. If you shut >>>>> down >>>>> the replicas first, they can obviously never catch up with the >>>>> shutdown >>>>> checkpoint of the primary. >>>>> As I recall, if you do shut down the primary first, one potential >>>>> danger >>>>> is that the primary fails to send the checkpoint record before >>>>> shutting >>>>> down, so the replicas won't receive it and obviously will not >>>>> replay >>>>> it; >>>>> or simply that they are behind enough that they receive it but >>>>> don't >>>>> replay it. >>>>> You could use pg_controldata to read the last checkpoint info from >>>>> all >>>>> nodes. You can run it on the primary after shutting it down, and >>>>> then >>>>> on each replica while it's still running to ensure that the correct >>>>> restartpoint has been created.