Re: Postgresql 9.6.12 switchover failing - Mailing list pgsql-admin
From | Jehan-Guillaume (ioguix) de Rorthais |
---|---|
Subject | Re: Postgresql 9.6.12 switchover failing |
Date | |
Msg-id | 20190311155536.1e3604e7@firost Whole thread Raw |
In response to | Re: Postgresql 9.6.12 switchover failing (M Jane <mskiesewetter@gmail.com>) |
Responses |
Re: Postgresql 9.6.12 switchover failing
|
List | pgsql-admin |
On Mon, 11 Mar 2019 12:17:59 +0100 M Jane <mskiesewetter@gmail.com> wrote: > Hi, > Thanks for your reply > I will try to put in all the logging I collected when doing the switchover. > I did the following: > 1. Check if the standby is uptodate > I ran the following query on the standby: > SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), > pg_last_xlog_replay_location()) AS replication_delay_bytes; > This resulted in 0. This is not required. You don't need to check if some received XLOG are not replayed on data yet. As far as the XLOG is on disk on the standby side, it will be replayed before promoting by default. Quoting doc: pg_last_xlog_receive_location(): Get last transaction log location received and synced to disk by streaming replication. In this step, you should check the standby is connected to the primary. Either be listing your processes on both side, check pg_stat_replication, etc. > 2. I shutdown the primary instance > and checked the alertlog of the primary: [...] > < 2019-03-08 09:46:41.159 CET >LOG: received smart shutdown request > < 2019-03-08 09:46:41.160 CET >LOG: autovacuum launcher shutting down > < 2019-03-08 09:46:41.160 CET >FATAL: terminating connection due to > administrator command What is this FATAL? Did you issue some more administrative commands after the initial "pg_ctl -m smart stop"? I would recommend using "pg_ctl -m fast stop" and nothing else. > 3. Then i promoted the slave with pg_ctl promote and checked the alertlog: What was the last received LSN? What was the last WAL file the standby retrieved from archives? Do you have log lines before "LOG: selected new timeline ID: 4" ? [...] > 4. On the former primary I created a recovery.conf > cat recovery.conf > restore_command = 'cat archive_dir/%f | gunzip > %p' "cat" is useless here. I recommend: "gzip -dc archive_dir/%f > %p" > recovery_target_timeline = 'latest' > standby_mode = on > primary_conninfo = 'host=newprimary port=5432 user=replication password=xxxx' > trigger_file = '/tmp/instance.trigger.5432' You should not put the trigger file in a public accessible folder. Anyone can triggers the promotion... You should only rely on "pg_ctl promote" if possible. > 5. And started te instance on the former primary > alertlog former primary: [...] > < 2019-03-08 09:53:04.591 CET >FATAL: requested timeline 4 is not a child > of this server's history > < 2019-03-08 09:53:04.591 CET >DETAIL: Latest checkpoint is at 2/A8000028 > on timeline 3, but in the history of the requested timeline, the server > forked off from that timeline at 2/A70000D0. According to these log lines: * during shutdown, old primary archived 0000000300000002000000A7 and wrote its shutdown checkpoint as first xlog record in 0000000300000002000000A8. The following command should show it: pg_xlogdump "pg_xlog/0000000300000002000000A8" * for some reasons, the standby never received any content from 0000000300000002000000A8. Neither be streaming rep, nor restored from archives * the promoted standby forked its timeline from the last received LSN that was in early 0000000300000002000000A7 file (2/A70000D0) * you end up with a new master that forked (probably) right before the shutdown checkpoint of the old master (as far as the next record in 2/A70000D0 was an XLOG SWITCH). In conclusion, the old master is right: the fork happen before its own last writes. > Am I missing something? You should really take care of the first steps. Why the standby did not received the shutdown checkpoint? Was it connected when you issued the "pg_ctl stop"? Did the wal_sender processus been killed? A network issue? Why the standby has not been able to restore the last segment from archives before promoting? I usually compare the last checkpoint LSN in the controldata from the old master with what received the standby to promote, before promoting it. > Please advise? > Thanks in advance. > regards, > Marian > > Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais < > ioguix@free.fr>: > > > On Fri, 8 Mar 2019 22:10:15 +0530 > > Shreeyansh Dba <shreeyansh2014@gmail.com> wrote: > > > > > Hi Marian, > > > > > > Yes, We have faced the same issue, generally, this type of issue occurs > > due > > > to missing of history file i.e(00000000.history) during the replication > > > setup using the pg_basebackup. > > > > > > To resolve this type issue you need to create an empty history file with > > > name 00000000.history on the master server then rebuild old master by > > > initiating the pg_basebackup. > > > > Uh, what? Do you have further info about this? I fail to understand how > > this is > > related to OP question... > > > > > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@gmail.com> wrote: > > ... > > > > But bringing up the old primary as a slave doesn't work. I get errors ( > > > > like >FATAL: requested timeline 5 is not a child of this server's > > history) > > > > and the instances shuts down automatically. > > > > While doing a clean switchover, I never seen this (yet). > > > > How are you checking both servers are up-to-date? Do you check that the > > standby > > received everything from the old primary, including its shutdown > > checkpoint, > > before promoting? > > > > Make sure your standby staid connected to your old primary until the later > > is > > completely shutdown-ed. Nothing should disconnect them during the whole > > process. And make sure to use mode fast or smart to stop the old primary. > > > > Are you able to share with us the restart checkpoint on the old > > primary and the history file created by the new-primary? How far are the > > restart LSN and timeline fork from each other? > > > > Regards, > > -- Jehan-Guillaume de Rorthais Dalibo
pgsql-admin by date: