Thread: Postgresql 9.6.12 switchover failing

Postgresql 9.6.12 switchover failing

From
M Jane
Date:
Hi everyone,

I have a strange issue on my testing postgresql cluster.
I have 2 nodes with postgresql 9.6.12 on  RHEL7 .
The slave is being updated through hotstandby streaming replication.

In the past ( 9.6.8 and earlier), I was able to do a clean switchover without any issues. 
I did the following steps:
1. Check if the standby is uptodate
2. If so, shutdown the primary
3. pg_ctl promote the standby
4. Create a recovery.conf on the former primary and start the instance.

This worked just fine, with in the end the roles of the nodes changed.

Now I have version 9.6.12 and am trying to do the same.
The steps 1 to 3 works just fine.
But bringing up the old primary as a slave doesn't work. I get errors ( like >FATAL:  requested timeline 5 is not a child of this server's history) and the instances shuts down automatically.
I know I can fix it with pg_rewind ( have done it), but am confused about this.

Does anyone have the same issue? Or has anyone encountered the same problem?

Thanks in advance.
regards,
Marian



Re: Postgresql 9.6.12 switchover failing

From
Shreeyansh Dba
Date:
Hi Marian,

Yes, We have faced the same issue, generally, this type of issue occurs due to missing of history file i.e(00000000.history) during the replication setup using the pg_basebackup.

To resolve this type issue you need to create an empty history file with name 00000000.history on the master server then rebuild old master by initiating the pg_basebackup.

Hope this helps.

Thanks & Regards,
Shreeyansh DBA Team
www.shreeyansh.com


On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@gmail.com> wrote:
Hi everyone,

I have a strange issue on my testing postgresql cluster.
I have 2 nodes with postgresql 9.6.12 on  RHEL7 .
The slave is being updated through hotstandby streaming replication.

In the past ( 9.6.8 and earlier), I was able to do a clean switchover without any issues. 
I did the following steps:
1. Check if the standby is uptodate
2. If so, shutdown the primary
3. pg_ctl promote the standby
4. Create a recovery.conf on the former primary and start the instance.

This worked just fine, with in the end the roles of the nodes changed.

Now I have version 9.6.12 and am trying to do the same.
The steps 1 to 3 works just fine.
But bringing up the old primary as a slave doesn't work. I get errors ( like >FATAL:  requested timeline 5 is not a child of this server's history) and the instances shuts down automatically.
I know I can fix it with pg_rewind ( have done it), but am confused about this.

Does anyone have the same issue? Or has anyone encountered the same problem?

Thanks in advance.
regards,
Marian



Re: Postgresql 9.6.12 switchover failing

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
On Fri, 8 Mar 2019 22:10:15 +0530
Shreeyansh Dba <shreeyansh2014@gmail.com> wrote:

> Hi Marian,
> 
> Yes, We have faced the same issue, generally, this type of issue occurs due
> to missing of history file i.e(00000000.history) during the replication
> setup using the pg_basebackup.
> 
> To resolve this type issue you need to create an empty history file with
> name 00000000.history on the master server then rebuild old master by
> initiating the pg_basebackup.

Uh, what? Do you have further info about this? I fail to understand how this is
related to OP question...

> On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@gmail.com> wrote:
...
> > But bringing up the old primary as a slave doesn't work. I get errors (
> > like >FATAL:  requested timeline 5 is not a child of this server's history)
> > and the instances shuts down automatically.

While doing a clean switchover, I never seen this (yet).

How are you checking both servers are up-to-date? Do you check that the standby
received everything from the old primary, including its shutdown checkpoint,
before promoting?

Make sure your standby staid connected to your old primary until the later is
completely shutdown-ed. Nothing should disconnect them during the whole
process. And make sure to use mode fast or smart to stop the old primary.

Are you able to share with us the restart checkpoint on the old
primary and the history file created by the new-primary? How far are the
restart LSN and timeline fork from each other?

Regards,


Re: Postgresql 9.6.12 switchover failing

From
M Jane
Date:
Hi,
Thanks for your reply
I will try to put in all the logging I collected when doing the switchover.
I did the following:
1. Check if the standby is uptodate
   I ran the following query on the standby:
       SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())  AS replication_delay_bytes;
   This resulted in 0.
2. I shutdown the primary instance 
   and checked the alertlog of the primary:
< 2019-03-08 09:45:31.553 CET >LOG:  database system was shut down at 2019-03-08 09:45:06 CET
< 2019-03-08 09:45:31.554 CET >FATAL:  the database system is starting up
< 2019-03-08 09:45:31.557 CET >LOG:  MultiXact member wraparound protections are now enabled
< 2019-03-08 09:45:31.559 CET >LOG:  autovacuum launcher started
< 2019-03-08 09:45:31.559 CET >LOG:  database system is ready to accept connections
< 2019-03-08 09:46:41.159 CET >LOG:  received smart shutdown request
< 2019-03-08 09:46:41.160 CET >LOG:  autovacuum launcher shutting down
< 2019-03-08 09:46:41.160 CET >FATAL:  terminating connection due to administrator command
< 2019-03-08 09:46:41.162 CET >LOG:  shutting down
< 2019-03-08 09:46:41.182 CET >FATAL:  the database system is shutting down
< 2019-03-08 09:46:41.332 CET >LOG:  checkpoint starting: shutdown immediate
< 2019-03-08 09:46:41.345 CET >LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.015 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16384 kB, estimate=16384 kB
< 2019-03-08 09:46:41.357 CET >LOG:  database system is shut down

3. Then i promoted the slave with pg_ctl promote and checked the alertlog:
< 2019-03-08 09:49:55.249 CET >LOG:  selected new timeline ID: 4
< 2019-03-08 09:49:55.369 CET >LOG:  archive recovery complete
< 2019-03-08 09:49:55.377 CET >LOG:  restored log file "00000003.history" from archive
< 2019-03-08 09:49:55.390 CET >LOG:  MultiXact member wraparound protections are now enabled
< 2019-03-08 09:49:55.391 CET >LOG:  checkpoint starting: force
< 2019-03-08 09:49:55.393 CET >LOG:  autovacuum launcher started
< 2019-03-08 09:49:55.393 CET >LOG:  database system is ready to accept connections
< 2019-03-08 09:49:55.402 CET >LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.011 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=29490 kB

4. On the former primary I created a recovery.conf
cat recovery.conf
restore_command = 'cat archive_dir/%f | gunzip >  %p'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=newprimary port=5432 user=replication password=xxxx'
trigger_file = '/tmp/instance.trigger.5432'

5. And started te instance on the former primary
   alertlog former primary:
< 2019-03-08 09:53:04.543 CET >LOG:  database system was shut down at 2019-03-08 09:46:41 CET
< 2019-03-08 09:53:04.566 CET >LOG:  restored log file "00000004.history" from archive
cat: archive_dir/00000005.history: No such file or directory

gzip: stdin: unexpected end of file
< 2019-03-08 09:53:04.571 CET >LOG:  entering standby mode
< 2019-03-08 09:53:04.577 CET >LOG:  restored log file "00000004.history" from archive
cat: archive_dir/0000000400000002000000A8: No such file or directory

gzip: stdin: unexpected end of file
cat: archive_dir/0000000300000002000000A8: No such file or directory

gzip: stdin: unexpected end of file
< 2019-03-08 09:53:04.591 CET >FATAL:  requested timeline 4 is not a child of this server's history
< 2019-03-08 09:53:04.591 CET >DETAIL:  Latest checkpoint is at 2/A8000028 on timeline 3, but in the history of the requested timeline, the server forked off from that timeline at 2/A70000D0.
< 2019-03-08 09:53:04.592 CET >LOG:  startup process (PID 29629) exited with exit code 1
< 2019-03-08 09:53:04.592 CET >LOG:  aborting startup due to startup process failure
< 2019-03-08 09:53:04.595 CET >LOG:  database system is shut down

Am I missing something?
Please advise?
Thanks in advance.
regards,
Marian

Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr>:
On Fri, 8 Mar 2019 22:10:15 +0530
Shreeyansh Dba <shreeyansh2014@gmail.com> wrote:

> Hi Marian,
>
> Yes, We have faced the same issue, generally, this type of issue occurs due
> to missing of history file i.e(00000000.history) during the replication
> setup using the pg_basebackup.
>
> To resolve this type issue you need to create an empty history file with
> name 00000000.history on the master server then rebuild old master by
> initiating the pg_basebackup.

Uh, what? Do you have further info about this? I fail to understand how this is
related to OP question...

> On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@gmail.com> wrote:
...
> > But bringing up the old primary as a slave doesn't work. I get errors (
> > like >FATAL:  requested timeline 5 is not a child of this server's history)
> > and the instances shuts down automatically.

While doing a clean switchover, I never seen this (yet).

How are you checking both servers are up-to-date? Do you check that the standby
received everything from the old primary, including its shutdown checkpoint,
before promoting?

Make sure your standby staid connected to your old primary until the later is
completely shutdown-ed. Nothing should disconnect them during the whole
process. And make sure to use mode fast or smart to stop the old primary.

Are you able to share with us the restart checkpoint on the old
primary and the history file created by the new-primary? How far are the
restart LSN and timeline fork from each other?

Regards,

Re: Postgresql 9.6.12 switchover failing

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
On Mon, 11 Mar 2019 12:17:59 +0100
M Jane <mskiesewetter@gmail.com> wrote:

> Hi,
> Thanks for your reply
> I will try to put in all the logging I collected when doing the switchover.
> I did the following:
> 1. Check if the standby is uptodate
>    I ran the following query on the standby:
>        SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
> pg_last_xlog_replay_location())  AS replication_delay_bytes;
>    This resulted in 0.

This is not required. You don't need to check if some received XLOG are not
replayed on data yet. As far as the XLOG is on disk on the standby side, it will
be replayed before promoting by default. Quoting doc: 

  pg_last_xlog_receive_location():
    Get last transaction log location received and synced to disk by streaming
    replication.

In this step, you should check the standby is connected to the primary. Either
be listing your processes on both side, check pg_stat_replication, etc.

> 2. I shutdown the primary instance
>    and checked the alertlog of the primary:
[...]
> < 2019-03-08 09:46:41.159 CET >LOG:  received smart shutdown request
> < 2019-03-08 09:46:41.160 CET >LOG:  autovacuum launcher shutting down
> < 2019-03-08 09:46:41.160 CET >FATAL:  terminating connection due to
> administrator command

What is this FATAL? Did you issue some more administrative commands after the
initial "pg_ctl -m smart stop"?

I would recommend using "pg_ctl -m fast stop" and nothing else.

> 3. Then i promoted the slave with pg_ctl promote and checked the alertlog:

What was the last received LSN? What was the last WAL file the standby
retrieved from archives?

Do you have log lines before "LOG:  selected new timeline ID: 4" ?

[...]
> 4. On the former primary I created a recovery.conf
> cat recovery.conf
> restore_command = 'cat archive_dir/%f | gunzip >  %p'

"cat" is useless here. I recommend: "gzip -dc archive_dir/%f > %p"

> recovery_target_timeline = 'latest'
> standby_mode = on
> primary_conninfo = 'host=newprimary port=5432 user=replication password=xxxx'
> trigger_file = '/tmp/instance.trigger.5432'

You should not put the trigger file in a public accessible folder. Anyone can
triggers the promotion...
You should only rely on "pg_ctl promote" if possible.

> 5. And started te instance on the former primary
>    alertlog former primary:
[...]
> < 2019-03-08 09:53:04.591 CET >FATAL:  requested timeline 4 is not a child
> of this server's history
> < 2019-03-08 09:53:04.591 CET >DETAIL:  Latest checkpoint is at 2/A8000028
> on timeline 3, but in the history of the requested timeline, the server
> forked off from that timeline at 2/A70000D0.

According to these log lines:

* during shutdown, old primary archived 0000000300000002000000A7 and wrote its
  shutdown checkpoint as first xlog record in 0000000300000002000000A8. The
  following command should show it:

  pg_xlogdump "pg_xlog/0000000300000002000000A8"

* for some reasons, the standby never received any content from
  0000000300000002000000A8. Neither be streaming rep, nor restored from archives

* the promoted standby forked its timeline from the last received LSN that was
  in early 0000000300000002000000A7 file (2/A70000D0)

* you end up with a new master that forked (probably) right before the shutdown
  checkpoint of the old master (as far as the next record in 2/A70000D0 was an
  XLOG SWITCH).

In conclusion, the old master is right: the fork happen before its own last
writes.

> Am I missing something?

You should really take care of the first steps. Why the standby did not
received the shutdown checkpoint? Was it connected when you issued the "pg_ctl
stop"? Did the wal_sender processus been killed? A network issue? Why the
standby has not been able to restore the last segment from archives before
promoting?

I usually compare the last checkpoint LSN in the controldata from the old
master with what received the standby to promote, before promoting it.

> Please advise?
> Thanks in advance.
> regards,
> Marian
> 
> Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais <
> ioguix@free.fr>:
> 
> > On Fri, 8 Mar 2019 22:10:15 +0530
> > Shreeyansh Dba <shreeyansh2014@gmail.com> wrote:
> >
> > > Hi Marian,
> > >
> > > Yes, We have faced the same issue, generally, this type of issue occurs
> > due
> > > to missing of history file i.e(00000000.history) during the replication
> > > setup using the pg_basebackup.
> > >
> > > To resolve this type issue you need to create an empty history file with
> > > name 00000000.history on the master server then rebuild old master by
> > > initiating the pg_basebackup.
> >
> > Uh, what? Do you have further info about this? I fail to understand how
> > this is
> > related to OP question...
> >
> > > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@gmail.com> wrote:
> > ...
> > > > But bringing up the old primary as a slave doesn't work. I get errors (
> > > > like >FATAL:  requested timeline 5 is not a child of this server's
> > history)
> > > > and the instances shuts down automatically.
> >
> > While doing a clean switchover, I never seen this (yet).
> >
> > How are you checking both servers are up-to-date? Do you check that the
> > standby
> > received everything from the old primary, including its shutdown
> > checkpoint,
> > before promoting?
> >
> > Make sure your standby staid connected to your old primary until the later
> > is
> > completely shutdown-ed. Nothing should disconnect them during the whole
> > process. And make sure to use mode fast or smart to stop the old primary.
> >
> > Are you able to share with us the restart checkpoint on the old
> > primary and the history file created by the new-primary? How far are the
> > restart LSN and timeline fork from each other?
> >
> > Regards,
> >



-- 
Jehan-Guillaume de Rorthais
Dalibo


Re: Postgresql 9.6.12 switchover failing

From
M Jane
Date:
Hi,
I will do the test tomorrow morning with your suggestions and will capture the logs and steps.
Thanks for the advice.
regards,
Marian


Op ma 11 mrt. 2019 om 15:55 schreef Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr>:
On Mon, 11 Mar 2019 12:17:59 +0100
M Jane <mskiesewetter@gmail.com> wrote:

> Hi,
> Thanks for your reply
> I will try to put in all the logging I collected when doing the switchover.
> I did the following:
> 1. Check if the standby is uptodate
>    I ran the following query on the standby:
>        SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
> pg_last_xlog_replay_location())  AS replication_delay_bytes;
>    This resulted in 0.

This is not required. You don't need to check if some received XLOG are not
replayed on data yet. As far as the XLOG is on disk on the standby side, it will
be replayed before promoting by default. Quoting doc:

  pg_last_xlog_receive_location():
    Get last transaction log location received and synced to disk by streaming
    replication.

In this step, you should check the standby is connected to the primary. Either
be listing your processes on both side, check pg_stat_replication, etc.

> 2. I shutdown the primary instance
>    and checked the alertlog of the primary:
[...]
> < 2019-03-08 09:46:41.159 CET >LOG:  received smart shutdown request
> < 2019-03-08 09:46:41.160 CET >LOG:  autovacuum launcher shutting down
> < 2019-03-08 09:46:41.160 CET >FATAL:  terminating connection due to
> administrator command

What is this FATAL? Did you issue some more administrative commands after the
initial "pg_ctl -m smart stop"?

I would recommend using "pg_ctl -m fast stop" and nothing else.

> 3. Then i promoted the slave with pg_ctl promote and checked the alertlog:

What was the last received LSN? What was the last WAL file the standby
retrieved from archives?

Do you have log lines before "LOG:  selected new timeline ID: 4" ?

[...]
> 4. On the former primary I created a recovery.conf
> cat recovery.conf
> restore_command = 'cat archive_dir/%f | gunzip >  %p'

"cat" is useless here. I recommend: "gzip -dc archive_dir/%f > %p"

> recovery_target_timeline = 'latest'
> standby_mode = on
> primary_conninfo = 'host=newprimary port=5432 user=replication password=xxxx'
> trigger_file = '/tmp/instance.trigger.5432'

You should not put the trigger file in a public accessible folder. Anyone can
triggers the promotion...
You should only rely on "pg_ctl promote" if possible.

> 5. And started te instance on the former primary
>    alertlog former primary:
[...]
> < 2019-03-08 09:53:04.591 CET >FATAL:  requested timeline 4 is not a child
> of this server's history
> < 2019-03-08 09:53:04.591 CET >DETAIL:  Latest checkpoint is at 2/A8000028
> on timeline 3, but in the history of the requested timeline, the server
> forked off from that timeline at 2/A70000D0.

According to these log lines:

* during shutdown, old primary archived 0000000300000002000000A7 and wrote its
  shutdown checkpoint as first xlog record in 0000000300000002000000A8. The
  following command should show it:

  pg_xlogdump "pg_xlog/0000000300000002000000A8"

* for some reasons, the standby never received any content from
  0000000300000002000000A8. Neither be streaming rep, nor restored from archives

* the promoted standby forked its timeline from the last received LSN that was
  in early 0000000300000002000000A7 file (2/A70000D0)

* you end up with a new master that forked (probably) right before the shutdown
  checkpoint of the old master (as far as the next record in 2/A70000D0 was an
  XLOG SWITCH).

In conclusion, the old master is right: the fork happen before its own last
writes.

> Am I missing something?

You should really take care of the first steps. Why the standby did not
received the shutdown checkpoint? Was it connected when you issued the "pg_ctl
stop"? Did the wal_sender processus been killed? A network issue? Why the
standby has not been able to restore the last segment from archives before
promoting?

I usually compare the last checkpoint LSN in the controldata from the old
master with what received the standby to promote, before promoting it.

> Please advise?
> Thanks in advance.
> regards,
> Marian
>
> Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais <
> ioguix@free.fr>:
>
> > On Fri, 8 Mar 2019 22:10:15 +0530
> > Shreeyansh Dba <shreeyansh2014@gmail.com> wrote:
> >
> > > Hi Marian,
> > >
> > > Yes, We have faced the same issue, generally, this type of issue occurs
> > due
> > > to missing of history file i.e(00000000.history) during the replication
> > > setup using the pg_basebackup.
> > >
> > > To resolve this type issue you need to create an empty history file with
> > > name 00000000.history on the master server then rebuild old master by
> > > initiating the pg_basebackup.
> >
> > Uh, what? Do you have further info about this? I fail to understand how
> > this is
> > related to OP question...
> >
> > > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@gmail.com> wrote:
> > ...
> > > > But bringing up the old primary as a slave doesn't work. I get errors (
> > > > like >FATAL:  requested timeline 5 is not a child of this server's
> > history)
> > > > and the instances shuts down automatically.
> >
> > While doing a clean switchover, I never seen this (yet).
> >
> > How are you checking both servers are up-to-date? Do you check that the
> > standby
> > received everything from the old primary, including its shutdown
> > checkpoint,
> > before promoting?
> >
> > Make sure your standby staid connected to your old primary until the later
> > is
> > completely shutdown-ed. Nothing should disconnect them during the whole
> > process. And make sure to use mode fast or smart to stop the old primary.
> >
> > Are you able to share with us the restart checkpoint on the old
> > primary and the history file created by the new-primary? How far are the
> > restart LSN and timeline fork from each other?
> >
> > Regards,
> >



--
Jehan-Guillaume de Rorthais
Dalibo

Re: Postgresql 9.6.12 switchover failing

From
M Jane
Date:
Hi Jehan-Guillaume,
Thanks for the support.
issue is solved thanks to your help.
The culprit was in the service script which did not stop the instance correct, resulting in a standby that did not receive all xlogs.
Fixed the service script and I am able to switchover without any problem.

Regards,
Marian


Op ma 11 mrt. 2019 om 17:44 schreef M Jane <mskiesewetter@gmail.com>:
Hi,
I will do the test tomorrow morning with your suggestions and will capture the logs and steps.
Thanks for the advice.
regards,
Marian


Op ma 11 mrt. 2019 om 15:55 schreef Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr>:
On Mon, 11 Mar 2019 12:17:59 +0100
M Jane <mskiesewetter@gmail.com> wrote:

> Hi,
> Thanks for your reply
> I will try to put in all the logging I collected when doing the switchover.
> I did the following:
> 1. Check if the standby is uptodate
>    I ran the following query on the standby:
>        SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
> pg_last_xlog_replay_location())  AS replication_delay_bytes;
>    This resulted in 0.

This is not required. You don't need to check if some received XLOG are not
replayed on data yet. As far as the XLOG is on disk on the standby side, it will
be replayed before promoting by default. Quoting doc:

  pg_last_xlog_receive_location():
    Get last transaction log location received and synced to disk by streaming
    replication.

In this step, you should check the standby is connected to the primary. Either
be listing your processes on both side, check pg_stat_replication, etc.

> 2. I shutdown the primary instance
>    and checked the alertlog of the primary:
[...]
> < 2019-03-08 09:46:41.159 CET >LOG:  received smart shutdown request
> < 2019-03-08 09:46:41.160 CET >LOG:  autovacuum launcher shutting down
> < 2019-03-08 09:46:41.160 CET >FATAL:  terminating connection due to
> administrator command

What is this FATAL? Did you issue some more administrative commands after the
initial "pg_ctl -m smart stop"?

I would recommend using "pg_ctl -m fast stop" and nothing else.

> 3. Then i promoted the slave with pg_ctl promote and checked the alertlog:

What was the last received LSN? What was the last WAL file the standby
retrieved from archives?

Do you have log lines before "LOG:  selected new timeline ID: 4" ?

[...]
> 4. On the former primary I created a recovery.conf
> cat recovery.conf
> restore_command = 'cat archive_dir/%f | gunzip >  %p'

"cat" is useless here. I recommend: "gzip -dc archive_dir/%f > %p"

> recovery_target_timeline = 'latest'
> standby_mode = on
> primary_conninfo = 'host=newprimary port=5432 user=replication password=xxxx'
> trigger_file = '/tmp/instance.trigger.5432'

You should not put the trigger file in a public accessible folder. Anyone can
triggers the promotion...
You should only rely on "pg_ctl promote" if possible.

> 5. And started te instance on the former primary
>    alertlog former primary:
[...]
> < 2019-03-08 09:53:04.591 CET >FATAL:  requested timeline 4 is not a child
> of this server's history
> < 2019-03-08 09:53:04.591 CET >DETAIL:  Latest checkpoint is at 2/A8000028
> on timeline 3, but in the history of the requested timeline, the server
> forked off from that timeline at 2/A70000D0.

According to these log lines:

* during shutdown, old primary archived 0000000300000002000000A7 and wrote its
  shutdown checkpoint as first xlog record in 0000000300000002000000A8. The
  following command should show it:

  pg_xlogdump "pg_xlog/0000000300000002000000A8"

* for some reasons, the standby never received any content from
  0000000300000002000000A8. Neither be streaming rep, nor restored from archives

* the promoted standby forked its timeline from the last received LSN that was
  in early 0000000300000002000000A7 file (2/A70000D0)

* you end up with a new master that forked (probably) right before the shutdown
  checkpoint of the old master (as far as the next record in 2/A70000D0 was an
  XLOG SWITCH).

In conclusion, the old master is right: the fork happen before its own last
writes.

> Am I missing something?

You should really take care of the first steps. Why the standby did not
received the shutdown checkpoint? Was it connected when you issued the "pg_ctl
stop"? Did the wal_sender processus been killed? A network issue? Why the
standby has not been able to restore the last segment from archives before
promoting?

I usually compare the last checkpoint LSN in the controldata from the old
master with what received the standby to promote, before promoting it.

> Please advise?
> Thanks in advance.
> regards,
> Marian
>
> Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais <
> ioguix@free.fr>:
>
> > On Fri, 8 Mar 2019 22:10:15 +0530
> > Shreeyansh Dba <shreeyansh2014@gmail.com> wrote:
> >
> > > Hi Marian,
> > >
> > > Yes, We have faced the same issue, generally, this type of issue occurs
> > due
> > > to missing of history file i.e(00000000.history) during the replication
> > > setup using the pg_basebackup.
> > >
> > > To resolve this type issue you need to create an empty history file with
> > > name 00000000.history on the master server then rebuild old master by
> > > initiating the pg_basebackup.
> >
> > Uh, what? Do you have further info about this? I fail to understand how
> > this is
> > related to OP question...
> >
> > > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@gmail.com> wrote:
> > ...
> > > > But bringing up the old primary as a slave doesn't work. I get errors (
> > > > like >FATAL:  requested timeline 5 is not a child of this server's
> > history)
> > > > and the instances shuts down automatically.
> >
> > While doing a clean switchover, I never seen this (yet).
> >
> > How are you checking both servers are up-to-date? Do you check that the
> > standby
> > received everything from the old primary, including its shutdown
> > checkpoint,
> > before promoting?
> >
> > Make sure your standby staid connected to your old primary until the later
> > is
> > completely shutdown-ed. Nothing should disconnect them during the whole
> > process. And make sure to use mode fast or smart to stop the old primary.
> >
> > Are you able to share with us the restart checkpoint on the old
> > primary and the history file created by the new-primary? How far are the
> > restart LSN and timeline fork from each other?
> >
> > Regards,
> >



--
Jehan-Guillaume de Rorthais
Dalibo