Re: Slave promotion problem... - Mailing list pgsql-general

From marin@kset.org
Subject Re: Slave promotion problem...
Date
Msg-id 02c9f0b656add4fae12ec2453fdc6b84@kset.org
Whole thread Raw
In response to Re: Slave promotion problem...  (Martín Marqués <martin@2ndquadrant.com>)
Responses Re: Slave promotion problem...
List pgsql-general
On 2015-08-31 14:38, Martín Marqués wrote:
> El 31/08/15 a las 03:29, marin@kset.org escribió:
>> Last week we had some problems on the master server which caused a
>> failover on the slave (the master was completely unresponsive due to
>> reasons still unknown). The slave received the promote signal (pg_ctl
>> promote) and logged that in the logs:
>> 2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG:  received promote
>> request
>> 2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL:  terminating
>> walreceiver process due to administrator command
>>
>> 5 hours later the slave still didn't promote. Meanwhile we fixed the
>> master and restarted it. The slave was restarted and it behaved just
>> like the promote signal didn't arrive, connecting to the master as a
>> regular slave.
>
> Aren't there any further logs after the walreceiver termination?
> Up to here everything looks fine, but we have no idea on what was
> logged
> afterwards.
There are logs (quite a few, cca. 5 hours of it), every second something
like this:
2015-08-28 23:05:12 UTC [79867]: [1-1] user=[unknown],db=[unknown] LOG:
connection received: host=[local]
2015-08-28 23:05:12 UTC [79867]: [2-1] user=postgres,db=postgres LOG:
connection authorized: user=postgres database=postgres
This logs the connection of the process that probes the server is alive.

I was expecting to see something like:
redo done at xxxxx
last completed transaction was at log time xxxxxxx

But those lines didn't appear after 5 hours. As I understand, these are
written before the server uses the restore_command to get WAL and
history files from the archive.

>
>> I am unsure if this promotion failure is a bug/glitch, but the promote
>> procedure is automated and tested a couple of hundred times so I am
>> certain we initiated the promote correctly.
>
> Are you using homemade scripts? Maybe you need to test them more
> thoroughly, with different environment parameters.

We use a custom script for the restore_command, but is seems that it was
not invoked.

Regards,
Mladen Marinović



pgsql-general by date:

Previous
From: Melvin Davidson
Date:
Subject: Re: PostgreSQL Developer Best Practices
Next
From: Ray Stell
Date:
Subject: bdr download