Thread: ...

...

From

Henry Korszun

Date:

25 April 2014, 17:37:19

I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.

I set up streaming replication for a read/write primary and a read/only standby. The replication works fine for a while, and then out of the blue BOTH machines become read/write, but with no replication from the original primary to the newly read/write standby.

The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""

Any help/guidance would be appreciated. Thanks in advance.

Re:

From

Scott Whitney

Date:

25 April 2014, 17:43:50

Sounds like you might have a "trigger_file" set in your recovery.conf. Do you? That or someone is issuing a pg_ctl promote command.

I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.

I set up streaming replication for a read/write primary and a read/only standby. The replication works fine for a while, and then out of the blue BOTH machines become read/write, but with no replication from the original primary to the newly read/write standby.

The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""

Any help/guidance would be appreciated. Thanks in advance.

Re:

From

Payal Singh

Date:

25 April 2014, 17:46:57

I would check for any automated jobs that touch the trigger file on the slave

Payal Singh,
Database Administrator,

OmniTI Computer Consulting Inc.
Phone: 240.646.0770 x 253

On Fri, Apr 25, 2014 at 1:37 PM, Henry Korszun <henryk302@yahoo.com> wrote:

I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.

I set up streaming replication for a read/write primary and a read/only standby. The replication works fine for a while, and then out of the blue BOTH machines become read/write, but with no replication from the original primary to the newly read/write standby.

The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""

Any help/guidance would be appreciated. Thanks in advance.

Re:

From

Payal Singh

Date:

25 April 2014, 17:48:08

Also, if you are using chef/puppet to automate the configurations, maybe the file recovery.conf is being overwritten or removed.

Payal Singh,
Database Administrator,

OmniTI Computer Consulting Inc.
Phone: 240.646.0770 x 253

On Fri, Apr 25, 2014 at 1:46 PM, Payal Singh <payal@omniti.com> wrote:

I would check for any automated jobs that touch the trigger file on the slave

Payal Singh,
Database Administrator,
OmniTI Computer Consulting Inc.
Phone: 240.646.0770 x 253

On Fri, Apr 25, 2014 at 1:37 PM, Henry Korszun <henryk302@yahoo.com> wrote:
I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.

I set up streaming replication for a read/write primary and a read/only standby. The replication works fine for a while, and then out of the blue BOTH machines become read/write, but with no replication from the original primary to the newly read/write standby.

The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""

Any help/guidance would be appreciated. Thanks in advance.

Re:

From

Henry Korszun

Date:

25 April 2014, 17:57:22

There IS a trigger file, which does appear to have been "touch"ed. But the problem is that a fail-over hasn't really occurred since the original read/write primary continues to be a fully functioning read/write machine. But it's no longer replicating to the erstwhile standby, which has become read/write. Bottom line, I now have 2 read/write machines, but with no replication between them.

On Friday, April 25, 2014 1:43 PM, Scott Whitney <scott@journyx.com> wrote:

Sounds like you might have a "trigger_file" set in your recovery.conf. Do you? That or someone is issuing a pg_ctl promote command.

I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.

I set up streaming replication for a read/write primary and a read/only standby. The replication works fine for a while, and then out of the blue BOTH machines become read/write, but with no replication from the original primary to the newly read/write standby.

The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""

Any help/guidance would be appreciated. Thanks in advance.

Re:

From

Payal Singh

Date:

25 April 2014, 18:00:14

Once a trigger file is touched on slave, it makes the slave standalone, but doesn't stop the old primary server
automatically.You have to handle that, by either stopping the old primary altogether or pointing a virtual ip to the
newslave.  
On Fri, Apr 25, 2014 at 10:57:16AM -0700, Henry Korszun wrote:
> There IS a trigger file, which does appear to have been "touch"ed.  But the problem is that a fail-over hasn't really
occurredsince the original read/write primary continues to be a fully functioning read/write machine. But it's no
longerreplicating to the erstwhile standby, which has become read/write.  Bottom line, I now have 2 read/write
machines,but with no replication between them. 
>
> On Friday, April 25, 2014 1:43 PM, Scott Whitney <scott@journyx.com> wrote:
>
> Sounds like you might have a "trigger_file" set in your recovery.conf. Do you? That or someone is issuing a pg_ctl
promotecommand. 
>
> ________________________________
>
> I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.
> >
> >I set up streaming replication for a read/write primary and a read/only
> standby.  The replication works fine for a while, and then out of the
> blue BOTH machines become read/write, but with no replication from the
> original primary to the newly read/write standby.
> >
> >The only log entry that seems relevant is as follows:
> FATAL,57P01,"terminating walreceiver process due to administrator
> command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""
> >
> >Any help/guidance would be appreciated. Thanks in advance.
> >
> >

Re:

From

Scott Whitney

Date:

25 April 2014, 18:00:56

The slave doesn't "turn off" the master. The trigger file is intended to be touched _when the master is down_.

Since the master never WENT down (or came back up) and the trigger file was touched, the slave got promoted.

You'll need to stop the slave, run your select pg_startbackup(), rsync, etc to get your slave back to slave mode.

There IS a trigger file, which does appear to have been "touch"ed. But the problem is that a fail-over hasn't really occurred since the original read/write primary continues to be a fully functioning read/write machine. But it's no longer replicating to the erstwhile standby, which has become read/write. Bottom line, I now have 2 read/write machines, but with no replication between them.
On Friday, April 25, 2014 1:43 PM, Scott Whitney <scott@journyx.com> wrote:
Sounds like you might have a "trigger_file" set in your recovery.conf. Do you? That or someone is issuing a pg_ctl promote command.

I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.

I set up streaming replication for a read/write primary and a read/only standby. The replication works fine for a while, and then out of the blue BOTH machines become read/write, but with no replication from the original primary to the newly read/write standby.

The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""

Any help/guidance would be appreciated. Thanks in advance.

Re:

From

Henry Korszun

Date:

25 April 2014, 18:10:05

I understand what you're saying, but I don't know what's causing the "touch" in the first place. I guess I need to further examine/debug. Thanks for your help.

On Friday, April 25, 2014 2:00 PM, Scott Whitney <scott@journyx.com> wrote:

The slave doesn't "turn off" the master. The trigger file is intended to be touched _when the master is down_.

Since the master never WENT down (or came back up) and the trigger file was touched, the slave got promoted.

You'll need to stop the slave, run your select pg_startbackup(), rsync, etc to get your slave back to slave mode.

There IS a trigger file, which does appear to have been "touch"ed. But the problem is that a fail-over hasn't really occurred since the original read/write primary continues to be a fully functioning read/write machine. But it's no longer replicating to the erstwhile standby, which has become read/write. Bottom line, I now have 2 read/write machines, but with no replication between them.
On Friday, April 25, 2014 1:43 PM, Scott Whitney <scott@journyx.com> wrote:
Sounds like you might have a "trigger_file" set in your recovery.conf. Do you? That or someone is issuing a pg_ctl promote command.

I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux), 64-bit.

I set up streaming replication for a read/write primary and a read/only standby. The replication works fine for a while, and then out of the blue BOTH machines become read/write, but with no replication from the original primary to the newly read/write standby.

The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""

Any help/guidance would be appreciated. Thanks in advance.

Re:

From

Jerry Sievers

Date:

25 April 2014, 18:24:07

Henry Korszun <henryk302@yahoo.com> writes:

> I understand what you're saying, but I don't know what's causing the "touch" in the first place.  I guess I need to
furtherexamine/debug. Thanks for your help. 
> On Friday, April 25, 2014 2:00 PM, Scott Whitney <scott@journyx.com> wrote:
> The slave doesn't "turn off" the master. The trigger file is intended to be touched _when the master is down_.

Nor do we.

Possibly your system is running some HA software and it's  onlining your
standby due to false-positive.

>
> Since the master never WENT down (or came back up) and the trigger file was touched, the slave got promoted.
>
> You'll need to stop the slave, run your select pg_startbackup(), rsync, etc to get your slave back to slave mode.
>
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>     There IS a trigger file, which does appear to have been "touch"ed.  But the problem is that a fail-over hasn't
reallyoccurred since the original read/write primary 
>     continues to be a fully functioning read/write machine. But it's no longer replicating to the erstwhile standby,
whichhas become read/write.  Bottom line, I now 
>     have 2 read/write machines, but with no replication between them.
>     On Friday, April 25, 2014 1:43 PM, Scott Whitney <scott@journyx.com> wrote:
>     Sounds like you might have a "trigger_file" set in your recovery.conf. Do you? That or someone is issuing a
pg_ctlpromote command. 
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>         I'm using PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20070115 (SUSE Linux),
64-bit.
>
>         I set up streaming replication for a read/write primary and a read/only standby. The replication works fine
fora while, and then out of the blue BOTH machines 
>         become read/write, but with no replication from the original primary to the newly read/write standby.
>
>         The only log entry that seems relevant is as follows: FATAL,57P01,"terminating walreceiver process due to
administrator
>         command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150",""
>
>         Any help/guidance would be appreciated. Thanks in advance.
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re:

From

Jim Mercer

Date:

27 April 2014, 19:49:50

On Fri, Apr 25, 2014 at 01:23:30PM -0500, Jerry Sievers wrote:
> Henry Korszun <henryk302@yahoo.com> writes:
> > I understand what you're saying, but I don't know what's causing the "touch" in the first place.  I guess I need to
furtherexamine/debug. Thanks for your help. 

it may be a semantic difference, but is recovery_mode dependent on the existence
of the trigger file, or the timestamp.

"touch" in the above context could either be the creation of the file,
or simply updating the timestamp of the file.

i suspect that the recovery is triggered by the mere existence of the file,
while henry might be talking in the 'update timestamp' context.

--jim

--
Jim Mercer     Reptilian Research      jim@reptiles.org    +1 416 410-5633
"He who dies with the most toys is nonetheless dead"

Re:

From

David G Johnston

Date:

27 April 2014, 20:26:25

Jim Mercer wrote
> On Fri, Apr 25, 2014 at 01:23:30PM -0500, Jerry Sievers wrote:
>> Henry Korszun <

> henryk302@

> > writes:
>> > I understand what you're saying, but I don't know what's causing the
>> "touch" in the first place.  I guess I need to further examine/debug.
>> Thanks for your help.
>
> it may be a semantic difference, but is recovery_mode dependent on the
> existence
> of the trigger file, or the timestamp.
>
> "touch" in the above context could either be the creation of the file,
> or simply updating the timestamp of the file.
>
> i suspect that the recovery is triggered by the mere existence of the
> file,
> while henry might be talking in the 'update timestamp' context.

I'm seriously doubting the timestamp info matters - the timestamp would
almost always be in the past (specifying a future recovery date doesn't make
sense anyway) and no arbitrary age is reasonable to make the file invalid
and would be extremely confusing if one did.

"Touch" is a shorthand for "a file whose mere existence is all that is
necessary" and by convention implies that what is in the file doesn't matter
(since actually touching a non-existent file creates a new empty file).

If the file already existed (not that timeframes are being defined all that
well here) the system would never have been in recovery mode...

Henry's last comment is that it is not known what process is creating the
empty trigger file in the first place - whether that process uses touch or
some other means to create the file is irrelevant to the issue at hand.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/no-subject-tp5801536p5801671.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.