Re: unable to fail over to warm standby server - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: unable to fail over to warm standby server
Date
Msg-id 4B615DA2.3040306@enterprisedb.com
Whole thread Raw
In response to unable to fail over to warm standby server  (Mason Hale <mason@onespot.com>)
Responses Re: unable to fail over to warm standby server  (Mason Hale <mason@onespot.com>)
List pgsql-bugs
Mason Hale wrote:
>  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> permittedtrigger file found
>
>  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted
>
> This file was not looked until after the attempt to recover was
> aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
> problem,
> but we don't see how that would explain the error messages, which seem
> to indicate that data on the standby server was corrupted.

Yes, that permission problem seems to be the root cause of the troubles.
If pg_standby fails to remove the trigger file, it exit()s with whatever
return code the unlink() call returned:

>         /*
>          * If trigger file found, we *must* delete it. Here's why: When
>          * recovery completes, we will be asked again for the same file from
>          * the archive using pg_standby so must remove trigger file so we can
>          * reload file again and come up correctly.
>          */
>         rc = unlink(triggerPath);
>         if (rc != 0)
>         {
>             fprintf(stderr, "\n ERROR: could not remove \"%s\": %s", triggerPath, strerror(errno));
>             fflush(stderr);
>             exit(rc);
>         }

unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of
the range of normal return codes, and apparently gets mangled into the
mysterious 65280 code you saw in the logs. The server treats that as a
fatal error, and dies.

That seems like a bug in pg_standby, but I'm not sure what it should do
if the unlink() fails. It could exit with some other exit code, so that
the server wouldn't die, but the lingering trigger file could cause
problems, as the comment explains. If it should indeed cause FATAL, it
should do so in a more robust way than the exit(rc) call above.

BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to
delete the trigger file (so that problematic block of code is gone), but
there's a new restore_end_command option in recovery.conf instead, where
you're supposed to put 'rm <triggerfile>'. I think in that
configuration, the standby would've started up, even though removal of
the trigger file would've still failed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

pgsql-bugs by date:

Previous
From: Craig Ringer
Date:
Subject: Re: BUG #5298: emedded SQL in C to get the record type from plpgsql
Next
From: Giorgio Valoti
Date:
Subject: Status of submitted bugs