Re: unable to fail over to warm standby server - Mailing list pgsql-bugs

From Mason Hale
Subject Re: unable to fail over to warm standby server
Date
Msg-id 1e85dd391001280703l4c13e231m77e50e2630f34975@mail.gmail.com
Whole thread Raw
In response to Re: unable to fail over to warm standby server  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: unable to fail over to warm standby server  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-bugs
Hello Heikki --

Thank you for investigating this issue and clearing up this mystery.
I do not believe it is obvious that the postgres process needs to be able to
remove the trigger file.

My naive assumption was that the trigger file was merely a flag to signal
that recovery mode needed to be stopped. If I were to guess what those steps
would be, I would assume the following:

   - detect the presence of the trigger file
   - stop the postgres process safely (e.g pg_ctl ... stop)
   - rename recovery.conf to recovery.done
   - restart the postgres process (e.g. pg_ctl ... start)

It is not obvious that the trigger file needs to be removed.
And if permissions prevent it from being removed the last thing that should
happen is to cause to database to become corrupted.

At minimum the pg_standby documentation should make this requirement clear.
I suggest language to the effect of the following:

Note it is critical the trigger file be created with permissions that allow
> the postgres process to remove the file. Generally this is best done by
> creating the file from the postgres user account. Data corruption may result
> if the trigger file permissions prevent deletion of the trigger file.


Of course the best solution is to avoid this issue entirely. Something as
easy to miss as file permissions should not cause data corruption,
especially in the process meant to fail over from a crashing primary
database.

thanks,

Mason Hale
http://www.onespot.com
direct +1 800.618.0768 ext 701



On Thu, Jan 28, 2010 at 3:49 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

> Mason Hale wrote:
> >  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> > permittedtrigger file found
> >
> >  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> permitted
> >
> > This file was not looked until after the attempt to recover was
> > aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
> > problem,
> > but we don't see how that would explain the error messages, which seem
> > to indicate that data on the standby server was corrupted.
>
> Yes, that permission problem seems to be the root cause of the troubles.
> If pg_standby fails to remove the trigger file, it exit()s with whatever
> return code the unlink() call returned:
>
> >               /*
> >                * If trigger file found, we *must* delete it. Here's why:
> When
> >                * recovery completes, we will be asked again for the same
> file from
> >                * the archive using pg_standby so must remove trigger file
> so we can
> >                * reload file again and come up correctly.
> >                */
> >               rc = unlink(triggerPath);
> >               if (rc != 0)
> >               {
> >                       fprintf(stderr, "\n ERROR: could not remove \"%s\":
> %s", triggerPath, strerror(errno));
> >                       fflush(stderr);
> >                       exit(rc);
> >               }
>
> unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of
> the range of normal return codes, and apparently gets mangled into the
> mysterious 65280 code you saw in the logs. The server treats that as a
> fatal error, and dies.
>
> That seems like a bug in pg_standby, but I'm not sure what it should do
> if the unlink() fails. It could exit with some other exit code, so that
> the server wouldn't die, but the lingering trigger file could cause
> problems, as the comment explains. If it should indeed cause FATAL, it
> should do so in a more robust way than the exit(rc) call above.
>
> BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to
> delete the trigger file (so that problematic block of code is gone), but
> there's a new restore_end_command option in recovery.conf instead, where
> you're supposed to put 'rm <triggerfile>'. I think in that
> configuration, the standby would've started up, even though removal of
> the trigger file would've still failed.
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>

pgsql-bugs by date:

Previous
From: Sun Duozhong(孙多忠)
Date:
Subject: emedded SQL in C to get the record type from plpgsql
Next
From: Tom Lane
Date:
Subject: Re: Status of submitted bugs