Thread: "Bogus data in lock file" shouldn't be FATAL?

"Bogus data in lock file" shouldn't be FATAL?

From
Tom Lane
Date:
This complaint:
http://archives.postgresql.org/pgsql-admin/2010-08/msg00111.php

seems to suggest that this code in CreateLockFile() is not well-thought-out:
    if (other_pid <= 0)        elog(FATAL, "bogus data in lock file \"%s\": \"%s\"",             filename, buffer);

as it means that a corrupted (empty, in this case) postmaster.pid file
prevents the server from starting until somebody intervenes manually.

I think that the original concern was that if we couldn't read valid
data out of postmaster.pid then we couldn't be sure if there were a
conflicting postmaster running.  But if that's the plan then
CreateLockFile is violating it further down, where it happily skips the 
PGSharedMemoryIsInUse check if it fails to pull shmem ID numbers from
the file.

We could perhaps address that risk another way: after having written
postmaster.pid, try to read it back to verify that it contains what we
wrote, and abort if not.  Then, if we can't read it during startup,
it's okay to assume there is no conflicting postmaster.

Or, given the infrequency of complaints, maybe it's better not to touch
this.  Thoughts?
        regards, tom lane


Re: "Bogus data in lock file" shouldn't be FATAL?

From
Robert Haas
Date:
On Mon, Aug 16, 2010 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This complaint:
> http://archives.postgresql.org/pgsql-admin/2010-08/msg00111.php
>
> seems to suggest that this code in CreateLockFile() is not well-thought-out:
>
>                if (other_pid <= 0)
>                        elog(FATAL, "bogus data in lock file \"%s\": \"%s\"",
>                                 filename, buffer);
>
> as it means that a corrupted (empty, in this case) postmaster.pid file
> prevents the server from starting until somebody intervenes manually.
>
> I think that the original concern was that if we couldn't read valid
> data out of postmaster.pid then we couldn't be sure if there were a
> conflicting postmaster running.  But if that's the plan then
> CreateLockFile is violating it further down, where it happily skips the
> PGSharedMemoryIsInUse check if it fails to pull shmem ID numbers from
> the file.
>
> We could perhaps address that risk another way: after having written
> postmaster.pid, try to read it back to verify that it contains what we
> wrote, and abort if not.  Then, if we can't read it during startup,
> it's okay to assume there is no conflicting postmaster.

What if it was readable when written but has since become unreadable?

My basic feeling on this is that manual intervention to start the
server is really undesirable and we should try hard to avoid needing
it.  That having been said, accidentally starting two postmasters at
the same time that are accessing the same data files would be several
orders of magnitude worse.  We can't afford to compromise on any
interlock mechanisms that are necessary to prevent that from
happening.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: "Bogus data in lock file" shouldn't be FATAL?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 16, 2010 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We could perhaps address that risk another way: after having written
>> postmaster.pid, try to read it back to verify that it contains what we
>> wrote, and abort if not. �Then, if we can't read it during startup,
>> it's okay to assume there is no conflicting postmaster.

> What if it was readable when written but has since become unreadable?

Yup, that's the weak spot in any such assumption.  One might also draw
an analogy to the case of failing to open postmaster.pid because of
permissions change, which seems at least as likely as a data change.
And we consider that as fatal, for good reason I think.

> My basic feeling on this is that manual intervention to start the
> server is really undesirable and we should try hard to avoid needing
> it.  That having been said, accidentally starting two postmasters at
> the same time that are accessing the same data files would be several
> orders of magnitude worse.  We can't afford to compromise on any
> interlock mechanisms that are necessary to prevent that from
> happening.

Yeah.  At the same time, it's really really bad to encourage people to
remove postmaster.pid manually as the first attempt to fix anything.
That completely destroys whatever interlock you thought you had.  So
it's not too hard to make a case that avoiding this scenario will really
make things safer not less so.

The bottom line here is that it's not clear to me whether changing this
would be a net reliability improvement or not.  Maybe better to leave
it alone.
        regards, tom lane


Re: "Bogus data in lock file" shouldn't be FATAL?

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of lun ago 16 11:18:32 -0400 2010:

> We could perhaps address that risk another way: after having written
> postmaster.pid, try to read it back to verify that it contains what we
> wrote, and abort if not.  Then, if we can't read it during startup,
> it's okay to assume there is no conflicting postmaster.

Makes sense.

BTW some past evidence:
http://archives.postgresql.org/message-id/e3e180dc0905070719q58136caai23fbb777fd3c0df7@mail.gmail.com

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: "Bogus data in lock file" shouldn't be FATAL?

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of lun ago 16 11:49:51 -0400 2010:

> The bottom line here is that it's not clear to me whether changing this
> would be a net reliability improvement or not.  Maybe better to leave
> it alone.

In that case, maybe consider fsync'ing it.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: "Bogus data in lock file" shouldn't be FATAL?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Tom Lane's message of lun ago 16 11:49:51 -0400 2010:
>> The bottom line here is that it's not clear to me whether changing this
>> would be a net reliability improvement or not.  Maybe better to leave
>> it alone.

> In that case, maybe consider fsync'ing it.

Hrm ... I had supposed we did fsync lockfiles, but a look at the code
shows not.  This seems like a clear oversight.  I think we should not
only add that, but back-patch it.  It seems entirely plausible that the
lack of an fsync is exactly what led to the original complaint.
        regards, tom lane


Re: "Bogus data in lock file" shouldn't be FATAL?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> In that case, maybe consider fsync'ing it.

BTW, I see you already proposed that in the thread at
http://archives.postgresql.org/message-id/e3e180dc0905070719q58136caai23fbb777fd3c0df7@mail.gmail.com
I'm not sure how come the idea fell through the cracks, but we should
surely have done it then.  I think I was reading the initial complaint
as being just logfile corruption, and failed to make the connection to
the postmaster.pid file.
        regards, tom lane