Thread: "Bogus data in lock file" shouldn't be FATAL?
This complaint: http://archives.postgresql.org/pgsql-admin/2010-08/msg00111.php seems to suggest that this code in CreateLockFile() is not well-thought-out: if (other_pid <= 0) elog(FATAL, "bogus data in lock file \"%s\": \"%s\"", filename, buffer); as it means that a corrupted (empty, in this case) postmaster.pid file prevents the server from starting until somebody intervenes manually. I think that the original concern was that if we couldn't read valid data out of postmaster.pid then we couldn't be sure if there were a conflicting postmaster running. But if that's the plan then CreateLockFile is violating it further down, where it happily skips the PGSharedMemoryIsInUse check if it fails to pull shmem ID numbers from the file. We could perhaps address that risk another way: after having written postmaster.pid, try to read it back to verify that it contains what we wrote, and abort if not. Then, if we can't read it during startup, it's okay to assume there is no conflicting postmaster. Or, given the infrequency of complaints, maybe it's better not to touch this. Thoughts? regards, tom lane
On Mon, Aug 16, 2010 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This complaint: > http://archives.postgresql.org/pgsql-admin/2010-08/msg00111.php > > seems to suggest that this code in CreateLockFile() is not well-thought-out: > > if (other_pid <= 0) > elog(FATAL, "bogus data in lock file \"%s\": \"%s\"", > filename, buffer); > > as it means that a corrupted (empty, in this case) postmaster.pid file > prevents the server from starting until somebody intervenes manually. > > I think that the original concern was that if we couldn't read valid > data out of postmaster.pid then we couldn't be sure if there were a > conflicting postmaster running. But if that's the plan then > CreateLockFile is violating it further down, where it happily skips the > PGSharedMemoryIsInUse check if it fails to pull shmem ID numbers from > the file. > > We could perhaps address that risk another way: after having written > postmaster.pid, try to read it back to verify that it contains what we > wrote, and abort if not. Then, if we can't read it during startup, > it's okay to assume there is no conflicting postmaster. What if it was readable when written but has since become unreadable? My basic feeling on this is that manual intervention to start the server is really undesirable and we should try hard to avoid needing it. That having been said, accidentally starting two postmasters at the same time that are accessing the same data files would be several orders of magnitude worse. We can't afford to compromise on any interlock mechanisms that are necessary to prevent that from happening. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 16, 2010 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We could perhaps address that risk another way: after having written >> postmaster.pid, try to read it back to verify that it contains what we >> wrote, and abort if not. �Then, if we can't read it during startup, >> it's okay to assume there is no conflicting postmaster. > What if it was readable when written but has since become unreadable? Yup, that's the weak spot in any such assumption. One might also draw an analogy to the case of failing to open postmaster.pid because of permissions change, which seems at least as likely as a data change. And we consider that as fatal, for good reason I think. > My basic feeling on this is that manual intervention to start the > server is really undesirable and we should try hard to avoid needing > it. That having been said, accidentally starting two postmasters at > the same time that are accessing the same data files would be several > orders of magnitude worse. We can't afford to compromise on any > interlock mechanisms that are necessary to prevent that from > happening. Yeah. At the same time, it's really really bad to encourage people to remove postmaster.pid manually as the first attempt to fix anything. That completely destroys whatever interlock you thought you had. So it's not too hard to make a case that avoiding this scenario will really make things safer not less so. The bottom line here is that it's not clear to me whether changing this would be a net reliability improvement or not. Maybe better to leave it alone. regards, tom lane
Excerpts from Tom Lane's message of lun ago 16 11:18:32 -0400 2010: > We could perhaps address that risk another way: after having written > postmaster.pid, try to read it back to verify that it contains what we > wrote, and abort if not. Then, if we can't read it during startup, > it's okay to assume there is no conflicting postmaster. Makes sense. BTW some past evidence: http://archives.postgresql.org/message-id/e3e180dc0905070719q58136caai23fbb777fd3c0df7@mail.gmail.com -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Tom Lane's message of lun ago 16 11:49:51 -0400 2010: > The bottom line here is that it's not clear to me whether changing this > would be a net reliability improvement or not. Maybe better to leave > it alone. In that case, maybe consider fsync'ing it. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of lun ago 16 11:49:51 -0400 2010: >> The bottom line here is that it's not clear to me whether changing this >> would be a net reliability improvement or not. Maybe better to leave >> it alone. > In that case, maybe consider fsync'ing it. Hrm ... I had supposed we did fsync lockfiles, but a look at the code shows not. This seems like a clear oversight. I think we should not only add that, but back-patch it. It seems entirely plausible that the lack of an fsync is exactly what led to the original complaint. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > In that case, maybe consider fsync'ing it. BTW, I see you already proposed that in the thread at http://archives.postgresql.org/message-id/e3e180dc0905070719q58136caai23fbb777fd3c0df7@mail.gmail.com I'm not sure how come the idea fell through the cracks, but we should surely have done it then. I think I was reading the initial complaint as being just logfile corruption, and failed to make the connection to the postmaster.pid file. regards, tom lane