On March 7, 2018 5:51:29 PM PST, Craig Ringer <craig@2ndquadrant.com> wrote: >My favourite remains an organisation that kept "fixing" an issue by >kill >-9'ing the postmaster and removing postmaster.pid to make it start up >again. Without killing all the leftover backends. Of course, the system >kept getting more unstable and broken, so they did it more and more >often. >They were working on scripting it when they gave up and asked for help.
Maybe I'm missing something, but that ought to not work. The shmem segment that we keep around would be a conflict, no?
As I understand it, because we allow multiple Pg instances on a system, we identify the small sysv shmem segment we use by the postmaster's pid. If you remove the DirLockFile (postmaster.pid) you remove the interlock against starting a new postmaster. It'll think it's a new independent instance on the same host, make a new shmem segment and go merrily on its way mangling data horribly.
See CreateLockFile(). Also 7e2a18a9161 . In particular src/backend/utils/init/miscinit.c +938,
if (isDDLock)
{
....
if (PGSharedMemoryIsInUse(id1, id2))
ereport(FATAL,
(errcode(ERRCODE_LOCK_FILE_EXISTS),
errmsg("pre-existing shared memory block "
"(key %lu, ID %lu) is still in use",
id1, id2),
errhint("If you're sure there are no old "
"server processes still running, remove "
"the shared memory block "
"or just delete the file \"%s\".",
filename)));
....
}
I still think that error is a bit optimistic, and should really say "make very sure there are no 'postgres' processes associated with this data directory, then ...'
It'd be nice if the OS offered us some support here. Something like opening a lockfile in exclusive lock mode, then inheriting the FD and lock on all children, with each child inheriting the lock. So the exclusive lock wouldn't get released until all FDs associated with it are released. But AFAIK nothing like that is present, let alone portable.