Re: How to shoot yourself in the foot: kill -9 postmaster - Mailing list pgsql-hackers

From Alfred Perlstein
Subject Re: How to shoot yourself in the foot: kill -9 postmaster
Date
Msg-id 20010306102246.N8663@fw.wintelcom.net
Whole thread Raw
In response to Re: How to shoot yourself in the foot: kill -9 postmaster  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: How to shoot yourself in the foot: kill -9 postmaster  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
* Tom Lane <tgl@sss.pgh.pa.us> [010306 10:10] wrote:
> Alfred Perlstein <bright@wintelcom.net> writes:
> > I'm sure some sort of encoding of the PGDATA directory along with
> > the pids stored in the shm segment...
> 
> I thought about this too, but it strikes me as not very trustworthy.
> The problem is that there's no guarantee that the new postmaster will
> even notice the old shmem segment: it might select a different shmem
> key.  (The 7.1 coding of shmem key selection makes this more likely
> than it used to be, but even under 7.0, it will certainly fail to work
> if I choose to start the new postmaster using a different port number
> than the old one had.  The shmem key is driven primarily by port number
> not data directory ...)

This seems like a mistake.  

I'm suprised you guys aren't just using some form of the FreeBSD
ftok() algorithm for this:

FTOK(3)                FreeBSD Library Functions Manual                FTOK(3)

...
    The ftok() function attempts to create a unique key suitable for use with    the msgget(3), semget(2) and shmget(2)
functionsgiven the path of an ex-    isting file and a user-selectable id.
 
    The specified path must specify an existing file that is accessible to    the calling process or the call will
fail. Also, note that links to    files will return the same key, given the same id.
 

BUGS    The returned key is computed based on the device minor number and inode    of the specified path in combination
withthe lower 8 bits of the given    id.  Thus it is quite possible for the routine to return duplicate keys.
 

The "BUGS" seems to be exactly what you guys are looking for, a somewhat
reliable method of obtaining a system id.  If that sounds evil, read 
below for an alternate suggestion.

> The interlock has to be tightly tied to the PGDATA directory, because
> what we're trying to protect is the files in and under that directory.
> It seems that something based on file(s) in that directory is the way
> to go.
> 
> The best idea I've seen so far is Hiroshi's idea of having all the
> backends hold fcntl locks on the same file (probably postmaster.pid
> would do fine).  Then the new postmaster can test whether any backends
> are still alive by trying to lock the old postmaster.pid file.
> Unfortunately, I read in the fcntl man page:
> 
>     Locks are not inherited by a child process in a fork(2) system call.
> 
> This makes the idea much less attractive than I originally thought:
> a new backend would not automatically inherit a lock on the
> postmaster.pid file from the postmaster, but would have to open/lock it
> for itself.  That means there's a window where the new backend exists
> but would be invisible to a hypothetical new postmaster.
> 
> We could work around this with the following, very ugly protocol:
> 
> 1. Postmaster normally maintains fcntl read lock on its postmaster.pid
> file.  Each spawned backend immediately opens and read-locks
> postmaster.pid, too, and holds that file open until it dies.  (Thus
> wasting a kernel FD per backend, which is one of the less attractive
> things about this.)  If the backend is unable to obtain read lock on
> postmaster.pid, then it complains and dies.  We must use read locks
> here so that all these processes can hold them separately.
> 
> 2. If a newly started postmaster sees a pre-existing postmaster.pid
> file, it tries to obtain a *write* lock on that file.  If it fails,
> conclude that an old postmaster or backend is still alive; complain
> and quit.  If it succeeds, sit for say 1 second before deleting the file
> and creating a new one.  (The delay here is to allow any just-started
> old backends to fail to acquire read lock and quit.  A possible
> objection is that we have no way to guarantee 1 second is enough, though
> it ought to be plenty if the lock acquisition is just after the fork.)
> 
> One thing that worries me a little bit is that this means an fcntl
> read-lock request will exist inside the kernel for each active backend.
> Does anyone know of any performance problems or hard kernel limits we
> might run into with large numbers of backends (lots and lots of fcntl
> locks)?  At least the locks are on a file that we don't actually touch
> in the normal course of business.
> 
> A small savings is that the backends don't actually need to open new FDs
> for the postmaster.pid file; they can use the one they inherit from the
> postmaster, even though they do need to lock it again.  I'm not sure how
> much that saves inside the kernel, but at least something.
> 
> There are also the usual set of concerns about portability of flock,
> though this time we're locking a plain file and not a socket, so it
> shouldn't be as much trouble as it was before.
> 
> Comments?  Does anyone see a better way to do it?

Possibly...

What about encoding the shm id in the pidfile?  Then one can just ask
how many processes are attached to that segment?  (if it doesn't
exist, one can assume all backends have exited)

you want the field 'shm_nattch'
    The shmid_ds struct is defined as follows:
    struct shmid_ds {        struct ipc_perm shm_perm;   /* operation permission structure */        int
shm_segsz; /* size of segment in bytes */        pid_t           shm_lpid;   /* process ID of last shared memory op */
     pid_t           shm_cpid;   /* process ID of creator */        short           shm_nattch; /* number of current
attaches*/        time_t          shm_atime;  /* time of last shmat() */        time_t          shm_dtime;  /* time of
lastshmdt() */        time_t          shm_ctime;  /* time of last change by shmctl() */        void
*shm_internal;/* sysv stupidity */    };
 


--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: AW: AW: AW: AW: WAL-based allocation of XIDs is insecur e
Next
From: Tom Lane
Date:
Subject: Re: How to shoot yourself in the foot: kill -9 postmaster