Thread: Feature: POSIX Shared memory support

Feature: POSIX Shared memory support

From

Chris Marcellino

Date:

06 February 2007, 09:26:32

On Mac OS X and other BSD's, the default System V shared memory
limits are often very low and require adjustment for acceptable
performance. Particularly, when Postgres is included as part of
larger end-user friendly software products, these kernel settings are
often difficult to change for 2 reasons:

1. The (arbitrarily) limited resources must be shared by all programs
that use System V shared memory. For example on my Mac OS X computer,
I have Postgres running a standalone database, but also as part of
Apple Remote Desktop. Without manual adjustment, running both
simultaneously causes one of them to fail. Correcting this in any
robust way is challenging to automate for consumer-style (i.e. Mac)
installers.

2. On these BSD's, this System V shared memory is wired down and
cannot be swapped out for any reason. If Postgres is running as part
of another software program or is a lower priority, other programs
cannot use the potentially limited memory. This places the user or
developer in a tricky position of having to minimize overall system
impact, while permitting enough shared memory for Postgres to perform
well.

To this end, I have "ported" the svsv_shmem.c layer to use the POSIX
calls (which are some ways more robust w.r.t reducing collision by
using strings as shared memory id's, instead of ints).

In principle, this should not have any significant affect on
performance. Running PGBench on a few different load types gives very
similar results (-3%/+1%), that aren't very statistically
significant. Of course, on a un-tuned Mac OS X machine (where the
original SysV version is limited to the default 4MB) the POSIX
version outperforms significantly (+250%). Using the POSIX calls
helps minimize the kernel side of the tuning, which is a big plus for
integrated uses of Postgres, but also for other amateur installations
(i.e. Fink).

If this is appropriate for the distribution, it could become a
'contrib' add-on or it could be a autoconf custom build option until
it reached greater maturity.

Any thoughts? Suggestions? I would also appreciate any advice on more
sophisticate ways to measure the performance impacts of a change like
this.

Thanks,
Chris Marcellino
Apple Computer, Inc.






src/backend/port/posix_shmem.c
===================================================================
/
*-----------------------------------------------------------------------
--
  *
  * posix_shmem.c
  *      Implement shared memory using POSIX facilities
  *
  * These routines represent a fairly thin layer on top of POSIX shared
  * memory functionality.
  *
  * Portions Copyright (c) 1996-2006, PostgreSQL Global Development
Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *

*-----------------------------------------------------------------------
--
  */
#include "postgres.h"

#include <signal.h>
#include <unistd.h>
#include <sys/file.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#ifdef HAVE_KERNEL_OS_H
#include <kernel/OS.h>
#endif

#include "miscadmin.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"


#define IPCProtection    (0600)    /* access/modify by user only */
#define IPCNameLength        32    /* must be long enough to contain all
possible format strings
                                 * see GenerateIPDName */


unsigned long UsedShmemSegID = 0;
void       *UsedShmemSegAddr = NULL;

static void GenerateIPCName(int memKey, char *dest);
static void *InternalIpcMemoryCreate(int memKey, Size size);
static void IpcMemoryDetach(int status, Datum shmaddr);
static void IpcMemoryDelete(int status, Datum memKey);
static PGShmemHeader *PGSharedMemoryAttach(int key);


/*
  *    GenerateIPCName(key, dest)
  *
  * Generate a shared memory object key name using the argument key.
  * This uses the magic number and text to prevent collisions from other
  * apps.
  */
static void
GenerateIPCName(int memKey, char *dest)
{
    /* This must be 31 characters or less for portability (i.e. Mac OS
X) */
    sprintf(dest, "PostgreSQL.%lx.%lx", PGShmemMagic, memKey);
}

/*
  *    InternalIpcMemoryCreate(memKey, size)
  *
  * Attempt to create a new shared memory segment with the specified
key.
  * Will fail (return NULL) if such a segment already exists.  If
successful,
  * attach the segment to the current process and return its attached
address.
  * On success, callbacks are registered with on_shmem_exit to detach
and
  * delete the segment when on_shmem_exit is called.
  *
  * If we fail with a failure code other than collision-with-existing-
segment,
  * print out an error and abort.  Other types of errors are not
recoverable.
  */
static void *
InternalIpcMemoryCreate(int memKey, Size size)
{
    int            fd;
    void       *memAddress;
    char        keyName[IPCNameLength];
    struct        stat statbuf;

    GenerateIPCName(memKey, keyName);
    fd = shm_open(keyName, O_RDWR | O_CREAT | O_EXCL, IPCProtection);

    if (fd < 0)
    {
        /*
         * Fail quietly if error indicates a collision with existing segment.
         * One would expect EEXIST, given that we said O_EXCL.
         */
        if (errno == EEXIST || errno == EACCES)
            return NULL;

        /*
         * Else complain and abort
         */
        ereport(FATAL,
                (errmsg("could not create shared memory segment: %m"),
          errdetail("Failed system call was shm_open(name=%s, oflag=%lu,
mode=%lu).",
                    keyName, (unsigned long) O_CREAT | O_EXCL,
                    (unsigned long) IPCProtection),
                 (errno == EMFILE) ?
                 errhint("This error means that the process has reached its limit "
                         "for open file descriptors.") : 0,
                 (errno == ENOSPC) ?
                 errhint("This error means the process has ran out of address "
                         "space.") : 0));
    }

    /* Register on-exit routine to delete the new segment */
    on_shmem_exit(IpcMemoryDelete, Int32GetDatum(memKey));

    /* Increase the size of the file descriptor to the desired length.
     * If this fails so will mmap since it can't map size bytes. */
    fstat(fd, &statbuf);
    if (statbuf.st_size < size)
        ftruncate(fd, size);

    /* OK, should be able to attach to the segment */
    memAddress = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
fd, 0);
    close(fd);

    if (memAddress == (void *) -1)
        elog(FATAL, "mmap(fd=%d) failed: %m", fd);

    /* Register on-exit routine to detach new segment before deleting */
    on_shmem_exit(IpcMemoryDetach, PointerGetDatum(memAddress));

    /* Record key and ID in lockfile for data directory. */
    RecordSharedMemoryInLockFile((unsigned long) memKey, 0);

    return memAddress;
}

/
************************************************************************
****/
/*    IpcMemoryDetach(status, shmaddr)    removes a shared memory segment        */
/*                                        from process' address space        */
/*    (called as an on_shmem_exit callback, hence funny argument list)        */
/
************************************************************************
****/
static void
IpcMemoryDetach(int status, Datum shmaddr)
{
    PGShmemHeader  *hdr;
    hdr = (PGShmemHeader *) DatumGetPointer(shmaddr);
    if (munmap(DatumGetPointer(shmaddr), hdr->totalsize) < 0)
        elog(LOG, "munmap(%p, ...) failed: %m", DatumGetPointer(shmaddr));
}

/
************************************************************************
****/
/*    IpcMemoryDelete(status, fd)        deletes a shared memory segment        */
/*    (called as an on_shmem_exit callback, hence funny argument list)        */
/
************************************************************************
****/
static void
IpcMemoryDelete(int status, Datum memKey)
{
    char        keyName[IPCNameLength];
    GenerateIPCName(memKey, keyName);

    if (shm_unlink(keyName) < 0)
        elog(LOG, "shm_unlink(%s) failed: %m", keyName);
}

/*
  * PGSharedMemoryIsInUse
  *
  * Is a previously-existing shmem segment still existing and in use?
  *
  * The point of this exercise is to detect the case where a prior
postmaster
  * crashed, but it left child backends that are still running.
Therefore
  * we only care about shmem segments that are associated with the
intended
  * DataDir.  This is an important consideration since accidental
matches of
  * shmem segment IDs are reasonably common.
  */
bool
PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
{
    char        keyName[IPCNameLength];
    PGShmemHeader  *hdr;
    int            fd, isValidHeader;

#ifndef WIN32
    struct stat statbuf;
#endif

    GenerateIPCName(id1, keyName);

    /*
     * We detect whether a shared memory segment is in use by seeing
whether
     * it (a) exists and (b) has any processes are attached to it.
     */
    fd = shm_open(keyName, O_RDWR, 0);
    if (fd < 0)
    {
        /*
         * ENOENT means the segment no longer exists.
         */
        if (errno == ENOENT)
            return false;

        /*
         * EACCES implies that the segment belongs to some other userid, which
         * means it is not a Postgres shmem segment that is relevant to our
         * data directory.
         */
        if (errno == EACCES)
            return false;

        /*
         * Otherwise, we had better assume that the segment is in use.
         */
        return true;
    }

    /*
     * Try to attach to the segment and see if it matches our data
directory.
     * This avoids fd-conflict problems on machines that are running
     * several postmasters under the same userid.  On Windows, which
doesn't
     * have useful inode numbers, we can't do this so we punt and assume
there
     * is a conflict.
     */
#ifdef WIN32
    close(fd);
    return true;
#endif

    if (stat(DataDir, &statbuf) < 0)
    {
        close(fd);
        return true;            /* if can't stat, be conservative */
    }

    hdr = (PGShmemHeader *) mmap(NULL, sizeof(PGShmemHeader), PROT_READ,
MAP_SHARED, fd, 0);
    close(fd);

    if (hdr == (PGShmemHeader *) -1)
        return true;            /* if can't attach, be conservative */

    isValidHeader = hdr->magic != PGShmemMagic ||
        hdr->device != statbuf.st_dev ||
        hdr->inode != statbuf.st_ino;
    munmap((void *) hdr, sizeof(PGShmemHeader));

    if (isValidHeader)
    {
        /*
         * It's either not a Postgres segment, or not one for my data
         * directory.  In either case it poses no threat.
         */
        munmap((void *) hdr, sizeof(PGShmemHeader));
        return false;
    }

    /* Trouble --- looks a lot like there's still live backends */

    return true;
}


/*
  * PGSharedMemoryCreate
  *
  * Create a shared memory segment of the given size and initialize its
  * standard header.  Also, register an on_shmem_exit callback to
release
  * the storage.
  *
  * Dead Postgres segments are recycled if found, but we do not fail
upon
  * collision with non-Postgres shmem segments.    The idea here is to
detect and
  * re-use keys that may have been assigned by a crashed postmaster
or backend.
  *
  * makePrivate means to always create a new segment, rather than
attach to
  * or recycle any existing segment.
  *
  * The port number is passed for possible use as a key (for SysV, we
use
  * it to generate the starting shmem key).    In a standalone backend,
  * zero will be passed.
  */
PGShmemHeader *
PGSharedMemoryCreate(Size size, bool makePrivate, int port)
{
    int            NextShmemSegID;
    void       *memAddress;
    PGShmemHeader *hdr;
    char        keyName[IPCNameLength];

#ifndef WIN32
    struct stat statbuf;
#endif

    /* Room for a header? */
    Assert(size > MAXALIGN(sizeof(PGShmemHeader)));

    /* Make sure PGSharedMemoryAttach doesn't fail without need */
    UsedShmemSegAddr = NULL;

    /* Loop till we find a free IPC key */
    NextShmemSegID = port * 1000;

    for (NextShmemSegID++;; NextShmemSegID++)
    {
        /* Try to create new segment */
        memAddress = InternalIpcMemoryCreate(NextShmemSegID, size);
        if (memAddress)
            break;                /* successful create and attach */

        /* Check shared memory and possibly remove and recreate */

        if (makePrivate)        /* a standalone backend shouldn't do this */
            continue;

        if ((memAddress = PGSharedMemoryAttach(NextShmemSegID)) == NULL)
            continue;            /* can't attach, not one of mine */

        /*
         * If I am not the creator and it belongs to an extant process,
         * continue.
         */
        hdr = (PGShmemHeader *) memAddress;
        if (hdr->creatorPID != getpid())
        {
            if (kill(hdr->creatorPID, 0) == 0 || errno != ESRCH)
            {
                munmap(memAddress, hdr->totalsize);
                continue;        /* segment belongs to a live process */
            }
        }

        /*
         * The segment appears to be from a dead Postgres process, or from a
         * previous cycle of life in this same process.  Zap it, if possible.
         * This probably shouldn't fail, but if it does, assume the segment
         * belongs to someone else after all, and continue quietly.
         */
        GenerateIPCName(NextShmemSegID, keyName);

        munmap(memAddress, hdr->totalsize);
        if (shm_unlink(keyName) < 0)
            continue;

        /*
         * Now try again to create the segment.
         */
        memAddress = InternalIpcMemoryCreate(NextShmemSegID, size);
        if (memAddress)
            break;                /* successful create and attach */

        /*
         * Can only get here if some other process managed to create the same
         * shmem key before we did.  Let him have that one, loop around to try
         * next key.
         */
    }

    /*
     * OK, we created a new segment.  Mark it as created by this
process. The
     * order of assignments here is critical so that another Postgres
process
     * can't see the header as valid but belonging to an invalid PID!
     */
    hdr = (PGShmemHeader *) memAddress;
    hdr->creatorPID = getpid();
    hdr->magic = PGShmemMagic;

#ifndef WIN32
    /* Fill in the data directory ID info, too */
    if (stat(DataDir, &statbuf) < 0)
        ereport(FATAL,
                (errcode_for_file_access(),
                 errmsg("could not stat data directory \"%s\": %m",
                        DataDir)));
    hdr->device = statbuf.st_dev;
    hdr->inode = statbuf.st_ino;
#endif

    /*
     * Initialize space allocation status for segment.
     */
    hdr->totalsize = size;
    hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));

    /* Save info for possible future use */
    UsedShmemSegAddr = memAddress;
    UsedShmemSegID = (unsigned long) NextShmemSegID;

    return hdr;
}

#ifdef EXEC_BACKEND

/*
  * PGSharedMemoryReAttach
  *
  * Re-attach to an already existing shared memory segment.    In the non
  * EXEC_BACKEND case this is not used, because postmaster children
inherit
  * the shared memory segment attachment via fork().
  *
  * UsedShmemSegID and UsedShmemSegAddr are implicit parameters to this
  * routine.  The caller must have already restored them to the
postmaster's
  * values.
  */
void
PGSharedMemoryReAttach(void)
{
    int fd;
    void       *hdr;
    void       *origUsedShmemSegAddr = UsedShmemSegAddr;

    Assert(UsedShmemSegAddr != NULL);
    Assert(IsUnderPostmaster);

#ifdef __CYGWIN__
    /* cygipc (currently) appears to not detach on exec. */
    PGSharedMemoryDetach();
    UsedShmemSegAddr = origUsedShmemSegAddr;
#endif

    elog(DEBUG3, "attaching to %p", UsedShmemSegAddr);
    hdr = (void *) PGSharedMemoryAttach((int) UsedShmemSegID);
    if (hdr == NULL)
        elog(FATAL, "could not reattach to shared memory (key=%d, addr=%p):
%m",
             (int) UsedShmemSegID, UsedShmemSegAddr);
    if (hdr != origUsedShmemSegAddr)
        elog(FATAL, "reattaching to shared memory returned unexpected
address (got %p, expected %p)",
             hdr, origUsedShmemSegAddr);

    UsedShmemSegAddr = hdr;        /* probably redundant */
}
#endif   /* EXEC_BACKEND */

/*
  * PGSharedMemoryDetach
  *
  * Detach from the shared memory segment, if still attached.  This
is not
  * intended for use by the process that originally created the segment
  * (it will have an on_shmem_exit callback registered to do that).
Rather,
  * this is for subprocesses that have inherited an attachment and
want to
  * get rid of it.
  */
void
PGSharedMemoryDetach(void)
{
    PGShmemHeader  *hdr;
    if (UsedShmemSegAddr != NULL)
    {
        hdr = (PGShmemHeader *) UsedShmemSegAddr;
        if (munmap(UsedShmemSegAddr, hdr->totalsize) < 0)
            elog(LOG, "munmap(%p) failed: %m", UsedShmemSegAddr);
        UsedShmemSegAddr = NULL;
    }
}


/*
  * Attach to shared memory and make sure it has a Postgres header
  *
  * Returns attach address if OK, else NULL
  */
static PGShmemHeader *
PGSharedMemoryAttach(int key)
{
    PGShmemHeader *hdr;
    char        keyName[IPCNameLength];
    Size        size;
    int            fd;

    GenerateIPCName(key, keyName);
    if ((fd = shm_open(keyName, O_RDWR, 0)) < 0)
        return NULL;

    hdr = (PGShmemHeader *) mmap(UsedShmemSegAddr, sizeof(PGShmemHeader),
                                 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

    if (hdr == (PGShmemHeader *) -1)
    {
        close(fd);
        return NULL;            /* failed: must be some other app's */
    }

    if (hdr->magic != PGShmemMagic)
    {
        close(fd);
        munmap((void *) hdr, sizeof(PGShmemHeader));
        return NULL;            /* segment belongs to a non-Postgres app */
    }

    /* Since the segment has a valid Postgres header, unmap and re-map
it with the proper size */
    size = hdr->totalsize;
    munmap((void *) hdr, sizeof(PGShmemHeader));
    hdr = (PGShmemHeader *) mmap(UsedShmemSegAddr, size, PROT_READ |
PROT_WRITE, MAP_SHARED, fd, 0);
    close(fd);

    return hdr;
}

Attachment

posix_shmem.c

Re: Feature: POSIX Shared memory support

From

Tom Lane

Date:

06 February 2007, 14:17:37

Chris Marcellino <maps@levelview.com> writes:
> To this end, I have "ported" the svsv_shmem.c layer to use the POSIX
> calls (which are some ways more robust w.r.t reducing collision by
> using strings as shared memory id's, instead of ints).

This has been suggested before, and rejected before, on the grounds that
the POSIX API provides no way to detect whether anyone else is attached
to the segment.  Not being able to tell that is a tremendous robustness
hit for us.  We are not going to risk destroying someone's database
(or in the alternative, failing to restart after most crashes, which
it looks like your patch would do) in order to make installation
fractionally easier.

I read through your patch in the hopes that you had a solution for this,
but all I find is a copied-and-pasted comment

>     /*
>      * We detect whether a shared memory segment is in use by seeing whether
>      * it (a) exists and (b) has any processes are attached to it.
>      */

followed by code that does no such thing.

            regards, tom lane

Re: Feature: POSIX Shared memory support

From

Michael Paesold

Date:

06 February 2007, 14:51:35

Tom Lane wrote:
> Chris Marcellino <maps@levelview.com> writes:
>> To this end, I have "ported" the svsv_shmem.c layer to use the POSIX
>> calls (which are some ways more robust w.r.t reducing collision by
>> using strings as shared memory id's, instead of ints).
>
> This has been suggested before, and rejected before, on the grounds that
> the POSIX API provides no way to detect whether anyone else is attached
> to the segment.  Not being able to tell that is a tremendous robustness
> hit for us.  We are not going to risk destroying someone's database
> (or in the alternative, failing to restart after most crashes, which
> it looks like your patch would do) in order to make installation
> fractionally easier.
>
> I read through your patch in the hopes that you had a solution for this,
> but all I find is a copied-and-pasted comment
>
>>     /*
>>      * We detect whether a shared memory segment is in use by seeing whether
>>      * it (a) exists and (b) has any processes are attached to it.
>>      */
>
> followed by code that does no such thing.

Just an idea, but would it be possible to have a small SysV area as an
"advisory lock" (using the existing semantics) to protect the POSIX segment.

Best Regards
Michael Paesold

Re: Feature: POSIX Shared memory support

From

Chris Marcellino

Date:

06 February 2007, 17:27:19

Tom, that is a definitely valid point and thanks for the feedback. I
assume that the 'more modern' string segment naming gave the POSIX
methods an edge in avoiding collision between other apps.
As far as detecting a) whether anyone else is currently attached to
that segment and b) whether an earlier existence of the current
backend was still attached to a segment, I presumed that checking the
pid's of the backend that owns the shared memory segment and checking
the data directory (both which the SysV code already does) would
suffice?
What am I forgetting?

Michael, that is an interesting idea. That might be an avenue to
explore if there isn't a simpler way.

Thanks,
Chris Marcellino

On Feb 6, 2007, at 7:51 AM, Michael Paesold wrote:

> Tom Lane wrote:
>> Chris Marcellino <maps@levelview.com> writes:
>>> To this end, I have "ported" the svsv_shmem.c layer to use the
>>> POSIX  calls (which are some ways more robust w.r.t reducing
>>> collision by  using strings as shared memory id's, instead of ints).
>> This has been suggested before, and rejected before, on the
>> grounds that
>> the POSIX API provides no way to detect whether anyone else is
>> attached
>> to the segment.  Not being able to tell that is a tremendous
>> robustness
>> hit for us.  We are not going to risk destroying someone's database
>> (or in the alternative, failing to restart after most crashes, which
>> it looks like your patch would do) in order to make installation
>> fractionally easier.
>> I read through your patch in the hopes that you had a solution for
>> this,
>> but all I find is a copied-and-pasted comment
>>>     /*
>>>      * We detect whether a shared memory segment is in use by seeing
>>> whether
>>>      * it (a) exists and (b) has any processes are attached to it.
>>>      */
>> followed by code that does no such thing.
>
> Just an idea, but would it be possible to have a small SysV area as
> an "advisory lock" (using the existing semantics) to protect the
> POSIX segment.
>
> Best Regards
> Michael Paesold
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org

Re: Feature: POSIX Shared memory support

From

Alvaro Herrera

Date:

06 February 2007, 17:32:30

Chris Marcellino wrote:
> Tom, that is a definitely valid point and thanks for the feedback. I
> assume that the 'more modern' string segment naming gave the POSIX
> methods an edge in avoiding collision between other apps.
> As far as detecting a) whether anyone else is currently attached to
> that segment and b) whether an earlier existence of the current
> backend was still attached to a segment, I presumed that checking the
> pid's of the backend that owns the shared memory segment and checking
> the data directory (both which the SysV code already does) would
> suffice?

Is there an API call to list all PIDs that are connected to a particular
segment?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Feature: POSIX Shared memory support

From

Chris Marcellino

Date:

06 February 2007, 17:37:30

To my knowledge there is unfortunately not a portable call that does
that.
I was actually referring to the check that the current SysV code does
on the pid that is stored in the shmem header. I presume that if the
backend is dead, the kill(hdr->creatorPID, 0) returning zero would
suffice for confirming the existence of the other backend process.

Chris Marcellino

On Feb 6, 2007, at 10:32 AM, Alvaro Herrera wrote:

> Chris Marcellino wrote:
>> Tom, that is a definitely valid point and thanks for the feedback. I
>> assume that the 'more modern' string segment naming gave the POSIX
>> methods an edge in avoiding collision between other apps.
>> As far as detecting a) whether anyone else is currently attached to
>> that segment and b) whether an earlier existence of the current
>> backend was still attached to a segment, I presumed that checking the
>> pid's of the backend that owns the shared memory segment and checking
>> the data directory (both which the SysV code already does) would
>> suffice?
>
> Is there an API call to list all PIDs that are connected to a
> particular
> segment?
>
> --
> Alvaro Herrera                                http://
> www.CommandPrompt.com/
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Feature: POSIX Shared memory support

From

Tom Lane

Date:

06 February 2007, 17:48:14

Chris Marcellino <maps@levelview.com> writes:
> I was actually referring to the check that the current SysV code does
> on the pid that is stored in the shmem header. I presume that if the
> backend is dead, the kill(hdr->creatorPID, 0) returning zero would
> suffice for confirming the existence of the other backend process.

No, that's not relevant, because only the postmaster's PID will be there
--- that test is actually more or less redundant with the existing
postmaster.pid lockfile checks.  The thing that the SysV attachment
count is useful for is detecting whether there are orphaned backends
still alive in the database (and potentially changing it, hence the
danger).

We've speculated on occasion about using file locking in some form as a
substitute mechanism for detecting this, but that seems to just bring
its own set of not-too-portable assumptions.

            regards, tom lane

Re: Feature: POSIX Shared memory support

From

"Takayuki Tsunakawa"

Date:

07 February 2007, 02:04:59

From: "Chris Marcellino" <maps@levelview.com>
> To this end, I have "ported" the svsv_shmem.c layer to use the POSIX
> calls (which are some ways more robust w.r.t reducing collision by
> using strings as shared memory id's, instead of ints).

I hope your work will be accepted.  Setting IPC parameters is tedious
for normal users, and they sometimes miss the manual article and hit
the IPC resource shortage problem, particularly when the system
developers run multiple instances on a single machine at the same
time.
Then, how about semaphores?  When I just do configure, PostgreSQL
seems to use SysV semaphores.  But POSIX semaphore implementation is
prepared in src/backend/port/posix_sema.c.  Why isn't it used by
default?  Does it have any problem?
# Windows is good in this point, isn't it?

I'm sorry to ask you a question even though I've not read your patch
well.  Does mmap(MAP_SHARED) need msync() to make the change by one
process visible to other processes?  I found the following in the
manual page of mmap on Linux:

------------------------------------------------------------
       MAP_SHARED Share this mapping with all other processes that
map  this
    object.   Storing to the region is equivalent to writing to
    the file.  The file  may  not  actually  be  updated until
    msync(2) or munmap(2) are called.
------------------------------------------------------------

BTW, is the number of semaphores for dummy backends (eg bgwriter,
autovacuum) counted in PostgreSQL manual?

From: "Tom Lane" <tgl@sss.pgh.pa.us>
> the POSIX API provides no way to detect whether anyone else is
attached
> to the segment.  Not being able to tell that is a tremendous
robustness
> hit for us.  We are not going to risk destroying someone's database
> (or in the alternative, failing to restart after most crashes, which
> it looks like your patch would do) in order to make installation
> fractionally easier.

How is this done on Windows?  Is it possible to count the number of
processes that attach a shared memory?

Re: Feature: POSIX Shared memory support

From

Chris Marcellino

Date:

07 February 2007, 03:07:19

Responses inline.

On Feb 6, 2007, at 7:05 PM, Takayuki Tsunakawa wrote:

> From: "Chris Marcellino" <maps@levelview.com>
>> To this end, I have "ported" the svsv_shmem.c layer to use the POSIX
>> calls (which are some ways more robust w.r.t reducing collision by
>> using strings as shared memory id's, instead of ints).
>
> I hope your work will be accepted.  Setting IPC parameters is tedious
> for normal users, and they sometimes miss the manual article and hit
> the IPC resource shortage problem, particularly when the system
> developers run multiple instances on a single machine at the same
> time.

As Tom pointed out, the code I posted yesterday is not robust enough
for general consumption. I'm working on a better solution, which will
likely involve using a very small SysV shmem segment as a mutex of
sorts (as Michael Paesold suggested).

> Then, how about semaphores?  When I just do configure, PostgreSQL
> seems to use SysV semaphores.  But POSIX semaphore implementation is
> prepared in src/backend/port/posix_sema.c.  Why isn't it used by
> default?  Does it have any problem?
>

In this case, semaphore usage is unrelated to shared memory
shortages. Also, on many platforms the posix_sema's code is used.
Either way, Essentially, no one is running out of shared memory due
to semaphores.

> # Windows is good in this point, isn't it?

 From what I can tell, if you look at the Windows SysV shmem
emulation code in src/backend/port/win32/shmem.c, you will see in the
shmctl() function that the 'other process detection' code is not
implemented, since their is no corresponding Win32 API to implement
this. There is only so much you can do in that case.

As far as the other platforms go, any replacement for the SysV shmem
code should be as reliable as what preceded it.


>
> I'm sorry to ask you a question even though I've not read your patch
> well.  Does mmap(MAP_SHARED) need msync() to make the change by one
> process visible to other processes?  I found the following in the
> manual page of mmap on Linux:
>
> ------------------------------------------------------------
>        MAP_SHARED Share this mapping with all other processes that
> map  this
>     object.   Storing to the region is equivalent to writing to
>     the file.  The file  may  not  actually  be  updated until
>     msync(2) or munmap(2) are called.
> ------------------------------------------------------------
>
> BTW, is the number of semaphores for dummy backends (eg bgwriter,
> autovacuum) counted in PostgreSQL manual?
>
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> the POSIX API provides no way to detect whether anyone else is
> attached
>> to the segment.  Not being able to tell that is a tremendous
> robustness
>> hit for us.  We are not going to risk destroying someone's database
>> (or in the alternative, failing to restart after most crashes, which
>> it looks like your patch would do) in order to make installation
>> fractionally easier.
>
> How is this done on Windows?  Is it possible to count the number of
> processes that attach a shared memory?
>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

Re: Feature: POSIX Shared memory support

From

Tom Lane

Date:

07 February 2007, 03:09:03

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> writes:
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> the POSIX API provides no way to detect whether anyone else is
>> attached to the segment.  Not being able to tell that is a tremendous
>> robustness hit for us.

> How is this done on Windows?  Is it possible to count the number of
> processes that attach a shared memory?

AFAIK the Windows port is simply wrong/insecure on this point --- it's
one of the reasons you'll never see me recommending Windows as the OS
for a production Postgres server.

            regards, tom lane

Re: Feature: POSIX Shared memory support

From

Tom Lane

Date:

07 February 2007, 03:28:01

Chris Marcellino <maps@levelview.com> writes:
> As Tom pointed out, the code I posted yesterday is not robust enough
> for general consumption. I'm working on a better solution, which will
> likely involve using a very small SysV shmem segment as a mutex of
> sorts (as Michael Paesold suggested).

One problem with Michael's idea is that it gives up one of the better
arguments for having a POSIX option, namely to allow us to run on
platforms where SysV shmem support is not there at all.

I'm not sure whether the idea can be implemented without creating new
failure modes; that will have to wait on seeing a patch.  But the
strength of the coupling between the SysV and POSIX segments is
certainly going to be a red-flag item to look at.

>> Then, how about semaphores?  When I just do configure, PostgreSQL
>> seems to use SysV semaphores.  But POSIX semaphore implementation is
>> prepared in src/backend/port/posix_sema.c.  Why isn't it used by
>> default?  Does it have any problem?

> In this case, semaphore usage is unrelated to shared memory
> shortages. Also, on many platforms the posix_sema's code is used.
> Either way, Essentially, no one is running out of shared memory due
> to semaphores.

AFAIK the only platform where the POSIX sema code is really used is
Darwin (OS X), and it is not something I'd use there if I had a choice.
The problem with it is that *every* semaphore corresponds to an open
file handle in the postmaster that has to be inherited by *every* forked
child.  So N backend slots cost you O(N^2) in kernel filehandles and
process fork overhead, plus if N is big you're taking a serious hit in
the number of disk files any one backend can have open.  This problem
may be specific to Darwin's implementation of the POSIX spec, but it's
real enough there.  If you trawl the archives you'll probably notice a
lack of people running big Postgres installations on Darwin, and this is
why.

            regards, tom lane

Re: Feature: POSIX Shared memory support

From

"Takayuki Tsunakawa"

Date:

07 February 2007, 03:30:55

ep

Re: Feature: POSIX Shared memory support

From

"Takayuki Tsunakawa"

Date:

07 February 2007, 03:31:51

>> Then, how about semaphores?  When I just do configure, PostgreSQL
>> seems to use SysV semaphores.  But POSIX semaphore implementation
is
>> prepared in src/backend/port/posix_sema.c.  Why isn't it used by
>> default?  Does it have any problem?
>>
>
> Either way, Essentially, no one is running out of shared memory due
> to semaphores.
> In this case, semaphore usage is unrelated to shared memory
> shortages.

Yes, of course, shared memory is not related to semaphores.

> Also, on many platforms the posix_sema's code is used.

Really?  When I run 'configure' without any parameter on Red Hat
Enterprise Linux 4.0 (kernel 2.6.x), PostgreSQL uses SysV semaphores.
I confirmed that by seeing the result of 'ipcs -u'.  What platforms is
POSIX sema used by PostgreSQL by default?

Re: Feature: POSIX Shared memory support

From

Chris Marcellino

Date:

07 February 2007, 03:44:40

Yes, as Tom pointed out. Sorry, I misread the autoconf file. I've
gotten quite used to Darwin == BSD.
I've added a note to my todo list to look into the posix semaphore
performance on the Darwin side.

--Chris

On Feb 6, 2007, at 8:32 PM, Takayuki Tsunakawa wrote:

>>> Then, how about semaphores?  When I just do configure, PostgreSQL
>>> seems to use SysV semaphores.  But POSIX semaphore implementation
> is
>>> prepared in src/backend/port/posix_sema.c.  Why isn't it used by
>>> default?  Does it have any problem?
>>>
>>
>> Either way, Essentially, no one is running out of shared memory due
>> to semaphores.
>> In this case, semaphore usage is unrelated to shared memory
>> shortages.
>
> Yes, of course, shared memory is not related to semaphores.
>
>> Also, on many platforms the posix_sema's code is used.
>
> Really?  When I run 'configure' without any parameter on Red Hat
> Enterprise Linux 4.0 (kernel 2.6.x), PostgreSQL uses SysV semaphores.
> I confirmed that by seeing the result of 'ipcs -u'.  What platforms is
> POSIX sema used by PostgreSQL by default?
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Re: Feature: POSIX Shared memory support

From

Chris Marcellino

Date:

07 February 2007, 04:46:40

Attached is a beta of the POSIX shared memory layer. It is 75% the
original sysv_shmem.c code. I'm looking for ways to refactor it down
a bit, while changing as little of the tried-and-tested code as
possible. I though I'd put it out there for comments.

Of course, unfortunately it is more complicated than the original as
it uses both sets of API.  Also, I haven't tested the crash recovery
thoroughly.  The POSIX code could be used Windows-style (i.e. no
crash recovery) if one ifdef'd out the SysV calls properly, if they
had such a POSIX-only platform they needed to run Postgres on.

Using both API is certainly not ideal. You mentioned,

> We've speculated on occasion about using file locking in some form
> as a
> substitute mechanism for detecting this, but that seems to just bring
> its own set of not-too-portable assumptions

What sort of file locking did you have in mind? Do you think this
might be worth me trying?

Thanks for your help,
Chris Marcellino

Attachment

posix_shmem.c

Re: Feature: POSIX Shared memory support

From

"Andrew Dunstan"

Date:

07 February 2007, 06:27:17

Tom Lane wrote:
>
> We've speculated on occasion about using file locking in some form as a
> substitute mechanism for detecting this, but that seems to just bring
> its own set of not-too-portable assumptions.
>

Maybe we should look some more at that. Use of file locking was one
thought I had today after I saw Tom's earlier comments.

Perl provides a moderately portable flock(), which we use in fact in
buildfarm to stop it from running more than one at a time on a given repo
copy.

The Perl description starts thus:

   Calls flock(2), or an emulation of it, on FILEHANDLE.  Returns
   true for success, false on failure.  Produces a fatal error if
   used on a machine that doesn't implement flock(2), fcntl(2)
   locking, or lockf(3).  "flock" is Perl's portable file locking
   interface, although it locks only entire files, not records.

Note that this means it works on every platform that has ever reported on
buildfarm.

Maybe we can borrow some code.

cheers

andrew

Re: Feature: POSIX Shared memory support

From

Magnus Hagander

Date:

07 February 2007, 12:33:08

On Tue, Feb 06, 2007 at 11:08:51PM -0500, Tom Lane wrote:
> "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> writes:
> > From: "Tom Lane" <tgl@sss.pgh.pa.us>
> >> the POSIX API provides no way to detect whether anyone else is
> >> attached to the segment.  Not being able to tell that is a tremendous
> >> robustness hit for us.
>
> > How is this done on Windows?  Is it possible to count the number of
> > processes that attach a shared memory?
>
> AFAIK the Windows port is simply wrong/insecure on this point --- it's
> one of the reasons you'll never see me recommending Windows as the OS
> for a production Postgres server.

What exactly is the failure case? Might be able to figure out a way to
do what we want on win32 even if it's not possible to do it exactly with
the sysv semantics.

//Magnus

Re: Feature: POSIX Shared memory support

From

Tom Lane

Date:

07 February 2007, 13:41:02

Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Feb 06, 2007 at 11:08:51PM -0500, Tom Lane wrote:
>> AFAIK the Windows port is simply wrong/insecure on this point --- it's
>> one of the reasons you'll never see me recommending Windows as the OS
>> for a production Postgres server.

> What exactly is the failure case? Might be able to figure out a way to
> do what we want on win32 even if it's not possible to do it exactly with
> the sysv semantics.

kill -9 postmaster (only), then try to start new postmaster.  This
should succeed if and only if there are no live orphaned backends.
An implementation that hasn't got a direct test for the presence of
backends can only get one of the two cases correct.

On Windows (or really any EXEC_BACKEND platform) there's an additional
problem, which is that even with an attach count you have a race
condition: what if the postmaster launched a new backend just before
dying, and that process has not yet re-attached to shared memory?
I don't think this is a big problem in practice, because most people
don't feel a need for an automated postmaster-restarting monitor, and
so the time scale for human intervention is too long to hit the race
condition.  But it's annoying from a theoretical perspective.

It's probably possible to replace the attach-count test with some sort
of file locking convention --- eg if all the backends hold some type of
shared lock on postmaster.pid.  This seems unlikely to be much more
portable than the attach-count solution as far as Unixen go, but if
we're looking for a Windows-specific solution that's where I'd look.

            regards, tom lane

Re: Feature: POSIX Shared memory support

From

Alvaro Herrera

Date:

07 February 2007, 14:04:11

Andrew Dunstan wrote:

> Maybe we should look some more at that. Use of file locking was one
> thought I had today after I saw Tom's earlier comments.
>
> Perl provides a moderately portable flock(), which we use in fact in
> buildfarm to stop it from running more than one at a time on a given repo
> copy.

But does it work over NFS?  On my system, the flock manpage claims it
doesn't, lockf doesn't say and fcntl also doesn't say, but the flock
manpage says fcntl does.  A lot of people runs servers on NFS, even
though we recommend they don't.  And there are those strange hybrids
like SANs, NASes or what have you.

One serious problem is that if the lock doesn't work for some reason
like NFSness, it will fail silently, which is not acceptable.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Feature: POSIX Shared memory support

From

Andrew Dunstan

Date:

07 February 2007, 14:37:08

Alvaro Herrera wrote:
> Andrew Dunstan wrote:
>
>
>> Maybe we should look some more at that. Use of file locking was one
>> thought I had today after I saw Tom's earlier comments.
>>
>> Perl provides a moderately portable flock(), which we use in fact in
>> buildfarm to stop it from running more than one at a time on a given repo
>> copy.
>>
>
> But does it work over NFS?  On my system, the flock manpage claims it
> doesn't, lockf doesn't say and fcntl also doesn't say, but the flock
> manpage says fcntl does.  A lot of people runs servers on NFS, even
> though we recommend they don't.  And there are those strange hybrids
> like SANs, NASes or what have you.
>
> One serious problem is that if the lock doesn't work for some reason
> like NFSness, it will fail silently, which is not acceptable.
>
>

Fair point. Perl in fact uses whatever it can from the underlying
system,  preferring (I think) flock, then fcntl, then lockf. So its
flock is quite possibly not NFS safe in many cases.

cheers

andrew

Re: Feature: POSIX Shared memory support

From

Alvaro Herrera

Date:

07 February 2007, 14:42:01

Andrew Dunstan wrote:

> Perl provides a moderately portable flock(), which we use in fact in
> buildfarm to stop it from running more than one at a time on a given repo
> copy.
>
[...]

> Maybe we can borrow some code.

Probably not, because it's GPL/Artistic; but we could borrow some ideas
instead.

The relevant code is here
http://public.activestate.com/cgi-bin/perlbrowse/f/pp_sys.c

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Feature: POSIX Shared memory support

From

Magnus Hagander

Date:

08 February 2007, 10:33:03

On Wed, Feb 07, 2007 at 09:40:16AM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Tue, Feb 06, 2007 at 11:08:51PM -0500, Tom Lane wrote:
> >> AFAIK the Windows port is simply wrong/insecure on this point --- it's
> >> one of the reasons you'll never see me recommending Windows as the OS
> >> for a production Postgres server.
>
> > What exactly is the failure case? Might be able to figure out a way to
> > do what we want on win32 even if it's not possible to do it exactly with
> > the sysv semantics.
>
> kill -9 postmaster (only), then try to start new postmaster.  This
> should succeed if and only if there are no live orphaned backends.
> An implementation that hasn't got a direct test for the presence of
> backends can only get one of the two cases correct.
>
> On Windows (or really any EXEC_BACKEND platform) there's an additional
> problem, which is that even with an attach count you have a race
> condition: what if the postmaster launched a new backend just before
> dying, and that process has not yet re-attached to shared memory?
> I don't think this is a big problem in practice, because most people
> don't feel a need for an automated postmaster-restarting monitor, and
> so the time scale for human intervention is too long to hit the race
> condition.  But it's annoying from a theoretical perspective.
>
> It's probably possible to replace the attach-count test with some sort
> of file locking convention --- eg if all the backends hold some type of
> shared lock on postmaster.pid.  This seems unlikely to be much more
> portable than the attach-count solution as far as Unixen go, but if
> we're looking for a Windows-specific solution that's where I'd look.

Ok. From what I can tell, we create a shared mem segment named
PostgreSQL.5432001. If I kill postmaster with something active, and
start a new one, it gets named PostgreSQL.5432002.

If we just didn't add the serial number at the end, then it would be
impossible to create a shared memory segment for the same port again.
That protects the port and not the datadir. But what if we change the
name of the shared memory segment to be that of the data directory
instead of the port?

On win32 we do not have the problem of "orphaned segments", because once
the last process that holds a segment dies, the segment always goes
away. An anonymous region cannot exist if there are no handles open to
it.

As for the EXEC_BACKEND case you mentioned,  don't think it's an issue
on win32. If the postmaster dies before the backend re-attaches, the
backend will fail to re-attach. I think?

Thoughts?

//Magnus

Re: Feature: POSIX Shared memory support

From

Tom Lane

Date:

08 February 2007, 13:46:50

Magnus Hagander <magnus@hagander.net> writes:
> If we just didn't add the serial number at the end, then it would be
> impossible to create a shared memory segment for the same port again.
> That protects the port and not the datadir. But what if we change the
> name of the shared memory segment to be that of the data directory
> instead of the port?

That would help if there's only one possible spelling of the data
directory path ... otherwise not so much ...

            regards, tom lane

Re: Feature: POSIX Shared memory support

From

Magnus Hagander

Date:

08 February 2007, 19:53:18

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> If we just didn't add the serial number at the end, then it would be
>> impossible to create a shared memory segment for the same port again.
>> That protects the port and not the datadir. But what if we change the
>> name of the shared memory segment to be that of the data directory
>> instead of the port?
>
> That would help if there's only one possible spelling of the data
> directory path ... otherwise not so much ...

Well, we could run GetFullPathName() on it
(http://msdn2.microsoft.com/en-us/library/aa364963.aspx). I think that
should work - takes out the "relative vs absolute path" part at least.

It won't take care of somebody having a junction pointing at the data
directory and starting it against that one, but that's really someone
*trying* to break the system. You wouldn't do that by mistake...

Seems worthwhile to you? If so I can take a look at doing it when I get
some spare time.

//Magnus

Re: Feature: POSIX Shared memory support

From

Tom Lane

Date:

08 February 2007, 20:08:55

Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> Magnus Hagander <magnus@hagander.net> writes:
>>> If we just didn't add the serial number at the end, then it would be
>>> impossible to create a shared memory segment for the same port again.
>>> That protects the port and not the datadir. But what if we change the
>>> name of the shared memory segment to be that of the data directory
>>> instead of the port?
>>
>> That would help if there's only one possible spelling of the data
>> directory path ... otherwise not so much ...

> Well, we could run GetFullPathName() on it
> (http://msdn2.microsoft.com/en-us/library/aa364963.aspx). I think that
> should work - takes out the "relative vs absolute path" part at least.

> It won't take care of somebody having a junction pointing at the data
> directory and starting it against that one, but that's really someone
> *trying* to break the system. You wouldn't do that by mistake...

> Seems worthwhile to you? If so I can take a look at doing it when I get
> some spare time.

Sounds reasonable --- certainly it'd be better than the current
situation.  I assume that we can have long enough shared memory segment
names that the data directory path length isn't unduly constrained?

            regards, tom lane

Re: Feature: POSIX Shared memory support

From

Magnus Hagander

Date:

08 February 2007, 20:09:45

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Tom Lane wrote:
>>> Magnus Hagander <magnus@hagander.net> writes:
>>>> If we just didn't add the serial number at the end, then it would be
>>>> impossible to create a shared memory segment for the same port again.
>>>> That protects the port and not the datadir. But what if we change the
>>>> name of the shared memory segment to be that of the data directory
>>>> instead of the port?
>>> That would help if there's only one possible spelling of the data
>>> directory path ... otherwise not so much ...
>
>> Well, we could run GetFullPathName() on it
>> (http://msdn2.microsoft.com/en-us/library/aa364963.aspx). I think that
>> should work - takes out the "relative vs absolute path" part at least.
>
>> It won't take care of somebody having a junction pointing at the data
>> directory and starting it against that one, but that's really someone
>> *trying* to break the system. You wouldn't do that by mistake...
>
>> Seems worthwhile to you? If so I can take a look at doing it when I get
>> some spare time.
>
> Sounds reasonable --- certainly it'd be better than the current
> situation.  I assume that we can have long enough shared memory segment
> names that the data directory path length isn't unduly constrained?

From what I can see, we can have a shared memory segment name that is
just as long as any path name. Will run some tests on that to make
absolutely sure.

//Magnus