Re: [HACKERS] mmap and MAP_ANON - Mailing list pgsql-hackers

From ocie@paracel.com
Subject Re: [HACKERS] mmap and MAP_ANON
Date
Msg-id 9805131838.AA05684@dolomite.paracel.com
Whole thread Raw
In response to Re: [HACKERS] mmap and MAP_ANON  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
>
> "Göran Thyni" <goran@bildbasen.se> writes:
> > Linux can only MAP_SHARED if the file is a *real* file,
> > devices or trick like MAP_ANON does only work with MAP_PRIVATE.
>
> Well, this makes some sense: MAP_SHARED implies that the shared memory
> will also be accessible to independently started processes, and
> to do that you have to have an openable filename to refer to the
> data segment by.
>
> MAP_PRIVATE will *not* work for our purposes: according to my copy
> of mmap(2):
>
> :     If MAP_PRIVATE is set in flags:
> :          o    Modification to the mapped region by the calling process is
> :               not visible to other processes which have mapped the same
> :               region using either MAP_PRIVATE or MAP_SHARED.
> :               Modifications are not visible to descendant processes that
> :               have inherited the mapped region across a fork().
>
> so privately mapped segments are useless for interprocess communication,
> even after we get rid of exec().
>
> mmaping /dev/zero, as has been suggested earlier in this thread,
> seems like a really bad idea to me.  Would that not imply that
> any process anywhere in the system that also decides to mmap /dev/zero
> would get its hands on the Postgres shared memory segment?  You
> can't restrict permissions on /dev/zero to prevent it.

On some systems, mmaping /dev/zero can be shared with child processes
as in this example:

#include <sys/types.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/wait.h>

int main()
{
  int fd;
  caddr_t ma;
  int i;
  int pagesize = sysconf(_SC_PAGESIZE);

  fd=open("/dev/zero",O_RDWR);
  if (fd==-1) {
    perror("open");
    exit(1);
  }

  ma=mmap((caddr_t) 0,
      pagesize,
      (PROT_READ|PROT_WRITE),
      MAP_SHARED,
      fd,
      0);

  if ((int)ma == -1) {
    perror("mmap");
    exit(1);
  }

  memset(ma,0,pagesize);

  i=fork();

  if (i==-1) {
    perror("fork");
    exit(1);
  }

  if (i==0) { /* child */
    ((char*)ma)[0]=1;
    sleep(1);
    printf("child %d %d\n",((char*)ma)[0],((char*)ma)[1]);
    sleep(1);
    return 0;
  } else { /* parent */
    ((char*)ma)[1]=1;
    sleep(1);
    printf("parent %d %d\n",((char*)ma)[0],((char*)ma)[1]);
  }

  wait(NULL);
  munmap(ma,pagesize*10);

  return 0;
}


This works on Solaris and as expected, both the parent and child are
able to write into the memory and their changes are honored (the
memory is truely shared between processes.  We can certainly map a
real file, and this might even give us some interesting crash recovery
options.  The nice thing about doing away with the exec is that the
memory mapped in the parent process is avalible at the same address
region in every process, so we don't have to do funky pointer tricks.

The only problem I see with mmap is that we don't know exactly when a
page will be written to disk.  I.E. If you make two writes, the page
might get sync'ed between them, thus storing an inconsistant
intermediate state to the disk.  Perhaps with proper transaction
control, this is not a problem.

The question is should the individual database files be mapped into
memory, or should one "pgmem" file be mapped, with pages from
different files read into it.  The first option would allow different
backend processes to map different pages of different files as they
are needed.  The postmaster could "pre-map" pages on behalf of the
backend processes as sort of an inteligent read-ahead mechanism.

I'll try to write this seperate from Postgres just to see how it works.

Ocie

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Re: [QUESTIONS] money or dollar type
Next
From: Michal Mosiewicz
Date:
Subject: Re: [HACKERS] mmap and MAP_ANON