Thread: mmap and MAP_ANON

mmap and MAP_ANON

From
Bruce Momjian
Date:
Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call?  You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others.  I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] mmap and MAP_ANON

From
dj@pelf.harvard.edu (Diab Jerius)
Date:
I can't find MAP_ANON on Solaris 2.5.1 or 2.5.6.  The man
page claims the following options are avaliable:

          MAP_SHARED               Share changes.
          MAP_PRIVATE              Changes are private.
          MAP_FIXED                Interpret addr exactly.
          MAP_NORESERVE            Don't reserve swap space.


If you'd like, I can send along the whole man page.

---------  Received message begins Here  ---------

>
> Would people tell me what platforms do NOT support the MAP_ANON flag to
> the mmap() system call?  You should find it in the mmap() manual page.
>
> *BSD has it, but I am not sure of the others.  I am researching cache
> size issues and the use of mmap vs. SYSV shared memory.
>
> --
> Bruce Momjian                          |  830 Blythe Avenue
> maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
>   +  If your life is a hard drive,     |  (610) 353-9879(w)
>   +  Christ can be your backup.        |  (610) 853-3000(h)
>
>

-------------
Diab Jerius                       Harvard-Smithsonian Center for Astrophysics
                                  60 Garden St, MS 70, Cambridge MA 02138 USA
djerius@cfa.harvard.edu           vox: 617 496 7575         fax: 617 495 7356

Re: [HACKERS] mmap and MAP_ANON

From
"Göran Thyni"
Date:
Bruce Momjian wrote:
>
> Would people tell me what platforms do NOT support the MAP_ANON flag to
> the mmap() system call?  You should find it in the mmap() manual page.
>
> *BSD has it, but I am not sure of the others.  I am researching cache
> size issues and the use of mmap vs. SYSV shared memory.

SVR4 (at least older ones) does not support MMAP_ANON,
but the recommended in W. Richards Stevens'
"Advanced programming in the Unix environment" (aka the Bible part 2)
is to use /dev/zero.

This should be configurable with autoconf:

<PSEUDO CODE>

if (exists MAP_ANON) use it; else use /dev/zero

------------

flags = MAP_SHARED;
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);

</PSEUDO CODE>


    regards,
--
---------------------------------------------
Göran Thyni, sysadm, JMS Bildbasen, Kiruna

Re: [HACKERS] mmap and MAP_ANON

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> Would people tell me what platforms do NOT support the MAP_ANON flag to
> the mmap() system call?  You should find it in the mmap() manual page.

On HPUX it seems to be spelled MAP_ANONYMOUS.  At least if this means
the same thing as what you are talking about.  The HP man page says

:     The MAP_FILE and MAP_ANONYMOUS flags control whether the region to be
:     mapped is a mapped file region or an anonymous shared memory region.
:     Exactly one of these flags must be selected.

            regards, tom lane

Re: [HACKERS] mmap and MAP_ANON

From
ocie@paracel.com
Date:
Bruce Momjian wrote:
>
> Would people tell me what platforms do NOT support the MAP_ANON flag to
> the mmap() system call?  You should find it in the mmap() manual page.

Doesn't seem to appear in Linux (2.0.30 kernel).  As another poster
commented, /dev/zero can be mapped for anonymous memory.

Ocie Mitchell

Re: [HACKERS] mmap and MAP_ANON

From
"Göran Thyni"
Date:
Göran Thyni wrote:
>
> Bruce Momjian wrote:
> >
> > Would people tell me what platforms do NOT support the MAP_ANON flag to
> > the mmap() system call?  You should find it in the mmap() manual page.
> >
> > *BSD has it, but I am not sure of the others.  I am researching cache
> > size issues and the use of mmap vs. SYSV shared memory.
>
> SVR4 (at least older ones) does not support MMAP_ANON,
> but the recommended in W. Richards Stevens'
> "Advanced programming in the Unix environment" (aka the Bible part 2)
> is to use /dev/zero.
>
> This should be configurable with autoconf:
>
> <PSEUDO CODE>
>
> if (exists MAP_ANON) use it; else use /dev/zero
>
> ------------
>
> flags = MAP_SHARED;
> #ifdef HAS_MMAP_ANON
> fd = -1;
> flags |= MAP_ANON;
> #else
> fd = open('/dev/zero, O_RDWR);
> #endif
> area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
>
> </PSEUDO CODE>

Ouch, hate to say this but:
I played around with this last night and
I can't get either of the above technics to work with Linux 2.0.33

I will try it with the upcoming 2.2,
but for now, we can't loose shmem without loosing
a large part of the users (including some developers).
flags = MAP_SHARED;

<PSEUDO CODE>
#ifdef HAS_WORKING_MMAP
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
#else
id = shget(...);
area = shmat(...);
#endif
</PSEUDO CODE>

    not happy,
--
---------------------------------------------
Göran Thyni, sysadm, JMS Bildbasen, Kiruna

Re: [HACKERS] mmap and MAP_ANON

From
Bruce Momjian
Date:
>
> Bruce Momjian wrote:
> >
> > Would people tell me what platforms do NOT support the MAP_ANON flag to
> > the mmap() system call?  You should find it in the mmap() manual page.
>
> Doesn't seem to appear in Linux (2.0.30 kernel).  As another poster
> commented, /dev/zero can be mapped for anonymous memory.
>

OK, who doesn't have /dev/zero?


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] mmap and MAP_ANON

From
Bruce Momjian
Date:
>
> G�ran Thyni wrote:
> >
> > Bruce Momjian wrote:
> > >
> > > Would people tell me what platforms do NOT support the MAP_ANON flag to
> > > the mmap() system call?  You should find it in the mmap() manual page.
> > >
> > > *BSD has it, but I am not sure of the others.  I am researching cache
> > > size issues and the use of mmap vs. SYSV shared memory.
> >
> > SVR4 (at least older ones) does not support MMAP_ANON,
> > but the recommended in W. Richards Stevens'
> > "Advanced programming in the Unix environment" (aka the Bible part 2)
> > is to use /dev/zero.
> >
> > This should be configurable with autoconf:
> >
> > <PSEUDO CODE>
> >
> > if (exists MAP_ANON) use it; else use /dev/zero
> >
> > ------------
> >
> > flags = MAP_SHARED;
> > #ifdef HAS_MMAP_ANON
> > fd = -1;
> > flags |= MAP_ANON;
> > #else
> > fd = open('/dev/zero, O_RDWR);
> > #endif
> > area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
> >
> > </PSEUDO CODE>
>
> Ouch, hate to say this but:
> I played around with this last night and
> I can't get either of the above technics to work with Linux 2.0.33
>
> I will try it with the upcoming 2.2,
> but for now, we can't loose shmem without loosing
> a large part of the users (including some developers).
> flags = MAP_SHARED;
>
> <PSEUDO CODE>
> #ifdef HAS_WORKING_MMAP
> #ifdef HAS_MMAP_ANON
> fd = -1;
> flags |= MAP_ANON;
> #else
> fd = open('/dev/zero, O_RDWR);
> #endif
> area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
> #else
> id = shget(...);
> area = shmat(...);
> #endif
> </PSEUDO CODE>
>

What exactly did not work?

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] mmap and MAP_ANON

From
ocie@paracel.com
Date:
Bruce Momjian wrote:
>
> >
> > Bruce Momjian wrote:
> > >
> > > Would people tell me what platforms do NOT support the MAP_ANON flag to
> > > the mmap() system call?  You should find it in the mmap() manual page.
> >
> > Doesn't seem to appear in Linux (2.0.30 kernel).  As another poster
> > commented, /dev/zero can be mapped for anonymous memory.
> >
>
> OK, who doesn't have /dev/zero?

I have been playing around with mmap on Linux.  I have been unable to
mmap /dev/zero or to use MAP_ANON in conjunction with MAP_SHARED.
There is no problem sharing memory when a real file is used.
Solaris-sparc seems to have no trouble sharing memory mapped from
/dev/zero.  Very strange.

Ocie

Re: [HACKERS] mmap and MAP_ANON

From
Bruce Momjian
Date:
>
> Bruce Momjian wrote:
> >
> > >
> > > Bruce Momjian wrote:
> > > >
> > > > Would people tell me what platforms do NOT support the MAP_ANON flag to
> > > > the mmap() system call?  You should find it in the mmap() manual page.
> > >
> > > Doesn't seem to appear in Linux (2.0.30 kernel).  As another poster
> > > commented, /dev/zero can be mapped for anonymous memory.
> > >
> >
> > OK, who doesn't have /dev/zero?
>
> I have been playing around with mmap on Linux.  I have been unable to
> mmap /dev/zero or to use MAP_ANON in conjunction with MAP_SHARED.
> There is no problem sharing memory when a real file is used.
> Solaris-sparc seems to have no trouble sharing memory mapped from
> /dev/zero.  Very strange.

And very bad.  We have to have a 100% usable solution, or have some if
ANON code, else shared memory.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] mmap and MAP_ANON

From
"Göran Thyni"
Date:
Bruce Momjian wrote:
> > Göran Thyni wrote:
> >
> > Ouch, hate to say this but:
> > I played around with this last night and
> > I can't get either of the above technics to work with Linux 2.0.33
> >
> > I will try it with the upcoming 2.2,
> > but for now, we can't loose shmem without loosing
> > a large part of the users (including some developers).
> >
> > <PSEUDO CODE>
> > #ifdef HAS_WORKING_MMAP
> > flags = MAP_SHARED;
> > #ifdef HAS_MMAP_ANON
> > fd = -1;
> > flags |= MAP_ANON;
> > #else
> > fd = open('/dev/zero, O_RDWR);
> > #endif
> > area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
> > #else
> > id = shget(...);
> > area = shmat(...);
> > #endif
> > </PSEUDO CODE>
> >
>
> What exactly did not work?

OK, here's the story:

Linux can only MAP_SHARED if the file is a *real* file,
devices or trick like MAP_ANON does only work with MAP_PRIVATE.

2.1.101 does not work either which means 2.2 will probably not
implement this feature (feature freeze i in effect for 2.2).

*But*,
(I was thinking about this,)
we should IMHO take a step backwards to get a better view
over the whole memory subsystem.
- Why and for what is shared memory used in the first place?
- Could we use mmap:ing of files at a higher level then
  src/backend/strorage/ipc/ipc.c to get even better performance
  and cleaness?

I will, time permitting, look into cleaning up the shmem-init/exit
routines
to work in a "no-exec" environment. I also has a hack to use
mmap-shared/private,
which of course is untested, since it does not work on my linux-boxen.

    regards,
--
---------------------------------------------
Göran Thyni, sysadm, JMS Bildbasen, Kiruna

Re: [HACKERS] mmap and MAP_ANON

From
Bruce Momjian
Date:
> *But*,
> (I was thinking about this,)
> we should IMHO take a step backwards to get a better view
> over the whole memory subsystem.
> - Why and for what is shared memory used in the first place?
> - Could we use mmap:ing of files at a higher level then
>   src/backend/strorage/ipc/ipc.c to get even better performance
>   and cleaness?

Yes, we could use mmap() to map the actual files.  I will post time
timings on this soon.

The shared memory acts as a cache for us, that can be locked and not
read in/out of the address space for each sharing, like it does when we
use the OS buffer cache.

>
> I will, time permitting, look into cleaning up the shmem-init/exit
> routines
> to work in a "no-exec" environment. I also has a hack to use
> mmap-shared/private,
> which of course is untested, since it does not work on my linux-boxen.
>
>     regards,
> --
> ---------------------------------------------
> G�ran Thyni, sysadm, JMS Bildbasen, Kiruna
>


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] mmap and MAP_ANON

From
Tom Lane
Date:
"G�ran Thyni" <goran@bildbasen.se> writes:
> Linux can only MAP_SHARED if the file is a *real* file,
> devices or trick like MAP_ANON does only work with MAP_PRIVATE.

Well, this makes some sense: MAP_SHARED implies that the shared memory
will also be accessible to independently started processes, and
to do that you have to have an openable filename to refer to the
data segment by.

MAP_PRIVATE will *not* work for our purposes: according to my copy
of mmap(2):

:     If MAP_PRIVATE is set in flags:
:          o    Modification to the mapped region by the calling process is
:               not visible to other processes which have mapped the same
:               region using either MAP_PRIVATE or MAP_SHARED.
:               Modifications are not visible to descendant processes that
:               have inherited the mapped region across a fork().

so privately mapped segments are useless for interprocess communication,
even after we get rid of exec().

mmaping /dev/zero, as has been suggested earlier in this thread,
seems like a really bad idea to me.  Would that not imply that
any process anywhere in the system that also decides to mmap /dev/zero
would get its hands on the Postgres shared memory segment?  You
can't restrict permissions on /dev/zero to prevent it.

Am I right in thinking that the contents of the shared memory segment
do not need to outlive a particular postmaster run?  (If they do, then
we have to mmap a real file anyway.)  If so, then MAP_ANON(YMOUS) is
a reasonable solution on systems that support it.  On those that
don't support it, we will have to mmap a real file owned by (and only
readable/writable by) the postgres user.  Time for another configure
test.

BTW, /dev/zero doesn't exist anyway on HPUX 9.

            regards, tom lane

Re: [HACKERS] mmap and MAP_ANON

From
Bruce Momjian
Date:
>
> "G�ran Thyni" <goran@bildbasen.se> writes:
> > Linux can only MAP_SHARED if the file is a *real* file,
> > devices or trick like MAP_ANON does only work with MAP_PRIVATE.
>
> Well, this makes some sense: MAP_SHARED implies that the shared memory
> will also be accessible to independently started processes, and
> to do that you have to have an openable filename to refer to the
> data segment by.
>
> MAP_PRIVATE will *not* work for our purposes: according to my copy
> of mmap(2):

Right.
> so privately mapped segments are useless for interprocess communication,
> even after we get rid of exec().

Yep.

>
> mmaping /dev/zero, as has been suggested earlier in this thread,
> seems like a really bad idea to me.  Would that not imply that
> any process anywhere in the system that also decides to mmap /dev/zero
> would get its hands on the Postgres shared memory segment?  You
> can't restrict permissions on /dev/zero to prevent it.

Good point.

>
> Am I right in thinking that the contents of the shared memory segment
> do not need to outlive a particular postmaster run?  (If they do, then
> we have to mmap a real file anyway.)  If so, then MAP_ANON(YMOUS) is
> a reasonable solution on systems that support it.  On those that
> don't support it, we will have to mmap a real file owned by (and only
> readable/writable by) the postgres user.  Time for another configure
> test.

MAP_ANON is the best, because it can be restricted to only postmaster
children.

The problem with using a real file is that the filesystem is going to be
flushing those dirty pages to disk, and that could really hurt
performance.

Actually, when I install Informix, I always have to modify the kernel to
allow a larger amount of SYSV shared memory.  Maybe we just need to give
people per-OS instructions on how to do that.  Under BSD/OS, I now have
32MB of shared memory, or 3900 8k shared buffers.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] mmap and MAP_ANON

From
ocie@paracel.com
Date:
Tom Lane wrote:
>
> "Göran Thyni" <goran@bildbasen.se> writes:
> > Linux can only MAP_SHARED if the file is a *real* file,
> > devices or trick like MAP_ANON does only work with MAP_PRIVATE.
>
> Well, this makes some sense: MAP_SHARED implies that the shared memory
> will also be accessible to independently started processes, and
> to do that you have to have an openable filename to refer to the
> data segment by.
>
> MAP_PRIVATE will *not* work for our purposes: according to my copy
> of mmap(2):
>
> :     If MAP_PRIVATE is set in flags:
> :          o    Modification to the mapped region by the calling process is
> :               not visible to other processes which have mapped the same
> :               region using either MAP_PRIVATE or MAP_SHARED.
> :               Modifications are not visible to descendant processes that
> :               have inherited the mapped region across a fork().
>
> so privately mapped segments are useless for interprocess communication,
> even after we get rid of exec().
>
> mmaping /dev/zero, as has been suggested earlier in this thread,
> seems like a really bad idea to me.  Would that not imply that
> any process anywhere in the system that also decides to mmap /dev/zero
> would get its hands on the Postgres shared memory segment?  You
> can't restrict permissions on /dev/zero to prevent it.

On some systems, mmaping /dev/zero can be shared with child processes
as in this example:

#include <sys/types.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/wait.h>

int main()
{
  int fd;
  caddr_t ma;
  int i;
  int pagesize = sysconf(_SC_PAGESIZE);

  fd=open("/dev/zero",O_RDWR);
  if (fd==-1) {
    perror("open");
    exit(1);
  }

  ma=mmap((caddr_t) 0,
      pagesize,
      (PROT_READ|PROT_WRITE),
      MAP_SHARED,
      fd,
      0);

  if ((int)ma == -1) {
    perror("mmap");
    exit(1);
  }

  memset(ma,0,pagesize);

  i=fork();

  if (i==-1) {
    perror("fork");
    exit(1);
  }

  if (i==0) { /* child */
    ((char*)ma)[0]=1;
    sleep(1);
    printf("child %d %d\n",((char*)ma)[0],((char*)ma)[1]);
    sleep(1);
    return 0;
  } else { /* parent */
    ((char*)ma)[1]=1;
    sleep(1);
    printf("parent %d %d\n",((char*)ma)[0],((char*)ma)[1]);
  }

  wait(NULL);
  munmap(ma,pagesize*10);

  return 0;
}


This works on Solaris and as expected, both the parent and child are
able to write into the memory and their changes are honored (the
memory is truely shared between processes.  We can certainly map a
real file, and this might even give us some interesting crash recovery
options.  The nice thing about doing away with the exec is that the
memory mapped in the parent process is avalible at the same address
region in every process, so we don't have to do funky pointer tricks.

The only problem I see with mmap is that we don't know exactly when a
page will be written to disk.  I.E. If you make two writes, the page
might get sync'ed between them, thus storing an inconsistant
intermediate state to the disk.  Perhaps with proper transaction
control, this is not a problem.

The question is should the individual database files be mapped into
memory, or should one "pgmem" file be mapped, with pages from
different files read into it.  The first option would allow different
backend processes to map different pages of different files as they
are needed.  The postmaster could "pre-map" pages on behalf of the
backend processes as sort of an inteligent read-ahead mechanism.

I'll try to write this seperate from Postgres just to see how it works.

Ocie

Re: [HACKERS] mmap and MAP_ANON

From
Michal Mosiewicz
Date:
Bruce Momjian wrote:
>
> Would people tell me what platforms do NOT support the MAP_ANON flag to
> the mmap() system call?  You should find it in the mmap() manual page.
>
> *BSD has it, but I am not sure of the others.  I am researching cache
> size issues and the use of mmap vs. SYSV shared memory.

Well, I haven't noticed this discussion. However, I can't understand one
thing:

Why a lot of people investigate how to replace shared memory with
mmapping anonymously but there is no discussion on replacing
reads/writes with memory mapping of heap files.

This way we would save not only on having better system cache
utilisation but also we would have less memory copying. For me it seems
like a more robust solution. I suggested it few months ago.

If it's a bad idea, I wonder why?
Are there any systems that cannot do mmaps at all?

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Re: [HACKERS] mmap and MAP_ANON

From
Bruce Momjian
Date:
>
> Bruce Momjian wrote:
> >
> > Would people tell me what platforms do NOT support the MAP_ANON flag to
> > the mmap() system call?  You should find it in the mmap() manual page.
> >
> > *BSD has it, but I am not sure of the others.  I am researching cache
> > size issues and the use of mmap vs. SYSV shared memory.
>
> Well, I haven't noticed this discussion. However, I can't understand one
> thing:
>
> Why a lot of people investigate how to replace shared memory with
> mmapping anonymously but there is no discussion on replacing
> reads/writes with memory mapping of heap files.
>
> This way we would save not only on having better system cache
> utilisation but also we would have less memory copying. For me it seems
> like a more robust solution. I suggested it few months ago.
>
> If it's a bad idea, I wonder why?
> Are there any systems that cannot do mmaps at all?

mmap'ing a file is not necessary faster.  I will post time timings soon
that show this is not the case.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] mmap and MAP_ANON

From
dg@illustra.com (David Gould)
Date:
Michal Mosiewicz asks:
> Why a lot of people investigate how to replace shared memory with
> mmapping anonymously but there is no discussion on replacing
> reads/writes with memory mapping of heap files.
>
> This way we would save not only on having better system cache
> utilisation but also we would have less memory copying. For me it seems
> like a more robust solution. I suggested it few months ago.
>
> If it's a bad idea, I wonder why?

Unfortunately, it is probably a bad idea.

The postgres buffer cache is a shared pool of pages containing an assortment
of blocks from all the different tables in use by all the different backends.

That is, if backend 'a' is reading table 'ta', and backend 'b' is reading
table 'tb' then the buffer cache will have blocks from both table 'ta'
and table 'tb' in it.

The benefit occurs when backend 'x' starts reading either table 'ta' or 'tb'.
Rather than have to go to disk, it finds the pages already loaded in the
share buffer cache. Likewise, if backend 'a' should modify a page in table
'ta', the change is then visible to all the other backends (ignoring locks
for this discussion) without any explicit communication between the backends.

If we started creating a separate mmapped region for each table several
problems occur:

 - each time a backend wants to use a table it will have to somehow find out
   if it is already mapped, and then either map it (for the first time), or
   attach to an existing mapping created by another backend. This implies
   that the backends need to communicate with all the other backends to let
   them know what mappings they are using.

 - if two backends are using the same table, and the table is too big to
   map the whole thing, then each backend needs a "window" into the table.
   This becomes difficult if the two backends are using different parts of
   the table (ie, the first page and the last page).

 - there is a finite amount of memory available on the system for postgres
   to use. This will have to be split amoung all the open tables used by
   all the backends. If you have 50 backends each using 10 each with 3
   indexes, you now need 2,000 mappings in the system. Assuming that there
   are 2001 pages available for mapping, how do you decide with table gets
   to map 2 pages? How do you get all the backends to agree about this?

Essentially, mapping tables separately creates a requirement for a huge
amount of communication and synchronization amoung the backends. And, even
if this were not prohibitive, it ends up fragmenting the available memory
for buffers so badly that the cacheing becomes ineffective.

So, unless you are going to map whole tables and those tables are needed by
_all_ the active backends the idea of mmapping separate tables is unworkable.

That said, there are tables that meet this criteria, for instance the
transaction logs and anchors. Here mmapping might indeed be useful but even
so it would take some thought and a fair amount of work to gain any benefit.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
 and someone who knows less will correct me if I'm right."
               --David Palmer (palmer@tybalt.caltech.edu)

Re: [HACKERS] mmap and MAP_ANON

From
Michal Mosiewicz
Date:
David Gould wrote:

>  - each time a backend wants to use a table it will have to somehow find out
>    if it is already mapped, and then either map it (for the first time), or
>    attach to an existing mapping created by another backend. This implies
>    that the backends need to communicate with all the other backends to let
>    them know what mappings they are using.

Why backend has to check if it's already mapped? Let's say that backend
A maps first page from file X using MAP_SHARED, then backend B maps
first page using MAP_SHARED. So, at this moment they are pointing to the
same memory area without any communication. (at least that's the way it
works on Linux, in Linux even MAP_PRIVATE is the same memory region when
you mmap it twice until you write a byte in there - then it's copied).
So, why would we check what other backends map. We use MAP_SHARED to not
have to check it.

>  - if two backends are using the same table, and the table is too big to
>    map the whole thing, then each backend needs a "window" into the table.
>    This becomes difficult if the two backends are using different parts of
>    the table (ie, the first page and the last page).

Well I wasn't even thinking on mapping anything more than just one page
that is needed.

>  - there is a finite amount of memory available on the system for postgres
>    to use. This will have to be split amoung all the open tables used by
>    all the backends. If you have 50 backends each using 10 each with 3
>    indexes, you now need 2,000 mappings in the system. Assuming that there
>    are 2001 pages available for mapping, how do you decide with table gets
>    to map 2 pages? How do you get all the backends to agree about this?

IMHO, this is also not that much problem as it looks like. When the
system is running out of virtual memory, the occupied pages are
paged-out. The system does what actually buffer manager does - it writes
down the pages that are dirty, and simply frees memory from those that
are not modified on a last recently used basis. So the only thing that
costs are the memory structures that describe the bindings between disk
blocks and memory. And of course it's sometimes bad to use LRU
algorithm. Sometimes backend knows better which pages are best to
page-out.

I have to admit that this point seems to be potential source of
performance drop-downs and all the backends have to communicate to
prevent it. But I don't think that this communication is huge. Note that
currently all backends use quite large communication channel (256 pages
large by default?) which is hardly used for communication purposes but
rather for storage.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Re: [HACKERS] mmap and MAP_ANON

From
dg@illustra.com (David Gould)
Date:
This is all old news, but I am trying to catch up on my hackers mail. This
particular post caught my eye to think carefully about before replying.

Michal Mosiewicz <mimo@interdata.com.pl> writes:
> David Gould wrote:
>
> >  - each time a backend wants to use a table it will have to somehow find out
> >    if it is already mapped, and then either map it (for the first time), or
> >    attach to an existing mapping created by another backend. This implies
> >    that the backends need to communicate with all the other backends to let
> >    them know what mappings they are using.
>
> Why backend has to check if it's already mapped? Let's say that backend
> A maps first page from file X using MAP_SHARED, then backend B maps
> first page using MAP_SHARED. So, at this moment they are pointing to the
> same memory area without any communication. (at least that's the way it
> works on Linux, in Linux even MAP_PRIVATE is the same memory region when
> you mmap it twice until you write a byte in there - then it's copied).
> So, why would we check what other backends map. We use MAP_SHARED to not
> have to check it.
>
> >  - if two backends are using the same table, and the table is too big to
> >    map the whole thing, then each backend needs a "window" into the table.
> >    This becomes difficult if the two backends are using different parts of
> >    the table (ie, the first page and the last page).
>
> Well I wasn't even thinking on mapping anything more than just one page
> that is needed.

Your statement about not checking if a file was mapped struck me as a problem
but on second thought, I was thinking about a typical dbms buffer cache,
you are proposing eliminating the dbms buffer cache and using mmap() to read
file pages directly relying on the OS cache. I agree that this could work.

And, at least some OSes have pretty good buffer management and quick
mmap() calls. Linux 2.1.101 seems to be able to do a mmap() in 25 usec on
a P166 according to lmbench, BSD and Solaris are quite a bit slower, and
at the really slow end, IRIX and HPUX take hundreds of usec for mmap()).

But even given good OS mmap() and buffer management, there may still be
a performance justification for a separate DBMS buffer cache.

Suppose many backends are sharing a small table eg a lookup table with a
few dozen rows, perhaps three pages worth. Suppose that most queries
scan this table several times (eg multiple joins and subqueries). And
suppose most backends run several queries before being restarted.

This gives the situation where all the backends refer to same two or three
pages hundreds or thousands of times each.

In the traditional dbms buffer cache, the first backend to scan the table
does say three reads(), and each backend does one mmap() at startup time
to map the buffer cache. This means that a very few system calls suffice
for thousands of accesses to the shared table.

Your proposal, if I have understood it, has one page mmapped() for the table
by each backend. To get the next page another mmap() has to be done. This
results in three mmaps() per scan for each backend. So, even though the
table is fully cached by the OS, thousands of system calls are needed to
service all the scans. Even on systems with very fast mmap() I think this
may be a significant overhead.

That is, there may be a reason all the highend dbms's use their own buffer
caches.

If you are interested, this could be tested with not too much work. Simply
instrument the buffer manager to trace buffer lookups, and read()s, and
write()s and log this to a file. Then write a simple program to run the
trace file performing the same operations only using mmap(). Try to get
a trace from a busy web site or other heavy duty application using postgres.
I think that this will show that the buffer cache has its place in life.
But, I am prepared to hear otherwise.

> >  - there is a finite amount of memory available on the system for postgres
> >    to use. This will have to be split amoung all the open tables used by
> >    all the backends. If you have 50 backends each using 10 each with 3
> >    indexes, you now need 2,000 mappings in the system. Assuming that there
> >    are 2001 pages available for mapping, how do you decide with table gets
> >    to map 2 pages? How do you get all the backends to agree about this?
>
> IMHO, this is also not that much problem as it looks like. When the
> system is running out of virtual memory, the occupied pages are
> paged-out. The system does what actually buffer manager does - it writes
> down the pages that are dirty, and simply frees memory from those that
> are not modified on a last recently used basis. So the only thing that
> costs are the memory structures that describe the bindings between disk
> blocks and memory. And of course it's sometimes bad to use LRU
> algorithm. Sometimes backend knows better which pages are best to
> page-out.
>
> I have to admit that this point seems to be potential source of
> performance drop-downs and all the backends have to communicate to
> prevent it. But I don't think that this communication is huge. Note that
> currently all backends use quite large communication channel (256 pages
> large by default?) which is hardly used for communication purposes but
> rather for storage.

Perhaps. Still, to implement this would be a major task. I would prefer to
spend that effort on adding page or row level locking for instance.

-dg

David Gould           dg@illustra.com            510.628.3783 or 510.305.9468
Informix Software                      300 Lakeside Drive   Oakland, CA 94612
 - A child of five could understand this!  Fetch me a child of five.