Thread: mmap and MAP_ANON
Would people tell me what platforms do NOT support the MAP_ANON flag to the mmap() system call? You should find it in the mmap() manual page. *BSD has it, but I am not sure of the others. I am researching cache size issues and the use of mmap vs. SYSV shared memory. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
I can't find MAP_ANON on Solaris 2.5.1 or 2.5.6. The man page claims the following options are avaliable: MAP_SHARED Share changes. MAP_PRIVATE Changes are private. MAP_FIXED Interpret addr exactly. MAP_NORESERVE Don't reserve swap space. If you'd like, I can send along the whole man page. --------- Received message begins Here --------- > > Would people tell me what platforms do NOT support the MAP_ANON flag to > the mmap() system call? You should find it in the mmap() manual page. > > *BSD has it, but I am not sure of the others. I am researching cache > size issues and the use of mmap vs. SYSV shared memory. > > -- > Bruce Momjian | 830 Blythe Avenue > maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 > + If your life is a hard drive, | (610) 353-9879(w) > + Christ can be your backup. | (610) 853-3000(h) > > ------------- Diab Jerius Harvard-Smithsonian Center for Astrophysics 60 Garden St, MS 70, Cambridge MA 02138 USA djerius@cfa.harvard.edu vox: 617 496 7575 fax: 617 495 7356
Bruce Momjian wrote: > > Would people tell me what platforms do NOT support the MAP_ANON flag to > the mmap() system call? You should find it in the mmap() manual page. > > *BSD has it, but I am not sure of the others. I am researching cache > size issues and the use of mmap vs. SYSV shared memory. SVR4 (at least older ones) does not support MMAP_ANON, but the recommended in W. Richards Stevens' "Advanced programming in the Unix environment" (aka the Bible part 2) is to use /dev/zero. This should be configurable with autoconf: <PSEUDO CODE> if (exists MAP_ANON) use it; else use /dev/zero ------------ flags = MAP_SHARED; #ifdef HAS_MMAP_ANON fd = -1; flags |= MAP_ANON; #else fd = open('/dev/zero, O_RDWR); #endif area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0); </PSEUDO CODE> regards, -- --------------------------------------------- Göran Thyni, sysadm, JMS Bildbasen, Kiruna
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Would people tell me what platforms do NOT support the MAP_ANON flag to > the mmap() system call? You should find it in the mmap() manual page. On HPUX it seems to be spelled MAP_ANONYMOUS. At least if this means the same thing as what you are talking about. The HP man page says : The MAP_FILE and MAP_ANONYMOUS flags control whether the region to be : mapped is a mapped file region or an anonymous shared memory region. : Exactly one of these flags must be selected. regards, tom lane
Bruce Momjian wrote: > > Would people tell me what platforms do NOT support the MAP_ANON flag to > the mmap() system call? You should find it in the mmap() manual page. Doesn't seem to appear in Linux (2.0.30 kernel). As another poster commented, /dev/zero can be mapped for anonymous memory. Ocie Mitchell
Göran Thyni wrote: > > Bruce Momjian wrote: > > > > Would people tell me what platforms do NOT support the MAP_ANON flag to > > the mmap() system call? You should find it in the mmap() manual page. > > > > *BSD has it, but I am not sure of the others. I am researching cache > > size issues and the use of mmap vs. SYSV shared memory. > > SVR4 (at least older ones) does not support MMAP_ANON, > but the recommended in W. Richards Stevens' > "Advanced programming in the Unix environment" (aka the Bible part 2) > is to use /dev/zero. > > This should be configurable with autoconf: > > <PSEUDO CODE> > > if (exists MAP_ANON) use it; else use /dev/zero > > ------------ > > flags = MAP_SHARED; > #ifdef HAS_MMAP_ANON > fd = -1; > flags |= MAP_ANON; > #else > fd = open('/dev/zero, O_RDWR); > #endif > area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0); > > </PSEUDO CODE> Ouch, hate to say this but: I played around with this last night and I can't get either of the above technics to work with Linux 2.0.33 I will try it with the upcoming 2.2, but for now, we can't loose shmem without loosing a large part of the users (including some developers). flags = MAP_SHARED; <PSEUDO CODE> #ifdef HAS_WORKING_MMAP #ifdef HAS_MMAP_ANON fd = -1; flags |= MAP_ANON; #else fd = open('/dev/zero, O_RDWR); #endif area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0); #else id = shget(...); area = shmat(...); #endif </PSEUDO CODE> not happy, -- --------------------------------------------- Göran Thyni, sysadm, JMS Bildbasen, Kiruna
> > Bruce Momjian wrote: > > > > Would people tell me what platforms do NOT support the MAP_ANON flag to > > the mmap() system call? You should find it in the mmap() manual page. > > Doesn't seem to appear in Linux (2.0.30 kernel). As another poster > commented, /dev/zero can be mapped for anonymous memory. > OK, who doesn't have /dev/zero? -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > G�ran Thyni wrote: > > > > Bruce Momjian wrote: > > > > > > Would people tell me what platforms do NOT support the MAP_ANON flag to > > > the mmap() system call? You should find it in the mmap() manual page. > > > > > > *BSD has it, but I am not sure of the others. I am researching cache > > > size issues and the use of mmap vs. SYSV shared memory. > > > > SVR4 (at least older ones) does not support MMAP_ANON, > > but the recommended in W. Richards Stevens' > > "Advanced programming in the Unix environment" (aka the Bible part 2) > > is to use /dev/zero. > > > > This should be configurable with autoconf: > > > > <PSEUDO CODE> > > > > if (exists MAP_ANON) use it; else use /dev/zero > > > > ------------ > > > > flags = MAP_SHARED; > > #ifdef HAS_MMAP_ANON > > fd = -1; > > flags |= MAP_ANON; > > #else > > fd = open('/dev/zero, O_RDWR); > > #endif > > area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0); > > > > </PSEUDO CODE> > > Ouch, hate to say this but: > I played around with this last night and > I can't get either of the above technics to work with Linux 2.0.33 > > I will try it with the upcoming 2.2, > but for now, we can't loose shmem without loosing > a large part of the users (including some developers). > flags = MAP_SHARED; > > <PSEUDO CODE> > #ifdef HAS_WORKING_MMAP > #ifdef HAS_MMAP_ANON > fd = -1; > flags |= MAP_ANON; > #else > fd = open('/dev/zero, O_RDWR); > #endif > area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0); > #else > id = shget(...); > area = shmat(...); > #endif > </PSEUDO CODE> > What exactly did not work? -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Bruce Momjian wrote: > > > > > Bruce Momjian wrote: > > > > > > Would people tell me what platforms do NOT support the MAP_ANON flag to > > > the mmap() system call? You should find it in the mmap() manual page. > > > > Doesn't seem to appear in Linux (2.0.30 kernel). As another poster > > commented, /dev/zero can be mapped for anonymous memory. > > > > OK, who doesn't have /dev/zero? I have been playing around with mmap on Linux. I have been unable to mmap /dev/zero or to use MAP_ANON in conjunction with MAP_SHARED. There is no problem sharing memory when a real file is used. Solaris-sparc seems to have no trouble sharing memory mapped from /dev/zero. Very strange. Ocie
> > Bruce Momjian wrote: > > > > > > > > Bruce Momjian wrote: > > > > > > > > Would people tell me what platforms do NOT support the MAP_ANON flag to > > > > the mmap() system call? You should find it in the mmap() manual page. > > > > > > Doesn't seem to appear in Linux (2.0.30 kernel). As another poster > > > commented, /dev/zero can be mapped for anonymous memory. > > > > > > > OK, who doesn't have /dev/zero? > > I have been playing around with mmap on Linux. I have been unable to > mmap /dev/zero or to use MAP_ANON in conjunction with MAP_SHARED. > There is no problem sharing memory when a real file is used. > Solaris-sparc seems to have no trouble sharing memory mapped from > /dev/zero. Very strange. And very bad. We have to have a 100% usable solution, or have some if ANON code, else shared memory. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Bruce Momjian wrote: > > Göran Thyni wrote: > > > > Ouch, hate to say this but: > > I played around with this last night and > > I can't get either of the above technics to work with Linux 2.0.33 > > > > I will try it with the upcoming 2.2, > > but for now, we can't loose shmem without loosing > > a large part of the users (including some developers). > > > > <PSEUDO CODE> > > #ifdef HAS_WORKING_MMAP > > flags = MAP_SHARED; > > #ifdef HAS_MMAP_ANON > > fd = -1; > > flags |= MAP_ANON; > > #else > > fd = open('/dev/zero, O_RDWR); > > #endif > > area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0); > > #else > > id = shget(...); > > area = shmat(...); > > #endif > > </PSEUDO CODE> > > > > What exactly did not work? OK, here's the story: Linux can only MAP_SHARED if the file is a *real* file, devices or trick like MAP_ANON does only work with MAP_PRIVATE. 2.1.101 does not work either which means 2.2 will probably not implement this feature (feature freeze i in effect for 2.2). *But*, (I was thinking about this,) we should IMHO take a step backwards to get a better view over the whole memory subsystem. - Why and for what is shared memory used in the first place? - Could we use mmap:ing of files at a higher level then src/backend/strorage/ipc/ipc.c to get even better performance and cleaness? I will, time permitting, look into cleaning up the shmem-init/exit routines to work in a "no-exec" environment. I also has a hack to use mmap-shared/private, which of course is untested, since it does not work on my linux-boxen. regards, -- --------------------------------------------- Göran Thyni, sysadm, JMS Bildbasen, Kiruna
> *But*, > (I was thinking about this,) > we should IMHO take a step backwards to get a better view > over the whole memory subsystem. > - Why and for what is shared memory used in the first place? > - Could we use mmap:ing of files at a higher level then > src/backend/strorage/ipc/ipc.c to get even better performance > and cleaness? Yes, we could use mmap() to map the actual files. I will post time timings on this soon. The shared memory acts as a cache for us, that can be locked and not read in/out of the address space for each sharing, like it does when we use the OS buffer cache. > > I will, time permitting, look into cleaning up the shmem-init/exit > routines > to work in a "no-exec" environment. I also has a hack to use > mmap-shared/private, > which of course is untested, since it does not work on my linux-boxen. > > regards, > -- > --------------------------------------------- > G�ran Thyni, sysadm, JMS Bildbasen, Kiruna > -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
"G�ran Thyni" <goran@bildbasen.se> writes: > Linux can only MAP_SHARED if the file is a *real* file, > devices or trick like MAP_ANON does only work with MAP_PRIVATE. Well, this makes some sense: MAP_SHARED implies that the shared memory will also be accessible to independently started processes, and to do that you have to have an openable filename to refer to the data segment by. MAP_PRIVATE will *not* work for our purposes: according to my copy of mmap(2): : If MAP_PRIVATE is set in flags: : o Modification to the mapped region by the calling process is : not visible to other processes which have mapped the same : region using either MAP_PRIVATE or MAP_SHARED. : Modifications are not visible to descendant processes that : have inherited the mapped region across a fork(). so privately mapped segments are useless for interprocess communication, even after we get rid of exec(). mmaping /dev/zero, as has been suggested earlier in this thread, seems like a really bad idea to me. Would that not imply that any process anywhere in the system that also decides to mmap /dev/zero would get its hands on the Postgres shared memory segment? You can't restrict permissions on /dev/zero to prevent it. Am I right in thinking that the contents of the shared memory segment do not need to outlive a particular postmaster run? (If they do, then we have to mmap a real file anyway.) If so, then MAP_ANON(YMOUS) is a reasonable solution on systems that support it. On those that don't support it, we will have to mmap a real file owned by (and only readable/writable by) the postgres user. Time for another configure test. BTW, /dev/zero doesn't exist anyway on HPUX 9. regards, tom lane
> > "G�ran Thyni" <goran@bildbasen.se> writes: > > Linux can only MAP_SHARED if the file is a *real* file, > > devices or trick like MAP_ANON does only work with MAP_PRIVATE. > > Well, this makes some sense: MAP_SHARED implies that the shared memory > will also be accessible to independently started processes, and > to do that you have to have an openable filename to refer to the > data segment by. > > MAP_PRIVATE will *not* work for our purposes: according to my copy > of mmap(2): Right. > so privately mapped segments are useless for interprocess communication, > even after we get rid of exec(). Yep. > > mmaping /dev/zero, as has been suggested earlier in this thread, > seems like a really bad idea to me. Would that not imply that > any process anywhere in the system that also decides to mmap /dev/zero > would get its hands on the Postgres shared memory segment? You > can't restrict permissions on /dev/zero to prevent it. Good point. > > Am I right in thinking that the contents of the shared memory segment > do not need to outlive a particular postmaster run? (If they do, then > we have to mmap a real file anyway.) If so, then MAP_ANON(YMOUS) is > a reasonable solution on systems that support it. On those that > don't support it, we will have to mmap a real file owned by (and only > readable/writable by) the postgres user. Time for another configure > test. MAP_ANON is the best, because it can be restricted to only postmaster children. The problem with using a real file is that the filesystem is going to be flushing those dirty pages to disk, and that could really hurt performance. Actually, when I install Informix, I always have to modify the kernel to allow a larger amount of SYSV shared memory. Maybe we just need to give people per-OS instructions on how to do that. Under BSD/OS, I now have 32MB of shared memory, or 3900 8k shared buffers. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Tom Lane wrote: > > "Göran Thyni" <goran@bildbasen.se> writes: > > Linux can only MAP_SHARED if the file is a *real* file, > > devices or trick like MAP_ANON does only work with MAP_PRIVATE. > > Well, this makes some sense: MAP_SHARED implies that the shared memory > will also be accessible to independently started processes, and > to do that you have to have an openable filename to refer to the > data segment by. > > MAP_PRIVATE will *not* work for our purposes: according to my copy > of mmap(2): > > : If MAP_PRIVATE is set in flags: > : o Modification to the mapped region by the calling process is > : not visible to other processes which have mapped the same > : region using either MAP_PRIVATE or MAP_SHARED. > : Modifications are not visible to descendant processes that > : have inherited the mapped region across a fork(). > > so privately mapped segments are useless for interprocess communication, > even after we get rid of exec(). > > mmaping /dev/zero, as has been suggested earlier in this thread, > seems like a really bad idea to me. Would that not imply that > any process anywhere in the system that also decides to mmap /dev/zero > would get its hands on the Postgres shared memory segment? You > can't restrict permissions on /dev/zero to prevent it. On some systems, mmaping /dev/zero can be shared with child processes as in this example: #include <sys/types.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/wait.h> int main() { int fd; caddr_t ma; int i; int pagesize = sysconf(_SC_PAGESIZE); fd=open("/dev/zero",O_RDWR); if (fd==-1) { perror("open"); exit(1); } ma=mmap((caddr_t) 0, pagesize, (PROT_READ|PROT_WRITE), MAP_SHARED, fd, 0); if ((int)ma == -1) { perror("mmap"); exit(1); } memset(ma,0,pagesize); i=fork(); if (i==-1) { perror("fork"); exit(1); } if (i==0) { /* child */ ((char*)ma)[0]=1; sleep(1); printf("child %d %d\n",((char*)ma)[0],((char*)ma)[1]); sleep(1); return 0; } else { /* parent */ ((char*)ma)[1]=1; sleep(1); printf("parent %d %d\n",((char*)ma)[0],((char*)ma)[1]); } wait(NULL); munmap(ma,pagesize*10); return 0; } This works on Solaris and as expected, both the parent and child are able to write into the memory and their changes are honored (the memory is truely shared between processes. We can certainly map a real file, and this might even give us some interesting crash recovery options. The nice thing about doing away with the exec is that the memory mapped in the parent process is avalible at the same address region in every process, so we don't have to do funky pointer tricks. The only problem I see with mmap is that we don't know exactly when a page will be written to disk. I.E. If you make two writes, the page might get sync'ed between them, thus storing an inconsistant intermediate state to the disk. Perhaps with proper transaction control, this is not a problem. The question is should the individual database files be mapped into memory, or should one "pgmem" file be mapped, with pages from different files read into it. The first option would allow different backend processes to map different pages of different files as they are needed. The postmaster could "pre-map" pages on behalf of the backend processes as sort of an inteligent read-ahead mechanism. I'll try to write this seperate from Postgres just to see how it works. Ocie
Bruce Momjian wrote: > > Would people tell me what platforms do NOT support the MAP_ANON flag to > the mmap() system call? You should find it in the mmap() manual page. > > *BSD has it, but I am not sure of the others. I am researching cache > size issues and the use of mmap vs. SYSV shared memory. Well, I haven't noticed this discussion. However, I can't understand one thing: Why a lot of people investigate how to replace shared memory with mmapping anonymously but there is no discussion on replacing reads/writes with memory mapping of heap files. This way we would save not only on having better system cache utilisation but also we would have less memory copying. For me it seems like a more robust solution. I suggested it few months ago. If it's a bad idea, I wonder why? Are there any systems that cannot do mmaps at all? Mike -- WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340 add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND
> > Bruce Momjian wrote: > > > > Would people tell me what platforms do NOT support the MAP_ANON flag to > > the mmap() system call? You should find it in the mmap() manual page. > > > > *BSD has it, but I am not sure of the others. I am researching cache > > size issues and the use of mmap vs. SYSV shared memory. > > Well, I haven't noticed this discussion. However, I can't understand one > thing: > > Why a lot of people investigate how to replace shared memory with > mmapping anonymously but there is no discussion on replacing > reads/writes with memory mapping of heap files. > > This way we would save not only on having better system cache > utilisation but also we would have less memory copying. For me it seems > like a more robust solution. I suggested it few months ago. > > If it's a bad idea, I wonder why? > Are there any systems that cannot do mmaps at all? mmap'ing a file is not necessary faster. I will post time timings soon that show this is not the case. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Michal Mosiewicz asks: > Why a lot of people investigate how to replace shared memory with > mmapping anonymously but there is no discussion on replacing > reads/writes with memory mapping of heap files. > > This way we would save not only on having better system cache > utilisation but also we would have less memory copying. For me it seems > like a more robust solution. I suggested it few months ago. > > If it's a bad idea, I wonder why? Unfortunately, it is probably a bad idea. The postgres buffer cache is a shared pool of pages containing an assortment of blocks from all the different tables in use by all the different backends. That is, if backend 'a' is reading table 'ta', and backend 'b' is reading table 'tb' then the buffer cache will have blocks from both table 'ta' and table 'tb' in it. The benefit occurs when backend 'x' starts reading either table 'ta' or 'tb'. Rather than have to go to disk, it finds the pages already loaded in the share buffer cache. Likewise, if backend 'a' should modify a page in table 'ta', the change is then visible to all the other backends (ignoring locks for this discussion) without any explicit communication between the backends. If we started creating a separate mmapped region for each table several problems occur: - each time a backend wants to use a table it will have to somehow find out if it is already mapped, and then either map it (for the first time), or attach to an existing mapping created by another backend. This implies that the backends need to communicate with all the other backends to let them know what mappings they are using. - if two backends are using the same table, and the table is too big to map the whole thing, then each backend needs a "window" into the table. This becomes difficult if the two backends are using different parts of the table (ie, the first page and the last page). - there is a finite amount of memory available on the system for postgres to use. This will have to be split amoung all the open tables used by all the backends. If you have 50 backends each using 10 each with 3 indexes, you now need 2,000 mappings in the system. Assuming that there are 2001 pages available for mapping, how do you decide with table gets to map 2 pages? How do you get all the backends to agree about this? Essentially, mapping tables separately creates a requirement for a huge amount of communication and synchronization amoung the backends. And, even if this were not prohibitive, it ends up fragmenting the available memory for buffers so badly that the cacheing becomes ineffective. So, unless you are going to map whole tables and those tables are needed by _all_ the active backends the idea of mmapping separate tables is unworkable. That said, there are tables that meet this criteria, for instance the transaction logs and anchors. Here mmapping might indeed be useful but even so it would take some thought and a fair amount of work to gain any benefit. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Of course, someone who knows more about this will correct me if I'm wrong, and someone who knows less will correct me if I'm right." --David Palmer (palmer@tybalt.caltech.edu)
David Gould wrote: > - each time a backend wants to use a table it will have to somehow find out > if it is already mapped, and then either map it (for the first time), or > attach to an existing mapping created by another backend. This implies > that the backends need to communicate with all the other backends to let > them know what mappings they are using. Why backend has to check if it's already mapped? Let's say that backend A maps first page from file X using MAP_SHARED, then backend B maps first page using MAP_SHARED. So, at this moment they are pointing to the same memory area without any communication. (at least that's the way it works on Linux, in Linux even MAP_PRIVATE is the same memory region when you mmap it twice until you write a byte in there - then it's copied). So, why would we check what other backends map. We use MAP_SHARED to not have to check it. > - if two backends are using the same table, and the table is too big to > map the whole thing, then each backend needs a "window" into the table. > This becomes difficult if the two backends are using different parts of > the table (ie, the first page and the last page). Well I wasn't even thinking on mapping anything more than just one page that is needed. > - there is a finite amount of memory available on the system for postgres > to use. This will have to be split amoung all the open tables used by > all the backends. If you have 50 backends each using 10 each with 3 > indexes, you now need 2,000 mappings in the system. Assuming that there > are 2001 pages available for mapping, how do you decide with table gets > to map 2 pages? How do you get all the backends to agree about this? IMHO, this is also not that much problem as it looks like. When the system is running out of virtual memory, the occupied pages are paged-out. The system does what actually buffer manager does - it writes down the pages that are dirty, and simply frees memory from those that are not modified on a last recently used basis. So the only thing that costs are the memory structures that describe the bindings between disk blocks and memory. And of course it's sometimes bad to use LRU algorithm. Sometimes backend knows better which pages are best to page-out. I have to admit that this point seems to be potential source of performance drop-downs and all the backends have to communicate to prevent it. But I don't think that this communication is huge. Note that currently all backends use quite large communication channel (256 pages large by default?) which is hardly used for communication purposes but rather for storage. Mike -- WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340 add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND
This is all old news, but I am trying to catch up on my hackers mail. This particular post caught my eye to think carefully about before replying. Michal Mosiewicz <mimo@interdata.com.pl> writes: > David Gould wrote: > > > - each time a backend wants to use a table it will have to somehow find out > > if it is already mapped, and then either map it (for the first time), or > > attach to an existing mapping created by another backend. This implies > > that the backends need to communicate with all the other backends to let > > them know what mappings they are using. > > Why backend has to check if it's already mapped? Let's say that backend > A maps first page from file X using MAP_SHARED, then backend B maps > first page using MAP_SHARED. So, at this moment they are pointing to the > same memory area without any communication. (at least that's the way it > works on Linux, in Linux even MAP_PRIVATE is the same memory region when > you mmap it twice until you write a byte in there - then it's copied). > So, why would we check what other backends map. We use MAP_SHARED to not > have to check it. > > > - if two backends are using the same table, and the table is too big to > > map the whole thing, then each backend needs a "window" into the table. > > This becomes difficult if the two backends are using different parts of > > the table (ie, the first page and the last page). > > Well I wasn't even thinking on mapping anything more than just one page > that is needed. Your statement about not checking if a file was mapped struck me as a problem but on second thought, I was thinking about a typical dbms buffer cache, you are proposing eliminating the dbms buffer cache and using mmap() to read file pages directly relying on the OS cache. I agree that this could work. And, at least some OSes have pretty good buffer management and quick mmap() calls. Linux 2.1.101 seems to be able to do a mmap() in 25 usec on a P166 according to lmbench, BSD and Solaris are quite a bit slower, and at the really slow end, IRIX and HPUX take hundreds of usec for mmap()). But even given good OS mmap() and buffer management, there may still be a performance justification for a separate DBMS buffer cache. Suppose many backends are sharing a small table eg a lookup table with a few dozen rows, perhaps three pages worth. Suppose that most queries scan this table several times (eg multiple joins and subqueries). And suppose most backends run several queries before being restarted. This gives the situation where all the backends refer to same two or three pages hundreds or thousands of times each. In the traditional dbms buffer cache, the first backend to scan the table does say three reads(), and each backend does one mmap() at startup time to map the buffer cache. This means that a very few system calls suffice for thousands of accesses to the shared table. Your proposal, if I have understood it, has one page mmapped() for the table by each backend. To get the next page another mmap() has to be done. This results in three mmaps() per scan for each backend. So, even though the table is fully cached by the OS, thousands of system calls are needed to service all the scans. Even on systems with very fast mmap() I think this may be a significant overhead. That is, there may be a reason all the highend dbms's use their own buffer caches. If you are interested, this could be tested with not too much work. Simply instrument the buffer manager to trace buffer lookups, and read()s, and write()s and log this to a file. Then write a simple program to run the trace file performing the same operations only using mmap(). Try to get a trace from a busy web site or other heavy duty application using postgres. I think that this will show that the buffer cache has its place in life. But, I am prepared to hear otherwise. > > - there is a finite amount of memory available on the system for postgres > > to use. This will have to be split amoung all the open tables used by > > all the backends. If you have 50 backends each using 10 each with 3 > > indexes, you now need 2,000 mappings in the system. Assuming that there > > are 2001 pages available for mapping, how do you decide with table gets > > to map 2 pages? How do you get all the backends to agree about this? > > IMHO, this is also not that much problem as it looks like. When the > system is running out of virtual memory, the occupied pages are > paged-out. The system does what actually buffer manager does - it writes > down the pages that are dirty, and simply frees memory from those that > are not modified on a last recently used basis. So the only thing that > costs are the memory structures that describe the bindings between disk > blocks and memory. And of course it's sometimes bad to use LRU > algorithm. Sometimes backend knows better which pages are best to > page-out. > > I have to admit that this point seems to be potential source of > performance drop-downs and all the backends have to communicate to > prevent it. But I don't think that this communication is huge. Note that > currently all backends use quite large communication channel (256 pages > large by default?) which is hardly used for communication purposes but > rather for storage. Perhaps. Still, to implement this would be a major task. I would prefer to spend that effort on adding page or row level locking for instance. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software 300 Lakeside Drive Oakland, CA 94612 - A child of five could understand this! Fetch me a child of five.