Re: [HACKERS] mmap and MAP_ANON - Mailing list pgsql-hackers
From | dg@illustra.com (David Gould) |
---|---|
Subject | Re: [HACKERS] mmap and MAP_ANON |
Date | |
Msg-id | 9805310449.AA26366@hawk.illustra.com Whole thread Raw |
In response to | Re: [HACKERS] mmap and MAP_ANON (Michal Mosiewicz <mimo@interdata.com.pl>) |
List | pgsql-hackers |
This is all old news, but I am trying to catch up on my hackers mail. This particular post caught my eye to think carefully about before replying. Michal Mosiewicz <mimo@interdata.com.pl> writes: > David Gould wrote: > > > - each time a backend wants to use a table it will have to somehow find out > > if it is already mapped, and then either map it (for the first time), or > > attach to an existing mapping created by another backend. This implies > > that the backends need to communicate with all the other backends to let > > them know what mappings they are using. > > Why backend has to check if it's already mapped? Let's say that backend > A maps first page from file X using MAP_SHARED, then backend B maps > first page using MAP_SHARED. So, at this moment they are pointing to the > same memory area without any communication. (at least that's the way it > works on Linux, in Linux even MAP_PRIVATE is the same memory region when > you mmap it twice until you write a byte in there - then it's copied). > So, why would we check what other backends map. We use MAP_SHARED to not > have to check it. > > > - if two backends are using the same table, and the table is too big to > > map the whole thing, then each backend needs a "window" into the table. > > This becomes difficult if the two backends are using different parts of > > the table (ie, the first page and the last page). > > Well I wasn't even thinking on mapping anything more than just one page > that is needed. Your statement about not checking if a file was mapped struck me as a problem but on second thought, I was thinking about a typical dbms buffer cache, you are proposing eliminating the dbms buffer cache and using mmap() to read file pages directly relying on the OS cache. I agree that this could work. And, at least some OSes have pretty good buffer management and quick mmap() calls. Linux 2.1.101 seems to be able to do a mmap() in 25 usec on a P166 according to lmbench, BSD and Solaris are quite a bit slower, and at the really slow end, IRIX and HPUX take hundreds of usec for mmap()). But even given good OS mmap() and buffer management, there may still be a performance justification for a separate DBMS buffer cache. Suppose many backends are sharing a small table eg a lookup table with a few dozen rows, perhaps three pages worth. Suppose that most queries scan this table several times (eg multiple joins and subqueries). And suppose most backends run several queries before being restarted. This gives the situation where all the backends refer to same two or three pages hundreds or thousands of times each. In the traditional dbms buffer cache, the first backend to scan the table does say three reads(), and each backend does one mmap() at startup time to map the buffer cache. This means that a very few system calls suffice for thousands of accesses to the shared table. Your proposal, if I have understood it, has one page mmapped() for the table by each backend. To get the next page another mmap() has to be done. This results in three mmaps() per scan for each backend. So, even though the table is fully cached by the OS, thousands of system calls are needed to service all the scans. Even on systems with very fast mmap() I think this may be a significant overhead. That is, there may be a reason all the highend dbms's use their own buffer caches. If you are interested, this could be tested with not too much work. Simply instrument the buffer manager to trace buffer lookups, and read()s, and write()s and log this to a file. Then write a simple program to run the trace file performing the same operations only using mmap(). Try to get a trace from a busy web site or other heavy duty application using postgres. I think that this will show that the buffer cache has its place in life. But, I am prepared to hear otherwise. > > - there is a finite amount of memory available on the system for postgres > > to use. This will have to be split amoung all the open tables used by > > all the backends. If you have 50 backends each using 10 each with 3 > > indexes, you now need 2,000 mappings in the system. Assuming that there > > are 2001 pages available for mapping, how do you decide with table gets > > to map 2 pages? How do you get all the backends to agree about this? > > IMHO, this is also not that much problem as it looks like. When the > system is running out of virtual memory, the occupied pages are > paged-out. The system does what actually buffer manager does - it writes > down the pages that are dirty, and simply frees memory from those that > are not modified on a last recently used basis. So the only thing that > costs are the memory structures that describe the bindings between disk > blocks and memory. And of course it's sometimes bad to use LRU > algorithm. Sometimes backend knows better which pages are best to > page-out. > > I have to admit that this point seems to be potential source of > performance drop-downs and all the backends have to communicate to > prevent it. But I don't think that this communication is huge. Note that > currently all backends use quite large communication channel (256 pages > large by default?) which is hardly used for communication purposes but > rather for storage. Perhaps. Still, to implement this would be a major task. I would prefer to spend that effort on adding page or row level locking for instance. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software 300 Lakeside Drive Oakland, CA 94612 - A child of five could understand this! Fetch me a child of five.
pgsql-hackers by date: