Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... - Mailing list pgsql-performance
From | Tom Lane |
---|---|
Subject | Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... |
Date | |
Msg-id | 26390.1097875346@sss.pgh.pa.us Whole thread Raw |
In response to | Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... (Sean Chittenden <sean@chittenden.org>) |
Responses |
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
|
List | pgsql-performance |
Sean Chittenden <sean@chittenden.org> writes: > Coordination of data isn't > necessary if you mmap(2) data as a private block, which takes a > snapshot of the page at the time you make the mmap(2) call and gets > copied only when the page is written to. More on that later. We cannot move to a model where different backends have different views of the same page, which seems to me to be inherent in the idea of using MAP_PRIVATE for anything. To take just one example, a backend that had mapped one btree index page some time ago could get completely confused if that page splits, because it might see the effects of the split in nearby index pages but not in the one that was split. Or it could follow an index link to a heap entry that isn't there anymore, or miss an entry it should have seen. MVCC doesn't save you from this because btree adjustments happen below the level of transactions. However the really major difficulty with using mmap is that it breaks the scheme we are currently using for WAL, because you don't have any way to restrict how soon a change in an mmap'd page will go to disk. (No, I don't believe that mlock guarantees this. It says that the page will not be removed from main memory; it does not specify that, say, the syncer won't write the contents out anyway.) > Let's look at what happens with a read(2) call. To read(2) data you > have to have a block of memory to copy data into. Assume your OS of > choice has a good malloc(3) implementation and it only needs to call > brk(2) once to extend the process's memory address after the first > malloc(3) call. There's your first system call, which guarantees one > context switch. Wrong. Our reads occur into shared memory allocated at postmaster startup, remember? > mmap(2) is a totally different animal in that you don't ever need to > make calls to read(2): mmap(2) is used in place of those calls (With > #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it > was working with a page of mmap(2)'ed data or need to know that it is). Instead, you have to worry about address space management and keeping a consistent view of the data. > ... If a write(2) system call is issued on a page of > mmap(2)'ed data (and your operating system supports it, I know FreeBSD > does, but don't think Linux does), then the page of data is DMA'ed by > the network controller and sent out without the data needing to be > copied into the network controller's buffer. Perfectly irrelevant to Postgres, since there is no situation where we'd ever write directly from a disk buffer to a socket; in the present implementation there are at least two levels of copy needed in between (datatype-specific output function and protocol message assembly). And that's not even counting the fact that any data item large enough to make the savings interesting would have been sliced, diced, and compressed by TOAST. > ... If you're doing a write(2) or are directly > scribbling on an mmap(2)'ed page[1], you need to grab some kind of an > exclusive lock on the page/file (mlock(2) is going to be no more > expensive than a semaphore, but probably less expensive). More incorrect information. The locking involved here is done by LWLockAcquire, which is significantly *less* expensive than a kernel call in the case where there is no need to block. (If you have to block, any kernel call to do so is probably about as bad as any other.) Switching over to mlock would likely make things considerably slower. In any case, you didn't actually mean to say mlock did you? It doesn't lock pages against writes by other processes AFAICS. > shared mem is a bastardized subsystem that works, but isn't integral to > any performance areas in the kernel so it gets neglected. What performance issues do you think shared memory needs to have fixed? We don't issue any shmem kernel calls after the initial shmget, so comparing the level of kernel tenseness about shmget to the level of tenseness about mmap is simply irrelevant. Perhaps the reason you don't see any traffic about this on the kernel lists is that shared memory already works fine and doesn't need any fixing. > Please ask questions if you have them. Do you have any arguments that are actually convincing? What I just read was a proposal to essentially throw away not only the entire low-level data access model, but the entire low-level locking model, and start from scratch. There is no possible way we could support both this approach and the current one, which means that we'd be permanently dropping support for all platforms without high-quality mmap implementations; and despite your enthusiasm I don't think that that category includes every interesting platform. Furthermore, you didn't give any really convincing reasons to think that the enormous effort involved would be repaid. Those oprofile reports Josh just put up showed 3% of the CPU time going into userspace/kernelspace copying. Even assuming that that number consists entirely of reads and writes of shared buffers (and of course no other kernel call ever transfers any data across that boundary ;-)), there's no way we are going to buy into this sort of project in hopes of a 3% win. regards, tom lane
pgsql-performance by date: