Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... - Mailing list pgsql-performance
From | Tom Lane |
---|---|
Subject | Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... |
Date | |
Msg-id | 2823.1098418364@sss.pgh.pa.us Whole thread Raw |
In response to | Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... (Sean Chittenden <sean@chittenden.org>) |
List | pgsql-performance |
Sean Chittenden <sean@chittenden.org> writes: > When a backend wishes to write a page, the following steps are taken: > ... > 2) Backend mmap(2)'s a second copy of the page(s) being written to, > this time with the MAP_PRIVATE flag set. > ... > 5) Once the WAL logging is complete and it has hit the disk, the > backend msync(2)'s its private copy of the pages to disk (ASYNC or > SYNC, it doesn't really matter too much to me). My man page for mmap says that changes in a MAP_PRIVATE region are private; they do not affect the file at all, msync or no. So I don't think the above actually works. In any case, this scheme still forces you to flush WAL records to disk before making the changed page visible to other backends, so I don't see how it improves the situation. In the existing scheme we only have to fsync WAL at (1) transaction commit, (2) when we are forced to write a page out from shared buffers because we are short of buffers, or (3) checkpoint. Anything that implies an fsync per atomic action is going to be a loser. It does not matter how great your kernel API is if you only get to perform one atomic action per disk rotation :-( The important point here is that you can't postpone making changes at the page level visible to other backends; there's no MVCC at this level. Consider for example two backends wanting to insert a new row. If they both MAP_PRIVATE the same page, they'll probably choose the same tuple slot on the page to insert into (certainly there is nothing to stop that from happening). Now you have conflicting definitions for the same CTID, not to mention probably conflicting uses of the page's physical free space; disaster ensues. So "atomic action" really means "lock page, make changes, add WAL record to in-memory WAL buffers, unlock page" with the understanding that as soon as you unlock the page the changes you've made in it are visible to all other backends. You *can't* afford to put a WAL fsync in this sequence. You could possibly buy back most of the lossage in this scenario if there were some efficient way for a backend to hold the low-level lock on a page just until some other backend wanted to modify the page; whereupon the previous owner would have to do what's needed to make his changes visible before releasing the lock. Given the right access patterns you don't have to fsync very often (though given the wrong access patterns you're still in deep trouble). But we don't have any such mechanism and I think the communication costs of one would be forbidding. > [ much snipped ] > 4) Not having shared pages get lost when the backend dies (mmap(2) uses > refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean). Actually, that is not a bug that's a feature. One of the things that scares me about mmap is that a crashing backend is able to scribble all over live disk buffers before it finally SEGV's (think about memcpy gone wrong and similar cases). In our existing scheme there's a pretty good chance that we will be able to commit hara-kiri before any of the trashed data gets written out. In an mmap scheme, it's time to dig out your backup tapes, because there simply is no distinction between transient and permanent data --- the kernel has no way to know that you didn't mean it. In short, I remain entirely unconvinced that mmap is of any interest to us. regards, tom lane
pgsql-performance by date: