Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... - Mailing list pgsql-performance

From Tom Lane
Subject Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
Date
Msg-id 26390.1097875346@sss.pgh.pa.us
Whole thread Raw
In response to Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...  (Sean Chittenden <sean@chittenden.org>)
Responses Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
List pgsql-performance
Sean Chittenden <sean@chittenden.org> writes:
> Coordination of data isn't
> necessary if you mmap(2) data as a private block, which takes a
> snapshot of the page at the time you make the mmap(2) call and gets
> copied only when the page is written to.  More on that later.

We cannot move to a model where different backends have different
views of the same page, which seems to me to be inherent in the idea of
using MAP_PRIVATE for anything.  To take just one example, a backend
that had mapped one btree index page some time ago could get completely
confused if that page splits, because it might see the effects of the
split in nearby index pages but not in the one that was split.  Or it
could follow an index link to a heap entry that isn't there anymore,
or miss an entry it should have seen.  MVCC doesn't save you from this
because btree adjustments happen below the level of transactions.

However the really major difficulty with using mmap is that it breaks
the scheme we are currently using for WAL, because you don't have any
way to restrict how soon a change in an mmap'd page will go to disk.
(No, I don't believe that mlock guarantees this.  It says that the
page will not be removed from main memory; it does not specify that,
say, the syncer won't write the contents out anyway.)

> Let's look at what happens with a read(2) call.  To read(2) data you
> have to have a block of memory to copy data into.  Assume your OS of
> choice has a good malloc(3) implementation and it only needs to call
> brk(2) once to extend the process's memory address after the first
> malloc(3) call.  There's your first system call, which guarantees one
> context switch.

Wrong.  Our reads occur into shared memory allocated at postmaster
startup, remember?

> mmap(2) is a totally different animal in that you don't ever need to
> make calls to read(2): mmap(2) is used in place of those calls (With
> #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it
> was working with a page of mmap(2)'ed data or need to know that it is).

Instead, you have to worry about address space management and keeping a
consistent view of the data.

> ... If a write(2) system call is issued on a page of
> mmap(2)'ed data (and your operating system supports it, I know FreeBSD
> does, but don't think Linux does), then the page of data is DMA'ed by
> the network controller and sent out without the data needing to be
> copied into the network controller's buffer.

Perfectly irrelevant to Postgres, since there is no situation where we'd
ever write directly from a disk buffer to a socket; in the present
implementation there are at least two levels of copy needed in between
(datatype-specific output function and protocol message assembly).  And
that's not even counting the fact that any data item large enough to
make the savings interesting would have been sliced, diced, and
compressed by TOAST.

> ... If you're doing a write(2) or are directly
> scribbling on an mmap(2)'ed page[1], you need to grab some kind of an
> exclusive lock on the page/file (mlock(2) is going to be no more
> expensive than a semaphore, but probably less expensive).

More incorrect information.  The locking involved here is done by
LWLockAcquire, which is significantly *less* expensive than a kernel
call in the case where there is no need to block.  (If you have to
block, any kernel call to do so is probably about as bad as any other.)
Switching over to mlock would likely make things considerably slower.
In any case, you didn't actually mean to say mlock did you?  It doesn't
lock pages against writes by other processes AFAICS.

> shared mem is a bastardized subsystem that works, but isn't integral to
> any performance areas in the kernel so it gets neglected.

What performance issues do you think shared memory needs to have fixed?
We don't issue any shmem kernel calls after the initial shmget, so
comparing the level of kernel tenseness about shmget to the level of
tenseness about mmap is simply irrelevant.  Perhaps the reason you don't
see any traffic about this on the kernel lists is that shared memory
already works fine and doesn't need any fixing.

> Please ask questions if you have them.

Do you have any arguments that are actually convincing?  What I just
read was a proposal to essentially throw away not only the entire
low-level data access model, but the entire low-level locking model,
and start from scratch.  There is no possible way we could support both
this approach and the current one, which means that we'd be permanently
dropping support for all platforms without high-quality mmap
implementations; and despite your enthusiasm I don't think that that
category includes every interesting platform.  Furthermore, you didn't
give any really convincing reasons to think that the enormous effort
involved would be repaid.  Those oprofile reports Josh just put up
showed 3% of the CPU time going into userspace/kernelspace copying.
Even assuming that that number consists entirely of reads and writes of
shared buffers (and of course no other kernel call ever transfers any
data across that boundary ;-)), there's no way we are going to buy into
this sort of project in hopes of a 3% win.

            regards, tom lane

pgsql-performance by date:

Previous
From: Doug Y
Date:
Subject: Re: Tuning shared_buffers with ipcs ?
Next
From: Tom Lane
Date:
Subject: Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...