Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... - Mailing list pgsql-performance

From Sean Chittenden
Subject Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
Date
Msg-id EBC24157-239F-11D9-9171-000A95C705DC@chittenden.org
Whole thread Raw
In response to Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
> However the really major difficulty with using mmap is that it breaks
> the scheme we are currently using for WAL, because you don't have any
> way to restrict how soon a change in an mmap'd page will go to disk.
> (No, I don't believe that mlock guarantees this.  It says that the
> page will not be removed from main memory; it does not specify that,
> say, the syncer won't write the contents out anyway.)

I had to think about this for a minute (now nearly a week) and reread
the docs on WAL before I groked what could happen here.  You're
absolutely right in that WAL needs to be taken into account first.  How
does this execution path sound to you?

By default, all mmap(2)'ed pages are MAP_SHARED.  There are no
complications with regards to reads.

When a backend wishes to write a page, the following steps are taken:

1) Backend grabs a lock from the lockmgr to write to the page (exactly
as it does now)

2) Backend mmap(2)'s a second copy of the page(s) being written to,
this time with the MAP_PRIVATE flag set.  Mapping a copy of the page
again is wasteful in terms of address space, but does not require any
more memory than our current scheme.  The re-mapping of the page with
MAP_PRIVATE prevents changes to the data that other backends are
viewing.

3) The writing backend, can then scribble on its private copy of the
page(s) as it sees fit.

4) Once completed making changes and a transaction is to be committed,
the backend WAL logs its changes.

5) Once the WAL logging is complete and it has hit the disk, the
backend msync(2)'s its private copy of the pages to disk (ASYNC or
SYNC, it doesn't really matter too much to me).

6) Optional(?).  I'm not sure whether or not the backend would need to
also issues an msync(2) MS_INVALIDATE, but, I suspect it would not need
to on systems with unified buffer caches such as FreeBSD or OS-X.  On
HPUX, or other older *NIX'es, it may be necessary.  *shrug*  I could be
trying to be overly protective here.

7) Backend munmap(2)'s its private copy of the written on page(s).

8) Backend releases its lock from the lockmgr.

At this point, the remaining backends now are able to see the updated
pages of data.

>> Let's look at what happens with a read(2) call.  To read(2) data you
>> have to have a block of memory to copy data into.  Assume your OS of
>> choice has a good malloc(3) implementation and it only needs to call
>> brk(2) once to extend the process's memory address after the first
>> malloc(3) call.  There's your first system call, which guarantees one
>> context switch.
>
> Wrong.  Our reads occur into shared memory allocated at postmaster
> startup, remember?

Doh.  Fair enough.  In most programs that involve read(2), a call to
alloc(3) needs to be made.

>> mmap(2) is a totally different animal in that you don't ever need to
>> make calls to read(2): mmap(2) is used in place of those calls (With
>> #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it
>> was working with a page of mmap(2)'ed data or need to know that it
>> is).
>
> Instead, you have to worry about address space management and keeping a
> consistent view of the data.

Which is largely handled by mmap() and the VM.

>> ... If a write(2) system call is issued on a page of
>> mmap(2)'ed data (and your operating system supports it, I know FreeBSD
>> does, but don't think Linux does), then the page of data is DMA'ed by
>> the network controller and sent out without the data needing to be
>> copied into the network controller's buffer.
>
> Perfectly irrelevant to Postgres, since there is no situation where
> we'd
> ever write directly from a disk buffer to a socket; in the present
> implementation there are at least two levels of copy needed in between
> (datatype-specific output function and protocol message assembly).  And
> that's not even counting the fact that any data item large enough to
> make the savings interesting would have been sliced, diced, and
> compressed by TOAST.

The biggest winners will be columns whos storage type is PLAIN or
EXTERNAL.  writev(2) from mmap(2)'ed pages and non-mmap(2)'ed pages
would be a nice perk too (not sure if PostgreSQL uses this or not).
Since compression isn't happening on most tuples under 1K in size and
most tuples in a database are going to be under that, most tuples are
going to be uncompressed.  Total pages for the database, however, is
likely a different story.  For large tuples that are uncompressed and
larger than a page, it is probably beneficial to use sendfile(2)
instead of mmap(2) + write(2)'ing the page/file.

If a large tuple is compressed, it'd be interesting to see if it'd be
worthwhile to have the data uncompressed onto an anonymously mmap(2)'ed
page(s) that way the benefits of zero-socket-copies could be used.

>> shared mem is a bastardized subsystem that works, but isn't integral
>> to
>> any performance areas in the kernel so it gets neglected.
>
> What performance issues do you think shared memory needs to have fixed?
> We don't issue any shmem kernel calls after the initial shmget, so
> comparing the level of kernel tenseness about shmget to the level of
> tenseness about mmap is simply irrelevant.  Perhaps the reason you
> don't
> see any traffic about this on the kernel lists is that shared memory
> already works fine and doesn't need any fixing.

I'm gunna get flamed for this, but I think its improperly used as a
second level cache on top of the operating system's cache.  mmap(2)
would consolidate all caching into the kernel.

>> Please ask questions if you have them.
>
> Do you have any arguments that are actually convincing?

Three things come to mind.

1) A single cache for pages
2) Ability to give access hints to the kernel regarding future IO
3) On the fly memory use for a cache.  There would be no need to
preallocate slabs of shared memory on startup.

And a more minor point would be:

4) Not having shared pages get lost when the backend dies (mmap(2) uses
refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean).  This
isn't too practical in production though, but it sucks doing PostgreSQL
development on OS-X because there is no ipcs/ipcrm command.

> What I just read was a proposal to essentially throw away not only the
> entire
> low-level data access model, but the entire low-level locking model,
> and start from scratch.

 From the above list, steps 2, 3, 5, 6, and 7 would be different than
our current approach, all of which could be safely handled with some
#ifdef's on platforms that don't have mmap(2).

> There is no possible way we could support both
> this approach and the current one, which means that we'd be permanently
> dropping support for all platforms without high-quality mmap
> implementations;

Architecturally, I don't see anything different or incompatibilities
that aren't solved with an #ifdef USE_MMAP/#else/#endif.

> Furthermore, you didn't
> give any really convincing reasons to think that the enormous effort
> involved would be repaid.

Steven's has a great reimplementaion of cat(1) that uses mmap(1) and
benchmarks the two.  I did my own version of that here:

http://people.freebsd.org/~seanc/mmap_test/

When read(2)'ing/write(2)'ing /etc/services 100,000 times without
mmap(2), it takes 82 seconds.  With mmap(2), it takes anywhere from 1.1
to 18 seconds.  Worst case scenario with mmap(2) yields a speedup by a
factor of four.  Best case scenario...  *shrug* something better than
4x.  I doubt PostgreSQL would see 4x speedups in the IO department, but
I do think it would be vastly greater than the 3% suggested.

> Those oprofile reports Josh just put up
> showed 3% of the CPU time going into userspace/kernelspace copying.
> Even assuming that that number consists entirely of reads and writes of
> shared buffers (and of course no other kernel call ever transfers any
> data across that boundary ;-)), there's no way we are going to buy into
> this sort of project in hopes of a 3% win.

Would it be helpful if I created a test program that demonstrated that
the execution path for writing mmap(2)'ed pages as outlined above?

-sc

--
Sean Chittenden


pgsql-performance by date:

Previous
From: Josh Berkus
Date:
Subject: Links to OSDL test results up
Next
From: Thomas F.O'Connell
Date:
Subject: Performance Anomalies in 7.4.5