Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ... - Mailing list pgsql-performance

From Sean Chittenden
Subject Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
Date
Msg-id 0E3BB9CB-1EE6-11D9-A0BB-000A95C705DC@chittenden.org
Whole thread Raw
In response to Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...  (Kevin Brown <kevin@sysexperts.com>)
Responses Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
List pgsql-performance
>> pg to my mind is unique in not trying to avoid OS buffering. Other
>> dbmses spend a substantial effort to create a virtual OS (task
>> management, I/O drivers, etc.) both in code and support. Choosing mmap
>> seems such a limiting an option - it adds OS dependency and limits
>> kernel developer options (2G limits, global mlock serializations,
>> porting problems, inability to schedule or parallelize I/O, still
>> having to coordinate writers and readers).

2G limits?  That must be a Linux limitation, not a limitation with
mmap(2).  On OS-X and FreeBSD it's anywhere from 4GB to ... well,
whatever the 64bit limit is (which is bigger than any data file in
$PGDATA).  An mlock(2) serialization problem is going to be cheaper
than hitting the disk in nearly all cases and should be no worse than a
context switch or semaphore (what we use for the current locking
scheme), of which PostgreSQL causes plenty of 'em because it's
multi-process, not multi-threaded.  Coordination of data isn't
necessary if you mmap(2) data as a private block, which takes a
snapshot of the page at the time you make the mmap(2) call and gets
copied only when the page is written to.  More on that later.

> I'm not sure I entirely agree with this.  Whether you access a file
> via mmap() or via read(), the end result is that you still have to
> access it, and since PG has significant chunks of system-dependent
> code that it heavily relies on as it is (e.g., locking mechanisms,
> shared memory), writing the I/O subsystem in a similar way doesn't
> seem to me to be that much of a stretch (especially since PG already
> has the storage manager), though it might involve quite a bit of work.

Obviously you have to access the file on the hard drive, but you're
forgetting an enormous advantage of mmap(2).  With a read(2) system
call, the program has to allocate space for the read(2), then it copies
data from the kernel into the allocated memory in the userland's newly
allocated memory location.  With mmap(2) there is no second copy.

Let's look at what happens with a read(2) call.  To read(2) data you
have to have a block of memory to copy data into.  Assume your OS of
choice has a good malloc(3) implementation and it only needs to call
brk(2) once to extend the process's memory address after the first
malloc(3) call.  There's your first system call, which guarantees one
context switch.  The second hit, a much larger hit, is the actual
read(2) call itself, wherein the kernel has to copy the data twice:
once into a kernel buffer, then from the kernel buffer into the
userland's memory space.  Yuk.  Webserver's figured this out long ago
that read(2) is slow and evil in terms of performance.  Apache uses
mmap(2) to send static files at performance levels that don't suck and
is actually quite fast (in terms of responsiveness, I'm not talking
about Apache's parallelism/concurrency performance levels... which in
1.X aren't great).

mmap(2) is a totally different animal in that you don't ever need to
make calls to read(2): mmap(2) is used in place of those calls (With
#ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it
was working with a page of mmap(2)'ed data or need to know that it is).
  Instead you mmap(2) a file descriptor and the kernel does some heavy
lifting/optimized magic in its VM.  The kernel reads the file
descriptor and places the data it reads into its buffer (exactly the
same as what happens with read(2)), but, instead of copying the data to
the userspace, mmap(2) adjusts the process's address space and maps the
address of the kernel buffer into the process's address space.  No
copying necessary.  The savings here are *huge*!

Depending on the mmap(2) implementation, the VM may not even get a page
from disk until its actually needed.  So, lets say you mmap(2) a 16M
file.  The address space picks up an extra 16M of bits that the process
*can* use, but doesn't necessarily use.  So if a user reads only ten
pages out of a 16MB file, only 10 pages (10 * getpagesize()), or
usually 40,960K, which is 0.24% the amount of disk access (((4096 * 10)
/ (16 *1024 * 1024)) * 100).  Did I forget to mention that if the file
is already in the kernel's buffers, there's no need for the kernel to
access the hard drive?  Another big win for data that's hot/frequently
accessed.

There's another large savings if the machine is doing network IO too...

> As for parallelization of I/O, the use of mmap() for reads should
> signficantly improve parallelization -- now instead of issuing read()
> system calls, possibly for the same set of blocks, all the backends
> would essentially be examining the same data directly.  The
> performance improvements as a result of accessing the kernel's cache
> pages directly instead of having it do buffer copies to process-local
> memory should increase as concurrency goes up.  But see below.

That's kinda true... though not quite correct.  The improvement in IO
concurrency comes from zero-socket-copy operations from the disk to the
network controller.  If a write(2) system call is issued on a page of
mmap(2)'ed data (and your operating system supports it, I know FreeBSD
does, but don't think Linux does), then the page of data is DMA'ed by
the network controller and sent out without the data needing to be
copied into the network controller's buffer.  So, instead of the CPU
copying data from the OS's buffer to a kernel buffer, the network card
grabs the chunk of data in one interrupt because of the DMA (direct
memory access).  This is a pretty big deal for web serving, but if
you've got a database sending large sets of data over the network,
assuming the network isn't the bottle neck, this results in a heafty
performance boost (that won't be noticed by most until they're running
huge, very busy installations).  This optimization comes for free and
without needing to add one line of code to an application once mmap(2)
has been added to an application.

>> More to the point, I think it is very hard to effectively coordinate
>> multithreaded I/O, and mmap seems used mostly to manage relatively
>> simple scenarios.
>
> PG already manages and coordinates multithreaded I/O.  The mechanisms
> used to coordinate writes needn't change at all.  But the way reads
> are done relative to writes might have to be rethought, since an
> mmap()ed buffer always reflects what's actually in kernel space at the
> time the buffer is accessed, while a buffer retrieved via read()
> reflects the state of the file at the time of the read().  If it's
> necessary for the state of the buffers to be fixed at examination
> time, then mmap() will be at best a draw, not a win.

Here's where things can get interesting from a transaction stand point.
  Your statement is correct up until you make the assertion that a page
needs to be fixed.  If you're doing a read(2) transaction, mmap(2) a
region and set the MAP_PRIVATE flag so the ground won't change
underneath you.  No copying of this page is done by the kernel unless
it gets written to.  If you're doing a write(2) or are directly
scribbling on an mmap(2)'ed page[1], you need to grab some kind of an
exclusive lock on the page/file (mlock(2) is going to be no more
expensive than a semaphore, but probably less expensive).  We already
do that with semaphores, however.  So for databases that don't have
high contention for the same page/file of data, there are no additional
copies made.  When a piece of data is written, a page is duplicated
before it gets scribbled on, but the application never knows this
happens.  The next time a process mmap(2)'s a region of memory that's
been written to, it'll get the updated data without any need to flush a
cache or mark pages as dirty: the operating system does all of this for
us (and probably faster too).  mmap(2) implementations are, IMHO, more
optimized that shared memory implementations (mmap(2) is a VM function,
which gets many eyes to look it over and is always being tuned, whereas
shared mem is a bastardized subsystem that works, but isn't integral to
any performance areas in the kernel so it gets neglected.  Just my
observations from the *BSD commit lists.  Linux it may be different).

[1] I forgot to mention earlier, you don't have to write(2) data to a
file if it's mmap(2)'ed, you can change the contents of an mmap(2)'ed
region, then msync(2) it back to disk (to ensure it gets written out)
or let the last munmap(2) call do that for you (which would be just as
dangerous as running without fsync... but would result in an additional
performance boost).

>> mmap doesn't look that promising.
>
> This ultimately depends on two things: how much time is spent copying
> buffers around in kernel memory, and how much advantage can be gained
> by freeing up the memory used by the backends to store the
> backend-local copies of the disk pages they use (and thus making that
> memory available to the kernel to use for additional disk buffering).

Someone on IRC pointed me to some OSDL benchmarks, which broke down
where time is being spent.  Want to know what the most expensive part
of PostgreSQL is?  *drum roll*

http://khack.osdl.org/stp/297960/profile/DBT_2_Profile-tick.sort

3967393 total                                      1.7735
2331284 default_idle                             36426.3125
825716 do_sigaction                             1290.1813
133126 __copy_from_user_ll                      1040.0469
  97780 __copy_to_user_ll                        763.9062
  43135 finish_task_switch                       269.5938
  30973 do_anonymous_page                         62.4456
  24175 scsi_request_fn                           22.2197
  23355 __do_softirq                             121.6406
  17039 __wake_up                                133.1172
  16527 __make_request                            10.8730
   9823 try_to_wake_up                            13.6431
   9525 generic_unplug_device                     66.1458
   8799 find_get_page                             78.5625
   7878 scsi_end_request                          30.7734

Copying data to/from userspace and signal handling!!!!  Let's hear it
for the need for mmap(2)!!!  *crowd goes wild*

> The gains from the former are likely small.  The gains from the latter
> are probably also small, but harder to estimate.

I disagree.

> The use of mmap() is probably one of those optimizations that should
> be done when there's little else left to optimize, because the
> potential gains are possibly (if not probably) relatively small and
> the amount of work involved may be quite large.

If system/kernel time is where most of your database spends its time,
then mmap(2) is a huge optimization that is very much worth pursuing.
It's stable (nearly all webservers use it, notably Apache), widely
deployed, POSIX specified (granted not all implementations are 100%
consistent, but that's an OS bug and mmap(2) doesn't have to be turned
on for those platforms: it's no worse than where we are now), and well
optimized by operating system hackers.  I guarantee that your operating
system of choice has a faster VM and disk cache than PostgreSQL's
userland cache, nevermind using the OSs buffers leads to many
performance boosts as the OS can short-circuit common pathways that
would require data copying (ex: zero-socket-copy operations and copying
data to/from userland).

mmap(2) isn't a panacea or replacement for good software design, but it
certainly does make IO operations vastly faster, which is what
PostgreSQL does a lot of (hence its need for a userland cache).
Remember, back when PostgreSQL had its architecture thunk up, mmap(2)
hardly existed in anyone's eyes, nevermind it being widely used or a
POSIX function.  It wasn't until Apache started using it that Operating
System vendors felt the need to implement it or make it work well.  Now
it's integral to nearly all virtual memory implementations and a modern
OS can't live without it or have it broken in any way.  It would be
largely beneficial to PostgreSQL to heavily utilize mmap(2).

A few places it should be used include:

*) Storage.  It is a good idea to mmap(2) all files instead of
read(2)'ing files.  mmap(2) doesn't fetch a page from disk until its
actually needed, which is a nifty savings.  Sure it causes a fault in
the kernel, but it won't the second time that page is accessed.
Changes are necessary to src/backend/storage/file/, possibly
src/backend/storage/freespace/ (why is it using fread(3) and not
read(2)?), src/backend/storage/large_object/ can remain gimpy since
people should use BYTEA instead (IMHO), src/backend/storage/page/
doesn't need changes (I don't think), src/backend/storage/smgr/
shouldn't need any modifications either.

*) ARC.  Why unmmap(2) data if you don't need to?  With ARC, it's
possible for the database to coach the operating system in what pages
should be persistent.  ARC's a smart algorithm for handling the needs
of a database.  Instead of having a cache of pages in userland,
PostgreSQL would have a cache of mmap(2)'ed pages.  It's shared between
processes, the changes are public to external programs read(2)'ing
data, and its quick.  The needs for shared memory by the kernel drops
to nearly nothing.  The needs for mmap(2)'able space in the kernel,
however, does go up.  Unlike SysV shared mem, this can normally be
changed on the fly.  The end result would be, if a page is needed, it
checks to see if its in the cache.  If it is, the mmap(2)'ed page is
returned.  If it isn't, the page gets read(2)/mmap(2) like it currently
is loaded (except in the mmap(2) case where after the data has been
loaded, the page gets munmap(2)'ed).  If ARC decides to keep the page,
the page doesn't get munmap(2)'ed.  I don't think any changes need to
be made though to take advantage of mmap(2) if the changes are made in
the places mentioned above in the Storage point.


A few other perks:

*) DIRECTIO can be used without much of a cache coherency headache
since the cache of data is in the kernel, not userland.

*) NFS.  I'm not suggesting multiple clients use the same data
directory via NFS (unless read only), but if there were a single client
accessing a data directory over NFS, performance would be much better
than it is today because data consistency is handled by the kernel so
in flight packets for writes that get dropped or lost won't cause a
slow down (mmap(2) behaves differently with NFS pages) or corruption.

*) mmap(2) is conditional on the operating system's abilities, but
doesn't require any architectural changes.  It does change the location
of the cache, from being in the userland, down in to the kernel.  This
is a change for database administrators, but a good one, IMHO.
Previously, the operating system would be split 25% kernel, 75% user
because PostgreSQL would need the available RAM for its cache.  Now,
that can be moved closer to the opposite, 75% kernel, 25% user because
most of the memory is mmap(2)'ed pages instead of actual memory in the
userland.

*) Pages can be protected via PROT_(EXEC|READ|WRITE).  For backends
that aren't making changes to the DDL or system catalogs (permissions,
etc.), pages that are loaded from the catalogs could be loaded with the
protection PROT_READ, which would prevent changes to the catalogs.  All
DDL and permission altering commands (anything that touches the system
catalogs) would then load the page with the PROT_WRITE bit set, make
their changes, then PROT_READ the page again.  This would provide a
first line of defense against buggy programs or exploits.

*) Eliminates the double caching done currently (caching in PostgreSQL
and the kernel) by pushing the cache into the kernel... but without
PostgreSQL knowing it's working on a page that's in the kernel.

Please ask questions if you have them.

-sc

--
Sean Chittenden


pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Tuning shared_buffers with ipcs ?
Next
From: Josh Berkus
Date:
Subject: Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...