Thread: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...

mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...

From
Aaron Werman
Date:
pg to my mind is unique in not trying to avoid OS buffering. Other
dbmses spend a substantial effort to create a virtual OS (task
management, I/O drivers, etc.) both in code and support. Choosing mmap
seems such a limiting an option - it adds OS dependency and limits
kernel developer options (2G limits, global mlock serializations,
porting problems, inability to schedule or parallelize I/O, still
having to coordinate writers and readers).

More to the point, I think it is very hard to effectively coordinate
multithreaded I/O, and mmap seems used mostly to manage relatively
simple scenarios. If the I/O options are:
- OS (which has enormous investment and is stable, but is general
purpose with overhead)
- pg (direct I/O would be costly and potentially destabilizing, but
with big possible performance rewards)
- mmap (a feature mostly used to reduce buffer copies in less
concurrent apps such as image processing that has major architectural
risk including an order of magnitude more semaphores, but can reduce
some extra block copies)
mmap doesn't look that promising.

/Aaron

----- Original Message -----
From: "Kevin Brown" <kevin@sysexperts.com>
To: <pgsql-performance@postgresql.org>
Sent: Thursday, October 14, 2004 4:25 PM
Subject: Re: [PERFORM] First set of OSDL Shared Mem scalability
results, some wierdness ...


> Tom Lane wrote:
> > Kevin Brown <kevin@sysexperts.com> writes:
> > > Tom Lane wrote:
> > >> mmap() is Right Out because it does not afford us sufficient control
> > >> over when changes to the in-memory data will propagate to disk.
> >
> > > ... that's especially true if we simply cannot
> > > have the page written to disk in a partially-modified state (something
> > > I can easily see being an issue for the WAL -- would the same hold
> > > true of the index/data files?).
> >
> > You're almost there.  Remember the fundamental WAL rule: log entries
> > must hit disk before the data changes they describe.  That means that we
> > need not only a way of forcing changes to disk (fsync) but a way of
> > being sure that changes have *not* gone to disk yet.  In the existing
> > implementation we get that by just not issuing write() for a given page
> > until we know that the relevant WAL log entries are fsync'd down to
> > disk.  (BTW, this is what the LSN field on every page is for: it tells
> > the buffer manager the latest WAL offset that has to be flushed before
> > it can safely write the page.)
> >
> > mmap provides msync which is comparable to fsync, but AFAICS it
> > provides no way to prevent an in-memory change from reaching disk too
> > soon.  This would mean that WAL entries would have to be written *and
> > flushed* before we could make the data change at all, which would
> > convert multiple updates of a single page into a series of write-and-
> > wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
> > is bad enough, once per atomic action is intolerable.
>
> Hmm...something just occurred to me about this.
>
> Would a hybrid approach be possible?  That is, use mmap() to handle
> reads, and use write() to handle writes?
>
> Any code that wishes to write to a page would have to recognize that
> it's doing so and fetch a copy from the storage manager (or
> something), which would look to see if the page already exists as a
> writeable buffer.  If it doesn't, it creates it by allocating the
> memory and then copying the page from the mmap()ed area to the new
> buffer, and returning it.  If it does, it just returns a pointer to
> the buffer.  There would obviously have to be some bookkeeping
> involved: the storage manager would have to know how to map a mmap()ed
> page back to a writeable buffer and vice-versa, so that once it
> decides to write the buffer it can determine which page in the
> original file the buffer corresponds to (so it can do the appropriate
> seek()).
>
> In a write-heavy database, you'll end up with a lot of memory copy
> operations, but with the scheme we currently use you get that anyway
> (it just happens in kernel code instead of user code), so I don't see
> that as much of a loss, if any.  Where you win is in a read-heavy
> database: you end up being able to read directly from the pages in the
> kernel's page cache and thus save a memory copy from kernel space to
> user space, not to mention the context switch that happens due to
> issuing the read().
>
>
> Obviously you'd want to mmap() the file read-only in order to prevent
> the issues you mention regarding an errant backend, and then reopen
> the file read-write for the purpose of writing to it.  In fact, you
> could decouple the two: mmap() the file, then close the file -- the
> mmap()ed region will remain mapped.  Then, as long as the file remains
> mapped, you need to open the file again only when you want to write to
> it.
>
>
> --
> Kevin Brown       kevin@sysexperts.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
--

Regards,
/Aaron

Aaron Werman wrote:
> pg to my mind is unique in not trying to avoid OS buffering. Other
> dbmses spend a substantial effort to create a virtual OS (task
> management, I/O drivers, etc.) both in code and support. Choosing mmap
> seems such a limiting an option - it adds OS dependency and limits
> kernel developer options (2G limits, global mlock serializations,
> porting problems, inability to schedule or parallelize I/O, still
> having to coordinate writers and readers).

I'm not sure I entirely agree with this.  Whether you access a file
via mmap() or via read(), the end result is that you still have to
access it, and since PG has significant chunks of system-dependent
code that it heavily relies on as it is (e.g., locking mechanisms,
shared memory), writing the I/O subsystem in a similar way doesn't
seem to me to be that much of a stretch (especially since PG already
has the storage manager), though it might involve quite a bit of work.

As for parallelization of I/O, the use of mmap() for reads should
signficantly improve parallelization -- now instead of issuing read()
system calls, possibly for the same set of blocks, all the backends
would essentially be examining the same data directly.  The
performance improvements as a result of accessing the kernel's cache
pages directly instead of having it do buffer copies to process-local
memory should increase as concurrency goes up.  But see below.

> More to the point, I think it is very hard to effectively coordinate
> multithreaded I/O, and mmap seems used mostly to manage relatively
> simple scenarios.

PG already manages and coordinates multithreaded I/O.  The mechanisms
used to coordinate writes needn't change at all.  But the way reads
are done relative to writes might have to be rethought, since an
mmap()ed buffer always reflects what's actually in kernel space at the
time the buffer is accessed, while a buffer retrieved via read()
reflects the state of the file at the time of the read().  If it's
necessary for the state of the buffers to be fixed at examination
time, then mmap() will be at best a draw, not a win.

> mmap doesn't look that promising.

This ultimately depends on two things: how much time is spent copying
buffers around in kernel memory, and how much advantage can be gained
by freeing up the memory used by the backends to store the
backend-local copies of the disk pages they use (and thus making that
memory available to the kernel to use for additional disk buffering).
The gains from the former are likely small.  The gains from the latter
are probably also small, but harder to estimate.

The use of mmap() is probably one of those optimizations that should
be done when there's little else left to optimize, because the
potential gains are possibly (if not probably) relatively small and
the amount of work involved may be quite large.


So I agree -- compared with other, much lower-hanging fruit, mmap()
doesn't look promising.



--
Kevin Brown                          kevin@sysexperts.com

Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...

From
Sean Chittenden
Date:
>> pg to my mind is unique in not trying to avoid OS buffering. Other
>> dbmses spend a substantial effort to create a virtual OS (task
>> management, I/O drivers, etc.) both in code and support. Choosing mmap
>> seems such a limiting an option - it adds OS dependency and limits
>> kernel developer options (2G limits, global mlock serializations,
>> porting problems, inability to schedule or parallelize I/O, still
>> having to coordinate writers and readers).

2G limits?  That must be a Linux limitation, not a limitation with
mmap(2).  On OS-X and FreeBSD it's anywhere from 4GB to ... well,
whatever the 64bit limit is (which is bigger than any data file in
$PGDATA).  An mlock(2) serialization problem is going to be cheaper
than hitting the disk in nearly all cases and should be no worse than a
context switch or semaphore (what we use for the current locking
scheme), of which PostgreSQL causes plenty of 'em because it's
multi-process, not multi-threaded.  Coordination of data isn't
necessary if you mmap(2) data as a private block, which takes a
snapshot of the page at the time you make the mmap(2) call and gets
copied only when the page is written to.  More on that later.

> I'm not sure I entirely agree with this.  Whether you access a file
> via mmap() or via read(), the end result is that you still have to
> access it, and since PG has significant chunks of system-dependent
> code that it heavily relies on as it is (e.g., locking mechanisms,
> shared memory), writing the I/O subsystem in a similar way doesn't
> seem to me to be that much of a stretch (especially since PG already
> has the storage manager), though it might involve quite a bit of work.

Obviously you have to access the file on the hard drive, but you're
forgetting an enormous advantage of mmap(2).  With a read(2) system
call, the program has to allocate space for the read(2), then it copies
data from the kernel into the allocated memory in the userland's newly
allocated memory location.  With mmap(2) there is no second copy.

Let's look at what happens with a read(2) call.  To read(2) data you
have to have a block of memory to copy data into.  Assume your OS of
choice has a good malloc(3) implementation and it only needs to call
brk(2) once to extend the process's memory address after the first
malloc(3) call.  There's your first system call, which guarantees one
context switch.  The second hit, a much larger hit, is the actual
read(2) call itself, wherein the kernel has to copy the data twice:
once into a kernel buffer, then from the kernel buffer into the
userland's memory space.  Yuk.  Webserver's figured this out long ago
that read(2) is slow and evil in terms of performance.  Apache uses
mmap(2) to send static files at performance levels that don't suck and
is actually quite fast (in terms of responsiveness, I'm not talking
about Apache's parallelism/concurrency performance levels... which in
1.X aren't great).

mmap(2) is a totally different animal in that you don't ever need to
make calls to read(2): mmap(2) is used in place of those calls (With
#ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it
was working with a page of mmap(2)'ed data or need to know that it is).
  Instead you mmap(2) a file descriptor and the kernel does some heavy
lifting/optimized magic in its VM.  The kernel reads the file
descriptor and places the data it reads into its buffer (exactly the
same as what happens with read(2)), but, instead of copying the data to
the userspace, mmap(2) adjusts the process's address space and maps the
address of the kernel buffer into the process's address space.  No
copying necessary.  The savings here are *huge*!

Depending on the mmap(2) implementation, the VM may not even get a page
from disk until its actually needed.  So, lets say you mmap(2) a 16M
file.  The address space picks up an extra 16M of bits that the process
*can* use, but doesn't necessarily use.  So if a user reads only ten
pages out of a 16MB file, only 10 pages (10 * getpagesize()), or
usually 40,960K, which is 0.24% the amount of disk access (((4096 * 10)
/ (16 *1024 * 1024)) * 100).  Did I forget to mention that if the file
is already in the kernel's buffers, there's no need for the kernel to
access the hard drive?  Another big win for data that's hot/frequently
accessed.

There's another large savings if the machine is doing network IO too...

> As for parallelization of I/O, the use of mmap() for reads should
> signficantly improve parallelization -- now instead of issuing read()
> system calls, possibly for the same set of blocks, all the backends
> would essentially be examining the same data directly.  The
> performance improvements as a result of accessing the kernel's cache
> pages directly instead of having it do buffer copies to process-local
> memory should increase as concurrency goes up.  But see below.

That's kinda true... though not quite correct.  The improvement in IO
concurrency comes from zero-socket-copy operations from the disk to the
network controller.  If a write(2) system call is issued on a page of
mmap(2)'ed data (and your operating system supports it, I know FreeBSD
does, but don't think Linux does), then the page of data is DMA'ed by
the network controller and sent out without the data needing to be
copied into the network controller's buffer.  So, instead of the CPU
copying data from the OS's buffer to a kernel buffer, the network card
grabs the chunk of data in one interrupt because of the DMA (direct
memory access).  This is a pretty big deal for web serving, but if
you've got a database sending large sets of data over the network,
assuming the network isn't the bottle neck, this results in a heafty
performance boost (that won't be noticed by most until they're running
huge, very busy installations).  This optimization comes for free and
without needing to add one line of code to an application once mmap(2)
has been added to an application.

>> More to the point, I think it is very hard to effectively coordinate
>> multithreaded I/O, and mmap seems used mostly to manage relatively
>> simple scenarios.
>
> PG already manages and coordinates multithreaded I/O.  The mechanisms
> used to coordinate writes needn't change at all.  But the way reads
> are done relative to writes might have to be rethought, since an
> mmap()ed buffer always reflects what's actually in kernel space at the
> time the buffer is accessed, while a buffer retrieved via read()
> reflects the state of the file at the time of the read().  If it's
> necessary for the state of the buffers to be fixed at examination
> time, then mmap() will be at best a draw, not a win.

Here's where things can get interesting from a transaction stand point.
  Your statement is correct up until you make the assertion that a page
needs to be fixed.  If you're doing a read(2) transaction, mmap(2) a
region and set the MAP_PRIVATE flag so the ground won't change
underneath you.  No copying of this page is done by the kernel unless
it gets written to.  If you're doing a write(2) or are directly
scribbling on an mmap(2)'ed page[1], you need to grab some kind of an
exclusive lock on the page/file (mlock(2) is going to be no more
expensive than a semaphore, but probably less expensive).  We already
do that with semaphores, however.  So for databases that don't have
high contention for the same page/file of data, there are no additional
copies made.  When a piece of data is written, a page is duplicated
before it gets scribbled on, but the application never knows this
happens.  The next time a process mmap(2)'s a region of memory that's
been written to, it'll get the updated data without any need to flush a
cache or mark pages as dirty: the operating system does all of this for
us (and probably faster too).  mmap(2) implementations are, IMHO, more
optimized that shared memory implementations (mmap(2) is a VM function,
which gets many eyes to look it over and is always being tuned, whereas
shared mem is a bastardized subsystem that works, but isn't integral to
any performance areas in the kernel so it gets neglected.  Just my
observations from the *BSD commit lists.  Linux it may be different).

[1] I forgot to mention earlier, you don't have to write(2) data to a
file if it's mmap(2)'ed, you can change the contents of an mmap(2)'ed
region, then msync(2) it back to disk (to ensure it gets written out)
or let the last munmap(2) call do that for you (which would be just as
dangerous as running without fsync... but would result in an additional
performance boost).

>> mmap doesn't look that promising.
>
> This ultimately depends on two things: how much time is spent copying
> buffers around in kernel memory, and how much advantage can be gained
> by freeing up the memory used by the backends to store the
> backend-local copies of the disk pages they use (and thus making that
> memory available to the kernel to use for additional disk buffering).

Someone on IRC pointed me to some OSDL benchmarks, which broke down
where time is being spent.  Want to know what the most expensive part
of PostgreSQL is?  *drum roll*

http://khack.osdl.org/stp/297960/profile/DBT_2_Profile-tick.sort

3967393 total                                      1.7735
2331284 default_idle                             36426.3125
825716 do_sigaction                             1290.1813
133126 __copy_from_user_ll                      1040.0469
  97780 __copy_to_user_ll                        763.9062
  43135 finish_task_switch                       269.5938
  30973 do_anonymous_page                         62.4456
  24175 scsi_request_fn                           22.2197
  23355 __do_softirq                             121.6406
  17039 __wake_up                                133.1172
  16527 __make_request                            10.8730
   9823 try_to_wake_up                            13.6431
   9525 generic_unplug_device                     66.1458
   8799 find_get_page                             78.5625
   7878 scsi_end_request                          30.7734

Copying data to/from userspace and signal handling!!!!  Let's hear it
for the need for mmap(2)!!!  *crowd goes wild*

> The gains from the former are likely small.  The gains from the latter
> are probably also small, but harder to estimate.

I disagree.

> The use of mmap() is probably one of those optimizations that should
> be done when there's little else left to optimize, because the
> potential gains are possibly (if not probably) relatively small and
> the amount of work involved may be quite large.

If system/kernel time is where most of your database spends its time,
then mmap(2) is a huge optimization that is very much worth pursuing.
It's stable (nearly all webservers use it, notably Apache), widely
deployed, POSIX specified (granted not all implementations are 100%
consistent, but that's an OS bug and mmap(2) doesn't have to be turned
on for those platforms: it's no worse than where we are now), and well
optimized by operating system hackers.  I guarantee that your operating
system of choice has a faster VM and disk cache than PostgreSQL's
userland cache, nevermind using the OSs buffers leads to many
performance boosts as the OS can short-circuit common pathways that
would require data copying (ex: zero-socket-copy operations and copying
data to/from userland).

mmap(2) isn't a panacea or replacement for good software design, but it
certainly does make IO operations vastly faster, which is what
PostgreSQL does a lot of (hence its need for a userland cache).
Remember, back when PostgreSQL had its architecture thunk up, mmap(2)
hardly existed in anyone's eyes, nevermind it being widely used or a
POSIX function.  It wasn't until Apache started using it that Operating
System vendors felt the need to implement it or make it work well.  Now
it's integral to nearly all virtual memory implementations and a modern
OS can't live without it or have it broken in any way.  It would be
largely beneficial to PostgreSQL to heavily utilize mmap(2).

A few places it should be used include:

*) Storage.  It is a good idea to mmap(2) all files instead of
read(2)'ing files.  mmap(2) doesn't fetch a page from disk until its
actually needed, which is a nifty savings.  Sure it causes a fault in
the kernel, but it won't the second time that page is accessed.
Changes are necessary to src/backend/storage/file/, possibly
src/backend/storage/freespace/ (why is it using fread(3) and not
read(2)?), src/backend/storage/large_object/ can remain gimpy since
people should use BYTEA instead (IMHO), src/backend/storage/page/
doesn't need changes (I don't think), src/backend/storage/smgr/
shouldn't need any modifications either.

*) ARC.  Why unmmap(2) data if you don't need to?  With ARC, it's
possible for the database to coach the operating system in what pages
should be persistent.  ARC's a smart algorithm for handling the needs
of a database.  Instead of having a cache of pages in userland,
PostgreSQL would have a cache of mmap(2)'ed pages.  It's shared between
processes, the changes are public to external programs read(2)'ing
data, and its quick.  The needs for shared memory by the kernel drops
to nearly nothing.  The needs for mmap(2)'able space in the kernel,
however, does go up.  Unlike SysV shared mem, this can normally be
changed on the fly.  The end result would be, if a page is needed, it
checks to see if its in the cache.  If it is, the mmap(2)'ed page is
returned.  If it isn't, the page gets read(2)/mmap(2) like it currently
is loaded (except in the mmap(2) case where after the data has been
loaded, the page gets munmap(2)'ed).  If ARC decides to keep the page,
the page doesn't get munmap(2)'ed.  I don't think any changes need to
be made though to take advantage of mmap(2) if the changes are made in
the places mentioned above in the Storage point.


A few other perks:

*) DIRECTIO can be used without much of a cache coherency headache
since the cache of data is in the kernel, not userland.

*) NFS.  I'm not suggesting multiple clients use the same data
directory via NFS (unless read only), but if there were a single client
accessing a data directory over NFS, performance would be much better
than it is today because data consistency is handled by the kernel so
in flight packets for writes that get dropped or lost won't cause a
slow down (mmap(2) behaves differently with NFS pages) or corruption.

*) mmap(2) is conditional on the operating system's abilities, but
doesn't require any architectural changes.  It does change the location
of the cache, from being in the userland, down in to the kernel.  This
is a change for database administrators, but a good one, IMHO.
Previously, the operating system would be split 25% kernel, 75% user
because PostgreSQL would need the available RAM for its cache.  Now,
that can be moved closer to the opposite, 75% kernel, 25% user because
most of the memory is mmap(2)'ed pages instead of actual memory in the
userland.

*) Pages can be protected via PROT_(EXEC|READ|WRITE).  For backends
that aren't making changes to the DDL or system catalogs (permissions,
etc.), pages that are loaded from the catalogs could be loaded with the
protection PROT_READ, which would prevent changes to the catalogs.  All
DDL and permission altering commands (anything that touches the system
catalogs) would then load the page with the PROT_WRITE bit set, make
their changes, then PROT_READ the page again.  This would provide a
first line of defense against buggy programs or exploits.

*) Eliminates the double caching done currently (caching in PostgreSQL
and the kernel) by pushing the cache into the kernel... but without
PostgreSQL knowing it's working on a page that's in the kernel.

Please ask questions if you have them.

-sc

--
Sean Chittenden


On Fri, Oct 15, 2004 at 01:09:01PM -0700, Sean Chittenden wrote:
[snip]
> >
> > This ultimately depends on two things: how much time is spent copying
> > buffers around in kernel memory, and how much advantage can be gained
> > by freeing up the memory used by the backends to store the
> > backend-local copies of the disk pages they use (and thus making that
> > memory available to the kernel to use for additional disk buffering).
>
> Someone on IRC pointed me to some OSDL benchmarks, which broke down
> where time is being spent.  Want to know what the most expensive part
> of PostgreSQL is?  *drum roll*
>
> http://khack.osdl.org/stp/297960/profile/DBT_2_Profile-tick.sort
>
> 3967393 total                                      1.7735
> 2331284 default_idle                             36426.3125
> 825716 do_sigaction                             1290.1813
> 133126 __copy_from_user_ll                      1040.0469
>   97780 __copy_to_user_ll                        763.9062
>   43135 finish_task_switch                       269.5938
>   30973 do_anonymous_page                         62.4456
>   24175 scsi_request_fn                           22.2197
>   23355 __do_softirq                             121.6406
>   17039 __wake_up                                133.1172
>   16527 __make_request                            10.8730
>    9823 try_to_wake_up                            13.6431
>    9525 generic_unplug_device                     66.1458
>    8799 find_get_page                             78.5625
>    7878 scsi_end_request                          30.7734
>
> Copying data to/from userspace and signal handling!!!!  Let's hear it
> for the need for mmap(2)!!!  *crowd goes wild*
>
[snip]

I know where the do_sigaction is coming from in this particular case.
Manfred Spraul tracked it to a pair of pgsignal calls in libpq.
Commenting out those two calls out virtually eliminates do_sigaction from
the kernel profile for this workload.  I've lost track of the discussion
over the past year, but I heard a rumor that it was finally addressed to
some degree.  I did understand it touched on a lot of other things, but
can anyone summarize where that discussion has gone?

Mark

Sean Chittenden <sean@chittenden.org> writes:
> Coordination of data isn't
> necessary if you mmap(2) data as a private block, which takes a
> snapshot of the page at the time you make the mmap(2) call and gets
> copied only when the page is written to.  More on that later.

We cannot move to a model where different backends have different
views of the same page, which seems to me to be inherent in the idea of
using MAP_PRIVATE for anything.  To take just one example, a backend
that had mapped one btree index page some time ago could get completely
confused if that page splits, because it might see the effects of the
split in nearby index pages but not in the one that was split.  Or it
could follow an index link to a heap entry that isn't there anymore,
or miss an entry it should have seen.  MVCC doesn't save you from this
because btree adjustments happen below the level of transactions.

However the really major difficulty with using mmap is that it breaks
the scheme we are currently using for WAL, because you don't have any
way to restrict how soon a change in an mmap'd page will go to disk.
(No, I don't believe that mlock guarantees this.  It says that the
page will not be removed from main memory; it does not specify that,
say, the syncer won't write the contents out anyway.)

> Let's look at what happens with a read(2) call.  To read(2) data you
> have to have a block of memory to copy data into.  Assume your OS of
> choice has a good malloc(3) implementation and it only needs to call
> brk(2) once to extend the process's memory address after the first
> malloc(3) call.  There's your first system call, which guarantees one
> context switch.

Wrong.  Our reads occur into shared memory allocated at postmaster
startup, remember?

> mmap(2) is a totally different animal in that you don't ever need to
> make calls to read(2): mmap(2) is used in place of those calls (With
> #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it
> was working with a page of mmap(2)'ed data or need to know that it is).

Instead, you have to worry about address space management and keeping a
consistent view of the data.

> ... If a write(2) system call is issued on a page of
> mmap(2)'ed data (and your operating system supports it, I know FreeBSD
> does, but don't think Linux does), then the page of data is DMA'ed by
> the network controller and sent out without the data needing to be
> copied into the network controller's buffer.

Perfectly irrelevant to Postgres, since there is no situation where we'd
ever write directly from a disk buffer to a socket; in the present
implementation there are at least two levels of copy needed in between
(datatype-specific output function and protocol message assembly).  And
that's not even counting the fact that any data item large enough to
make the savings interesting would have been sliced, diced, and
compressed by TOAST.

> ... If you're doing a write(2) or are directly
> scribbling on an mmap(2)'ed page[1], you need to grab some kind of an
> exclusive lock on the page/file (mlock(2) is going to be no more
> expensive than a semaphore, but probably less expensive).

More incorrect information.  The locking involved here is done by
LWLockAcquire, which is significantly *less* expensive than a kernel
call in the case where there is no need to block.  (If you have to
block, any kernel call to do so is probably about as bad as any other.)
Switching over to mlock would likely make things considerably slower.
In any case, you didn't actually mean to say mlock did you?  It doesn't
lock pages against writes by other processes AFAICS.

> shared mem is a bastardized subsystem that works, but isn't integral to
> any performance areas in the kernel so it gets neglected.

What performance issues do you think shared memory needs to have fixed?
We don't issue any shmem kernel calls after the initial shmget, so
comparing the level of kernel tenseness about shmget to the level of
tenseness about mmap is simply irrelevant.  Perhaps the reason you don't
see any traffic about this on the kernel lists is that shared memory
already works fine and doesn't need any fixing.

> Please ask questions if you have them.

Do you have any arguments that are actually convincing?  What I just
read was a proposal to essentially throw away not only the entire
low-level data access model, but the entire low-level locking model,
and start from scratch.  There is no possible way we could support both
this approach and the current one, which means that we'd be permanently
dropping support for all platforms without high-quality mmap
implementations; and despite your enthusiasm I don't think that that
category includes every interesting platform.  Furthermore, you didn't
give any really convincing reasons to think that the enormous effort
involved would be repaid.  Those oprofile reports Josh just put up
showed 3% of the CPU time going into userspace/kernelspace copying.
Even assuming that that number consists entirely of reads and writes of
shared buffers (and of course no other kernel call ever transfers any
data across that boundary ;-)), there's no way we are going to buy into
this sort of project in hopes of a 3% win.

            regards, tom lane

Mark Wong <markw@osdl.org> writes:
> I know where the do_sigaction is coming from in this particular case.
> Manfred Spraul tracked it to a pair of pgsignal calls in libpq.
> Commenting out those two calls out virtually eliminates do_sigaction from
> the kernel profile for this workload.

Hmm, I suppose those are the ones associated with suppressing SIGPIPE
during send().  It looks to me like those should go away in 8.0 if you
have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being
built in the current round of tests?

            regards, tom lane

On Fri, Oct 15, 2004 at 05:37:50PM -0400, Tom Lane wrote:
> Mark Wong <markw@osdl.org> writes:
> > I know where the do_sigaction is coming from in this particular case.
> > Manfred Spraul tracked it to a pair of pgsignal calls in libpq.
> > Commenting out those two calls out virtually eliminates do_sigaction from
> > the kernel profile for this workload.
>
> Hmm, I suppose those are the ones associated with suppressing SIGPIPE
> during send().  It looks to me like those should go away in 8.0 if you
> have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being
> built in the current round of tests?
>

Ah, yes.  Ok.  It's not being configured with any options.  That'll be easy to
rememdy though.  I'll get that change made and we can try again.

Mark

Re: mmap (was First set of OSDL Shared Mem scalability results,

From
Bruce Momjian
Date:
Tom Lane wrote:
> Mark Wong <markw@osdl.org> writes:
> > I know where the do_sigaction is coming from in this particular case.
> > Manfred Spraul tracked it to a pair of pgsignal calls in libpq.
> > Commenting out those two calls out virtually eliminates do_sigaction from
> > the kernel profile for this workload.
>
> Hmm, I suppose those are the ones associated with suppressing SIGPIPE
> during send().  It looks to me like those should go away in 8.0 if you
> have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being
> built in the current round of tests?

Yes, those calls are gone in 8.0 with --enable-thread-safety and were
added specifically because of Manfred's reports.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

On Fri, Oct 15, 2004 at 09:22:03PM -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Mark Wong <markw@osdl.org> writes:
> > > I know where the do_sigaction is coming from in this particular case.
> > > Manfred Spraul tracked it to a pair of pgsignal calls in libpq.
> > > Commenting out those two calls out virtually eliminates do_sigaction from
> > > the kernel profile for this workload.
> >
> > Hmm, I suppose those are the ones associated with suppressing SIGPIPE
> > during send().  It looks to me like those should go away in 8.0 if you
> > have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being
> > built in the current round of tests?
>
> Yes, those calls are gone in 8.0 with --enable-thread-safety and were
> added specifically because of Manfred's reports.
>

Ok, I had the build commands changed for installing PostgreSQL in STP.
The do_sigaction call isn't at the top of the profile anymore, here's
a reference for those who are interested; it should have the same test
parameters as the one Tom referenced a little earlier:
    http://khack.osdl.org/stp/298230/

Mark

Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...

From
Sean Chittenden
Date:
> However the really major difficulty with using mmap is that it breaks
> the scheme we are currently using for WAL, because you don't have any
> way to restrict how soon a change in an mmap'd page will go to disk.
> (No, I don't believe that mlock guarantees this.  It says that the
> page will not be removed from main memory; it does not specify that,
> say, the syncer won't write the contents out anyway.)

I had to think about this for a minute (now nearly a week) and reread
the docs on WAL before I groked what could happen here.  You're
absolutely right in that WAL needs to be taken into account first.  How
does this execution path sound to you?

By default, all mmap(2)'ed pages are MAP_SHARED.  There are no
complications with regards to reads.

When a backend wishes to write a page, the following steps are taken:

1) Backend grabs a lock from the lockmgr to write to the page (exactly
as it does now)

2) Backend mmap(2)'s a second copy of the page(s) being written to,
this time with the MAP_PRIVATE flag set.  Mapping a copy of the page
again is wasteful in terms of address space, but does not require any
more memory than our current scheme.  The re-mapping of the page with
MAP_PRIVATE prevents changes to the data that other backends are
viewing.

3) The writing backend, can then scribble on its private copy of the
page(s) as it sees fit.

4) Once completed making changes and a transaction is to be committed,
the backend WAL logs its changes.

5) Once the WAL logging is complete and it has hit the disk, the
backend msync(2)'s its private copy of the pages to disk (ASYNC or
SYNC, it doesn't really matter too much to me).

6) Optional(?).  I'm not sure whether or not the backend would need to
also issues an msync(2) MS_INVALIDATE, but, I suspect it would not need
to on systems with unified buffer caches such as FreeBSD or OS-X.  On
HPUX, or other older *NIX'es, it may be necessary.  *shrug*  I could be
trying to be overly protective here.

7) Backend munmap(2)'s its private copy of the written on page(s).

8) Backend releases its lock from the lockmgr.

At this point, the remaining backends now are able to see the updated
pages of data.

>> Let's look at what happens with a read(2) call.  To read(2) data you
>> have to have a block of memory to copy data into.  Assume your OS of
>> choice has a good malloc(3) implementation and it only needs to call
>> brk(2) once to extend the process's memory address after the first
>> malloc(3) call.  There's your first system call, which guarantees one
>> context switch.
>
> Wrong.  Our reads occur into shared memory allocated at postmaster
> startup, remember?

Doh.  Fair enough.  In most programs that involve read(2), a call to
alloc(3) needs to be made.

>> mmap(2) is a totally different animal in that you don't ever need to
>> make calls to read(2): mmap(2) is used in place of those calls (With
>> #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it
>> was working with a page of mmap(2)'ed data or need to know that it
>> is).
>
> Instead, you have to worry about address space management and keeping a
> consistent view of the data.

Which is largely handled by mmap() and the VM.

>> ... If a write(2) system call is issued on a page of
>> mmap(2)'ed data (and your operating system supports it, I know FreeBSD
>> does, but don't think Linux does), then the page of data is DMA'ed by
>> the network controller and sent out without the data needing to be
>> copied into the network controller's buffer.
>
> Perfectly irrelevant to Postgres, since there is no situation where
> we'd
> ever write directly from a disk buffer to a socket; in the present
> implementation there are at least two levels of copy needed in between
> (datatype-specific output function and protocol message assembly).  And
> that's not even counting the fact that any data item large enough to
> make the savings interesting would have been sliced, diced, and
> compressed by TOAST.

The biggest winners will be columns whos storage type is PLAIN or
EXTERNAL.  writev(2) from mmap(2)'ed pages and non-mmap(2)'ed pages
would be a nice perk too (not sure if PostgreSQL uses this or not).
Since compression isn't happening on most tuples under 1K in size and
most tuples in a database are going to be under that, most tuples are
going to be uncompressed.  Total pages for the database, however, is
likely a different story.  For large tuples that are uncompressed and
larger than a page, it is probably beneficial to use sendfile(2)
instead of mmap(2) + write(2)'ing the page/file.

If a large tuple is compressed, it'd be interesting to see if it'd be
worthwhile to have the data uncompressed onto an anonymously mmap(2)'ed
page(s) that way the benefits of zero-socket-copies could be used.

>> shared mem is a bastardized subsystem that works, but isn't integral
>> to
>> any performance areas in the kernel so it gets neglected.
>
> What performance issues do you think shared memory needs to have fixed?
> We don't issue any shmem kernel calls after the initial shmget, so
> comparing the level of kernel tenseness about shmget to the level of
> tenseness about mmap is simply irrelevant.  Perhaps the reason you
> don't
> see any traffic about this on the kernel lists is that shared memory
> already works fine and doesn't need any fixing.

I'm gunna get flamed for this, but I think its improperly used as a
second level cache on top of the operating system's cache.  mmap(2)
would consolidate all caching into the kernel.

>> Please ask questions if you have them.
>
> Do you have any arguments that are actually convincing?

Three things come to mind.

1) A single cache for pages
2) Ability to give access hints to the kernel regarding future IO
3) On the fly memory use for a cache.  There would be no need to
preallocate slabs of shared memory on startup.

And a more minor point would be:

4) Not having shared pages get lost when the backend dies (mmap(2) uses
refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean).  This
isn't too practical in production though, but it sucks doing PostgreSQL
development on OS-X because there is no ipcs/ipcrm command.

> What I just read was a proposal to essentially throw away not only the
> entire
> low-level data access model, but the entire low-level locking model,
> and start from scratch.

 From the above list, steps 2, 3, 5, 6, and 7 would be different than
our current approach, all of which could be safely handled with some
#ifdef's on platforms that don't have mmap(2).

> There is no possible way we could support both
> this approach and the current one, which means that we'd be permanently
> dropping support for all platforms without high-quality mmap
> implementations;

Architecturally, I don't see anything different or incompatibilities
that aren't solved with an #ifdef USE_MMAP/#else/#endif.

> Furthermore, you didn't
> give any really convincing reasons to think that the enormous effort
> involved would be repaid.

Steven's has a great reimplementaion of cat(1) that uses mmap(1) and
benchmarks the two.  I did my own version of that here:

http://people.freebsd.org/~seanc/mmap_test/

When read(2)'ing/write(2)'ing /etc/services 100,000 times without
mmap(2), it takes 82 seconds.  With mmap(2), it takes anywhere from 1.1
to 18 seconds.  Worst case scenario with mmap(2) yields a speedup by a
factor of four.  Best case scenario...  *shrug* something better than
4x.  I doubt PostgreSQL would see 4x speedups in the IO department, but
I do think it would be vastly greater than the 3% suggested.

> Those oprofile reports Josh just put up
> showed 3% of the CPU time going into userspace/kernelspace copying.
> Even assuming that that number consists entirely of reads and writes of
> shared buffers (and of course no other kernel call ever transfers any
> data across that boundary ;-)), there's no way we are going to buy into
> this sort of project in hopes of a 3% win.

Would it be helpful if I created a test program that demonstrated that
the execution path for writing mmap(2)'ed pages as outlined above?

-sc

--
Sean Chittenden


Sean Chittenden <sean@chittenden.org> writes:
> When a backend wishes to write a page, the following steps are taken:
> ...
> 2) Backend mmap(2)'s a second copy of the page(s) being written to,
> this time with the MAP_PRIVATE flag set.
> ...
> 5) Once the WAL logging is complete and it has hit the disk, the
> backend msync(2)'s its private copy of the pages to disk (ASYNC or
> SYNC, it doesn't really matter too much to me).

My man page for mmap says that changes in a MAP_PRIVATE region are
private; they do not affect the file at all, msync or no.  So I don't
think the above actually works.

In any case, this scheme still forces you to flush WAL records to disk
before making the changed page visible to other backends, so I don't
see how it improves the situation.  In the existing scheme we only have
to fsync WAL at (1) transaction commit, (2) when we are forced to write
a page out from shared buffers because we are short of buffers, or (3)
checkpoint.  Anything that implies an fsync per atomic action is going
to be a loser.  It does not matter how great your kernel API is if you
only get to perform one atomic action per disk rotation :-(

The important point here is that you can't postpone making changes at
the page level visible to other backends; there's no MVCC at this level.
Consider for example two backends wanting to insert a new row.  If they
both MAP_PRIVATE the same page, they'll probably choose the same tuple
slot on the page to insert into (certainly there is nothing to stop that
from happening).  Now you have conflicting definitions for the same
CTID, not to mention probably conflicting uses of the page's physical
free space; disaster ensues.  So "atomic action" really means "lock
page, make changes, add WAL record to in-memory WAL buffers, unlock
page" with the understanding that as soon as you unlock the page the
changes you've made in it are visible to all other backends.  You
*can't* afford to put a WAL fsync in this sequence.

You could possibly buy back most of the lossage in this scenario if
there were some efficient way for a backend to hold the low-level lock
on a page just until some other backend wanted to modify the page;
whereupon the previous owner would have to do what's needed to make his
changes visible before releasing the lock.  Given the right access
patterns you don't have to fsync very often (though given the wrong
access patterns you're still in deep trouble).  But we don't have any
such mechanism and I think the communication costs of one would be
forbidding.

> [ much snipped ]
> 4) Not having shared pages get lost when the backend dies (mmap(2) uses
> refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean).

Actually, that is not a bug that's a feature.  One of the things that
scares me about mmap is that a crashing backend is able to scribble all
over live disk buffers before it finally SEGV's (think about memcpy gone
wrong and similar cases).  In our existing scheme there's a pretty good
chance that we will be able to commit hara-kiri before any of the
trashed data gets written out.  In an mmap scheme, it's time to dig out
your backup tapes, because there simply is no distinction between
transient and permanent data --- the kernel has no way to know that you
didn't mean it.

In short, I remain entirely unconvinced that mmap is of any interest to us.

            regards, tom lane