[PATCH] PostgreSQL 9.4 mmap(2) performance regression on FreeBSD... - Mailing list pgsql-hackers

From Sean Chittenden
Subject [PATCH] PostgreSQL 9.4 mmap(2) performance regression on FreeBSD...
Date
Msg-id sig.030123e089.53EA43F6.2040108@chittenden.org
Whole thread Raw
Responses Re: [PATCH] PostgreSQL 9.4 mmap(2) performance regression on FreeBSD...
List pgsql-hackers
One of the patches that I've been sitting on and am derelict in punting upstream is the attached mmap(2) flags patch for the BSDs. Is there any chance this can be squeezed in to the PostreSQL 9.4 release?

The patch is trivial in size and is used to add one flag to mmap(2) calls in dsm_impl.c.  Alan Cox (FreeBSD alc, not Linux) and I went back and forth regarding PostgreSQL's use of mmap(2) and determined that the following is correct and will prevent a likely performance regression in PostgreSQL 9.4. In PostgreSQL 9.3, all mmap(2) calls were called with the flags MAP_ANON | MAP_SHARED, whereas in PostgreSQL 9.4 this is not the case.

Digging in to the patch, in reviewing src/backend/storage/ipc/dsm_impl.c, it's clear that rhaas@ understood the consequences of mmap(2), and the possible consequences of having dirty pages gratuitously flushed to disk:

src/backend/storage/ipc/dsm_impl.c:781
 * Operating system primitives to support mmap-based shared memory.
 *
 * Calling this "shared memory" is somewhat of a misnomer, because what
 * we're really doing is creating a bunch of files and mapping them into
 * our address space.  The operating system may feel obliged to
 * synchronize the contents to disk even if nothing is being paged out,
 * which will not serve us well.  The user can relocate the pg_dynshmem
 * directory to a ramdisk to avoid this problem, if available.

In order for the above comment to be true for FreeBSD, an extra flag needs to be passed to mmap(2). From FreeBSD 10's mmap(2) page[2]:

     MAP_NOSYNC         Causes data dirtied via this VM map to be flushed to
                        physical media only when necessary (usually by the
                        pager) rather than gratuitously.  Typically this pre-
                        vents the update daemons from flushing pages dirtied
                        through such maps and thus allows efficient sharing of
                        memory across unassociated processes using a file-
                        backed shared memory map.  Without this option any VM
                        pages you dirty may be flushed to disk every so often
                        (every 30-60 seconds usually) which can create perfor-
                        mance problems if you do not need that to occur (such
                        as when you are using shared file-backed mmap regions
                        for IPC purposes).  Note that VM/file system coherency
                        is maintained whether you use MAP_NOSYNC or not.  This
                        option is not portable across UNIX platforms (yet),
                        though some may implement the same behavior by
                        default.

                        WARNING!  Extending a file with ftruncate(2), thus
                        creating a big hole, and then filling the hole by mod-
                        ifying a shared mmap() can lead to severe file frag-
                        mentation.  In order to avoid such fragmentation you
                        should always pre-allocate the file's backing store by
                        write()ing zero's into the newly extended area prior
                        to modifying the area via your mmap().  The fragmenta-
                        tion problem is especially sensitive to MAP_NOSYNC
                        pages, because pages may be flushed to disk in a
                        totally random order.

                        The same applies when using MAP_NOSYNC to implement a
                        file-based shared memory store.  It is recommended
                        that you create the backing store by write()ing zero's
                        to the backing file rather than ftruncate()ing it.
                        You can test file fragmentation by observing the KB/t
                        (kilobytes per transfer) results from an ``iostat 1''
                        while reading a large file sequentially, e.g. using
                        ``dd if=filename of=/dev/null bs=32k''.

                        The fsync(2) system call will flush all dirty data and
                        metadata associated with a file, including dirty
                        NOSYNC VM data, to physical media.  The sync(8) com-
                        mand and sync(2) system call generally do not flush
                        dirty NOSYNC VM data.  The msync(2) system call is
                        usually not needed since BSD implements a coherent
                        file system buffer cache.  However, it may be used to
                        associate dirty VM pages with file system buffers and
                        thus cause them to be flushed to physical media sooner
                        rather than later.

The man page for madvise(2) has more pointed advise[3]:

     MADV_NOSYNC      Request that the system not flush the data associated
                      with this map to physical backing store unless it needs
                      to.  Typically this prevents the file system update dae-
                      mon from gratuitously writing pages dirtied by the VM
                      system to physical disk.  Note that VM/file system
                      coherency is always maintained, this feature simply
                      ensures that the mapped data is only flush when it needs
                      to be, usually by the system pager.

                      This feature is typically used when you want to use a
                      file-backed shared memory area to communicate between
                      processes (IPC) and do not particularly need the data
                      being stored in that area to be physically written to
                      disk.  With this feature you get the equivalent perfor-
                      mance with mmap that you would expect to get with SysV
                      shared memory calls, but in a more controllable and less
                      restrictive manner.  However, note that this feature is
                      not portable across UNIX platforms (though some may do
                      the right thing by default).  For more information see
                      the MAP_NOSYNC section of mmap(2)

Anyway, could you give this a quick review and apply the patch in time so the build farm can get a full build completed before the release?

Thanks in advance. -sc


[1] https://kib.kiev.ua/kib/pgsql_perf.pdf
[2] http://www.freebsd.org/cgi/man.cgi?query=mmap&apropos=0&sektion=0&manpath=FreeBSD+10.0-stable&arch=default&format=html
[3] http://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2&apropos=0&manpath=FreeBSD+10.0-stable



--
Sean Chittenden

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Proposal: Incremental Backup
Next
From: Andres Freund
Date:
Subject: Re: [PATCH] PostgreSQL 9.4 mmap(2) performance regression on FreeBSD...