Thread: EXPERIMENTAL: mmap-based memory context / allocator

EXPERIMENTAL: mmap-based memory context / allocator

From

Tomas Vondra

Date:

15 February 2015, 21:58:08

Hi!

While playing with the memory context internals some time ago, I've been
wondering if there are better ways to allocate memory from the kernel -
either tweaking libc somehow, or maybe interacting with kernel directly.

I mostly forgot about that topic, but after the local conference last
week we went to a pub and one of the things we discussed over a beer was
how complex and unintuitive the memory stuff is, because of the libc
heuristics, 'sbrk' properties [1] and behavior in the presence of holes,
OOM etc.

The virtual memory system should handle this to a large degree, but I've
repeatedly ran into problem when that was not the case (for example the
memory is still part of the virtual address space, and thus counted by OOM).

One of the wilder ideas (I mentined beer was involved!) was a memory
allocator based on mmap [2], bypassing the libc malloc implementation
altogether. mmap() has some nice features (e.g. no issues with returning
memory back to the kernel, which may be problem with sbrk). So I hacked
a bit and switched the AllocSet implementation to mmap().

And it works surprisingly well, so here is an experimental patch for
comments whether this really is a good idea etc. Some parts of the patch
are a bit dirty and I've only tested it on x86.


Notes
-----

(1) The main changes are mostly trivial, rewriting malloc()/free() to
    mmap()/munmap(), plus related chages (e.g. mmap() returns (void*)-1
    instead of NULL in case of failure).

(2) A significant difference is that mmap() can't allocate blocks
    smaller than page size, which is 4kB on x86. This means the
    smallest context is 4kB (instead of 1kB), and also affects the
    growth of block size (it can only grow to 8kB). So this changes
    the AllocSet internal behavior a bit.

(3) As this bypasses libc, it can't use the libc freelists (which are
    used by malloc). To compensate for this, there's a simple
    process-level freelist of blocks, shared by all memory contexts.
    This cache a limited capacity (roughly 4MB per).

(4) Some of the comments are obsolete, still referencing malloc/free.


Benchmarks
----------

I've done extensive testing and also benchmrking, and it seems to be no
slower than the current implementation, and in some cases is actually a
bit faster.

a) time pgbench -i -s 300

   - pgbench initialisation, measuring the COPY and the total duration.
   - averages of 3 runs (negligible variations between runs)

               COPY           total
    ---------------------------------
    master     0:26.22         1:22
    mmap       0:26.35         1:22

   Pretty much no difference.

b) pgbench -S -c 8 -j 8 -T 60

   - short read-only runs (1 minute) after a warmup
   - min, max, average of 8 runs

               average      min      max
    -------------------------------------
    master       96785    95329    97981
    mmap         98279    97259    99671

   That's a rather consistent 1-2% improvement.

c) REINDEX pgbench_accounts_pkey

   - large maintenance_work_mem so that it's in-memory sort
   - average, min, max of 8 runs (duration in seconds)

               average      min      max
    -------------------------------------
    master       10.35    9.64     13.56
    mmap          9.85    9.81      9.90

    Again, mostly improvement, except for the minimum where the currect
    memory context was a bit faster. But overall the mmap-based one is
    much more consistent.

Some of the differences may be due to allocating 4kB blocks from the
very start (while the current allocator starts with 1kB, then 2kB and
finally 4kB).

Ideas, opinions?


[1] http://linux.die.net/man/2/sbrk
[2] http://linux.die.net/man/2/mmap

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

mmap-allocator-v1-wip.patch

Re: EXPERIMENTAL: mmap-based memory context / allocator

From

Heikki Linnakangas

Date:

15 February 2015, 22:57:45

On 02/15/2015 08:57 PM, Tomas Vondra wrote:
> One of the wilder ideas (I mentined beer was involved!) was a memory
> allocator based on mmap [2], bypassing the libc malloc implementation
> altogether. mmap() has some nice features (e.g. no issues with returning
> memory back to the kernel, which may be problem with sbrk). So I hacked
> a bit and switched the AllocSet implementation to mmap().

glibc's malloc() also uses mmap() for larger allocations. Precisely 
because those allocations can then be handed back to the OS. I don't 
think we'd want to use mmap() for small allocations either. Let's not 
re-invent malloc()..

- Heikki

Re: EXPERIMENTAL: mmap-based memory context / allocator

From

Tomas Vondra

Date:

15 February 2015, 23:07:39

On 15.2.2015 20:56, Heikki Linnakangas wrote:
> On 02/15/2015 08:57 PM, Tomas Vondra wrote:
>> One of the wilder ideas (I mentined beer was involved!) was a memory
>> allocator based on mmap [2], bypassing the libc malloc implementation
>> altogether. mmap() has some nice features (e.g. no issues with returning
>> memory back to the kernel, which may be problem with sbrk). So I hacked
>> a bit and switched the AllocSet implementation to mmap().
> 
> glibc's malloc() also uses mmap() for larger allocations. Precisely
> because those allocations can then be handed back to the OS. I don't
> think we'd want to use mmap() for small allocations either. Let's not
> re-invent malloc()..

malloc() does that only for allocations over MAP_THRESHOLD, which is
128kB by default. Vast majority of blocks we allocate are <= 8kB, so
mmap() almost never happens.

At least that's my understanding, I may be wrong of course.

> 
> - Heikki
> 



-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: EXPERIMENTAL: mmap-based memory context / allocator

From

Andres Freund

Date:

15 February 2015, 23:13:44

On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote:
> On 15.2.2015 20:56, Heikki Linnakangas wrote:
> > On 02/15/2015 08:57 PM, Tomas Vondra wrote:
> >> One of the wilder ideas (I mentined beer was involved!) was a memory
> >> allocator based on mmap [2], bypassing the libc malloc implementation
> >> altogether. mmap() has some nice features (e.g. no issues with returning
> >> memory back to the kernel, which may be problem with sbrk). So I hacked
> >> a bit and switched the AllocSet implementation to mmap().
> > 
> > glibc's malloc() also uses mmap() for larger allocations. Precisely
> > because those allocations can then be handed back to the OS. I don't
> > think we'd want to use mmap() for small allocations either. Let's not
> > re-invent malloc()..
> 
> malloc() does that only for allocations over MAP_THRESHOLD, which is
> 128kB by default. Vast majority of blocks we allocate are <= 8kB, so
> mmap() almost never happens.

The problem is that mmap() is, to my knowledge, noticeably more
expensive than sbrk(). Especially with concurrent workloads. Which is
why the malloc/libc authors chose to use sbrk...

IIRC glibc malloc also batches several allocation into mmap()ed areas
after some time.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: EXPERIMENTAL: mmap-based memory context / allocator

From

Tomas Vondra

Date:

15 February 2015, 23:19:47

On 15.2.2015 21:13, Andres Freund wrote:
> On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote:
>
>> malloc() does that only for allocations over MAP_THRESHOLD, which
>> is 128kB by default. Vast majority of blocks we allocate are <=
>> 8kB, so mmap() almost never happens.
> 
> The problem is that mmap() is, to my knowledge, noticeably more 
> expensive than sbrk(). Especially with concurrent workloads. Which is
> why the malloc/libc authors chose to use sbrk ...

Any ideas how to simulate such workloads? None of the tests I've done
suggested such issue exists.

> IIRC glibc malloc also batches several allocation into mmap()ed
> areas after some time.

Maybe, there's certainly a lot of such optimizations in libc. But how do
you return memory to system in that case?

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: EXPERIMENTAL: mmap-based memory context / allocator

From

Tom Lane

Date:

15 February 2015, 23:39:00

Andres Freund <andres@2ndquadrant.com> writes:
> On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote:
>> On 15.2.2015 20:56, Heikki Linnakangas wrote:
>>> glibc's malloc() also uses mmap() for larger allocations. Precisely
>>> because those allocations can then be handed back to the OS. I don't
>>> think we'd want to use mmap() for small allocations either. Let's not
>>> re-invent malloc()..

>> malloc() does that only for allocations over MAP_THRESHOLD, which is
>> 128kB by default. Vast majority of blocks we allocate are <= 8kB, so
>> mmap() almost never happens.

> The problem is that mmap() is, to my knowledge, noticeably more
> expensive than sbrk(). Especially with concurrent workloads. Which is
> why the malloc/libc authors chose to use sbrk...

> IIRC glibc malloc also batches several allocation into mmap()ed areas
> after some time.

Keep in mind also that aset.c doubles the request size every time it goes
back to malloc() for some more space for a given context.  So you get up
to 128kB pretty quickly.

There will be a population of 8K-to-64K chunks that don't ever get
returned to the OS but float back and forth between different
MemoryContexts as those are created and deleted.  I'm inclined to think
this is fine and we don't need to improve on it.

Part of the reason for my optimism is that on glibc-based platforms,
IME PG backends do pretty well at reducing their memory consumption back
down to a minimal value after each query.  (On other platforms, not so
much, but arguably that's libc's fault not ours.)  So I'm not really
seeing a problem that needs fixed, and definitely not one that a
platform-specific fix will do much for.
        regards, tom lane

Re: EXPERIMENTAL: mmap-based memory context / allocator

From

Tomas Vondra

Date:

16 February 2015, 00:14:18

On 15.2.2015 21:38, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote:
>>> On 15.2.2015 20:56, Heikki Linnakangas wrote:
>>>> glibc's malloc() also uses mmap() for larger allocations. Precisely
>>>> because those allocations can then be handed back to the OS. I don't
>>>> think we'd want to use mmap() for small allocations either. Let's not
>>>> re-invent malloc()..
> 
>>> malloc() does that only for allocations over MAP_THRESHOLD, which is
>>> 128kB by default. Vast majority of blocks we allocate are <= 8kB, so
>>> mmap() almost never happens.
> 
>> The problem is that mmap() is, to my knowledge, noticeably more
>> expensive than sbrk(). Especially with concurrent workloads. Which is
>> why the malloc/libc authors chose to use sbrk...
> 
>> IIRC glibc malloc also batches several allocation into mmap()ed
>> areas after some time.
> 
> Keep in mind also that aset.c doubles the request size every time it
> goes back to malloc() for some more space for a given context. So you
> get up to 128kB pretty quickly.

That's true, so for sufficiently large contexts we're already using
mmap() indirectly, through libc. Some contexts use just 8kB
(ALLOCSET_SMALL_MAXSIZE), but that's just a minority.

> There will be a population of 8K-to-64K chunks that don't ever get 
> returned to the OS but float back and forth between different 
> MemoryContexts as those are created and deleted. I'm inclined to
> think this is fine and we don't need to improve on it.

Sure, but there are scenarios where that can't happen, because the
contexts are created 'concurrently' so the blocks can't float between
the contexts.

And example that comes to mind is array_agg() with many groups, which is
made worse by allocating the MemoryContext data in TopMemoryContext,
creating 'islands' and making it impossible to release the memory.

http://www.postgresql.org/message-id/e010519fbe83b1331ee0dfcb122a616a@fuzzy.cz

> Part of the reason for my optimism is that on glibc-based platforms,
>  IME PG backends do pretty well at reducing their memory consumption 
> back down to a minimal value after each query. (On other platforms, 
> not so much, but arguably that's libc's fault not ours.) So I'm not 
> really seeing a problem that needs fixed, and definitely not one
> that a platform-specific fix will do much for.

I certainly agree this is not something we need to fix ASAP, and that
bypassing the libc may not be the right remedy. That's why I posted it
just here (and not to the CF), and marked it as experimental.

That however does not mean we can't improve this somehow - from time to
time I have to deal with machines where the minimum amount of memory
assigned to a process grew over time, gradually increased memory
pressure and eventually causing trouble. There are ways to fix this
(e.g. by reopening the connections, thus creating a new backend).

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services