Thread: EXPERIMENTAL: mmap-based memory context / allocator
Hi! While playing with the memory context internals some time ago, I've been wondering if there are better ways to allocate memory from the kernel - either tweaking libc somehow, or maybe interacting with kernel directly. I mostly forgot about that topic, but after the local conference last week we went to a pub and one of the things we discussed over a beer was how complex and unintuitive the memory stuff is, because of the libc heuristics, 'sbrk' properties [1] and behavior in the presence of holes, OOM etc. The virtual memory system should handle this to a large degree, but I've repeatedly ran into problem when that was not the case (for example the memory is still part of the virtual address space, and thus counted by OOM). One of the wilder ideas (I mentined beer was involved!) was a memory allocator based on mmap [2], bypassing the libc malloc implementation altogether. mmap() has some nice features (e.g. no issues with returning memory back to the kernel, which may be problem with sbrk). So I hacked a bit and switched the AllocSet implementation to mmap(). And it works surprisingly well, so here is an experimental patch for comments whether this really is a good idea etc. Some parts of the patch are a bit dirty and I've only tested it on x86. Notes ----- (1) The main changes are mostly trivial, rewriting malloc()/free() to mmap()/munmap(), plus related chages (e.g. mmap() returns (void*)-1 instead of NULL in case of failure). (2) A significant difference is that mmap() can't allocate blocks smaller than page size, which is 4kB on x86. This means the smallest context is 4kB (instead of 1kB), and also affects the growth of block size (it can only grow to 8kB). So this changes the AllocSet internal behavior a bit. (3) As this bypasses libc, it can't use the libc freelists (which are used by malloc). To compensate for this, there's a simple process-level freelist of blocks, shared by all memory contexts. This cache a limited capacity (roughly 4MB per). (4) Some of the comments are obsolete, still referencing malloc/free. Benchmarks ---------- I've done extensive testing and also benchmrking, and it seems to be no slower than the current implementation, and in some cases is actually a bit faster. a) time pgbench -i -s 300 - pgbench initialisation, measuring the COPY and the total duration. - averages of 3 runs (negligible variations between runs) COPY total --------------------------------- master 0:26.22 1:22 mmap 0:26.35 1:22 Pretty much no difference. b) pgbench -S -c 8 -j 8 -T 60 - short read-only runs (1 minute) after a warmup - min, max, average of 8 runs average min max ------------------------------------- master 96785 95329 97981 mmap 98279 97259 99671 That's a rather consistent 1-2% improvement. c) REINDEX pgbench_accounts_pkey - large maintenance_work_mem so that it's in-memory sort - average, min, max of 8 runs (duration in seconds) average min max ------------------------------------- master 10.35 9.64 13.56 mmap 9.85 9.81 9.90 Again, mostly improvement, except for the minimum where the currect memory context was a bit faster. But overall the mmap-based one is much more consistent. Some of the differences may be due to allocating 4kB blocks from the very start (while the current allocator starts with 1kB, then 2kB and finally 4kB). Ideas, opinions? [1] http://linux.die.net/man/2/sbrk [2] http://linux.die.net/man/2/mmap -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 02/15/2015 08:57 PM, Tomas Vondra wrote: > One of the wilder ideas (I mentined beer was involved!) was a memory > allocator based on mmap [2], bypassing the libc malloc implementation > altogether. mmap() has some nice features (e.g. no issues with returning > memory back to the kernel, which may be problem with sbrk). So I hacked > a bit and switched the AllocSet implementation to mmap(). glibc's malloc() also uses mmap() for larger allocations. Precisely because those allocations can then be handed back to the OS. I don't think we'd want to use mmap() for small allocations either. Let's not re-invent malloc().. - Heikki
On 15.2.2015 20:56, Heikki Linnakangas wrote: > On 02/15/2015 08:57 PM, Tomas Vondra wrote: >> One of the wilder ideas (I mentined beer was involved!) was a memory >> allocator based on mmap [2], bypassing the libc malloc implementation >> altogether. mmap() has some nice features (e.g. no issues with returning >> memory back to the kernel, which may be problem with sbrk). So I hacked >> a bit and switched the AllocSet implementation to mmap(). > > glibc's malloc() also uses mmap() for larger allocations. Precisely > because those allocations can then be handed back to the OS. I don't > think we'd want to use mmap() for small allocations either. Let's not > re-invent malloc().. malloc() does that only for allocations over MAP_THRESHOLD, which is 128kB by default. Vast majority of blocks we allocate are <= 8kB, so mmap() almost never happens. At least that's my understanding, I may be wrong of course. > > - Heikki > -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote: > On 15.2.2015 20:56, Heikki Linnakangas wrote: > > On 02/15/2015 08:57 PM, Tomas Vondra wrote: > >> One of the wilder ideas (I mentined beer was involved!) was a memory > >> allocator based on mmap [2], bypassing the libc malloc implementation > >> altogether. mmap() has some nice features (e.g. no issues with returning > >> memory back to the kernel, which may be problem with sbrk). So I hacked > >> a bit and switched the AllocSet implementation to mmap(). > > > > glibc's malloc() also uses mmap() for larger allocations. Precisely > > because those allocations can then be handed back to the OS. I don't > > think we'd want to use mmap() for small allocations either. Let's not > > re-invent malloc().. > > malloc() does that only for allocations over MAP_THRESHOLD, which is > 128kB by default. Vast majority of blocks we allocate are <= 8kB, so > mmap() almost never happens. The problem is that mmap() is, to my knowledge, noticeably more expensive than sbrk(). Especially with concurrent workloads. Which is why the malloc/libc authors chose to use sbrk... IIRC glibc malloc also batches several allocation into mmap()ed areas after some time. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 15.2.2015 21:13, Andres Freund wrote: > On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote: > >> malloc() does that only for allocations over MAP_THRESHOLD, which >> is 128kB by default. Vast majority of blocks we allocate are <= >> 8kB, so mmap() almost never happens. > > The problem is that mmap() is, to my knowledge, noticeably more > expensive than sbrk(). Especially with concurrent workloads. Which is > why the malloc/libc authors chose to use sbrk ... Any ideas how to simulate such workloads? None of the tests I've done suggested such issue exists. > IIRC glibc malloc also batches several allocation into mmap()ed > areas after some time. Maybe, there's certainly a lot of such optimizations in libc. But how do you return memory to system in that case? -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote: >> On 15.2.2015 20:56, Heikki Linnakangas wrote: >>> glibc's malloc() also uses mmap() for larger allocations. Precisely >>> because those allocations can then be handed back to the OS. I don't >>> think we'd want to use mmap() for small allocations either. Let's not >>> re-invent malloc().. >> malloc() does that only for allocations over MAP_THRESHOLD, which is >> 128kB by default. Vast majority of blocks we allocate are <= 8kB, so >> mmap() almost never happens. > The problem is that mmap() is, to my knowledge, noticeably more > expensive than sbrk(). Especially with concurrent workloads. Which is > why the malloc/libc authors chose to use sbrk... > IIRC glibc malloc also batches several allocation into mmap()ed areas > after some time. Keep in mind also that aset.c doubles the request size every time it goes back to malloc() for some more space for a given context. So you get up to 128kB pretty quickly. There will be a population of 8K-to-64K chunks that don't ever get returned to the OS but float back and forth between different MemoryContexts as those are created and deleted. I'm inclined to think this is fine and we don't need to improve on it. Part of the reason for my optimism is that on glibc-based platforms, IME PG backends do pretty well at reducing their memory consumption back down to a minimal value after each query. (On other platforms, not so much, but arguably that's libc's fault not ours.) So I'm not really seeing a problem that needs fixed, and definitely not one that a platform-specific fix will do much for. regards, tom lane
On 15.2.2015 21:38, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> On 2015-02-15 21:07:13 +0100, Tomas Vondra wrote: >>> On 15.2.2015 20:56, Heikki Linnakangas wrote: >>>> glibc's malloc() also uses mmap() for larger allocations. Precisely >>>> because those allocations can then be handed back to the OS. I don't >>>> think we'd want to use mmap() for small allocations either. Let's not >>>> re-invent malloc().. > >>> malloc() does that only for allocations over MAP_THRESHOLD, which is >>> 128kB by default. Vast majority of blocks we allocate are <= 8kB, so >>> mmap() almost never happens. > >> The problem is that mmap() is, to my knowledge, noticeably more >> expensive than sbrk(). Especially with concurrent workloads. Which is >> why the malloc/libc authors chose to use sbrk... > >> IIRC glibc malloc also batches several allocation into mmap()ed >> areas after some time. > > Keep in mind also that aset.c doubles the request size every time it > goes back to malloc() for some more space for a given context. So you > get up to 128kB pretty quickly. That's true, so for sufficiently large contexts we're already using mmap() indirectly, through libc. Some contexts use just 8kB (ALLOCSET_SMALL_MAXSIZE), but that's just a minority. > There will be a population of 8K-to-64K chunks that don't ever get > returned to the OS but float back and forth between different > MemoryContexts as those are created and deleted. I'm inclined to > think this is fine and we don't need to improve on it. Sure, but there are scenarios where that can't happen, because the contexts are created 'concurrently' so the blocks can't float between the contexts. And example that comes to mind is array_agg() with many groups, which is made worse by allocating the MemoryContext data in TopMemoryContext, creating 'islands' and making it impossible to release the memory. http://www.postgresql.org/message-id/e010519fbe83b1331ee0dfcb122a616a@fuzzy.cz > Part of the reason for my optimism is that on glibc-based platforms, > IME PG backends do pretty well at reducing their memory consumption > back down to a minimal value after each query. (On other platforms, > not so much, but arguably that's libc's fault not ours.) So I'm not > really seeing a problem that needs fixed, and definitely not one > that a platform-specific fix will do much for. I certainly agree this is not something we need to fix ASAP, and that bypassing the libc may not be the right remedy. That's why I posted it just here (and not to the CF), and marked it as experimental. That however does not mean we can't improve this somehow - from time to time I have to deal with machines where the minimum amount of memory assigned to a process grew over time, gradually increased memory pressure and eventually causing trouble. There are ways to fix this (e.g. by reopening the connections, thus creating a new backend). -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services