Re: [HACKERS] PATCH: two slab-like memory allocators - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: two slab-like memory allocators
Date
Msg-id b72dfa3d-c241-8a3e-97ad-bf214b754aca@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: two slab-like memory allocators  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] PATCH: two slab-like memory allocators  (Andres Freund <andres@anarazel.de>)
Re: [HACKERS] PATCH: two slab-like memory allocators  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 02/09/2017 10:37 PM, Andres Freund wrote:
> Hi,
>
> On 2016-12-13 01:45:13 +0100, Tomas Vondra wrote:
>>  src/backend/utils/mmgr/Makefile   |   2 +-
>>  src/backend/utils/mmgr/aset.c     | 115 +--------------------------------
>>  src/backend/utils/mmgr/memdebug.c | 131 ++++++++++++++++++++++++++++++++++++++
>>  src/include/utils/memdebug.h      |  22 +++++++
>>  4 files changed, 156 insertions(+), 114 deletions(-)
>>  create mode 100644 src/backend/utils/mmgr/memdebug.c
>
> I'm a bit loathe to move these to a .c file - won't this likely make
> these debugging tools even slower?  Seems better to put some of them
> them in a header as static inlines (not randomize, but the rest).
>

Do you have any numbers to support that? AFAICS compilers got really 
good in inlining stuff on their own.

>
>> From 43aaabf70b979b172fd659ef4d0ef129fd78d72d Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas@2ndquadrant.com>
>> Date: Wed, 30 Nov 2016 15:36:23 +0100
>> Subject: [PATCH 2/3] slab allocator
>>
>> ---
>>  src/backend/replication/logical/reorderbuffer.c |  82 +--
>>  src/backend/utils/mmgr/Makefile                 |   2 +-
>>  src/backend/utils/mmgr/slab.c                   | 803 ++++++++++++++++++++++++
>>  src/include/nodes/memnodes.h                    |   2 +-
>>  src/include/nodes/nodes.h                       |   1 +
>>  src/include/replication/reorderbuffer.h         |  15 +-
>>  src/include/utils/memutils.h                    |   9 +
>
> I'd like to see the reorderbuffer changes split into a separate commit
> from the slab allocator introduction.
>

I rather dislike patches that only add a bunch of code, without actually 
using it anywhere. But if needed, this is trivial to do at commit time - 
just don't commit the reorderbuffer bits.

>
>
>
>> +/*-------------------------------------------------------------------------
>> + *
>> + * slab.c
>> + *      SLAB allocator definitions.
>> + *
>> + * SLAB is a custom MemoryContext implementation designed for cases of
>> + * equally-sized objects.
>> + *
>> + *
>> + * Portions Copyright (c) 2016, PostgreSQL Global Development Group
>
> Bump, before a committer forgets it.
>

OK.

>
>> + * IDENTIFICATION
>> + *      src/backend/utils/mmgr/slab.c
>> + *
>> + *
>> + *    The constant allocation size allows significant simplification and various
>> + *    optimizations. Firstly, we can get rid of the doubling and carve the blocks
>> + *    into chunks of exactly the right size (plus alignment), not wasting memory.
>
> Getting rid of it relative to what? I'd try to phrase it so these
> comments stand on their own.
>

OK, rill reword.

>
>> + *    Each block includes a simple bitmap tracking which chunks are used/free.
>> + *    This makes it trivial to check if all chunks on the block are free, and
>> + *    eventually free the whole block (which is almost impossible with a global
>> + *    freelist of chunks, storing chunks from all blocks).
>
> Why is checking a potentially somewhat long-ish bitmap better than a
> simple counter, or a "linked list" of "next free chunk-number" or such
> (where free chunks simply contain the id of the subsequent chunk)?
> Using a list instead of a bitmap would also make it possible to get
> 'lifo' behaviour, which is good for cache efficiency.  A simple
> chunk-number based singly linked list would only imply a minimum
> allocation size of 4 - that seems perfectly reasonable?
>

A block-level counter would be enough to decide if all chunks on the 
block are free, but it's not sufficient to identify which chunks are 
free / available for reuse.

The bitmap only has a single bit per chunk, so I find "potentially 
long-ish" is a bit misleading. Any linked list implementation will 
require much more per-chunk overhead - as the chunks are fixed-legth, 
it's possible to use chunk index (instead of 64-bit pointers), to save 
some space. But with large blocks / small chunks that's still at least 2 
or 4 bytes per index, and you'll need two (to implement doubly-linked 
list, to make add/remove efficient).

For example assume 8kB block and 64B chunks, i.e. 128 chunks. With 
bitmap that's 16B to track all free space on the block. Doubly linked 
list would require 1B per chunk index, 2 indexes per chunk. That's 128*2 
= 256B.

I have a hard time believing this the cache efficiency of linked lists 
(which may or may not be real in this case) out-weights this, but if you 
want to try, be my guest.

>
>> + *    At the context level, we use 'freelist' to track blocks ordered by number
>> + *    of free chunks, starting with blocks having a single allocated chunk, and
>> + *    with completely full blocks on the tail.
>
> Why that way round? Filling chunks up as much as possible is good for
> cache and TLB efficiency, and allows for earlier de-allocation of
> partially used blocks?  Oh, I see you do that in the next comment,
> but it still leaves me wondering.
>
> Also, is this actually a list? It's more an array of lists, right?
> I.e. it should be named freelists?
>

Possibly. Naming things is hard.
>
> Thirdly, isn't that approach going to result in a quite long freelists
> array, when you have small items and a decent blocksize? That seems like
> a fairly reasonable thing to do?
>

I'm confused. Why wouldn't that be reasonable. Or rather, what would be 
a more reasonable way?

>
>> + *    This also allows various optimizations - for example when searching for
>> + *    free chunk, we the allocator reuses space from the most full blocks first,
>> + *    in the hope that some of the less full blocks will get completely empty
>> + *    (and returned back to the OS).
>
> Might be worth mentioning tlb/cache efficiency too.
>

I haven't really considered tlb/cache very much. The main goal of this 
design was to free blocks (instead of keeping many partially-used blocks 
around). If you have comments on this, feel free to add them.

>
>> + *    For each block, we maintain pointer to the first free chunk - this is quite
>> + *    cheap and allows us to skip all the preceding used chunks, eliminating
>> + *    a significant number of lookups in many common usage patters. In the worst
>> + *    case this performs as if the pointer was not maintained.
>
> Hm, so that'd be eliminated if we maintained a linked list of chunks (by
> chunk number) and a free_chunk_cnt or such.
>

As I explained above, I don't think linked list is a good solution. IIRC 
correctly I've initially done that, and ended using the bitmap. If you 
have idea how to do that, feel free to implement that and then we can do 
some measurements and compare the patches.

>
>> +
>> +#include "postgres.h"
>> +
>> +#include "utils/memdebug.h"
>> +#include "utils/memutils.h"
>> +#include "lib/ilist.h"
>
> Move ilist up, above memdebug, so the list is alphabetically ordered.
>

OK

>
>> +/*
>> + * SlabPointer
>> + *        Aligned pointer which may be a member of an allocation set.
>> + */
>> +typedef void *SlabPointer;
>> +typedef SlabContext *Slab;
>
> I personally wont commit this whith pointer hiding typedefs.  If
> somebody else does, I can live with it, but for me it's bad enough taste
> that I wont.
>

Meh.

>
>> +/*
>> + * SlabContext is a specialized implementation of MemoryContext.
>> + */
>> +typedef struct SlabContext
>> +{
>> +    MemoryContextData header;    /* Standard memory-context fields */
>> +    /* Allocation parameters for this context: */
>> +    Size        chunkSize;        /* chunk size */
>> +    Size        fullChunkSize;    /* chunk size including header and alignment */
>> +    Size        blockSize;        /* block size */
>> +    int            chunksPerBlock; /* number of chunks per block */
>> +    int            minFreeChunks;    /* min number of free chunks in any block */
>> +    int            nblocks;        /* number of blocks allocated */
>> +    /* blocks with free space, grouped by number of free chunks: */
>> +    dlist_head    freelist[FLEXIBLE_ARRAY_MEMBER];
>> +}    SlabContext;
>> +
>
> Why aren't these ints something unsigned?
>

Yeah, some of those could be unsigned. Will check.

>
>> +/*
>> + * SlabIsValid
>> + *        True iff set is valid allocation set.
>> + */
>> +#define SlabIsValid(set) PointerIsValid(set)
>
> It's not your fault, but this "iff" is obviously a lot stronger than the
> actual test ;).  I seriously doubt this macro is worth anything...
>

Yeah.

>
>> +/*
>> + * SlabReset
>> + *        Frees all memory which is allocated in the given set.
>> + *
>> + * The code simply frees all the blocks in the context - we don't keep any
>> + * keeper blocks or anything like that.
>> + */
>
> Why don't we?  Seems quite worthwhile?   Thinking about this, won't this
> result in a drastic increase of system malloc/mmap/brk traffic when
> there's lot of short transactions in reorderbuffer?
>

I haven't seen any significant impact of that during the tests I've done 
(even with many tiny transactions), but it adding keeper blocks should 
be trivial.

>
>> +static void
>> +SlabReset(MemoryContext context)
>> +{
>> +    /* walk over freelists and free the blocks */
>> +    for (i = 0; i <= set->chunksPerBlock; i++)
>> +    {
>> +        dlist_mutable_iter miter;
>> +
>> +        dlist_foreach_modify(miter, &set->freelist[i])
>> +        {
>> +            SlabBlock    block = dlist_container(SlabBlockData, node, miter.cur);
>> +
>> +            dlist_delete(miter.cur);
>> +
>> +            /* Normal case, release the block */
>
> What does "normal case" refer to here? Given that there's no alternative
> case...
>

Mem, bogus comment.

>
>> +    /*
>> +     * We need to update index of the next free chunk on the block. If we used
>> +     * the last free chunk on this block, set it to chunksPerBlock (which is
>> +     * not a valid chunk index). Otherwise look for the next chunk - we know
>> +     * that it has to be above the current firstFreeChunk value, thanks to how
>> +     * we maintain firstFreeChunk here and in SlabFree().
>> +     */
>> +    if (block->nfree == 0)
>> +        block->firstFreeChunk = set->chunksPerBlock;
>> +    else
>> +    {
>> +        /* look for the next free chunk in the block, after the first one */
>> +        while ((++block->firstFreeChunk) < set->chunksPerBlock)
>> +        {
>> +            int            byte = block->firstFreeChunk / 8;
>> +            int            bit = block->firstFreeChunk % 8;
>> +
>> +            /* stop when you find 0 (unused chunk) */
>> +            if (!(block->bitmapptr[byte] & (0x01 << bit)))
>> +                break;
>> +        }
>> +
>> +        /* must have found the free chunk */
>> +        Assert(block->firstFreeChunk != set->chunksPerBlock);
>> +    }
>
> This and previous code just re-affirms my opinion that a bitmap is not
> the best structure here.
>

It'd be great if you could explain why, instead of just making such 
claims ...

>
>> +    /* move the whole block to the right place in the freelist */
>> +    dlist_delete(&block->node);
>> +    dlist_push_head(&set->freelist[block->nfree], &block->node);
>
> Hm.  What if we, instead of the array of doubly linked lists, just kept
> a single linked list of blocks, and keep that list sorted by number of
> free chunks?  Given that freeing / allocation never changes the number
> of allocated chunks by more than 1, we'll never have to move an entry
> far in that list to keep it sorted.
>

Only assuming that there'll be only few blocks with the same number of 
free chunks. If that's not the case, you'll have to walk many blocks to 
move the block to the right place in the list. The array of lists 
handles such cases way more efficiently, and I think we should keep it.

>
>> +/*
>> + * SlabRealloc
>> + *        As Slab is designed for allocating equally-sized chunks of memory, it
>> + *        can't really do an actual realloc.
>> + *
>> + * We try to be gentle and allow calls with exactly the same size as in that
>> + * case we can simply return the same chunk. When the size differs, we fail
>> + * with assert failure or return NULL.
>> + *
>> + * We might be even support cases with (size < chunkSize). That however seems
>> + * rather pointless - Slab is meant for chunks of constant size, and moreover
>> + * realloc is usually used to enlarge the chunk.
>> + *
>> + * XXX Perhaps we should not be gentle at all and simply fails in all cases,
>> + * to eliminate the (mostly pointless) uncertainty.
>
> I think I'm in favor of that.  This seems more likely to hide a bug than
> actually helpful.
>

OK.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] Checksums by default?
Next
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] multivariate statistics (v19)