things I learned from working on memory allocation - Mailing list pgsql-hackers

From Robert Haas
Subject things I learned from working on memory allocation
Date
Msg-id CA+TgmoYx0d5aMCQ0TRpNie=kP_30vRf+yMdMSHbxA0H1KvgDNg@mail.gmail.com
Whole thread Raw
Responses Re: things I learned from working on memory allocation  (Peter Geoghegan <pg@heroku.com>)
Re: things I learned from working on memory allocation  (Amit Kapila <amit.kapila16@gmail.com>)
Re: things I learned from working on memory allocation  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
I wrote previously about my efforts to create a new memory allocator
(then called sb_alloc, now renamed to BlockAllocatorContext in my git
branch for consistency with existing identifiers) for PostgreSQL.  To
make a long story short, many of the basic concepts behind that patch
still seem sound to me, but significant problems remain, which account
for it not having been submitted to the current CommitFest nor
attached to this email.  Working in this area has led me to a few
conclusions - comments welcome.

1. I tried to write a single allocator which would be better than
aset.c in two ways: first, by allowing allocation from dynamic shared
memory; and second, by being more memory-efficient than aset.c.
Heikki told me at PGCon that he thought that was a bad idea, and I now
think he was right.  Allocating from dynamic shared memory requires
the use of relative rather than absolute pointers (because the DSM
could be mapped at different addresses in different processes) and, if
concurrent allocating and freeing is to be supported, locking.
Although neither of these things requires very many extra
instructions, the common path for aset.c is extremely short, and we
haven't got those instructions to spare.  Code like "if (lock != NULL)
LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the
branch is never taken, apparently because being prepared to call a
function at that point requires storing more stuff in our stack frame,
and extra instructions to save and restore registers are a painfully
expensive on allocation-heavy workloads.  On PPC64, to match aset.c's
performance, a palloc/pfree cycle must run in ~120 instructions, ~80
cycles, which doesn't leave much room for fluff.

2. After ripping the relative-pointer and locking stuff out of my
allocator, I came up with something which performs about like aset.c
on allocation, but with significantly better memory efficiency on
small allocations.  REINDEX pgbench_accounts_pkey saves about 22%,
because the SortTuple array saves nothing but the individual tuples
take only half as much space, by avoiding the StandardChunkHeader
overhead.  This seems like a savings worth pursuing, if we can figure
out how to get it.  Unfortunately, there's some incremental overhead
in pfree, amounting to a single test of global variable, even when the
new allocator is never used, which is measurable on a microbenchmark;
I don't remember the exact numbers off-hand.  When the new allocator
*is* used, pfree gets a lot slower - mostly because one of the data
structures used by the new allocator suffer occasional cache misses
during free operations.  I think there might be an argument that some
hit on pfree would be worth taking in exchange for better
space-efficiency, because we mostly free contexts by resetting them
anyway; but I haven't yet managed to make that hit small enough to
feel especially good about it.

3. The current MemoryContext abstraction layer is inadequate for
almost everything one might want to do.  The idea of a context that
allows only allocation, and ignores requests to free memory, has been
discussed more than once.  Such an allocator could skip the
power-of-two rounding that aset.c does, but it couldn't omit the chunk
header, which means that in practice it would save no memory at all
for cases like the one mentioned above (REINDEX
pgbench_accounts_pkey).  While there might be some benefit in other
cases, without getting rid of or at least reducing the size of the
chunk-header, I expect that the benefits will be too narrow to make
this approach worth pursuing.  The chunk-header requirement is also a
killer for DSM, again because segments can be mapped at different
addresses in different backends.  I'm inclined to think that if we
can't find a way to get rid of the chunk-header requirement, we ought
to consider ripping out the whole abstraction layer as a
performance-wasting exercise in futility.

4. The MemoryContext layer embeds assumptions not only about the
layout of memory, but the performance of various operations.  For
example, GetMemoryContextSpace() is assumed to be cheap, so there's no
interface like void *MemoryContextAlloc(Size request_size, Size
*chunk_size) or Size MemoryContextFree(Size chunk_size), but those
would be significantly more efficient for my new allocator.  Possibly
better still would be to have the context itself track utilization and
soft-fail when we run out of space; even for aset.c, it makes little
sense to refuse to accept another tuple when the AllocSet has a
suitable object on the freelist.  That's probably not a critical flaw
because, since tuples being sorted are likely all about the same size,
the waste may be little or nothing.  Other allocators might have
different space-management strategies, though, where this matters
more.

5. It's tempting to look at other ways of solving the parallel sort
problem that don't need an allocator - perhaps by simply packing all
the tuples into a DSM one after the next.  But this is not easy to do,
or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
through one of the tuplesort_put*() functions, most of which end up
calling one of the copytup_*() functions.  But you can't easily just
change those functions to stick the tuple someplace else, because they
don't necessarily get the address of the space to be used from palloc
directly.  In particular, copytup_heap() calls
ExecCopySlotMinimalTuple(), and copytup_cluster() calls
heap_copytuple().  heap_copytuple() is simple enough that you might be
able to finagle a new API that would make it work, but I'm not sure
what we could really do about ExecCopySlotMinimalTuple() except call
it and then copy the result.  Perhaps that'd be OK for a first
version.

6. In general, I'm worried that it's going to be hard to keep the
overhead of parallel sort from leaking into the non-parallel case.
With the no-allocator approach, every place that uses
GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
DSM and non-DSM cases differently, which isn't great for either
performance or maintainability.  And even with an allocator, the
SortTuple array will need to use relative pointers in a DSM; that
might burden the non-DSM case.

Thoughts, comments, suggestions?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Abhijit Menon-Sen
Date:
Subject: Re: pg_xlogdump --stats
Next
From: Bruce Momjian
Date:
Subject: Re: Pg_upgrade and toast tables bug discovered