Thread: things I learned from working on memory allocation

things I learned from working on memory allocation

From
Robert Haas
Date:
I wrote previously about my efforts to create a new memory allocator
(then called sb_alloc, now renamed to BlockAllocatorContext in my git
branch for consistency with existing identifiers) for PostgreSQL.  To
make a long story short, many of the basic concepts behind that patch
still seem sound to me, but significant problems remain, which account
for it not having been submitted to the current CommitFest nor
attached to this email.  Working in this area has led me to a few
conclusions - comments welcome.

1. I tried to write a single allocator which would be better than
aset.c in two ways: first, by allowing allocation from dynamic shared
memory; and second, by being more memory-efficient than aset.c.
Heikki told me at PGCon that he thought that was a bad idea, and I now
think he was right.  Allocating from dynamic shared memory requires
the use of relative rather than absolute pointers (because the DSM
could be mapped at different addresses in different processes) and, if
concurrent allocating and freeing is to be supported, locking.
Although neither of these things requires very many extra
instructions, the common path for aset.c is extremely short, and we
haven't got those instructions to spare.  Code like "if (lock != NULL)
LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the
branch is never taken, apparently because being prepared to call a
function at that point requires storing more stuff in our stack frame,
and extra instructions to save and restore registers are a painfully
expensive on allocation-heavy workloads.  On PPC64, to match aset.c's
performance, a palloc/pfree cycle must run in ~120 instructions, ~80
cycles, which doesn't leave much room for fluff.

2. After ripping the relative-pointer and locking stuff out of my
allocator, I came up with something which performs about like aset.c
on allocation, but with significantly better memory efficiency on
small allocations.  REINDEX pgbench_accounts_pkey saves about 22%,
because the SortTuple array saves nothing but the individual tuples
take only half as much space, by avoiding the StandardChunkHeader
overhead.  This seems like a savings worth pursuing, if we can figure
out how to get it.  Unfortunately, there's some incremental overhead
in pfree, amounting to a single test of global variable, even when the
new allocator is never used, which is measurable on a microbenchmark;
I don't remember the exact numbers off-hand.  When the new allocator
*is* used, pfree gets a lot slower - mostly because one of the data
structures used by the new allocator suffer occasional cache misses
during free operations.  I think there might be an argument that some
hit on pfree would be worth taking in exchange for better
space-efficiency, because we mostly free contexts by resetting them
anyway; but I haven't yet managed to make that hit small enough to
feel especially good about it.

3. The current MemoryContext abstraction layer is inadequate for
almost everything one might want to do.  The idea of a context that
allows only allocation, and ignores requests to free memory, has been
discussed more than once.  Such an allocator could skip the
power-of-two rounding that aset.c does, but it couldn't omit the chunk
header, which means that in practice it would save no memory at all
for cases like the one mentioned above (REINDEX
pgbench_accounts_pkey).  While there might be some benefit in other
cases, without getting rid of or at least reducing the size of the
chunk-header, I expect that the benefits will be too narrow to make
this approach worth pursuing.  The chunk-header requirement is also a
killer for DSM, again because segments can be mapped at different
addresses in different backends.  I'm inclined to think that if we
can't find a way to get rid of the chunk-header requirement, we ought
to consider ripping out the whole abstraction layer as a
performance-wasting exercise in futility.

4. The MemoryContext layer embeds assumptions not only about the
layout of memory, but the performance of various operations.  For
example, GetMemoryContextSpace() is assumed to be cheap, so there's no
interface like void *MemoryContextAlloc(Size request_size, Size
*chunk_size) or Size MemoryContextFree(Size chunk_size), but those
would be significantly more efficient for my new allocator.  Possibly
better still would be to have the context itself track utilization and
soft-fail when we run out of space; even for aset.c, it makes little
sense to refuse to accept another tuple when the AllocSet has a
suitable object on the freelist.  That's probably not a critical flaw
because, since tuples being sorted are likely all about the same size,
the waste may be little or nothing.  Other allocators might have
different space-management strategies, though, where this matters
more.

5. It's tempting to look at other ways of solving the parallel sort
problem that don't need an allocator - perhaps by simply packing all
the tuples into a DSM one after the next.  But this is not easy to do,
or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
through one of the tuplesort_put*() functions, most of which end up
calling one of the copytup_*() functions.  But you can't easily just
change those functions to stick the tuple someplace else, because they
don't necessarily get the address of the space to be used from palloc
directly.  In particular, copytup_heap() calls
ExecCopySlotMinimalTuple(), and copytup_cluster() calls
heap_copytuple().  heap_copytuple() is simple enough that you might be
able to finagle a new API that would make it work, but I'm not sure
what we could really do about ExecCopySlotMinimalTuple() except call
it and then copy the result.  Perhaps that'd be OK for a first
version.

6. In general, I'm worried that it's going to be hard to keep the
overhead of parallel sort from leaking into the non-parallel case.
With the no-allocator approach, every place that uses
GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
DSM and non-DSM cases differently, which isn't great for either
performance or maintainability.  And even with an allocator, the
SortTuple array will need to use relative pointers in a DSM; that
might burden the non-DSM case.

Thoughts, comments, suggestions?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: things I learned from working on memory allocation

From
Peter Geoghegan
Date:
On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 5. It's tempting to look at other ways of solving the parallel sort
> problem that don't need an allocator - perhaps by simply packing all
> the tuples into a DSM one after the next.  But this is not easy to do,
> or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
> through one of the tuplesort_put*() functions, most of which end up
> calling one of the copytup_*() functions.  But you can't easily just
> change those functions to stick the tuple someplace else, because they
> don't necessarily get the address of the space to be used from palloc
> directly.  In particular, copytup_heap() calls
> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
> heap_copytuple().  heap_copytuple() is simple enough that you might be
> able to finagle a new API that would make it work, but I'm not sure
> what we could really do about ExecCopySlotMinimalTuple() except call
> it and then copy the result.  Perhaps that'd be OK for a first
> version.

Perhaps. If my experience with sort support for text is anything to go
on, the cost of copying over the tuples isn't really that high
relative to the cost of performing the sort, particularly when there
are many tuples to sort. OTOH, I'd be concerned about the cost of main
memory accesses for pass-by-reference types during parallel tuple
sorts in particular. The same concern applies to cases where the tuple
proper must be accessed, although presumably that occurs less
frequently.

The LLVM project's new libc++ library passes remark on a similar issue
in their rationale for yet another C++ standard library: "For example,
it is generally accepted that building std::string using the "short
string optimization" instead of using Copy On Write (COW) is a
superior approach for multicore machines...". It seems likely that
they're mostly referencing the apparent trend of memory bandwidth per
core stagnation [1], [2]. To get the real benefit of a parallel sort,
we must do anything we can to avoid having cores/workers remain idle
during the sort while waiting on memory access. I believe that that's
the bigger problem.

[1] http://www.admin-magazine.com/HPC/Articles/Finding-Memory-Bottlenecks-with-Stream
[2] https://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-in-threaded-applications

-- 
Peter Geoghegan



Re: things I learned from working on memory allocation

From
Robert Haas
Date:
On Mon, Jul 7, 2014 at 5:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 5. It's tempting to look at other ways of solving the parallel sort
>> problem that don't need an allocator - perhaps by simply packing all
>> the tuples into a DSM one after the next.  But this is not easy to do,
>> or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
>> through one of the tuplesort_put*() functions, most of which end up
>> calling one of the copytup_*() functions.  But you can't easily just
>> change those functions to stick the tuple someplace else, because they
>> don't necessarily get the address of the space to be used from palloc
>> directly.  In particular, copytup_heap() calls
>> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
>> heap_copytuple().  heap_copytuple() is simple enough that you might be
>> able to finagle a new API that would make it work, but I'm not sure
>> what we could really do about ExecCopySlotMinimalTuple() except call
>> it and then copy the result.  Perhaps that'd be OK for a first
>> version.
>
> Perhaps. If my experience with sort support for text is anything to go
> on, the cost of copying over the tuples isn't really that high
> relative to the cost of performing the sort, particularly when there
> are many tuples to sort.

The testing I did showed about a 5% overhead on REINDEX INDEX
pgbench_accounts_pkey from one extra tuple copy (cf.
9f03ca915196dfc871804a1f8aad26207f601fd6).  Of course that could vary
by circumstance for a variety of reasons.

> OTOH, I'd be concerned about the cost of main
> memory accesses for pass-by-reference types during parallel tuple
> sorts in particular. The same concern applies to cases where the tuple
> proper must be accessed, although presumably that occurs less
> frequently.

I do think that's a problem with our sort implementation, but it's not
clear to me whether it's *more* of a problem for parallel sort than it
is for single-backend sort.

> The LLVM project's new libc++ library passes remark on a similar issue
> in their rationale for yet another C++ standard library: "For example,
> it is generally accepted that building std::string using the "short
> string optimization" instead of using Copy On Write (COW) is a
> superior approach for multicore machines...". It seems likely that
> they're mostly referencing the apparent trend of memory bandwidth per
> core stagnation [1], [2]. To get the real benefit of a parallel sort,
> we must do anything we can to avoid having cores/workers remain idle
> during the sort while waiting on memory access. I believe that that's
> the bigger problem.

Yes, I agree that's a problem.  There are algorithms for parallel
quicksort in the literature which seem to offer promising solutions,
but unfortunately these infrastructure issues have prevented me from
reaching them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: things I learned from working on memory allocation

From
Peter Geoghegan
Date:
On Mon, Jul 7, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The testing I did showed about a 5% overhead on REINDEX INDEX
> pgbench_accounts_pkey from one extra tuple copy (cf.
> 9f03ca915196dfc871804a1f8aad26207f601fd6).  Of course that could vary
> by circumstance for a variety of reasons.

Be careful of the check for pre-sorted input within qsort() giving an
overly optimistic picture of things, thereby exaggerating the
proportion of time spent copying if taken as generally representative.

>> OTOH, I'd be concerned about the cost of main
>> memory accesses for pass-by-reference types during parallel tuple
>> sorts in particular. The same concern applies to cases where the tuple
>> proper must be accessed, although presumably that occurs less
>> frequently.
>
> I do think that's a problem with our sort implementation, but it's not
> clear to me whether it's *more* of a problem for parallel sort than it
> is for single-backend sort.

If you'll forgive me for going on about my patch on this thread, I
think the pgbench "-c 4" and "-c 1" cases that I tested suggest it is
a particular problem for parallel sorts, as there is a much bigger
both absolute and proportional difference in transaction throughput
between those two with the patch applied. It seems reasonable to
suppose the difference would be larger still if we were considering a
single parallel sort, as opposed to multiple independent sorts (of the
same data) that happen to occur in parallel.

-- 
Peter Geoghegan



Re: things I learned from working on memory allocation

From
Peter Geoghegan
Date:
On Mon, Jul 7, 2014 at 7:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I do think that's a problem with our sort implementation, but it's not
>> clear to me whether it's *more* of a problem for parallel sort than it
>> is for single-backend sort.
>
> If you'll forgive me for going on about my patch on this thread, I
> think the pgbench "-c 4" and "-c 1" cases that I tested suggest it is
> a particular problem for parallel sorts, as there is a much bigger
> both absolute and proportional difference in transaction throughput
> between those two with the patch applied. It seems reasonable to
> suppose the difference would be larger still if we were considering a
> single parallel sort, as opposed to multiple independent sorts (of the
> same data) that happen to occur in parallel.

I think that I may have been too optimistic when I said that there was
an apparent trend of memory bandwidth per core merely stagnating:

http://users.ece.cmu.edu/~omutlu/pub/mutlu_memory-scaling_imw13_invited-talk.pdf

As slide 8 indicates, memory capacity per core is expected to go down
30% every two years, while the trend for memory bandwidth per core is
even worse.

-- 
Peter Geoghegan



Re: things I learned from working on memory allocation

From
Amit Kapila
Date:
On Tue, Jul 8, 2014 at 1:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> 6. In general, I'm worried that it's going to be hard to keep the
> overhead of parallel sort from leaking into the non-parallel case.
> With the no-allocator approach, every place that uses
> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
> DSM and non-DSM cases differently, which isn't great for either
> performance or maintainability.  And even with an allocator, the
> SortTuple array will need to use relative pointers in a DSM; that
> might burden the non-DSM case.

I think you are right in saying that there can be a performance impact
for non-parallel case due to pfree() (or other similar calls) and/or due
to usage of relative pointers, but can it have measurable impact during
actual sort operation?  Changes for any of them doesn't seem to cause
much impact for overall sort operation, although I am not sure what kind
of impact usage of relative pointers can cause, do you any specific point
in mind due to which you think that it can cause noticeable performance
impact. 

If there is an noticeable impact, then do you think having separate
file/infrastructure for parallel sort can help, basically non-parallel
and parallel sort will have some common and some specific parts.
The above layer will call the parallel/non-parallel functions based on
operation need.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: things I learned from working on memory allocation

From
Robert Haas
Date:
On Thu, Jul 10, 2014 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jul 8, 2014 at 1:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 6. In general, I'm worried that it's going to be hard to keep the
>> overhead of parallel sort from leaking into the non-parallel case.
>> With the no-allocator approach, every place that uses
>> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
>> DSM and non-DSM cases differently, which isn't great for either
>> performance or maintainability.  And even with an allocator, the
>> SortTuple array will need to use relative pointers in a DSM; that
>> might burden the non-DSM case.
>
> I think you are right in saying that there can be a performance impact
> for non-parallel case due to pfree() (or other similar calls) and/or due
> to usage of relative pointers, but can it have measurable impact during
> actual sort operation?

Benchmarking seems to indicate that it indeed can.

> If there is an noticeable impact, then do you think having separate
> file/infrastructure for parallel sort can help, basically non-parallel
> and parallel sort will have some common and some specific parts.
> The above layer will call the parallel/non-parallel functions based on
> operation need.

The tuplesort.c already handles so many cases already that adding yet
another axis of variability is intimidating.  But it may work out that
there's no better option.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: things I learned from working on memory allocation

From
Amit Kapila
Date:
On Fri, Jul 11, 2014 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jul 10, 2014 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > If there is an noticeable impact, then do you think having separate
> > file/infrastructure for parallel sort can help, basically non-parallel
> > and parallel sort will have some common and some specific parts.
> > The above layer will call the parallel/non-parallel functions based on
> > operation need.
>
> The tuplesort.c already handles so many cases already that adding yet
> another axis of variability is intimidating.  But it may work out that
> there's no better option.

For using new allocator, one idea is that it works seemlesly with current
memory allocation routines (palloc, pfree, repalloc, ..), another could be
that have separate routines (shm_palloc, shm_pfree, ..) for allocation
from new allocator.  I think having separate routines means all the call
sites for allocation/deallocation needs to be aware of operation type
(parallel or non-parallel), however I think this can avoid the overhead
for non-parallel cases.

I think it is bit difficult to say which approach ("with allocator" or
"directly use dsm") will turn out to be better as both have its pros and
cons as listed by you, however I feel that if we "directly using dsm", we
might need to change more core logic than "with allocator".

I believe we will face the same question again if somebody wants to
parallelize the join, so one parameter to decide could be based on
which approach will lead to less change in core logic/design.
 

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: things I learned from working on memory allocation

From
Andres Freund
Date:
Hi Robert,

On 2014-07-07 15:57:00 -0400, Robert Haas wrote:
> 1. I tried to write a single allocator which would be better than
> aset.c in two ways: first, by allowing allocation from dynamic shared
> memory; and second, by being more memory-efficient than aset.c.
> Heikki told me at PGCon that he thought that was a bad idea, and I now
> think he was right.

I agree with that as well. I am wondering whether it's possible to use
most of the same code and compile it once as a per process allocator and
once as the allocator for shared memory.

> Allocating from dynamic shared memory requires
> the use of relative rather than absolute pointers (because the DSM
> could be mapped at different addresses in different processes) and, if
> concurrent allocating and freeing is to be supported, locking.
> Although neither of these things requires very many extra
> instructions, the common path for aset.c is extremely short, and we
> haven't got those instructions to spare.  Code like "if (lock != NULL)
> LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the
> branch is never taken, apparently because being prepared to call a
> function at that point requires storing more stuff in our stack frame,
> and extra instructions to save and restore registers are a painfully
> expensive on allocation-heavy workloads.  On PPC64, to match aset.c's
> performance, a palloc/pfree cycle must run in ~120 instructions, ~80
> cycles, which doesn't leave much room for fluff.

The actual if (lock != NULL) bit costs significant amounts of cycles?
I'd have assumed that branch prediction takes care of that. Or is it
actually the icache not keeping up? Did you measure icache vs. dcache
misses?
Have you played with unlikely()/likely() type of macros?

I don't think that any such fixes changes will make enough of a
difference to negate the point that these simply are two different
allocators with different requirements, but it's still interesting.

> 2. After ripping the relative-pointer and locking stuff out of my
> allocator, I came up with something which performs about like aset.c
> on allocation, but with significantly better memory efficiency on
> small allocations.  REINDEX pgbench_accounts_pkey saves about 22%,
> because the SortTuple array saves nothing but the individual tuples
> take only half as much space, by avoiding the StandardChunkHeader
> overhead.  This seems like a savings worth pursuing, if we can figure
> out how to get it.

Yes, that'd certainly be nice.

> Unfortunately, there's some incremental overhead
> in pfree, amounting to a single test of global variable, even when the
> new allocator is never used, which is measurable on a microbenchmark;
> I don't remember the exact numbers off-hand.

Hm. Why's that needed? Because you're searching for your allocator's
superblocks in pfree()? I guess you don't store information about the
size/context for every allocatation like aset.c does?

> When the new allocator
> *is* used, pfree gets a lot slower - mostly because one of the data
> structures used by the new allocator suffer occasional cache misses
> during free operations.  I think there might be an argument that some
> hit on pfree would be worth taking in exchange for better
> space-efficiency, because we mostly free contexts by resetting them
> anyway; but I haven't yet managed to make that hit small enough to
> feel especially good about it.

The speed bump is hit when pfree()ing a allocation that's been allocated
with the new allocator or also when there's any concurrent activity for
that allocator?

> 3. The current MemoryContext abstraction layer is inadequate for
> almost everything one might want to do.  The idea of a context that
> allows only allocation, and ignores requests to free memory, has been
> discussed more than once.  Such an allocator could skip the
> power-of-two rounding that aset.c does, but it couldn't omit the chunk
> header, which means that in practice it would save no memory at all
> for cases like the one mentioned above (REINDEX
> pgbench_accounts_pkey).

I personally think that the performance benefit of not doing the
rounding, not accessing the freelist, and such is more interesting for
such an allocator than the memory savings. I'm pretty sure such a
'allocation only' allocator for e.g. the parser and parts of the
executor would be a rather noticeable performance benefit.

But even for the cases where the space savings are the important bit: To
me it sounds feasible to declare that some allocators don't allow
reallocations and freeing. Those then would be allowed to omit the chunk
header.  To make that debuggable we would need to add a chunk header
when assertions are enabled to error out when such an operation is
performed - but that's a acceptable price imo.

Btw, am I the only one that wondered if it  wouldn't be rather
beneficial to make aset.c add the chunk header before rounding?

> While there might be some benefit in other
> cases, without getting rid of or at least reducing the size of the
> chunk-header, I expect that the benefits will be too narrow to make
> this approach worth pursuing.  The chunk-header requirement is also a
> killer for DSM, again because segments can be mapped at different
> addresses in different backends.  I'm inclined to think that if we
> can't find a way to get rid of the chunk-header requirement, we ought
> to consider ripping out the whole abstraction layer as a
> performance-wasting exercise in futility.

FWIW, I have done that in an experimental patch and at least on x86-64
the performance benefits were neglegible. Didn't evaluate memory
efficiency though.

> 4. The MemoryContext layer embeds assumptions not only about the
> layout of memory, but the performance of various operations.  For
> example, GetMemoryContextSpace() is assumed to be cheap, so there's no
> interface like void *MemoryContextAlloc(Size request_size, Size
> *chunk_size) or Size MemoryContextFree(Size chunk_size), but those
> would be significantly more efficient for my new allocator.

I think mcxct.c is at least partially right in that. An memory
allocation interface that requires keeping track of the current
allocation size in userland *sucks*. Sure there are some cases where
that's not too much of a bother, but in the majority of cases it'll made
code harder to read and quite possibly just move the overhead in lots
more places.
I don't really see a problem with tweaking the interface so it allows
MemoryContextAlloc to immediately return the actually allocated size -
if we're doing that in a way that doesn't break existing users - why not?

> Possibly
> better still would be to have the context itself track utilization and
> soft-fail when we run out of space; even for aset.c, it makes little
> sense to refuse to accept another tuple when the AllocSet has a
> suitable object on the freelist.  That's probably not a critical flaw
> because, since tuples being sorted are likely all about the same size,
> the waste may be little or nothing.  Other allocators might have
> different space-management strategies, though, where this matters
> more.

I can't really figure out what you're talking about here.

> 5. It's tempting to look at other ways of solving the parallel sort
> problem that don't need an allocator - perhaps by simply packing all
> the tuples into a DSM one after the next.  But this is not easy to do,
> or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
> through one of the tuplesort_put*() functions, most of which end up
> calling one of the copytup_*() functions.  But you can't easily just
> change those functions to stick the tuple someplace else, because they
> don't necessarily get the address of the space to be used from palloc
> directly.  In particular, copytup_heap() calls
> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
> heap_copytuple().  heap_copytuple() is simple enough that you might be
> able to finagle a new API that would make it work, but I'm not sure
> what we could really do about ExecCopySlotMinimalTuple() except call
> it and then copy the result.  Perhaps that'd be OK for a first
> version.

This sounds like a problem that can be solved in a relatively localized
fashion. I think it'd be fair enough to provide one function that does
it efficiently and just fall back to ocpying for everything else.

> 6. In general, I'm worried that it's going to be hard to keep the
> overhead of parallel sort from leaking into the non-parallel case.
> With the no-allocator approach, every place that uses
> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
> DSM and non-DSM cases differently, which isn't great for either
> performance or maintainability.  And even with an allocator, the
> SortTuple array will need to use relative pointers in a DSM; that
> might burden the non-DSM case.

Yes, I share that concern.

I somewhat doubt that tuplesort.c really is the right layer to do the
parallel part of parallel sort including the chunked storage. Why not
pack the tuples outside it and teach tuplesort.c to not copy tuples for
some _put* routine? Then tuplesort can entirely happen inside a single
process and doesn't have to worry about indirect pointers and anything?
Yes, it'll need slightly more memory, but that doesn't sound too bad.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: things I learned from working on memory allocation

From
Robert Haas
Date:
On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The actual if (lock != NULL) bit costs significant amounts of cycles?
> I'd have assumed that branch prediction takes care of that. Or is it
> actually the icache not keeping up? Did you measure icache vs. dcache
> misses?
> Have you played with unlikely()/likely() type of macros?

I have not.  I think it's really got more to do with how much stuff
needs to be saved in the stack frame, but I might be wrong about that.

>> Unfortunately, there's some incremental overhead
>> in pfree, amounting to a single test of global variable, even when the
>> new allocator is never used, which is measurable on a microbenchmark;
>> I don't remember the exact numbers off-hand.
>
> Hm. Why's that needed? Because you're searching for your allocator's
> superblocks in pfree()? I guess you don't store information about the
> size/context for every allocatation like aset.c does?

Since the chunks don't have a StandardChunkHeader, pfree has got to
decide whether it's safe to look at the 16 bytes preceding the chunk
before doing so.  If the new allocator's not in use (single global
variable test) that's pretty cheap, but not so cheap as you might
hope.  I found that it was possible to buy back most of the cost by
replacing (*context->methods->free_p) (context, pointer); with a
hard-coded AllocSetFree(context, pointer), so that gives you some idea
what order of magnitude we're talking about here.

>> When the new allocator
>> *is* used, pfree gets a lot slower - mostly because one of the data
>> structures used by the new allocator suffer occasional cache misses
>> during free operations.  I think there might be an argument that some
>> hit on pfree would be worth taking in exchange for better
>> space-efficiency, because we mostly free contexts by resetting them
>> anyway; but I haven't yet managed to make that hit small enough to
>> feel especially good about it.
>
> The speed bump is hit when pfree()ing a allocation that's been allocated
> with the new allocator or also when there's any concurrent activity for
> that allocator?

The latter.

>> 3. The current MemoryContext abstraction layer is inadequate for
>> almost everything one might want to do.  The idea of a context that
>> allows only allocation, and ignores requests to free memory, has been
>> discussed more than once.  Such an allocator could skip the
>> power-of-two rounding that aset.c does, but it couldn't omit the chunk
>> header, which means that in practice it would save no memory at all
>> for cases like the one mentioned above (REINDEX
>> pgbench_accounts_pkey).
>
> I personally think that the performance benefit of not doing the
> rounding, not accessing the freelist, and such is more interesting for
> such an allocator than the memory savings. I'm pretty sure such a
> 'allocation only' allocator for e.g. the parser and parts of the
> executor would be a rather noticeable performance benefit.

I don't have any reason to believe that.  All palloc() does in the
common case is bump the boundary pointer and stick the chunk header in
there.  You're not going to be able to make that much faster.

> But even for the cases where the space savings are the important bit: To
> me it sounds feasible to declare that some allocators don't allow
> reallocations and freeing. Those then would be allowed to omit the chunk
> header.  To make that debuggable we would need to add a chunk header
> when assertions are enabled to error out when such an operation is
> performed - but that's a acceptable price imo.

Hmm, yeah, that could be done.  However, it does have the disadvantage
of requiring you to affirmatively avoid pfreeing anything that might
have come from such a context, rather than simply making pfree a noop.

> Btw, am I the only one that wondered if it  wouldn't be rather
> beneficial to make aset.c add the chunk header before rounding?

Well, it would help in some cases, hurt in others, and make no
difference at all in still others.  I mean, it just has to do with how
well your size classes fit your actual memory allocation pattern;
you'd be effectively changing the size classes from 32, 64, 128, 256
to 16, 48, 112, 240 and that could be a win or a loss depending on the
actual allocation pattern.   I'm sure it's pretty trivial to construct
a test case favoring either outcome.

>> 5. It's tempting to look at other ways of solving the parallel sort
>> problem that don't need an allocator - perhaps by simply packing all
>> the tuples into a DSM one after the next.  But this is not easy to do,
>> or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
>> through one of the tuplesort_put*() functions, most of which end up
>> calling one of the copytup_*() functions.  But you can't easily just
>> change those functions to stick the tuple someplace else, because they
>> don't necessarily get the address of the space to be used from palloc
>> directly.  In particular, copytup_heap() calls
>> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
>> heap_copytuple().  heap_copytuple() is simple enough that you might be
>> able to finagle a new API that would make it work, but I'm not sure
>> what we could really do about ExecCopySlotMinimalTuple() except call
>> it and then copy the result.  Perhaps that'd be OK for a first
>> version.
>
> This sounds like a problem that can be solved in a relatively localized
> fashion. I think it'd be fair enough to provide one function that does
> it efficiently and just fall back to ocpying for everything else.

OK.

>> 6. In general, I'm worried that it's going to be hard to keep the
>> overhead of parallel sort from leaking into the non-parallel case.
>> With the no-allocator approach, every place that uses
>> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
>> DSM and non-DSM cases differently, which isn't great for either
>> performance or maintainability.  And even with an allocator, the
>> SortTuple array will need to use relative pointers in a DSM; that
>> might burden the non-DSM case.
>
> Yes, I share that concern.
>
> I somewhat doubt that tuplesort.c really is the right layer to do the
> parallel part of parallel sort including the chunked storage. Why not
> pack the tuples outside it and teach tuplesort.c to not copy tuples for
> some _put* routine? Then tuplesort can entirely happen inside a single
> process and doesn't have to worry about indirect pointers and anything?
> Yes, it'll need slightly more memory, but that doesn't sound too bad.

I'm not sure I understand what you're describing here:

- the _put* routine has to do with how tuples get into the tuplesort;
but parallel sort has to do with how we get them in order once they're
in there
- having tuplesort happen inside a single process seems like the exact
opposite of parallel sort
- why would we need more memory?

But having expressed my confusion, I'd certainly welcome further
comment on this point, because I think figuring out how to set up the
abstraction is in some ways the hardest part of this problem - and it
is certainly made harder by the complexity of the existing
abstraction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: things I learned from working on memory allocation

From
Andres Freund
Date:
On 2014-07-14 11:24:26 -0400, Robert Haas wrote:
> On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > The actual if (lock != NULL) bit costs significant amounts of cycles?
> > I'd have assumed that branch prediction takes care of that. Or is it
> > actually the icache not keeping up? Did you measure icache vs. dcache
> > misses?
> > Have you played with unlikely()/likely() type of macros?
> 
> I have not.  I think it's really got more to do with how much stuff
> needs to be saved in the stack frame, but I might be wrong about that.

I don't really see why that'd play such a big role. The register
pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL)
alone requires to push anything on the stack. Sure, the call to
LWLockAcquire() will, but if that's done in the separate branch...

> >> Unfortunately, there's some incremental overhead
> >> in pfree, amounting to a single test of global variable, even when the
> >> new allocator is never used, which is measurable on a microbenchmark;
> >> I don't remember the exact numbers off-hand.
> >
> > Hm. Why's that needed? Because you're searching for your allocator's
> > superblocks in pfree()? I guess you don't store information about the
> > size/context for every allocatation like aset.c does?
> 
> Since the chunks don't have a StandardChunkHeader, pfree has got to
> decide whether it's safe to look at the 16 bytes preceding the chunk
> before doing so.  If the new allocator's not in use (single global
> variable test) that's pretty cheap, but not so cheap as you might
> hope.

That's what I was afraid of :/

> I found that it was possible to buy back most of the cost by
> replacing (*context->methods->free_p) (context, pointer); with a
> hard-coded AllocSetFree(context, pointer), so that gives you some idea
> what order of magnitude we're talking about here.

That's measured with a microbenchmark or actual postgres workloads?
Because in my measurements there wasn't consistent benefit in doing so
even when benchmarking workloads where allocation is a bottleneck.

> >> 3. The current MemoryContext abstraction layer is inadequate for
> >> almost everything one might want to do.  The idea of a context that
> >> allows only allocation, and ignores requests to free memory, has been
> >> discussed more than once.  Such an allocator could skip the
> >> power-of-two rounding that aset.c does, but it couldn't omit the chunk
> >> header, which means that in practice it would save no memory at all
> >> for cases like the one mentioned above (REINDEX
> >> pgbench_accounts_pkey).
> >
> > I personally think that the performance benefit of not doing the
> > rounding, not accessing the freelist, and such is more interesting for
> > such an allocator than the memory savings. I'm pretty sure such a
> > 'allocation only' allocator for e.g. the parser and parts of the
> > executor would be a rather noticeable performance benefit.
> 
> I don't have any reason to believe that.  All palloc() does in the
> common case is bump the boundary pointer and stick the chunk header in
> there.  You're not going to be able to make that much faster.

In my profiling the access to the freelist (to test whether there's free
chunks in there) actually is noticeable stall. Removing it makes things
measurably faster. Also adds memory overhead :) If you look at it, a
simple AllocSetAlloc() will currently access three cachelines inside
AllocSetContext - with some care that can be one.

> > But even for the cases where the space savings are the important bit: To
> > me it sounds feasible to declare that some allocators don't allow
> > reallocations and freeing. Those then would be allowed to omit the chunk
> > header.  To make that debuggable we would need to add a chunk header
> > when assertions are enabled to error out when such an operation is
> > performed - but that's a acceptable price imo.
> 
> Hmm, yeah, that could be done.  However, it does have the disadvantage
> of requiring you to affirmatively avoid pfreeing anything that might
> have come from such a context, rather than simply making pfree a noop.

Yep. Sounds feasible for a number of cases imo. I have serious doubts
that adding a measurable cost to pfree()s for allocations from all
contests is going to be fun. There's some operators that do a lot of
those - especially inside btree opclasses which have to do so. Since
those are the ones you'll often be calling from parallel sort...

> > Btw, am I the only one that wondered if it  wouldn't be rather
> > beneficial to make aset.c add the chunk header before rounding?
> 
> Well, it would help in some cases, hurt in others, and make no
> difference at all in still others.  I mean, it just has to do with how
> well your size classes fit your actual memory allocation pattern;
> you'd be effectively changing the size classes from 32, 64, 128, 256
> to 16, 48, 112, 240 and that could be a win or a loss depending on the
> actual allocation pattern.   I'm sure it's pretty trivial to construct
> a test case favoring either outcome.

I was actually thinking about it more from the CPU than from the memory
packing angle. There's a fair amount of +-ALLOC_CHUNKHDRSZ and it
doesn't look like the compiler removes it all.

> >> 6. In general, I'm worried that it's going to be hard to keep the
> >> overhead of parallel sort from leaking into the non-parallel case.
> >> With the no-allocator approach, every place that uses
> >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
> >> DSM and non-DSM cases differently, which isn't great for either
> >> performance or maintainability.  And even with an allocator, the
> >> SortTuple array will need to use relative pointers in a DSM; that
> >> might burden the non-DSM case.
> >
> > Yes, I share that concern.
> >
> > I somewhat doubt that tuplesort.c really is the right layer to do the
> > parallel part of parallel sort including the chunked storage. Why not
> > pack the tuples outside it and teach tuplesort.c to not copy tuples for
> > some _put* routine? Then tuplesort can entirely happen inside a single
> > process and doesn't have to worry about indirect pointers and anything?
> > Yes, it'll need slightly more memory, but that doesn't sound too bad.
> 
> I'm not sure I understand what you're describing here:
> 
> - the _put* routine has to do with how tuples get into the tuplesort;
> but parallel sort has to do with how we get them in order once they're
> in there
> - having tuplesort happen inside a single process seems like the exact
> opposite of parallel sort
> - why would we need more memory?
> 
> But having expressed my confusion, I'd certainly welcome further
> comment on this point, because I think figuring out how to set up the
> abstraction is in some ways the hardest part of this problem - and it
> is certainly made harder by the complexity of the existing
> abstraction.

I think we're imagining quite different approaches to the problem. In my
mind the parallelism part is essentially a layer *above* tuplesort.c,
but I think you're thinking about a much more integrated approach.
Could you maybe describe what the layering, distribution and locking
look like in your mind (or possibly even prototype?).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: things I learned from working on memory allocation

From
Robert Haas
Date:
On Mon, Jul 14, 2014 at 12:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-07-14 11:24:26 -0400, Robert Haas wrote:
>> On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > The actual if (lock != NULL) bit costs significant amounts of cycles?
>> > I'd have assumed that branch prediction takes care of that. Or is it
>> > actually the icache not keeping up? Did you measure icache vs. dcache
>> > misses?
>> > Have you played with unlikely()/likely() type of macros?
>>
>> I have not.  I think it's really got more to do with how much stuff
>> needs to be saved in the stack frame, but I might be wrong about that.
>
> I don't really see why that'd play such a big role. The register
> pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL)
> alone requires to push anything on the stack. Sure, the call to
> LWLockAcquire() will, but if that's done in the separate branch...

I can't tell you for sure what the cause is; I just saw that the
difference was very significant.

>> I found that it was possible to buy back most of the cost by
>> replacing (*context->methods->free_p) (context, pointer); with a
>> hard-coded AllocSetFree(context, pointer), so that gives you some idea
>> what order of magnitude we're talking about here.
>
> That's measured with a microbenchmark or actual postgres workloads?
> Because in my measurements there wasn't consistent benefit in doing so
> even when benchmarking workloads where allocation is a bottleneck.

Microbenchmark.

>> >> 6. In general, I'm worried that it's going to be hard to keep the
>> >> overhead of parallel sort from leaking into the non-parallel case.
>> >> With the no-allocator approach, every place that uses
>> >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
>> >> DSM and non-DSM cases differently, which isn't great for either
>> >> performance or maintainability.  And even with an allocator, the
>> >> SortTuple array will need to use relative pointers in a DSM; that
>> >> might burden the non-DSM case.
>> >
>> > Yes, I share that concern.
>> >
>> > I somewhat doubt that tuplesort.c really is the right layer to do the
>> > parallel part of parallel sort including the chunked storage. Why not
>> > pack the tuples outside it and teach tuplesort.c to not copy tuples for
>> > some _put* routine? Then tuplesort can entirely happen inside a single
>> > process and doesn't have to worry about indirect pointers and anything?
>> > Yes, it'll need slightly more memory, but that doesn't sound too bad.
>>
>> I'm not sure I understand what you're describing here:
>>
>> - the _put* routine has to do with how tuples get into the tuplesort;
>> but parallel sort has to do with how we get them in order once they're
>> in there
>> - having tuplesort happen inside a single process seems like the exact
>> opposite of parallel sort
>> - why would we need more memory?
>>
>> But having expressed my confusion, I'd certainly welcome further
>> comment on this point, because I think figuring out how to set up the
>> abstraction is in some ways the hardest part of this problem - and it
>> is certainly made harder by the complexity of the existing
>> abstraction.
>
> I think we're imagining quite different approaches to the problem. In my
> mind the parallelism part is essentially a layer *above* tuplesort.c,
> but I think you're thinking about a much more integrated approach.

So, I don't see how that can work.  tuplesort.c is *the* entrypoint
for sorting throughout the backend.  If you didn't use those same
entrypoints, you'd have to update every caller, which seems tedious
and rather pointless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company