Re: things I learned from working on memory allocation - Mailing list pgsql-hackers

From Andres Freund
Subject Re: things I learned from working on memory allocation
Date
Msg-id 20140714161942.GA22721@alap3.anarazel.de
Whole thread Raw
In response to Re: things I learned from working on memory allocation  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: things I learned from working on memory allocation  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2014-07-14 11:24:26 -0400, Robert Haas wrote:
> On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > The actual if (lock != NULL) bit costs significant amounts of cycles?
> > I'd have assumed that branch prediction takes care of that. Or is it
> > actually the icache not keeping up? Did you measure icache vs. dcache
> > misses?
> > Have you played with unlikely()/likely() type of macros?
> 
> I have not.  I think it's really got more to do with how much stuff
> needs to be saved in the stack frame, but I might be wrong about that.

I don't really see why that'd play such a big role. The register
pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL)
alone requires to push anything on the stack. Sure, the call to
LWLockAcquire() will, but if that's done in the separate branch...

> >> Unfortunately, there's some incremental overhead
> >> in pfree, amounting to a single test of global variable, even when the
> >> new allocator is never used, which is measurable on a microbenchmark;
> >> I don't remember the exact numbers off-hand.
> >
> > Hm. Why's that needed? Because you're searching for your allocator's
> > superblocks in pfree()? I guess you don't store information about the
> > size/context for every allocatation like aset.c does?
> 
> Since the chunks don't have a StandardChunkHeader, pfree has got to
> decide whether it's safe to look at the 16 bytes preceding the chunk
> before doing so.  If the new allocator's not in use (single global
> variable test) that's pretty cheap, but not so cheap as you might
> hope.

That's what I was afraid of :/

> I found that it was possible to buy back most of the cost by
> replacing (*context->methods->free_p) (context, pointer); with a
> hard-coded AllocSetFree(context, pointer), so that gives you some idea
> what order of magnitude we're talking about here.

That's measured with a microbenchmark or actual postgres workloads?
Because in my measurements there wasn't consistent benefit in doing so
even when benchmarking workloads where allocation is a bottleneck.

> >> 3. The current MemoryContext abstraction layer is inadequate for
> >> almost everything one might want to do.  The idea of a context that
> >> allows only allocation, and ignores requests to free memory, has been
> >> discussed more than once.  Such an allocator could skip the
> >> power-of-two rounding that aset.c does, but it couldn't omit the chunk
> >> header, which means that in practice it would save no memory at all
> >> for cases like the one mentioned above (REINDEX
> >> pgbench_accounts_pkey).
> >
> > I personally think that the performance benefit of not doing the
> > rounding, not accessing the freelist, and such is more interesting for
> > such an allocator than the memory savings. I'm pretty sure such a
> > 'allocation only' allocator for e.g. the parser and parts of the
> > executor would be a rather noticeable performance benefit.
> 
> I don't have any reason to believe that.  All palloc() does in the
> common case is bump the boundary pointer and stick the chunk header in
> there.  You're not going to be able to make that much faster.

In my profiling the access to the freelist (to test whether there's free
chunks in there) actually is noticeable stall. Removing it makes things
measurably faster. Also adds memory overhead :) If you look at it, a
simple AllocSetAlloc() will currently access three cachelines inside
AllocSetContext - with some care that can be one.

> > But even for the cases where the space savings are the important bit: To
> > me it sounds feasible to declare that some allocators don't allow
> > reallocations and freeing. Those then would be allowed to omit the chunk
> > header.  To make that debuggable we would need to add a chunk header
> > when assertions are enabled to error out when such an operation is
> > performed - but that's a acceptable price imo.
> 
> Hmm, yeah, that could be done.  However, it does have the disadvantage
> of requiring you to affirmatively avoid pfreeing anything that might
> have come from such a context, rather than simply making pfree a noop.

Yep. Sounds feasible for a number of cases imo. I have serious doubts
that adding a measurable cost to pfree()s for allocations from all
contests is going to be fun. There's some operators that do a lot of
those - especially inside btree opclasses which have to do so. Since
those are the ones you'll often be calling from parallel sort...

> > Btw, am I the only one that wondered if it  wouldn't be rather
> > beneficial to make aset.c add the chunk header before rounding?
> 
> Well, it would help in some cases, hurt in others, and make no
> difference at all in still others.  I mean, it just has to do with how
> well your size classes fit your actual memory allocation pattern;
> you'd be effectively changing the size classes from 32, 64, 128, 256
> to 16, 48, 112, 240 and that could be a win or a loss depending on the
> actual allocation pattern.   I'm sure it's pretty trivial to construct
> a test case favoring either outcome.

I was actually thinking about it more from the CPU than from the memory
packing angle. There's a fair amount of +-ALLOC_CHUNKHDRSZ and it
doesn't look like the compiler removes it all.

> >> 6. In general, I'm worried that it's going to be hard to keep the
> >> overhead of parallel sort from leaking into the non-parallel case.
> >> With the no-allocator approach, every place that uses
> >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
> >> DSM and non-DSM cases differently, which isn't great for either
> >> performance or maintainability.  And even with an allocator, the
> >> SortTuple array will need to use relative pointers in a DSM; that
> >> might burden the non-DSM case.
> >
> > Yes, I share that concern.
> >
> > I somewhat doubt that tuplesort.c really is the right layer to do the
> > parallel part of parallel sort including the chunked storage. Why not
> > pack the tuples outside it and teach tuplesort.c to not copy tuples for
> > some _put* routine? Then tuplesort can entirely happen inside a single
> > process and doesn't have to worry about indirect pointers and anything?
> > Yes, it'll need slightly more memory, but that doesn't sound too bad.
> 
> I'm not sure I understand what you're describing here:
> 
> - the _put* routine has to do with how tuples get into the tuplesort;
> but parallel sort has to do with how we get them in order once they're
> in there
> - having tuplesort happen inside a single process seems like the exact
> opposite of parallel sort
> - why would we need more memory?
> 
> But having expressed my confusion, I'd certainly welcome further
> comment on this point, because I think figuring out how to set up the
> abstraction is in some ways the hardest part of this problem - and it
> is certainly made harder by the complexity of the existing
> abstraction.

I think we're imagining quite different approaches to the problem. In my
mind the parallelism part is essentially a layer *above* tuplesort.c,
but I think you're thinking about a much more integrated approach.
Could you maybe describe what the layering, distribution and locking
look like in your mind (or possibly even prototype?).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Pg_upgrade and toast tables bug discovered
Next
From: Steve Singer
Date:
Subject: Re: 9.4 logical replication - walsender keepalive replies