Re: things I learned from working on memory allocation - Mailing list pgsql-hackers

From Robert Haas
Subject Re: things I learned from working on memory allocation
Date
Msg-id CA+TgmoYDowuJ21_HTzLggJkfTGCz0vUe1LLd9zmYJNOFNzfXtw@mail.gmail.com
Whole thread Raw
In response to Re: things I learned from working on memory allocation  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: things I learned from working on memory allocation  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The actual if (lock != NULL) bit costs significant amounts of cycles?
> I'd have assumed that branch prediction takes care of that. Or is it
> actually the icache not keeping up? Did you measure icache vs. dcache
> misses?
> Have you played with unlikely()/likely() type of macros?

I have not.  I think it's really got more to do with how much stuff
needs to be saved in the stack frame, but I might be wrong about that.

>> Unfortunately, there's some incremental overhead
>> in pfree, amounting to a single test of global variable, even when the
>> new allocator is never used, which is measurable on a microbenchmark;
>> I don't remember the exact numbers off-hand.
>
> Hm. Why's that needed? Because you're searching for your allocator's
> superblocks in pfree()? I guess you don't store information about the
> size/context for every allocatation like aset.c does?

Since the chunks don't have a StandardChunkHeader, pfree has got to
decide whether it's safe to look at the 16 bytes preceding the chunk
before doing so.  If the new allocator's not in use (single global
variable test) that's pretty cheap, but not so cheap as you might
hope.  I found that it was possible to buy back most of the cost by
replacing (*context->methods->free_p) (context, pointer); with a
hard-coded AllocSetFree(context, pointer), so that gives you some idea
what order of magnitude we're talking about here.

>> When the new allocator
>> *is* used, pfree gets a lot slower - mostly because one of the data
>> structures used by the new allocator suffer occasional cache misses
>> during free operations.  I think there might be an argument that some
>> hit on pfree would be worth taking in exchange for better
>> space-efficiency, because we mostly free contexts by resetting them
>> anyway; but I haven't yet managed to make that hit small enough to
>> feel especially good about it.
>
> The speed bump is hit when pfree()ing a allocation that's been allocated
> with the new allocator or also when there's any concurrent activity for
> that allocator?

The latter.

>> 3. The current MemoryContext abstraction layer is inadequate for
>> almost everything one might want to do.  The idea of a context that
>> allows only allocation, and ignores requests to free memory, has been
>> discussed more than once.  Such an allocator could skip the
>> power-of-two rounding that aset.c does, but it couldn't omit the chunk
>> header, which means that in practice it would save no memory at all
>> for cases like the one mentioned above (REINDEX
>> pgbench_accounts_pkey).
>
> I personally think that the performance benefit of not doing the
> rounding, not accessing the freelist, and such is more interesting for
> such an allocator than the memory savings. I'm pretty sure such a
> 'allocation only' allocator for e.g. the parser and parts of the
> executor would be a rather noticeable performance benefit.

I don't have any reason to believe that.  All palloc() does in the
common case is bump the boundary pointer and stick the chunk header in
there.  You're not going to be able to make that much faster.

> But even for the cases where the space savings are the important bit: To
> me it sounds feasible to declare that some allocators don't allow
> reallocations and freeing. Those then would be allowed to omit the chunk
> header.  To make that debuggable we would need to add a chunk header
> when assertions are enabled to error out when such an operation is
> performed - but that's a acceptable price imo.

Hmm, yeah, that could be done.  However, it does have the disadvantage
of requiring you to affirmatively avoid pfreeing anything that might
have come from such a context, rather than simply making pfree a noop.

> Btw, am I the only one that wondered if it  wouldn't be rather
> beneficial to make aset.c add the chunk header before rounding?

Well, it would help in some cases, hurt in others, and make no
difference at all in still others.  I mean, it just has to do with how
well your size classes fit your actual memory allocation pattern;
you'd be effectively changing the size classes from 32, 64, 128, 256
to 16, 48, 112, 240 and that could be a win or a loss depending on the
actual allocation pattern.   I'm sure it's pretty trivial to construct
a test case favoring either outcome.

>> 5. It's tempting to look at other ways of solving the parallel sort
>> problem that don't need an allocator - perhaps by simply packing all
>> the tuples into a DSM one after the next.  But this is not easy to do,
>> or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
>> through one of the tuplesort_put*() functions, most of which end up
>> calling one of the copytup_*() functions.  But you can't easily just
>> change those functions to stick the tuple someplace else, because they
>> don't necessarily get the address of the space to be used from palloc
>> directly.  In particular, copytup_heap() calls
>> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
>> heap_copytuple().  heap_copytuple() is simple enough that you might be
>> able to finagle a new API that would make it work, but I'm not sure
>> what we could really do about ExecCopySlotMinimalTuple() except call
>> it and then copy the result.  Perhaps that'd be OK for a first
>> version.
>
> This sounds like a problem that can be solved in a relatively localized
> fashion. I think it'd be fair enough to provide one function that does
> it efficiently and just fall back to ocpying for everything else.

OK.

>> 6. In general, I'm worried that it's going to be hard to keep the
>> overhead of parallel sort from leaking into the non-parallel case.
>> With the no-allocator approach, every place that uses
>> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
>> DSM and non-DSM cases differently, which isn't great for either
>> performance or maintainability.  And even with an allocator, the
>> SortTuple array will need to use relative pointers in a DSM; that
>> might burden the non-DSM case.
>
> Yes, I share that concern.
>
> I somewhat doubt that tuplesort.c really is the right layer to do the
> parallel part of parallel sort including the chunked storage. Why not
> pack the tuples outside it and teach tuplesort.c to not copy tuples for
> some _put* routine? Then tuplesort can entirely happen inside a single
> process and doesn't have to worry about indirect pointers and anything?
> Yes, it'll need slightly more memory, but that doesn't sound too bad.

I'm not sure I understand what you're describing here:

- the _put* routine has to do with how tuples get into the tuplesort;
but parallel sort has to do with how we get them in order once they're
in there
- having tuplesort happen inside a single process seems like the exact
opposite of parallel sort
- why would we need more memory?

But having expressed my confusion, I'd certainly welcome further
comment on this point, because I think figuring out how to set up the
abstraction is in some ways the hardest part of this problem - and it
is certainly made harder by the complexity of the existing
abstraction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Christoph Martin
Date:
Subject: Re: [PATCH] Fix search_path default value separator.
Next
From: Robert Haas
Date:
Subject: Re: Pg_upgrade and toast tables bug discovered