Thread: Avoid stack frame setup in performance critical routines using tail calls

Avoid stack frame setup in performance critical routines using tail calls

From
Andres Freund
Date:
Hi,

We have a few routines that are taking up a meaningful share of nearly all
workloads. They are worth micro-optimizing, even though they rarely the most
expensive parts of a profile. The memory allocation infrastructure is an
example of that.

When looking at a profile one can often see that a measurable percentage of
the time is spent doing stack frame setup in functions like palloc(),
AllocSetAlloc(). E.g. here's a perf profile annotating palloc(), the first
column showing the percentage of the time the relevant instruction was
sampled:

       │      void *
       │      palloc(Size size)
       │      {
 11.62 │        push  %rbp
  5.89 │        mov   %rsp,%rbp
 11.79 │        push  %r12
       │        mov   %rdi,%r12
  6.07 │        push  %rbx
       │      /* duplicates MemoryContextAlloc to avoid increased overhead */
       │      void       *ret;
       │      MemoryContext context = CurrentMemoryContext;
       │        mov   CurrentMemoryContext,%rbx
       │
       │      AssertArg(MemoryContextIsValid(context));
       │      AssertNotInCriticalSection(context);
       │
       │      if (!AllocSizeIsValid(size))
  5.86 │        cmp   $0x3fffffff,%rdi
       │      → ja    14fa5b <palloc.cold>
       │      elog(ERROR, "invalid memory alloc request size %zu", size);
       │
       │      context->isReset = false;
 17.71 │        movb  $0x0,0x4(%rbx)
       │
       │      ret = context->methods->alloc(context, size);
  5.98 │        mov   0x10(%rbx),%rax
       │        mov   %rdi,%rsi
       │        mov   %rbx,%rdi
 35.08 │      → callq *(%rax)


The stack frame setup bit is the push ... bit.

At least on x86-64 unixoid systems, that overhead can be avoided in certain
circumstances! The simplest case is if the function doesn't do any function
calls of its own. If simple enough (i.e. no register spilling), the compiler
will just not set up a stack frame - nobody could need it.

The slightly more complicated case is that of a function that only does a
"tail call", i.e. the only function call is just before returning (there can
be multiple such paths though). If the function is simple enough, gcc/clang
will then not use the "call" instruction to call the function (which would
require a proper stack frame being set up), but instead just jump to the other
function. Which ends up reusing the current function's stack frame,
basically. When that called function returns using 'ret', it'll reuse the
location pushed onto the call stack by the caller of the "original" function,
and return to its caller. Having optimized away the need to maintain one stack
frame level, and one indirection when returning from the inner function (which
just would do its own ret).

For that to work, there obviously cannot be any instructions in the function
before calling the inner function. Which brings us back to the palloc example
from above.

As an experiment, if i change the code for palloc() to omit the if (ret == NULL)
check, the assembly (omitting source for brevity) from:

  61c9a0:       55                      push   %rbp
  61c9a1:       48 89 e5                mov    %rsp,%rbp
  61c9a4:       41 54                   push   %r12
  61c9a6:       49 89 fc                mov    %rdi,%r12
  61c9a9:       53                      push   %rbx
  61c9aa:       48 8b 1d 2f f2 2a 00    mov    0x2af22f(%rip),%rbx        # 8cbbe0 <CurrentMemoryContext>
  61c9b1:       48 81 ff ff ff ff 3f    cmp    $0x3fffffff,%rdi
  61c9b8:       0f 87 9d 30 b3 ff       ja     14fa5b <palloc.cold>
  61c9be:       c6 43 04 00             movb   $0x0,0x4(%rbx)
  61c9c2:       48 8b 43 10             mov    0x10(%rbx),%rax
  61c9c6:       48 89 fe                mov    %rdi,%rsi
  61c9c9:       48 89 df                mov    %rbx,%rdi
  61c9cc:       ff 10                   callq  *(%rax)
  61c9ce:       48 85 c0                test   %rax,%rax
  61c9d1:       0f 84 b9 30 b3 ff       je     14fa90 <palloc.cold+0x35>
  61c9d7:       5b                      pop    %rbx
  61c9d8:       41 5c                   pop    %r12
  61c9da:       5d                      pop    %rbp
  61c9db:       c3                      retq

to

  61c8c0:       48 89 fe                mov    %rdi,%rsi
  61c8c3:       48 8b 3d 16 f3 2a 00    mov    0x2af316(%rip),%rdi        # 8cbbe0 <CurrentMemoryContext>
  61c8ca:       48 81 fe ff ff ff 3f    cmp    $0x3fffffff,%rsi
  61c8d1:       0f 87 c3 31 b3 ff       ja     14fa9a <palloc.cold>
  61c8d7:       c6 47 04 00             movb   $0x0,0x4(%rdi)
  61c8db:       48 8b 47 10             mov    0x10(%rdi),%rax
  61c8df:       ff 20                   jmpq   *(%rax)

It's not hard to see why that would be faster, I think.


Of course, we cannot just omit that check. But I think this is an argument for
why it is not a great idea to have such a check in palloc() - it prevents the
use of the above optimization, and it adds a branch to a performance critical
function, though there already existing branches in aset.c etc that
specifically know about this case.

The code in palloc() does this check after context->methods->alloc() since
3d6d1b585524: Move out-of-memory error checks from aset.c to mcxt.c

Of course, that commit changed things for a reason: It allows
palloc_extended() to exist.

However, it seems that the above optimization, as well as the desire to avoid
redundant branches (checking for allocation failures in AllocSetAlloc() and
then again in palloc() etc) in critical paths, suggests pushing the handling
of MCXT_ALLOC_NO_OOM (and perhaps others) a layer down, into the memory
context implementations. Which of course means that we would need to pass down
MCXT_ALLOC_NO_OOM into at least MemoryContextMethods->alloc,realloc}. But that
seems like a good idea to me anyway. That way we could pass down further
information as well, e.g. about required alignment.

Of course it'd make sense to avoid duplicating the same error message across
all contexts, but that could be addressed using a mcxt.c helper function to
deal with the allocation failure case.

E.g the existing cases like

    block = (AllocBlock) malloc(blksize);
    if (block == NULL)
        return NULL;

could become something like
    block = (AllocBlock) malloc(blksize);
    if (unlikely(block == NULL))
        return MemoryContextAllocationFailure(context, size, flags);


The trick of avoiding stack frame setup does not just apply to wrapper
functions like palloc(). It even can apply to AllocSetAlloc() itself! If one
separates out the "slow paths" from the "fast paths" of AllocSetAlloc(), the
fast path can avoid needing the stack frame, for the price of the slow paths
being a tiny bit slower. Often the generated code turns out to be better,
because the register allocation pressure is lower in the fast path.

For that to work, the paths of AllocSetAlloc() that call malloc() need to be
separated out. As we obviously need to process malloc()'s result, the call to
malloc cannot be a tail call. So we need to split out two paths:
1) handling of large allocations
2) running out of space in the current block / having no block

To actually benefit from the optimization, those paths need to actually return
the allocated memory. And they need to be marked pg_noinline, otherwise the
compiler won't get the message...

I think this actually makes the aset.c code a good bit more readable, and
highlights where in AllocSetAlloc() adding instructions hurts, and where its
fine.

I have *not* carefully benchmarked this, but a quick implementation of this
does seem to increase readonly pgbench tps at a small scale by 2-3% (both
-Mprepared/simple). Despite not being an all that pgbench bound workload.


Rough prototype patch for the above attached.

Comments?


A slightly tangential improvement would be to move the memset() in palloc0()
et al do into a static inline. There's two benefits of that:

1) compilers can generate much better code for memset() if the length is known
   - instead of a function call with length dispatch replace that with a
   handful of instructions doing the zeroing for the precise length.

2) compilers can often optimize away [part of ]the overhead of needing to do
   the memset, as many callers will go on to overwrite a good portion of the
   zeroed data.


Greetings,

Andres Freund

Attachment
On Tue, 20 Jul 2021 at 08:00, Andres Freund <andres@anarazel.de> wrote:
> I have *not* carefully benchmarked this, but a quick implementation of this
> does seem to increase readonly pgbench tps at a small scale by 2-3% (both

Interesting.

I've not taken the time to study the patch but I was running some
other benchmarks today on a small scale pgbench readonly test and I
took this patch for a spin to see if I could see the same performance
gains.

This is an AMD 3990x machine that seems to get the most throughput
from pgbench with 132 processes

I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared
--random-seed=12345 postgres

master = dd498998a

Master: 3816959.53 tps
Patched: 3820723.252 tps

I didn't quite get the same 2-3% as you did, but it did come out
faster than on master.

David

Attachment
Hi,

On 2021-07-20 16:50:09 +1200, David Rowley wrote:
> I've not taken the time to study the patch but I was running some
> other benchmarks today on a small scale pgbench readonly test and I
> took this patch for a spin to see if I could see the same performance
> gains.

Thanks!


> This is an AMD 3990x machine that seems to get the most throughput
> from pgbench with 132 processes
> 
> I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared
> --random-seed=12345 postgres
> 
> master = dd498998a
> 
> Master: 3816959.53 tps
> Patched: 3820723.252 tps
> 
> I didn't quite get the same 2-3% as you did, but it did come out
> faster than on master.

It would not at all be suprising to me if AMD in recent microarchitectures did
a better job at removing stack management overview (e.g. by better register
renaming, or by resolving dependencies on %rsp in a smarter way) than Intel
has. This was on a Cascade Lake CPU (xeon 5215), which, despite being released
in 2019, effectively is a moderately polished (or maybe shoehorned)
microarchitecture from 2015 due to all the Intel troubles. Whereas Zen2 is
from 2019.

It's also possible that my attempts at avoiding the stack management just
didn't work on your compiler. Either due to vendor (I know that gcc is better
at it than clang), version, or compiler flags (e.g. -fno-omit-frame-pointer
could make it harder, -fno-optimize-sibling-calls would disable it).

A third plausible explanation for the difference is that at a client count of
132, the bottlenecks are sufficiently elsewhere to just not show a meaningful
gain from memory management efficiency improvements.


Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
palloc` from a patched run? And perhaps how high their percentages of the
total work are. E.g. using something like
perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'

It'd be interesting to know where the bottlenecks on a zen2 machine are.

Greetings,

Andres Freund



On Tue, 20 Jul 2021 at 18:17, Andres Freund <andres@anarazel.de> wrote:
> Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
> palloc` from a patched run? And perhaps how high their percentages of the
> total work are. E.g. using something like
> perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'

Sure. See attached.

David

Attachment

Re: Avoid stack frame setup in performance critical routines using tail calls

From
"Andres Freund"
Date:
Hi,

On Mon, Jul 19, 2021, at 23:53, David Rowley wrote:
> On Tue, 20 Jul 2021 at 18:17, Andres Freund <andres@anarazel.de> wrote:
> > Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
> > palloc` from a patched run? And perhaps how high their percentages of the
> > total work are. E.g. using something like
> > perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'
> 
> Sure. See attached.
> 
> David
> 
> Attachments:
> * AllocateSetAlloc.txt
> * palloc.txt
> * percent.txt

Huh, that's interesting. You have some control flow enforcement stuff turned on (the endbr64). And it looks like it has
anon zero cost (or maybe it's just skid). Did you enable that intentionally? If not, what compiler/version/distro is
it?I think at least on GCC that's -fcf-protection=...
 

Andres



On Tue, 20 Jul 2021 at 19:04, Andres Freund <andres@anarazel.de> wrote:
> > * AllocateSetAlloc.txt
> > * palloc.txt
> > * percent.txt
>
> Huh, that's interesting. You have some control flow enforcement stuff turned on (the endbr64). And it looks like it
hasa non zero cost (or maybe it's just skid). Did you enable that intentionally? If not, what compiler/version/distro
isit? I think at least on GCC that's -fcf-protection=... 

It's ubuntu 21.04 with gcc 10.3 (specifically gcc version 10.3.0
(Ubuntu 10.3.0-1ubuntu1)

I've attached the same results from compiling with clang 12
(12.0.0-3ubuntu1~21.04.1)

David

Attachment
Hi,

On 2021-07-20 19:37:46 +1200, David Rowley wrote:
> On Tue, 20 Jul 2021 at 19:04, Andres Freund <andres@anarazel.de> wrote:
> > > * AllocateSetAlloc.txt
> > > * palloc.txt
> > > * percent.txt
> >
> > Huh, that's interesting. You have some control flow enforcement stuff turned on (the endbr64). And it looks like it
hasa non zero cost (or maybe it's just skid). Did you enable that intentionally? If not, what compiler/version/distro
isit? I think at least on GCC that's -fcf-protection=...
 
>
> It's ubuntu 21.04 with gcc 10.3 (specifically gcc version 10.3.0
> (Ubuntu 10.3.0-1ubuntu1)
>
> I've attached the same results from compiling with clang 12
> (12.0.0-3ubuntu1~21.04.1)

It looks like the ubuntu folks have changed the default for CET to on.


andres@ubuntu2020:~$ echo 'int foo(void) { return 17;}' > test.c && gcc -O2  -c -o test.o test.c && objdump -S test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
   0:    f3 0f 1e fa              endbr64
   4:    b8 11 00 00 00           mov    $0x11,%eax
   9:    c3                       retq
andres@ubuntu2020:~$ echo 'int foo(void) { return 17;}' > test.c && gcc -O2 -fcf-protection=none -c -o test.o test.c &&
objdump-S test.o
 

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
   0:    b8 11 00 00 00           mov    $0x11,%eax
   5:    c3                       retq


Independent of this patch, it might be worth running a benchmark with
the default options, and one with -fcf-protection=none. None of my
machines support it...

$ cpuid -1|grep CET
      CET_SS: CET shadow stack                 = false
      CET_IBT: CET indirect branch tracking    = false
         XCR0 supported: CET_U state          = false
         XCR0 supported: CET_S state          = false

Here it adds about 40kB of .text, but I can't measure the CET
overhead...

Greetings,

Andres Freund



Hi,

David and I were chatting about this patch, in the context of his bump
allocator patch.  Attached is a rebased version that is also split up into two
steps, and a bit more polished.

I wasn't sure what a good test was. I ended up measuring
  COPY pgbench_accounts TO '/dev/null' WITH (FORMAT 'binary');
of a scale 1 database with pgbench:

c=1;pgbench -q -i -s1 && pgbench  -n -c$c -j$c -t100 -f <(echo "COPY pgbench_accounts TO '/dev/null' WITH (FORMAT
'binary');")

    average latency
HEAD:   33.865 ms
01:     32.820 ms
02:     29.934 ms

The server was pinned to the one core, turbo mode disabled.  That's a pretty
nice win, I'd say.  And I don't think this is actually the most allocator
bound workload, I just tried something fairly random...


Greetings,

Andres Freund

Attachment
On Wed, 19 Jul 2023 at 20:52, Andres Freund <andres@anarazel.de> wrote:
> David and I were chatting about this patch, in the context of his bump
> allocator patch.  Attached is a rebased version that is also split up into two
> steps, and a bit more polished.

I've only just briefly read through the updated patch, but I did take
it for a spin to see what sort of improvements I can get from it.

The attached graph shows the time in seconds that it took for each
allocator to allocate 10GBs of memory resetting the context once 1MB
is allocated.  The data point for aset with 32-byte chunks takes
master 1.697 seconds and with both patches, it goes down to 1.264,
which is a 34% increase in performance.

It's pretty nice that we can hide the AllocSizeIsValid tests inside
the allocChunkLimit path and pretty good that we can skip the NULL
checks in most cases since we're not having to check for malloc
failure unless we malloc a new block.

I'll reply back with a more detailed review next week.

David

Attachment

On Wed, Jul 19, 2023 at 3:53 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> David and I were chatting about this patch, in the context of his bump
> allocator patch.  Attached is a rebased version that is also split up into two
> steps, and a bit more polished.

Here is a quick test -- something similar was used to measure the slab improvements last cycle. With radix tree v37 0001-0011 from [1],

create extension bench_radix_tree;
select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;

The backend was pinned and turbo off. Perf runs were separate from timed runs. I included 0002 for completeness.

v37
         avg        
---------------------
 27.0400000000000000

  32.42%  postgres  bench_radix_tree.so  [.] rt_recursive_set
  21.60%  postgres  postgres             [.] SlabAlloc
  11.06%  postgres  [unknown]            [k] 0xffffffff930018f7
  10.49%  postgres  bench_radix_tree.so  [.] rt_extend_down
   7.07%  postgres  postgres             [.] MemoryContextAlloc
   4.83%  postgres  bench_radix_tree.so  [.] rt_node_insert_inner
   2.19%  postgres  bench_radix_tree.so  [.] rt_grow_node_48
   2.16%  postgres  bench_radix_tree.so  [.] rt_set.isra.0
   1.50%  postgres  bench_radix_tree.so  [.] MemoryContextAlloc@plt

v37 + palloc sibling calls
         avg        
---------------------
 26.0700000000000000

v37 + palloc sibling calls + opt aset
         avg        
---------------------
 26.0900000000000000

  33.78%  postgres  bench_radix_tree.so  [.] rt_recursive_set
  23.04%  postgres  postgres             [.] SlabAlloc
  11.43%  postgres  [unknown]            [k] 0xffffffff930018f7
  11.05%  postgres  bench_radix_tree.so  [.] rt_extend_down
   5.52%  postgres  bench_radix_tree.so  [.] rt_node_insert_inner
   2.47%  postgres  bench_radix_tree.so  [.] rt_set.isra.0
   2.30%  postgres  bench_radix_tree.so  [.] rt_grow_node_48
   1.88%  postgres  postgres             [.] MemoryContextAlloc
   1.44%  postgres  bench_radix_tree.so  [.] MemoryContextAlloc@plt

It's nice to see MemoryContextAlloc go down in the profile.

Re: Avoid stack frame setup in performance critical routines using tail calls

From
David Rowley
Date:
On Fri, 21 Jul 2023 at 14:03, David Rowley <dgrowleyml@gmail.com> wrote:
> I'll reply back with a more detailed review next week.

Here's a review of v2-0001:

1.

/*
* XXX: Should this also be moved into alloc()? We could possibly avoid
* zeroing in some cases (e.g. if we used mmap() ourselves.
*/
MemSetAligned(ret, 0, size);

Maybe this should be moved to the alloc function.  It would allow us
to get rid of this:

#define palloc0fast(sz) \
( MemSetTest(0, sz) ? \
MemoryContextAllocZeroAligned(CurrentMemoryContext, sz) : \
MemoryContextAllocZero(CurrentMemoryContext, sz) )

If we do the zeroing inside the alloc function then it can always use
the MemoryContextAllocZeroAligned version providing we zero before
setting the sentinel byte.

It would allow the tail call in the palloc0() case, but the drawback
would be having to check for the MCXT_ALLOC_ZERO flag in the alloc
function. I wonder if that branch would be predictable in most cases,
e.g. the parser will be making lots of nodes and want to zero all
allocations, but the executor won't be doing much of that. There will
be a mix of zeroing and not zeroing in the planner, mostly not, I
think.

2. Why do you need to add the NULL check here?

 #ifdef USE_VALGRIND
- if (method != MCTX_ALIGNED_REDIRECT_ID)
+ if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID)
  VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
 #endif

I know it's just valgrind code and performance does not matter, but
the realloc flags are being passed as 0, so allocation failures won't
return.

3.

/*
* XXX: Probably no need to check for huge allocations, we only support
* one size? Which could theoretically be huge, but that'd not make
* sense...
*/

They can't be huge per Assert(fullChunkSize <= MEMORYCHUNK_MAX_VALUE)
in SlabContextCreate().

4. It would be good to see some API documentation in the
MemoryContextMethods struct.  This adds a lot of responsibility onto
the context implementation without any extra documentation to explain
what, for example, palloc is responsible for and what the alloc
function needs to do itself.

David



I've rebased the 0001 patch and gone over it again and made a few
additional changes besides what I mentioned in my review.

On Wed, 9 Aug 2023 at 20:44, David Rowley <dgrowleyml@gmail.com> wrote:
> Here's a review of v2-0001:
> 2. Why do you need to add the NULL check here?
>
>  #ifdef USE_VALGRIND
> - if (method != MCTX_ALIGNED_REDIRECT_ID)
> + if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID)
>   VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
>  #endif

I removed this NULL check as we're calling the realloc function with
no flags, so it shouldn't return NULL as it'll error out from any OOM
errors.

> 3.
>
> /*
> * XXX: Probably no need to check for huge allocations, we only support
> * one size? Which could theoretically be huge, but that'd not make
> * sense...
> */
>
> They can't be huge per Assert(fullChunkSize <= MEMORYCHUNK_MAX_VALUE)
> in SlabContextCreate().

I removed this comment and adjusted the comment just below that which
checks the 'size' matches the expected slab chunk size. i.e.

/*
* Make sure we only allow correct request size.  This doubles as the
* MemoryContextCheckSize check.
*/
if (unlikely(size != slab->chunkSize))


> 4. It would be good to see some API documentation in the
> MemoryContextMethods struct.  This adds a lot of responsibility onto
> the context implementation without any extra documentation to explain
> what, for example, palloc is responsible for and what the alloc
> function needs to do itself.

I've done that too.

I also added header comments for MemoryContextAllocationFailure and
MemoryContextSizeFailure and added some comments to explain in places
like palloc() to warn people not to add checks after the 'alloc' call.

The rebased patch is 0001 and all of my changes are in 0002.  I will
rebase your original 0002 patch later.  I think 0001 is much more
important, as evident by the reported benchmarks on this thread.

In absence of anyone else looking at this, I think it's ready to go.
If anyone is following along and wants to review or test it, please do
so soon.

David

Attachment
Hi,

On 2024-02-23 00:46:26 +1300, David Rowley wrote:
> I've rebased the 0001 patch and gone over it again and made a few
> additional changes besides what I mentioned in my review.
>
> On Wed, 9 Aug 2023 at 20:44, David Rowley <dgrowleyml@gmail.com> wrote:
> > Here's a review of v2-0001:
> > 2. Why do you need to add the NULL check here?
> >
> >  #ifdef USE_VALGRIND
> > - if (method != MCTX_ALIGNED_REDIRECT_ID)
> > + if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID)
> >   VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
> >  #endif
>
> I removed this NULL check as we're calling the realloc function with
> no flags, so it shouldn't return NULL as it'll error out from any OOM
> errors.

That was probably a copy-paste issue...


> > 4. It would be good to see some API documentation in the
> > MemoryContextMethods struct.  This adds a lot of responsibility onto
> > the context implementation without any extra documentation to explain
> > what, for example, palloc is responsible for and what the alloc
> > function needs to do itself.
>
> I've done that too.
>
> I also added header comments for MemoryContextAllocationFailure and
> MemoryContextSizeFailure and added some comments to explain in places
> like palloc() to warn people not to add checks after the 'alloc' call.
>
> The rebased patch is 0001 and all of my changes are in 0002.  I will
> rebase your original 0002 patch later.

Thanks!


> I think 0001 is much more important, as evident by the reported benchmarks
> on this thread.

I agree that it's good to tackle 0001 first.

I don't understand the benchmark point though. Your benchmark seems to suggest
that 0002 improves aset performance by *more* than 0001: for 8 byte aset
allocs:

        time
master: 8.86
0001:   8.12
0002:   7.02

So 0001 reduces time by 0.92x and 0002 by 0.86x.


John's test shows basically no change for 0002 - which is unsurprising, as
0002 changes aset.c, but the test seems to solely exercise slab, as only
SlabAlloc() shows up in the profile. As 0002 only touches aset.c it couldn't
really have affected that test.


> In absence of anyone else looking at this, I think it's ready to go.
> If anyone is following along and wants to review or test it, please do
> so soon.

Makes sense!



> @@ -1061,6 +1072,16 @@ MemoryContextAlloc(MemoryContext context, Size size)
>
>      context->isReset = false;
>

For a moment this made me wonder if we could move the isReset handling into
the allocator slow paths as well - it's annoying to write that bit (and thus
dirty the cacheline) over and ove. But it'd be somewhat awkward due to
pre-allocated blocks. So that'd be a larger change better done separately.

Greetings,

Andres Freund



On Fri, 23 Feb 2024 at 11:53, Andres Freund <andres@anarazel.de> wrote:
> > @@ -1061,6 +1072,16 @@ MemoryContextAlloc(MemoryContext context, Size size)
> >
> >       context->isReset = false;
> >
>
> For a moment this made me wonder if we could move the isReset handling into
> the allocator slow paths as well - it's annoying to write that bit (and thus
> dirty the cacheline) over and ove. But it'd be somewhat awkward due to
> pre-allocated blocks. So that'd be a larger change better done separately.

It makes sense to do this, but on looking closer for aset.c, it seems
like the only time we can avoid un-setting the isReset flag is when
allocating from the freelist.  We must unset it for large allocations
and for allocations that don't fit onto the existing block (the
exiting block could be the keeper block) and for allocations that
require a new block.

With the current arrangement of code in generation.c, I didn't see any
path we could skip doing context->isReset = false.

For slab.c, it's very easy and we can skip setting the isReset in most cases.

I've attached the patches I benchmarked against 449e798c7 and also the
patch I used to add a function to exercise palloc.

The query I ran was:

select chksz,mtype,pg_allocate_memory_test_reset(chksz, 1024*1024,
1024*1024*1024, mtype)
from (values(8),(16),(32),(64)) sizes(chksz),
(values('aset'),('generation'),('slab')) cxt(mtype)
order by mtype,chksz;

David

Attachment
On Fri, 23 Feb 2024 at 11:53, Andres Freund <andres@anarazel.de> wrote:
>
> On 2024-02-23 00:46:26 +1300, David Rowley wrote:
> > In absence of anyone else looking at this, I think it's ready to go.
> > If anyone is following along and wants to review or test it, please do
> > so soon.
>
> Makes sense!

I pushed the 0001 and 0002 patches today.

I switched over to working on doing what you did in 0002 for
generation.c and slab.c.

See the attached patch which runs the same test as in [1] (aset.c is
just there for comparisons between slab and generation)

The attached includes some additional tuning to generation.c:

1) Changed GenerationFree() to not free() the current block when it
becomes empty. The code now just marks it as empty and reuses it.
Saves free()/malloc() cycle. Also means we can get rid of a NULL check
in GenerationAlloc().

2) Removed code in GenerationAlloc() which I felt was trying too hard
to fill the keeper, free and current block.  The changes I made here
do mean that once the keeper block becomes empty, it won't be used
again until the context is reset and gets a new allocation.  I don't
see this as a big issue as the keeper block is small anyway.

generation.c is now ~30% faster for the 8-byte test.

David

[1] https://postgr.es/m/CAApHDvqss7-a9c51nj+f9xyAr15wjLB6teHsxPe-NwLCNqiJbg@mail.gmail.com

Attachment
On Thu, 29 Feb 2024 at 00:29, David Rowley <dgrowleyml@gmail.com> wrote:
> I switched over to working on doing what you did in 0002 for
> generation.c and slab.c.
>
> See the attached patch which runs the same test as in [1] (aset.c is
> just there for comparisons between slab and generation)
>
> The attached includes some additional tuning to generation.c:

I've now pushed this.

David

> [1] https://postgr.es/m/CAApHDvqss7-a9c51nj+f9xyAr15wjLB6teHsxPe-NwLCNqiJbg@mail.gmail.com



Hi,

On 2024-03-04 17:43:50 +1300, David Rowley wrote:
> On Thu, 29 Feb 2024 at 00:29, David Rowley <dgrowleyml@gmail.com> wrote:
> > I switched over to working on doing what you did in 0002 for
> > generation.c and slab.c.
> >
> > See the attached patch which runs the same test as in [1] (aset.c is
> > just there for comparisons between slab and generation)
> >
> > The attached includes some additional tuning to generation.c:
> 
> I've now pushed this.

Thanks for working on all these, much appreciated!

Greetings,

Andres Freund