Re: slab allocator performance issues - Mailing list pgsql-hackers

From David Rowley
Subject Re: slab allocator performance issues
Date
Msg-id CAApHDvrnpKZrhJaz6TF0LM0Of85=eKAuE3x8STxHZ-fJBi1XMQ@mail.gmail.com
Whole thread Raw
In response to Re: slab allocator performance issues  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: slab allocator performance issues
List pgsql-hackers
Thanks for testing the patch.

On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote:
> v13-0001 to 0005:

>  2.60%  postgres  postgres             [.] SlabFree

> + v4 slab:

>    4.98%  postgres  postgres              [.] SlabFree
>
> While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2%
ofnodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. 

I've tried to reproduce this with the v13 patches applied and I'm not
really getting the same as you are. To run the function 100 times I
used:

select x, a.* from generate_series(1,100) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;

(I had to add the * (1+x-x) to add a lateral dependency to stop the
function just being executed once)

v13-0001 - 0005 gives me:

  37.71%  postgres             [.] rt_set
  19.24%  postgres             [.] SlabAlloc
   8.73%  [kernel]             [k] clear_page_rep
   5.21%  postgres             [.] rt_node_insert_inner.isra.0
   2.63%  [kernel]             [k] asm_exc_page_fault
   2.24%  postgres             [.] SlabFree

and fairly consistently 122 ms runtime per call.

Applying v4 slab patch I get:

  41.06%  postgres             [.] rt_set
  10.84%  postgres             [.] SlabAlloc
   9.01%  [kernel]             [k] clear_page_rep
   6.49%  postgres             [.] rt_node_insert_inner.isra.0
   2.76%  postgres             [.] SlabFree

and fairly consistently 112 ms per call.

I wonder if you can consistently get the same result on another
compiler or after patching something like master~50 or master~100.
Maybe it's just a code alignment thing.

Looking at the annotation of perf report for SlabFree with the patched
version I see:

      │
      │     /* push this chunk onto the head of the free list */
      │     *(MemoryChunk **) pointer = block->freehead;
 0.09 │       mov     0x10(%r8),%rax
      │     slab = block->slab;
59.15 │       mov     (%r8),%rbp
      │     *(MemoryChunk **) pointer = block->freehead;
 9.43 │       mov     %rax,(%rdi)
      │     block->freehead = chunk;
      │
      │     block->nfree++;

I think what that's telling me is that dereferencing the block's
memory is slow, likely due to that particular cache line not being
cached any longer. I tried running the test with 10,000 ints instead
of 500,000 so that there would be less CPU cache pressure. I see:

 29.76 │       mov     (%r8),%rbp
       │     *(MemoryChunk **) pointer = block->freehead;
 12.72 │       mov     %rax,(%rdi)
       │     block->freehead = chunk;
       │
       │     block->nfree++;
       │       mov     0x8(%r8),%eax
       │     block->freehead = chunk;
  4.27 │       mov     %rdx,0x10(%r8)
       │     SlabBlocklistIndex():
       │     index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
       │       mov     $0x1,%edx
       │     SlabFree():
       │     block->nfree++;
       │       lea     0x1(%rax),%edi
       │       mov     %edi,0x8(%r8)
       │     SlabBlocklistIndex():
       │     int32           blocklist_shift = slab->blocklist_shift;
       │       mov     0x70(%rbp),%ecx
       │     index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
  8.46 │       shl     %cl,%edx

various other instructions in SlabFree are proportionally taking
longer now. For example the bitshift at the end was insignificant
previously. That indicates to me that this is due to caching effects.
We must fetch the block in SlabFree() in both versions. It's possible
that something is going on in SlabAlloc() that is causing more useful
cachelines to be evicted, but (I think) one of primary design goals
Andres was going for was to reduce that. For example not having to
write out the freelist for an entire block when the block is first
allocated means not having to load possibly all cache lines for the
entire block anymore.

I tried looking at perf stat during the run.

Without slab changes:

drowley@amd3990x:~$ sudo perf stat --pid=74922 sleep 2
 Performance counter stats for process id '74922':

          2,000.74 msec task-clock                #    1.000 CPUs utilized
                 4      context-switches          #    1.999 /sec
                 0      cpu-migrations            #    0.000 /sec
           578,139      page-faults               #  288.963 K/sec
     8,614,687,392      cycles                    #    4.306 GHz
               (83.21%)
       682,574,688      stalled-cycles-frontend   #    7.92% frontend
cycles idle     (83.33%)
     4,822,904,271      stalled-cycles-backend    #   55.98% backend
cycles idle      (83.41%)
    11,447,124,105      instructions              #    1.33  insn per cycle
                                                  #    0.42  stalled
cycles per insn  (83.41%)
     1,947,647,575      branches                  #  973.464 M/sec
               (83.41%)
        13,914,897      branch-misses             #    0.71% of all
branches          (83.24%)

       2.000924020 seconds time elapsed

With slab changes:

drowley@amd3990x:~$ sudo perf stat --pid=75967 sleep 2
 Performance counter stats for process id '75967':

          2,000.89 msec task-clock                #    1.000 CPUs utilized
                 1      context-switches          #    0.500 /sec
                 0      cpu-migrations            #    0.000 /sec
           607,423      page-faults               #  303.576 K/sec
     8,566,091,176      cycles                    #    4.281 GHz
               (83.21%)
       737,839,390      stalled-cycles-frontend   #    8.61% frontend
cycles idle     (83.32%)
     4,454,357,725      stalled-cycles-backend    #   52.00% backend
cycles idle      (83.41%)
    10,760,559,837      instructions              #    1.26  insn per cycle
                                                  #    0.41  stalled
cycles per insn  (83.41%)
     1,872,047,962      branches                  #  935.606 M/sec
               (83.41%)
        14,928,953      branch-misses             #    0.80% of all
branches          (83.25%)

       2.000960610 seconds time elapsed

It would be interesting to see if your perf stat output is showing
something significantly different with and without the slab changes.

It does not seem impossible that due to the slab changes having to
look at less memory in SlabAlloc() that that's moving some additional
requirements for SlabFree() to fetch cache lines that in the unpatched
version would have already been available.  If that is the case, then
I think we shouldn't worry about it unless we can find some workload
that demonstrates an overall performance regression with the patch. I
just don't quite have enough perf experience to know how I might go
about proving that.

David



pgsql-hackers by date:

Previous
From: Joseph Koshakow
Date:
Subject: Re: Date-Time dangling unit fix
Next
From: Peter Geoghegan
Date:
Subject: Re: New strategies for freezing, advancing relfrozenxid early