Re: slab allocator performance issues - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: slab allocator performance issues |
Date | |
Msg-id | CAFBsxsEby=vzxX31Rc5-XjkgXFs2UygY7OAHr-Az600NcgSR9A@mail.gmail.com Whole thread Raw |
In response to | Re: slab allocator performance issues (David Rowley <dgrowleyml@gmail.com>) |
Responses |
Re: slab allocator performance issues
|
List | pgsql-hackers |
On Tue, Dec 13, 2022 at 7:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote:
> > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
Simply running over a longer period of time like this makes the SlabFree difference much closer to your results, so it doesn't seem out of line anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current slab, with a 5% reduction in total time:
500k ints:
v13-0001-0005
average of 30: 217ms
47.61% postgres postgres [.] rt_set
20.99% postgres postgres [.] SlabAlloc
10.00% postgres postgres [.] rt_node_insert_inner.isra.0
6.87% postgres [unknown] [k] 0xffffffffbce011b7
3.53% postgres postgres [.] MemoryContextAlloc
2.82% postgres postgres [.] SlabFree
+slab v4
average of 30: 206ms
51.13% postgres postgres [.] rt_set
14.08% postgres postgres [.] SlabAlloc
11.41% postgres postgres [.] rt_node_insert_inner.isra.0
7.44% postgres [unknown] [k] 0xffffffffbce011b7
3.89% postgres postgres [.] MemoryContextAlloc
3.39% postgres postgres [.] SlabFree
It doesn't look mysterious anymore, but I went ahead and took some more perf measurements, including for cache misses. My naive impression is that we're spending a bit more time waiting for data, but having to do less work with it once we get it, which is consistent with your earlier comments:
perf stat -p $pid sleep 2
v13:
2,001.55 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
311,690 page-faults:u # 155.724 K/sec
3,128,740,701 cycles:u # 1.563 GHz
4,739,333,861 instructions:u # 1.51 insn per cycle
820,014,588 branches:u # 409.690 M/sec
7,385,923 branch-misses:u # 0.90% of all branches
+slab v4:
2,001.09 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
326,017 page-faults:u # 162.920 K/sec
3,016,668,818 cycles:u # 1.508 GHz
4,324,863,908 instructions:u # 1.43 insn per cycle
761,839,927 branches:u # 380.712 M/sec
7,718,366 branch-misses:u # 1.01% of all branches
perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13: LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%
--
John Naylor
EDB: http://www.enterprisedb.com
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote:
> > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
Simply running over a longer period of time like this makes the SlabFree difference much closer to your results, so it doesn't seem out of line anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current slab, with a 5% reduction in total time:
500k ints:
v13-0001-0005
average of 30: 217ms
47.61% postgres postgres [.] rt_set
20.99% postgres postgres [.] SlabAlloc
10.00% postgres postgres [.] rt_node_insert_inner.isra.0
6.87% postgres [unknown] [k] 0xffffffffbce011b7
3.53% postgres postgres [.] MemoryContextAlloc
2.82% postgres postgres [.] SlabFree
+slab v4
average of 30: 206ms
51.13% postgres postgres [.] rt_set
14.08% postgres postgres [.] SlabAlloc
11.41% postgres postgres [.] rt_node_insert_inner.isra.0
7.44% postgres [unknown] [k] 0xffffffffbce011b7
3.89% postgres postgres [.] MemoryContextAlloc
3.39% postgres postgres [.] SlabFree
It doesn't look mysterious anymore, but I went ahead and took some more perf measurements, including for cache misses. My naive impression is that we're spending a bit more time waiting for data, but having to do less work with it once we get it, which is consistent with your earlier comments:
perf stat -p $pid sleep 2
v13:
2,001.55 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
311,690 page-faults:u # 155.724 K/sec
3,128,740,701 cycles:u # 1.563 GHz
4,739,333,861 instructions:u # 1.51 insn per cycle
820,014,588 branches:u # 409.690 M/sec
7,385,923 branch-misses:u # 0.90% of all branches
+slab v4:
2,001.09 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
326,017 page-faults:u # 162.920 K/sec
3,016,668,818 cycles:u # 1.508 GHz
4,324,863,908 instructions:u # 1.43 insn per cycle
761,839,927 branches:u # 380.712 M/sec
7,718,366 branch-misses:u # 1.01% of all branches
perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13: LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%
--
John Naylor
EDB: http://www.enterprisedb.com
pgsql-hackers by date: