Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adding basic NUMA awareness
Date
Msg-id 3223cdcd-6d16-4e90-a3a6-b957f762dc5a@vondra.me
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

Here's a v2 of the patch series, with a couple changes:

* I simplified the various freelist partitioning by keeping only the
"node" partitioning (so the cpu/pid strategies are gone). Those were
meant for experimenting, but it made the code more complicated so I
ditched it.


* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). So
even if your system is not NUMA, you'll have 4 of them. If you have 2
nodes, you'll still have 4, and each node will get 2. With 3 nodes we
get 6 partitions (we need 2 per node, and we want to keep the number
equal to keep things simple). Once the number of nodes exceeds 4, the
heuristics switches to one partition per node.

I'm aware there's a discussion about maybe simply removing freelists
entirely. If that happens, this becomes mostly irrelevant, of course.

The code should also make sure the freelists "agree" with how the
earlier patch mapped the buffers to NUMA nodes, i.e. the freelist should
only contain buffers from the "correct" NUMA node, etc. I haven't paid
much attention to this - I believe it should work for "nice" values of
shared buffers (when it evenly divides between nodes). But I'm sure it's
possible to confuse that (won't cause crashes, but inefficiency).


* There's now a patch partitioning clocksweep, using the same scheme as
the freelists. I came to the conclusion it doesn't make much sense to
partition these things differently - I can't think of a reason why that
would be advantageous, and it makes it easier to reason about.

The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.

It does work, as in "it doesn't crash". But this part definitely needs
review to make sure I got the changes to the "pacing" right.


* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.


I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.

For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
Next
From: Tomas Vondra
Date:
Subject: Re: Adding basic NUMA awareness