Thread: NUMA shared memory interleaving
Thanks to having pg_numa.c, we can now simply address problem#2 of NUMA imbalance from [1] pages 11-14, by interleaving shm memory in PG19 - patch attached. We do not need to call numa_set_localalloc() as we only interleave shm segments, while local allocations stay the same (well, "local" means relative to the CPU asking for private memory). Below is result from legacy 4s32t64 Sandy Bridge EP box with low NUMA (QPI) interconnect bandwidth to better illustrate the problem (it's little edgecase, but some one may hit it): Testcase: small SB (here it was 4GB*) that fully fits NUMA hugepage zone as this was tested with hugepages=on $ cat seqconcurrscans.pgb \set num (:client_id % 8) + 1 select sum(octet_length(filler)) from pgbench_accounts_:num; /usr/local/pgsql/bin/pg_ctl -D /db/data -l logfile restart /usr/local/pgsql/bin/psql -c "select pg_prewarm('pgbench_accounts_'||s) from generate_series(1, 8) s;" #load all using current policy /usr/local/pgsql/bin/psql -c "select * from pg_shmem_allocations_numa where name = 'Buffer Blocks';" /usr/local/pgsql/bin/pgbench -c 64 -j 8 -P 1 -T 60 -f seqconcurrscans.pgb on master and numa=off (default) and in previous versions: name | numa_node | size ---------------+-----------+------------ Buffer Blocks | 0 | 0 Buffer Blocks | 1 | 0 Buffer Blocks | 2 | 4297064448 Buffer Blocks | 3 | 0 latency average = 1826.324 ms latency stddev = 665.567 ms tps = 34.708151 (without initial connection time) on master and numa=on: name | numa_node | size ---------------+-----------+------------ Buffer Blocks | 0 | 1073741824 Buffer Blocks | 1 | 1073741824 Buffer Blocks | 2 | 1075838976 Buffer Blocks | 3 | 1073741824 latency average = 1002.288 ms latency stddev = 214.392 ms tps = 63.344814 (without initial connection time) Normal pgbench workloads tend to be not affected, as each backend tends to touch just a small partition of shm (thanks to BAS strategies). Some remaining questions are: 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the first option, as we could potentially in future add more optimizations behind that GUC. 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an expert on DSA/DSM at all) 3. Should we fail to start if we numa=on on an unsupported platform? * interesting tidbit to get reliable measurement: one needs to double check that s_b (hugepage allocation) is smaller than per-NUMA zone free hugepages (s_b fits static hugepage allocation within a single zone). This shouldn't be a problem on 2 sockets (as most of the time there, s_b is < 50% RAM anyway, well usually 26-30% with some stuff by max_connections, it's higher than 25% but people usually sysctl nr_hugepages=25%RAM) , but with >= 4 NUMA nodes (4 sockets or some modern MCMs) kernel might start spilling the s_b (> 25%) to the other NUMA node on it's own, so it's best to verify it using pg_shmem_allocations_numa... -J. [1] - https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf
Attachment
On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an > expert on DSA/DSM at all) I have no answers but I have speculated for years about a very specific case (without any idea where to begin due to lack of ... I guess all this sort of stuff): in ExecParallelHashJoinNewBatch(), workers split up and try to work on different batches on their own to minimise contention, and when that's not possible (more workers than batches, or finishing their existing work at different times and going to help others), they just proceed in round-robin order. A beginner thought is: if you're going to help someone working on a hash table, it would surely be best to have the CPUs and all the data on the same NUMA node. During loading, cache line ping pong would be cheaper, and during probing, it *might* be easier to tune explicit memory prefetch timing that way as it would look more like a single node system with a fixed latency, IDK (I've shared patches for prefetching before that showed pretty decent speedups, and the lack of that feature is probably a bigger problem than any of this stuff, who knows...). Another beginner thought is that the DSA allocator is a source of contention during loading: the dumbest problem is that the chunks are just too small, but it might also be interesting to look into per-node pools. Or something. IDK, just some thoughts...
On Thu, Apr 17, 2025 at 1:58 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I have no answers but I have speculated for years about a very > specific case (without any idea where to begin due to lack of ... I > guess all this sort of stuff): in ExecParallelHashJoinNewBatch(), > workers split up and try to work on different batches on their own to > minimise contention, and when that's not possible (more workers than > batches, or finishing their existing work at different times and going > to help others), they just proceed in round-robin order. A beginner > thought is: if you're going to help someone working on a hash table, > it would surely be best to have the CPUs and all the data on the same > NUMA node. During loading, cache line ping pong would be cheaper, and > during probing, it *might* be easier to tune explicit memory prefetch > timing that way as it would look more like a single node system with a > fixed latency, IDK (I've shared patches for prefetching before that > showed pretty decent speedups, and the lack of that feature is > probably a bigger problem than any of this stuff, who knows...). > Another beginner thought is that the DSA allocator is a source of > contention during loading: the dumbest problem is that the chunks are > just too small, but it might also be interesting to look into per-node > pools. Or something. IDK, just some thoughts... And BTW there are papers about that (but they mostly just remind me that I have to reboot the prefetching patch long before that...), for example: https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjoins/lang-imdm2013.pdf
On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > Normal pgbench workloads tend to be not affected, as each backend > tends to touch just a small partition of shm (thanks to BAS > strategies). Some remaining questions are: > 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the > first option, as we could potentially in future add more optimizations > behind that GUC. I wonder whether the GUC needs to support interleaving between a designated set of nodes rather than only being able to do all nodes. For example, suppose someone is pinning the processes to a certain set of NUMA nodes; perhaps then they wouldn't want to use memory from other nodes. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote: > On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > Normal pgbench workloads tend to be not affected, as each backend > > tends to touch just a small partition of shm (thanks to BAS > > strategies). Some remaining questions are: > > 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the > > first option, as we could potentially in future add more optimizations > > behind that GUC. > > I wonder whether the GUC needs to support interleaving between a > designated set of nodes rather than only being able to do all nodes. > For example, suppose someone is pinning the processes to a certain set > of NUMA nodes; perhaps then they wouldn't want to use memory from > other nodes. +1. That could be used for instances consolidation on the same host. One could ensure that numa nodes are not shared across instances (cpu and memory resource isolation per instance). Bonus point, adding Direct I/O into the game would ensure that the OS page cache is not shared too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote: > On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an > > expert on DSA/DSM at all) > > I have no answers but I have speculated for years about a very > specific case (without any idea where to begin due to lack of ... I > guess all this sort of stuff): in ExecParallelHashJoinNewBatch(), > workers split up and try to work on different batches on their own to > minimise contention, and when that's not possible (more workers than > batches, or finishing their existing work at different times and going > to help others), they just proceed in round-robin order. A beginner > thought is: if you're going to help someone working on a hash table, > it would surely be best to have the CPUs and all the data on the same > NUMA node. During loading, cache line ping pong would be cheaper, and > during probing, it *might* be easier to tune explicit memory prefetch > timing that way as it would look more like a single node system with a > fixed latency, IDK (I've shared patches for prefetching before that > showed pretty decent speedups, and the lack of that feature is > probably a bigger problem than any of this stuff, who knows...). > Another beginner thought is that the DSA allocator is a source of > contention during loading: the dumbest problem is that the chunks are > just too small, but it might also be interesting to look into per-node > pools. Or something. IDK, just some thoughts... I'm also thinking that could be beneficial for parallel workers. I think the ideal scenario would be to have the parallel workers spread across numa nodes and accessing their "local" memory first (and help with "remote" memory access if there is still more work to do "remotely"). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com