Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | 71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Adding basic NUMA awareness
|
List | pgsql-hackers |
Hi, Here's a somewhat cleaned up v3 of this patch series, with various improvements and a lot of cleanup. Still WIP, but I hope it resolves the various crashes reported for v2, but it still requires --with-libnuma (it won't build without it). I'm aware there's an ongoing discussion about removing the freelists, and changing the clocksweep in some way. If that happens, the relevant parts of this series will need some adjustment, of course. I haven't looked into that yet, I plan to review those patches soon. main changes in v3 ------------------ 1) I've introduced "registry" of the buffer partitions (imagine a small array of structs), serving as a source of truth for places that need info about the partitions (range of buffers, ...). With v2 there was no "shared definition" - the shared buffers, freelist and clocksweep did their own thing. But per the discussion it doesn't really make much sense for to partition buffers in different ways. So in v3 the 0001 patch defines the partitions, records them in shared memory (in a small array), and the later parts just reuse this. I also added a pg_buffercache_partitions() listing the partitions, with first/last buffer, etc. The freelist/clocksweep patches add additional information. 2) The PGPROC part introduces a similar registry, even though there are no other patches building on this. But it seemed useful to have a clear place recording this info. There's also a view pg_buffercache_pgproc. The pg_buffercache location is a bit bogus - it has nothing to do with buffers, but it was good enough for now. 3) The PGPROC partitioning is reworked and should fix the crash with the GUC set to "off". 4) This still doesn't do anything about "balancing" the clocksweep. I have some ideas how to do that, I'll work on that next. simple benchmark ---------------- I did a simple benchmark, measuring pgbench throughput with scale still fitting into RAM, but much larger (~2x) than shared buffers. See the attached test script, testing builds with more and more of the patches. I'm attaching results from two different machines (the "usual" 2P xeon and also a much larger cloud instance with EPYC/Genoa) - both the raw CSV files, with average tps and percentiles, and PDFs. The PDFs also have a comparison either to the "preceding" build (right side), or to master (below the table). There's results for the three "pgbench pinning" strategies, and that can have pretty significant impact (colocated generally performs much better than either "none" or "random"). For the "bigger" machine (wiuth 176 cores) the incremental results look like this (for pinning=none, i.e. regular pgbench): mode s_b buffers localal no-tail freelist sweep pgproc pinning ==================================================================== prepared 16GB 99% 101% 100% 103% 111% 99% 102% 32GB 98% 102% 99% 103% 107% 101% 112% 8GB 97% 102% 100% 102% 101% 101% 106% -------------------------------------------------------------------- simple 16GB 100% 100% 99% 105% 108% 99% 108% 32GB 98% 101% 100% 103% 100% 101% 97% 8GB 100% 100% 101% 99% 100% 104% 104% The way I read this is that the first three patches have about no impact on throughput. Then freelist partitioning and (especially) clocksweep partitioning can help quite a bit. pgproc is again close to ~0%, and PGPROC pinning can help again (but this part is merely experimental). For the xeon the differences (in either direction) are much smaller, so I'm not going to post it here. It's in the PDF, though. I think this looks reasonable. The way I see this patch series is not about improving peak throughput, but more about reducing imbalance and making the behavior more consistent. The results are more a confirmation there's not some sort of massive overhead, somewhere. But I'll get to this in a minute. To quantify this kind of improvement, I think we'll need tests that intentionally cause (or try to) imbalance. If you have ideas for such tests, let me know. overhead of partitioning calculation ------------------------------------ Regarding the "overhead", while the results look mostly OK, I think we'll need to rethink the partitioning scheme - particularly how the partition size is calculated. The current scheme has to use %, which can be somewhat expensive. The 0001 patch calculates a "chunk size", which is the smallest number of buffers it can "assign" to a NUMA node. This depends on how many buffer descriptors fit onto a single memory page, and it's either 512KB (with 4KB pages), or 256MB (with 2MB huge pages). And then each NUMA node gets multiple chunks, to cover shared_buffers/num_nodes. But this can be an arbitrary number - it minimizes the imbalance, but it also forces the use of % and / in the formulas. AFAIK if we required the partitions to be 2^k multiples of the chunk size, we could switch to using shifts and masking. Which is supposed to be much faster. But I haven't measured this, and the cost is that some of the nodes could get much less memory. Maybe that's fine. reserving number of huge pages ------------------------------ The other thing I realized is that partitioning buffers with huge pages is quite tricky, and can easily lead to SIGBUS when accessing the memory later. The crashes I saw happen like this: 1) figure # of pages needed (using shared_memory_size_in_huge_pages) This can be 16828 for shared_buffers=32GB. 2) make sure there's enough huge pages echo 16828 > /proc/sys/vm/nr_hugepages 3) start postgres - everything seems to works just fine 4) query pg_buffercache_numa - triggers SIGBUS accessing memory for a valid buffer (usually ~2GB from the end) It took me ages to realize what's happening, but it's very simple. The nr_hugepages is a global limit, but it's also translated into limits for each NUMA node. So when you write 16828 to it, in a 4-node system each node gets 1/4 of that. See $ numastat -cm Then we do the mmap(), and everything looks great, because there really is enough huge pages and the system can allocate memory from any NUMA node it needs. And then we come around, and do the numa_tonode_memory(). And that's where the issues start, because AFAIK this does not check the per-node limit of huge pages in any way. It just appears to work. And then later, when we finally touch the buffer, it tries to actually allocate the memory on the node, and realizes there's not enough huge pages. And triggers the SIGBUS. You may ask why the per-node limit is too low. We still need just shared_memory_size_in_huge_pages, right? And if we were partitioning the whole memory segment, that'd be true. But we only to that for shared buffers, and there's a lot of other shared memory - could be 1-2GB or so, depending on the configuration. And this gets placed on one of the nodes, and it counts against the limit on that particular node. And so it doesn't have enough huge pages to back the partition of shared buffers. The only way around this I found is by inflating the number of huge pages, significantly above the shared_memory_size_in_huge_pages value. Just to make sure the nodes get enough huge pages. I don't know what to do about this. It's quite annoying. If we only used huge pages for the partitioned parts, this wouldn't be a problem. I also realize this can be used to make sure the memory is balanced on NUMA systems. Because if you set nr_hugepages, the kernel will ensure the shared memory is distributed on all the nodes. It won't have the benefits of "coordinating" the buffers and buffer descriptors, and so on. But it will be balanced. regards -- Tomas Vondra
Attachment
- v3-0007-NUMA-pin-backends-to-NUMA-nodes.patch
- v3-0006-NUMA-interleave-PGPROC-entries.patch
- v3-0005-NUMA-clockweep-partitioning.patch
- v3-0004-NUMA-partition-buffer-freelist.patch
- v3-0003-freelist-Don-t-track-tail-of-a-freelist.patch
- v3-0002-NUMA-localalloc.patch
- v3-0001-NUMA-interleaving-buffers.patch
- numa-hb176.csv
- run-huge-pages.sh
- numa-xeon.csv
- numa-xeon-e5-2699.pdf
- numa-hb176.pdf
pgsql-hackers by date: