Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | 403c813e-7d63-4ee8-a95f-4e5c1e310f4d@vondra.me Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
| List | pgsql-hackers |
Hi, I've spent the last couple days considering what to do about this patch series in this thread. The approach assumes partitioning shared memory in a NUMA-aware way has enough benefits to justify the extra complexity. I'm getting somewhat skeptical about that being a good trade off. We've been able to demonstrate some significant benefits of the patches. See for example the results from about a month ago [1], showing that in some cases the throughput almost doubles. But I'm skeptical about this for two reasons: * Most of the benefit comes from patches unrelated to NUMA. The initial patches partition clockweep, in a NUMA oblivious way. In fact, applying the NUMA patches often *reduces* the throughput. So if we're concerned about contention on the clocksweep hand, we could apply just these first patches. That way we wouldn't have to deal with huge pages. * Furthermore, I'm not quite sure clocksweep really is a bottleneck in realistic cases. The benchmark used in this thread does many concurrent sequential scans, on data that exceeds shared buffers / fits into RAM. Perhaps that happens, but I doubt it's all that common. I've been unable to demonstrate any benefits on other workloads, even if there's a lot of buffer misses / reads into shared buffers. As soon as the query starts doing something else, the clocksweep contention becomes a non-issue. Consider for example read-only pgbench with database much larger than shared buffers (but still within RAM). The cost of the index scans (and other nodes) seems to reduce the pressure on clocksweep. So I'm skeptical about clocksweep pressure being a serious issue, except for some very narrow benchmarks (like the concurrent seqscan test). And even if this happened for some realistic cases, partitioning the buffers in a NUMA-oblivious way seems to do the trick. When discussing this stuff off list, it was suggested this might help with the scenario Andres presented in [3], where the throughput improves a lot with multiple databases. I've not observed that in practice, and I don't think these patches really can help with that. That scenario is about buffer lock contention, not clocksweep contention. With a single database there's nothing to do - there's one contended page, located on a single node. There'll be contention, the no matter which node it ends up on. With multiple databases (or multiple root pages), it either happens to work by chance (if the buffers happen to be from different nodes), or it would require figuring out which buffers are busy, and place them on different nodes. But the patches did not even try to do anything like that. So it still was a matter of chance. That does not mean we don't need to worry about NUMA. There's still the issue of misbalancing the allocation - with memory coming from just one node, etc. Which is an issue because it "overloads" the memory subsystem on that particular node. But that can be solved simply by interleaving the shared memory segment. That sidesteps most of the complexity - it does not need to figure out how to partition buffers, does not need to worry about huge pages, etc. This is not a new idea - it's more or less what Jakub Wartak initially proposed in [2], before I hijacked the thread into the more complex approach. Attached is a tiny patch doing mostly what Jakub did, except that it does two things. First, it allows interleaving the shared memory on all relevant NUMA nodes (per numa_get_mems_allowed). Second, it allows populating all memory by setting MAP_POPULATE in mmap(). There's a new GUC to enable each of these. I think we should try this (much simpler) approach first, or something close to it. Sorry for dragging everyone into a much more complex approach, which now seems to be a dead end. regards [1] https://www.postgresql.org/message-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me [2] https://www.postgresql.org/message-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com [3] https://youtu.be/V75KpACdl6E?t=2120 -- Tomas Vondra
Attachment
pgsql-hackers by date: