Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach - Mailing list pgsql-hackers
From | Cédric Villemain |
---|---|
Subject | Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach |
Date | |
Msg-id | a7ce37b4-da5e-4565-a9c3-4f92d7098128@data-bene.io Whole thread Raw |
In response to | Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
|
List | pgsql-hackers |
> On 7/8/25 18:06, Cédric Villemain wrote: >> >> >> >> >> >> >>> On 7/8/25 03:55, Cédric Villemain wrote: >>>> Hi Andres, >>>> >>>>> Hi, >>>>> >>>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: >>>>>> In my work on more careful PostgreSQL resource management, I've come >>>>>> to the >>>>>> conclusion that we should avoid pushing policy too deeply into the >>>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about >>>>>> integrating >>>>>> NUMA-specific management directly into core PostgreSQL in such a way. >>>>> >>>>> I think it's actually the opposite - whenever we pushed stuff like this >>>>> outside of core it has hurt postgres substantially. Not having >>>>> replication in >>>>> core was a huge mistake. Not having HA management in core is >>>>> probably the >>>>> biggest current adoption hurdle for postgres. >>>>> >>>>> To deal better with NUMA we need to improve memory placement and >>>>> various >>>>> algorithms, in an interrelated way - that's pretty much impossible >>>>> to do >>>>> outside of core. >>>> >>>> Except the backend pinning which is easy to achieve, thus my comment on >>>> the related patch. >>>> I'm not claiming NUMA memory and all should be managed outside of core >>>> (though I didn't read other patches yet). >>>> >>> >>> But an "optimal backend placement" seems to very much depend on where we >>> placed the various pieces of shared memory. Which the external module >>> will have trouble following, I suspect. >>> >>> I still don't have any idea what exactly would the external module do, >>> how would it decide where to place the backend. Can you describe some >>> use case with an example? >>> >>> Assuming we want to actually pin tasks from within Postgres, what I >>> think might work is allowing modules to "advise" on where to place the >>> task. But the decision would still be done by core. >> >> Possibly exactly what you're doing in proc.c when managing allocation of >> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good >> candidates), I didn't get that they require information not available to >> any process executing code from a module. >> > > Well, it needs to understand how some other stuff (especially PGPROC > entries) is distributed between nodes. I'm not sure how much of this > internal information we want to expose outside core ... > >> Parts of your code where you assign/define policy could be in one or >> more relevant routines of a "numa profile manager", like in an >> initProcessRoutine(), and registered in pmroutine struct: >> >> pmroutine = GetPmRoutineForInitProcess(); >> if (pmroutine != NULL && >> pmroutine->init_process != NULL) >> pmroutine->init_process(MyProc); >> >> This way it's easier to manage alternative policies, and also to be able >> to adjust when hardware and linux kernel changes. >> > > I'm not against making this extensible, in some way. But I still > struggle to imagine a reasonable alternative policy, where the external > module gets the same information and ends up with a different decision. > > So what would the alternate policy look like? What use case would the > module be supporting? That's the whole point: there are very distinct usages of PostgreSQL in the field. And maybe not all of them will require the policy defined by PostgreSQL core. May I ask the reverse: what prevent external modules from taking those decisions ? There are already a lot of area where external code can take over PostgreSQL processing, like Neon is doing. There are some very early processing for memory setup that I can see as a current blocker, and here I'd refer a more compliant NUMA api as proposed by Jakub so it's possible to arrange based on workload, hardware configuration or other matters. Reworking to get distinct segment and all as you do is great, and combo of both approach probably of great interest. There is also this weighted interleave discussed and probably much more to come in this area in Linux. I think some points raised already about possible distinct policies, I am precisely claiming that it is hard to come with one good policy with limited setup options, thus requirement to keep that flexible enough (hooks, api, 100 GUc ?). There is an EPYC story here also, given the NUMA setup can vary depending on BIOS setup, associated NUMA policy must probably take that into account (L3 can be either real cache or 4 extra "local" NUMA nodes - with highly distinct access cost from a RAM module). Does that change how PostgreSQL will place memory and process? Is it important or of interest ? -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
pgsql-hackers by date: