Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
Date
Msg-id 20537e21-3a32-4941-91eb-20bdfdb96e26@vondra.me
Whole thread Raw
In response to Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach  (Cédric Villemain <cedric.villemain@data-bene.io>)
List pgsql-hackers

On 7/9/25 08:40, Cédric Villemain wrote:
>> On 7/8/25 18:06, Cédric Villemain wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 7/8/25 03:55, Cédric Villemain wrote:
>>>>> Hi Andres,
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
>>>>>>> In my work on more careful PostgreSQL resource management, I've come
>>>>>>> to the
>>>>>>> conclusion that we should avoid pushing policy too deeply into the
>>>>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about
>>>>>>> integrating
>>>>>>> NUMA-specific management directly into core PostgreSQL in such a
>>>>>>> way.
>>>>>>
>>>>>> I think it's actually the opposite - whenever we pushed stuff like
>>>>>> this
>>>>>> outside of core it has hurt postgres substantially. Not having
>>>>>> replication in
>>>>>> core was a huge mistake. Not having HA management in core is
>>>>>> probably the
>>>>>> biggest current adoption hurdle for postgres.
>>>>>>
>>>>>> To deal better with NUMA we need to improve memory placement and
>>>>>> various
>>>>>> algorithms, in an interrelated way - that's pretty much impossible
>>>>>> to do
>>>>>> outside of core.
>>>>>
>>>>> Except the backend pinning which is easy to achieve, thus my
>>>>> comment on
>>>>> the related patch.
>>>>> I'm not claiming NUMA memory and all should be managed outside of core
>>>>> (though I didn't read other patches yet).
>>>>>
>>>>
>>>> But an "optimal backend placement" seems to very much depend on
>>>> where we
>>>> placed the various pieces of shared memory. Which the external module
>>>> will have trouble following, I suspect.
>>>>
>>>> I still don't have any idea what exactly would the external module do,
>>>> how would it decide where to place the backend. Can you describe some
>>>> use case with an example?
>>>>
>>>> Assuming we want to actually pin tasks from within Postgres, what I
>>>> think might work is allowing modules to "advise" on where to place the
>>>> task. But the decision would still be done by core.
>>>
>>> Possibly exactly what you're doing in proc.c when managing allocation of
>>> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
>>> candidates), I didn't get that they require information not available to
>>> any process executing code from a module.
>>>
>>
>> Well, it needs to understand how some other stuff (especially PGPROC
>> entries) is distributed between nodes. I'm not sure how much of this
>> internal information we want to expose outside core ...
>>
>>> Parts of your code where you assign/define policy could be in one or
>>> more relevant routines of a "numa profile manager", like in an
>>> initProcessRoutine(), and registered in pmroutine struct:
>>>
>>> pmroutine = GetPmRoutineForInitProcess();
>>> if (pmroutine != NULL &&
>>>      pmroutine->init_process != NULL)
>>>      pmroutine->init_process(MyProc);
>>>
>>> This way it's easier to manage alternative policies, and also to be able
>>> to adjust when hardware and linux kernel changes.
>>>
>>
>> I'm not against making this extensible, in some way. But I still
>> struggle to imagine a reasonable alternative policy, where the external
>> module gets the same information and ends up with a different decision.
>>
>> So what would the alternate policy look like? What use case would the
>> module be supporting?
> 
> 
> That's the whole point: there are very distinct usages of PostgreSQL in
> the field. And maybe not all of them will require the policy defined by
> PostgreSQL core.
> 
> May I ask the reverse: what prevent external modules from taking those
> decisions ? There are already a lot of area where external code can take
> over PostgreSQL processing, like Neon is doing.
> 

The complexity of making everything extensible in an arbitrary way. To
make it extensible in a useful, we need to have a reasonably clear idea
what aspects need to be extensible, and what's the goal.

> There are some very early processing for memory setup that I can see as
> a current blocker, and here I'd refer a more compliant NUMA api as
> proposed by Jakub so it's possible to arrange based on workload,
> hardware configuration or other matters. Reworking to get distinct
> segment and all as you do is great, and combo of both approach probably
> of great interest. There is also this weighted interleave discussed and
> probably much more to come in this area in Linux.
> 
> I think some points raised already about possible distinct policies, I
> am precisely claiming that it is hard to come with one good policy with
> limited setup options, thus requirement to keep that flexible enough
> (hooks, api, 100 GUc ?).
> 

I'm sorry, I don't want to sound too negative, but "I want arbitrary
extensibility" is not a very useful feedback. I've asked you to give
some examples of policies that'd customize some of the NUMA stuff.

> There is an EPYC story here also, given the NUMA setup can vary
> depending on BIOS setup, associated NUMA policy must probably take that
> into account (L3 can be either real cache or 4 extra "local" NUMA nodes
> - with highly distinct access cost from a RAM module).
> Does that change how PostgreSQL will place memory and process? Is it
> important or of interest ?
> 

So how exactly would the policy handle this? Right now we're entirely
oblivious to L3, or on-CPU caches in general. We don't even consider the
size of L3 when sizing hash tables in a hashjoin etc.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Problem with transition tables on partitioned tables with foreign-table partitions
Next
From: Tomas Vondra
Date:
Subject: Re: Adding basic NUMA awareness