Re: Add the ability to limit the amount of memory that can be allocated to backends. - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Add the ability to limit the amount of memory that can be allocated to backends.
Date
Msg-id 25e68736-00ff-4346-b432-4cda836743f3@vondra.me
Whole thread Raw
In response to Re: Add the ability to limit the amount of memory that can be allocated to backends.  (James Hunter <james.hunter.pg@gmail.com>)
List pgsql-hackers

On 12/27/24 20:14, James Hunter wrote:
> Reviving this thread, because I am thinking about something related --
> please ignore the "On Fri, Dec 27, 2024" date, this seems to be an
> artifact of me re-sending the message, from the list archive. The
> original message was from January 28, 2024.
> 
> On Fri, Dec 27, 2024 at 11:02 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Firstly, I agree with the goal of having a way to account for memory
>> used by the backends, and also ability to enforce some sort of limit.
>> It's difficult to track the memory at the OS level (interpreting RSS
>> values is not trivial), and work_mem is not sufficient to enforce a
>> backend-level limit, not even talking about a global limit.
>>
>> But as I said earlier, it seems quite strange to start by introducing
>> some sort of global limit, combining memory for all backends. I do
>> understand that the intent is to have such global limit in order to
>> prevent issues with the OOM killer and/or not to interfere with other
>> stuff running on the same machine. And while I'm not saying we should
>> not have such limit, every time I wished to have a memory limit it was a
>> backend-level one. Ideally a workmem-like limit that would "adjust" the
>> work_mem values used by the optimizer (but that's not what this patch
>> aims to do), or at least a backstop in case something goes wrong (say, a
>> memory leak, OLTP application issuing complex queries, etc.).
> 
> I think what Tomas suggests is the right strategy. I am thinking of
> something like:
> 
> 1. Say we have a backend_work_mem limit. Then the total amount of
> memory available on the entire system, for all queries, as work_mem,
> would be backend_work_mem * max_connections.
> 
> 2. We use this backend_work_mem to "adjust" work_mem values used by
> the executor. (I don't care about the optimizer right now -- optimizer
> just does its best to predict what will happen at runtime.)
> 
> At runtime, every node that uses work_mem currently checks its memory
> usage against the session work_mem (and possibly hash_mem_multiplier)
> GUC(s). Instead, now, every node would check against its node-specific
> "adjusted" work_mem. If it exceeds this limit, it spills, using
> existing logic.
> 
> In other words -- existing logic spills based on comparison to global
> work_mem GUC. Let's make it spill, instead, based on comparison to an
> operator-local "PlanState.work_mem" field.
> 
> And then let's set that "PlanState.work_mem" field based on a new
> "backend_work_mem" GUC. Then no queries will run OOM (at least, not
> due to work_mem -- we'll address other memory usages separately), so
> they won't need to be canceled. Instead, they'll spill.
> 
> This strategy solves the ongoing problem of how to set work_mem, if
> some queries have lots of operators and others don't -- now we just
> set backend_work_mem, as a limit on the entire query's total work_mem.
> And a bit of integration with the optimizer will allow us to
> distribute the total backend_work_mem to individual execution nodes,
> with the goal of minimizing spilling, without exceeding the
> backend_work_mem limit.
> 
> Anyway, I revived this thread to see if there's interest in this sort
> of strategy --
> 

I think there's still interest in having a memory limit of this kind,
but it's not very clear to me what you mean by "adjusted work_mem". Can
you explain how would that actually work in practice?

Whenever I've been thinking about this in the past, it wasn't clear to
me how would we know when to start adjusting work_mem, because we don't
know which nodes will actually use work_mem concurrently.

We certainly don't know that during planning - e.g. because we don't
actually know what nodes will be "above" the operation we're planning.
And AFIAK we don't have any concept which parts of the plan may be
active at the same time ...

Let's say we have limits work_mem=4MB and query_mem=8MB (to keep it
simple). And we have a query plan with 3 separate hash joins, each
needing 4MB. I believe these things can happen:

1) The hash joins are in distant parts of the query plan, and will not
actually run concurrently - we'll run each hashjoin till completion
before starting the next one. So, no problem with query_mem.

2) The hash joins are active at the same time, initialized one by one.
But we don't know what, so we'll init HJ #1, it'll get 4MB of memory.
Then we'll init HJ #2, it'll also get 4MB of memory. And now we want to
init HJ #3, but we've already exhausted query_mem, but the memory is
already used and we can't reclaim it.

How would the work_mem get adjusted in these cases? Ideally, we'd not
adjust work_mem in (1), because it'd force hash joins to spill data to
disk, affecting performance when there's enough memory to just run the
query plan (but maybe that's a reasonable price for the memory limit?).
While in (2) we'd need to start adjusting the memory from the beginning,
but I don't think we know that so early.

Can you explain / walk me through the proposal for these cases?

regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Connection limits/permissions, slotsync workers, etc