Re: Add the ability to limit the amount of memory that can be allocated to backends. - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Add the ability to limit the amount of memory that can be allocated to backends. |
Date | |
Msg-id | 1c5f1856-817d-45e5-8e1a-acd95c6dd335@enterprisedb.com Whole thread Raw |
In response to | Re: Add the ability to limit the amount of memory that can be allocated to backends. ("Anton A. Melnikov" <a.melnikov@postgrespro.ru>) |
Responses |
Re: Add the ability to limit the amount of memory that can be allocated to backends.
|
List | pgsql-hackers |
Hi, I took a closer look at this patch over the last couple days, and I did a bunch of benchmarks to see how expensive the accounting is. The good news is that I haven't observed any overhead - I did two simple tests, that I think would be particularly sensitive to this: 1) regular "pgbench -S" with scale 10, so tiny and not hitting I/O 2) "select count(*) from generate_series(1,N)" where N is either 10k or 1M, which should be enough to cause a fair number of allocations And I tested this on three branches - master (no patches applied), patched (but limit=0, so just accounting) and patched-limit (with limit set to 4.5GB, so high enough not to be hit). Attached are script and raw results for both benchmarks from two machines, i5 (small, 4C) and xeon (bigger, 16C/32C), and a PDF showing the results as candlestick charts (with 1 and 2 sigma intervals). AFAICS there's no measurable difference between master and patched builds. Which is good, clearly the local allowance makes the overhead negligible (at least in these benchmarks). Now, for the patch itself - I already said some of the stuff about a month ago [1], but I'll repeat some of it for the sake of completeness before I get to comments about the code etc. Firstly, I agree with the goal of having a way to account for memory used by the backends, and also ability to enforce some sort of limit. It's difficult to track the memory at the OS level (interpreting RSS values is not trivial), and work_mem is not sufficient to enforce a backend-level limit, not even talking about a global limit. But as I said earlier, it seems quite strange to start by introducing some sort of global limit, combining memory for all backends. I do understand that the intent is to have such global limit in order to prevent issues with the OOM killer and/or not to interfere with other stuff running on the same machine. And while I'm not saying we should not have such limit, every time I wished to have a memory limit it was a backend-level one. Ideally a workmem-like limit that would "adjust" the work_mem values used by the optimizer (but that's not what this patch aims to do), or at least a backstop in case something goes wrong (say, a memory leak, OLTP application issuing complex queries, etc.). The accounting and infrastructure introduced by the patch seems to be suitable for both types of limits (global and per-backend), but then it goes for the global limit first. It may seem simpler to implement, but in practice there's a bunch of problems mentioned earlier, but ignored. For example, I really don't believe it's OK to just report the error in the first backend that happens to hit the limit. Bertrand mentioned that [2], asking about a situation when a runaway process allocates 99% of memory. The response to that was > Initially, we believe that punishing the detector is reasonable if we > can help administrators avoid the OOM killer/resource starvation. But > we can and should expand on this idea. which I find rather unsatisfactory. Belief is not an argument, and it relies on the assumption that this helps the administrator to avoid the OOM killer etc. ISTM it can easily mean the administrator can't even connect to the database, run a query (to inspect the new system views), etc. because any of that would hit the memory limit. That seems more like a built-in DoS facility ... If this was up to me, I'd probably start with the per-backend limit. But that's my personal opinion. Now, some comments about the code etc. Stephen was promising a new patch version in October [6], but that didn't happen yet, so I looked at the patches I rebased in December. 0001 ---- 1) I don't see why we should separate memory by context type - without knowing which exact backends are using the memory, this seems pretty irrelevant/useless. Either it's private or shared memory, I don't think the accounting should break this into more counters. Also, while I'm not aware of anyone proposing a new memory context type, I doubt we'd want to add more and more counters if that happens. suggestion: only two counters, for local and shared memory If we absolutely want to have per-type counters, we should not mix the private memory and shared memory ones. 2) I have my doubts about the tracking of shared memory. It seems weird to count them only into the backend that allocated them, and transfer them to "global" on process exit. Surely we should know which shared memory is meant to be long-lived from the start, no? For example, it's not uncommon that an extension allocates a chunk of shared memory to store some sort of global state (e.g. BDR/pglogical does that, I'm sure other extensions do that). But the process that allocates the shared memory keeps running. AFAICS this would be tracked as backend DSM memory, which I'm not sure this is what we want ... And I'm not alone in thinking this should work differently - [3], [4]. 3) We limit the floor of allocation counters to zero. This seems really weird. Why would it be necessary? I mean, how could we get a negative amount of memory? Seems like it might easily mask issues with incorrect accounting, or something. 4) pg_stat_memory_allocation Maybe pg_stat_memory_usage would be a better name ... 5) The comments/docs repeatedly talk about "dynamic nature" and that it makes the values not exact: > Due to the dynamic nature of memory allocations the allocated bytes > values may not be exact but should be sufficient for the intended > purposes. I don't understand what "dynamic nature" refers to, and why would it make the values not exact. Presumably this refers to the "allowance" but how is that "dynamic nature"? This definitely needs to explain this better, with some basic estimate how how accurate the values are expected to be. 6) The SGML docs keep recommending to use pg_size_pretty(). I find that unnecessary, no other docs reporting values in bytes do that. 7) I see no reason to include shared_memory_size_mb in the view, and same for shared_memory_size_in_huge_pages. It has little to do with "allocated" memory, IMO. And a value in "MB" goes directly against the suggestion to use pg_size_pretty(). suggestion: drop this from the view 8) In a lot of places we do context->mem_allocated += blksize; pgstat_report_allocated_bytes_increase(blksize, PG_ALLOC_ASET); Maybe we should integrate the two types of accounting, wrap them into a single function call, or something? This makes it very simple to forget updating one of those places. AFAIC the patch tries to minimize the number of updates of the new shared counters, but with the allowance that should not be an issue I think. 9) This seems wrong. Why would it be OK to ever overflow? Isn't that a sign of incorrect accounting? /* Avoid allocated_bytes unsigned integer overflow on decrease */ if (pg_sub_u64_overflow(*my_allocated_bytes, proc_allocated_bytes, &temp)) { /* On overflow, set allocated bytes and allocator type bytes to zero */ *my_allocated_bytes = 0; *my_aset_allocated_bytes = 0; *my_dsm_allocated_bytes = 0; *my_generation_allocated_bytes = 0; *my_slab_allocated_bytes = 0; } 9) How could any of these values be negative? It's all capped to 0 and also stored in uint64. Seems pretty useless. +SELECT + datid > 0, pg_size_bytes(shared_memory_size) >= 0, shared_memory_size_in_huge_pages >= -1, global_dsm_allocated_bytes >= 0 +FROM + pg_stat_global_memory_allocation; + ?column? | ?column? | ?column? | ?column? +----------+----------+----------+---------- + t | t | t | t +(1 row) + +-- ensure that pg_stat_memory_allocation view exists +SELECT + pid > 0, allocated_bytes >= 0, aset_allocated_bytes >= 0, dsm_allocated_bytes >= 0, generation_allocated_bytes >= 0, slab_allocated_bytes >= 0 0003 ---- 1) The commit message says: > Further requests will not be allocated until dropping below the limit. > Keep this in mind when setting this value. The SGML docs have a similar "keep this in mind" suggestion in relation to the 1MB local allowance. I find this pretty useless, as it doesn't really say what to do with the information / what it means. I mean, should I set the value higher, or what am I supposed to do? This needs to be understandable for average user reading the SGML docs, who is unlikely to know the implementation details. 2) > This limit does not affect auxiliary backend processes. This seems pretty unfortunate, because in the cases where I actually saw OOM killer to intervene, it was often because of auxiliary processes allocating a lot of memory (say, autovacuum with maintenance_work_mem set very high, etc.). In a way, not excluding these auxiliary processes from the limit seems to go against the idea of preventing the OOM killer. 3) I don't think the patch ever explains what the "local allowance" is, how the refill works, why it's done this way, etc. I do think I understand that now, but I had to go through the thread, That's not really how it should be. There should be an explanation of how this works somewhere (comment in mcxt.c? separate README?). 4) > doc/src/sgml/monitoring.sgml Not sure why this removes the part about DSM being included only in the backend that created it. 5) > total_bkend_mem_bytes_available The "bkend" name is strange (to save 2 characters), and the fact how it combines local and shared memory seems confusing too. 6) + /* + * Local counter to manage shared memory allocations. At backend startup, set to + * initial_allocation_allowance via pgstat_init_allocated_bytes(). Decrease as + * memory is malloc'd. When exhausted, atomically refill if available from + * ProcGlobal->max_total_bkend_mem via exceeds_max_total_bkend_mem(). + */ +uint64 allocation_allowance = 0; But this allowance is for shared memory too, and shared memory is not allocated using malloc. 7) There's a lot of new global variables. Maybe it'd be better to group them into a struct, or something? 8) Why does pgstat_set_allocated_bytes_storage need the new return? 9) Unnecessary changes in src/backend/utils/hash/dynahash.c (whitespace, new comment that seems not very useful) 10) Shouldn't we have a new malloc wrapper that does the check? That way we wouldn't need to have the call in every memory context place calling malloc. 11) I'm not sure about the ereport() calls after hitting the limit. Adres thinks it might lead to recursion [4], but Stephen [5] seems to think this does not really make the situation worse. I'm not sure about that, though - I agree the ENOMEM can already happen, but but maybe having a limit (which clearly needs to be more stricter than the limit used by kernel for OOM) would be more likely to hit? 12) In general, I agree with Andres [4] that we'd be better of to focus on the accounting part, see how it works in practice, and then add some ability to limit memory after a while. regards [1] https://www.postgresql.org/message-id/4fb99fb7-8a6a-2828-dd77-e2f1d75c7dd0%40enterprisedb.com [2] https://www.postgresql.org/message-id/3b9b90c6-f4ae-a7df-6519-847ea9d5fe1e%40amazon.com [3] https://www.postgresql.org/message-id/20230110023118.qqbjbtyecofh3uvd%40awork3.anarazel.de [4] https://www.postgresql.org/message-id/20230113180411.rdqbrivz5ano2uat%40awork3.anarazel.de [5] https://www.postgresql.org/message-id/ZTArWsctGn5fEVPR%40tamriel.snowman.net [6] https://www.postgresql.org/message-id/ZTGycSYuFrsixv6q%40tamriel.snowman.net -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: