Thread: Limiting memory allocation

Limiting memory allocation

From
Stephen Frost
Date:
Greetings,

An ongoing issue in container environments where Kubernetes is being
used is that setting the overcommit parameters on the base system will
impact all of the processes on that system and not all of them handle
malloc failing as gracefully as PG does and may allocate more than what
they really need with the expectation that it'll work.  Folks have
already started hacking around this issue by using things like
LD_PRELOAD'd libraries to put code between palloc and malloc, but that
could cause any malloc to fail and it strikes me that, in an ideal
world, we could give users a way to constrain the total amount of memory
allocated by regular backends (or those where a user is able to control
how much memory is allocated, more generally) while not impacting the
processes which keep the server running.

I wanted to bring this general idea up for discussion here to get
feedback on the concept before going off to write code for it.  Seems
unlikely that it would be a huge amount of code while there's likely to
need to be discussion about how one would configure this and how we
might handle parallel query and such.

Empirically, the LD_PRELOAD hack does, in fact, seem to work, so even a
simple approach would be helpful, but certainly there is an angle to
this where we might eventually allow certain backends (perhaps certain
roles, etc) to allocate more and others to not be allowed to allocate as
much, etc.  Ideally we would look to go with the simple approch first
and then we can contemplate making it more complicated in the future and
not try to accomplish it all in the first pass.  Of course, we should
keep in mind such ideas and try to avoid anything which would preclude
us for adding that flexibility in the future.

Thoughts?

Thanks,

Stephen

Attachment

Re: Limiting memory allocation

From
Jan Wieck
Date:
On 5/17/22 15:42, Stephen Frost wrote:
> Thoughts?

Yes.

The main and foremost problem is a server that is used for multiple 
services and they behave differently when it comes to memory allocation. 
One service just allocates like we have petabytes of RAM, then uses 
little of it, while another one is doing precise accounting and uses all 
of that. These two types of services don't coexist well on one system 
without intervention.

Unfortunately swap space has been shunned as the ugly stepchild of 
memory in recent years. It could help in this regard to bring back swap 
space, but don't really intend to use it.

Using cgroups one can actually force a certain process (or user, or 
service) to use swap if and when that service is using more memory than 
it was "expected" to use. So I have a server with 64G of RAM. I give 16G 
to Postgres as shared buffers and another 16G to work with. I assume 
another 16G of OS buffers, so I restrict the Apache-Tomcat stuff running 
on it to something like 8-12G. After that, it has to swap. Of course, my 
Postgres processes also will have to swap if they need more than 16G of 
overall workmem ... but that is what I actually intended. I may have to 
reduce workmem, or max_connections, or something else.


Regards, Jan



Re: Limiting memory allocation

From
Tom Lane
Date:
Jan Wieck <jan@wi3ck.info> writes:
> On 5/17/22 15:42, Stephen Frost wrote:
>> Thoughts?

> Using cgroups one can actually force a certain process (or user, or 
> service) to use swap if and when that service is using more memory than 
> it was "expected" to use.

I wonder if we shouldn't just provide documentation pointing to OS-level
facilities like that one.  The kernel has a pretty trivial way to check
the total memory used by a process.  We don't: it'd require tracking total
space used in all our memory contexts, and then extracting some number out
of our rear ends for allocations made directly from malloc.  In short,
anything we do here will be slow and unreliable, unless you want to depend
on platform-specific things like looking at /proc/self/maps.

ulimit might be interesting to check into as well.  The last time I
looked, it wasn't too helpful for this on Linux, but that was years ago.

            regards, tom lane



Re: Limiting memory allocation

From
Stephen Frost
Date:
Greetings,

On Tue, May 17, 2022 at 18:12 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jan Wieck <jan@wi3ck.info> writes:
> On 5/17/22 15:42, Stephen Frost wrote:
>> Thoughts?

> Using cgroups one can actually force a certain process (or user, or
> service) to use swap if and when that service is using more memory than
> it was "expected" to use.

I wonder if we shouldn't just provide documentation pointing to OS-level
facilities like that one.  The kernel has a pretty trivial way to check
the total memory used by a process.  We don't: it'd require tracking total
space used in all our memory contexts, and then extracting some number out
of our rear ends for allocations made directly from malloc.  In short,
anything we do here will be slow and unreliable, unless you want to depend
on platform-specific things like looking at /proc/self/maps.

This isn’t actually a solution though and that’s the problem- you end up using swap but if you use more than “expected” the OOM killer comes in and happily blows you up anyway. Cgroups are containers and exactly what kube is doing.

I agree with the general statement that it would be better for the kernel to do this, and a patch was written for it but then rejected by the kernel folks. I’m hoping to push on that with the kernel developers but they seemed pretty against this and that’s quite unfortunate.

As for the performance concern and other mallocs: For the former, thanks to our memory contexts, I don’t expect it to be all that much of an issue as the actual allocations we do aren’t all that frequently done and apparently a relatively trivial implementation was done and performance was tested and it was claimed that there was basically negligible impact. Sadly that code isn’t open (yet… this is under discussion, supposedly) but my understanding was that they just used a simple bit of shared memory to keep the count.  As for the latter, we could at least review the difference between our count and actual memory allocated and see how big that difference is in some testing (which might be enlightening anyway..) and review our direct mallocs and see if there’s a real concern there. Naturally this approach would necessitate some amount less than the total amount of memory available being used by PG anyway, but that could certainly be desirable in some scenarios where there are other processes running and to ensure not all of the filesystem cache is ejected.

ulimit might be interesting to check into as well.  The last time I
looked, it wasn't too helpful for this on Linux, but that was years ago.

Unfortunately I really don’t think anything here has materially changed in a way which would help us.  This would also apply across all of PG’s processes and I would think it’d be nice to differentiate between user backends running away and sucking up a ton of memory vs backend processes that shouldn’t be constrained in this way.

Thanks,

Stephen

Re: Limiting memory allocation

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> On Tue, May 17, 2022 at 18:12 Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ulimit might be interesting to check into as well.  The last time I
>> looked, it wasn't too helpful for this on Linux, but that was years ago.

> Unfortunately I really don’t think anything here has materially changed in
> a way which would help us.  This would also apply across all of PG’s
> processes and I would think it’d be nice to differentiate between user
> backends running away and sucking up a ton of memory vs backend processes
> that shouldn’t be constrained in this way.

It may well be that they've not fixed its shortcomings, but the claim
that it couldn't be applied selectively is nonsense.  See setrlimit(2),
which we already use successfully (AFAIK) to set stack space on a
per-process basis.

            regards, tom lane



Re: Limiting memory allocation

From
Jan Wieck
Date:
On 5/17/22 18:30, Stephen Frost wrote:
> Greetings,
> 
> On Tue, May 17, 2022 at 18:12 Tom Lane <tgl@sss.pgh.pa.us 
> <mailto:tgl@sss.pgh.pa.us>> wrote:
> 
>     Jan Wieck <jan@wi3ck.info <mailto:jan@wi3ck.info>> writes:
>      > On 5/17/22 15:42, Stephen Frost wrote:
>      >> Thoughts?
> 
>      > Using cgroups one can actually force a certain process (or user, or
>      > service) to use swap if and when that service is using more
>     memory than
>      > it was "expected" to use.
> 
>     I wonder if we shouldn't just provide documentation pointing to OS-level
>     facilities like that one.  The kernel has a pretty trivial way to check
>     the total memory used by a process.  We don't: it'd require tracking
>     total
>     space used in all our memory contexts, and then extracting some
>     number out
>     of our rear ends for allocations made directly from malloc.  In short,
>     anything we do here will be slow and unreliable, unless you want to
>     depend
>     on platform-specific things like looking at /proc/self/maps.
> 
> 
> This isn’t actually a solution though and that’s the problem- you end up 
> using swap but if you use more than “expected” the OOM killer comes in 
> and happily blows you up anyway. Cgroups are containers and exactly what 
> kube is doing.

Maybe I'm missing something, but what is it that you would actually 
consider a solution? Knowing your current memory consumption doesn't 
make the need for allocating some right now go away. What do you 
envision the response of PostgreSQL to be if we had that information 
about resource pressure? I don't see us using mallopt(3) or 
malloc_trim(3) anywhere in the code, so I don't think any of our 
processes give back unused heap at this point (please correct me if I'm 
wrong). This means that even if we knew about the memory pressure of the 
system, adjusting things like work_mem on the fly may not do much at 
all, unless there is a constant turnover of backends.

So what do you propose PostgreSQL's response to high memory pressure to be?


Regards, Jan



Re: Limiting memory allocation

From
Ronan Dunklau
Date:
Le mercredi 18 mai 2022, 16:23:34 CEST Jan Wieck a écrit :
> On 5/17/22 18:30, Stephen Frost wrote:
> > Greetings,
> >
> > On Tue, May 17, 2022 at 18:12 Tom Lane <tgl@sss.pgh.pa.us
> >
> > <mailto:tgl@sss.pgh.pa.us>> wrote:
> >     Jan Wieck <jan@wi3ck.info <mailto:jan@wi3ck.info>> writes:
> >      > On 5/17/22 15:42, Stephen Frost wrote:
> >      >> Thoughts?
> >      >
> >      > Using cgroups one can actually force a certain process (or user, or
> >      > service) to use swap if and when that service is using more
> >
> >     memory than
> >
> >      > it was "expected" to use.
> >
> >     I wonder if we shouldn't just provide documentation pointing to
> >     OS-level
> >     facilities like that one.  The kernel has a pretty trivial way to
> >     check
> >     the total memory used by a process.  We don't: it'd require tracking
> >     total
> >     space used in all our memory contexts, and then extracting some
> >     number out
> >     of our rear ends for allocations made directly from malloc.  In short,
> >     anything we do here will be slow and unreliable, unless you want to
> >     depend
> >     on platform-specific things like looking at /proc/self/maps.
> >
> > This isn’t actually a solution though and that’s the problem- you end up
> > using swap but if you use more than “expected” the OOM killer comes in
> > and happily blows you up anyway. Cgroups are containers and exactly what
> > kube is doing.
>
> Maybe I'm missing something, but what is it that you would actually
> consider a solution? Knowing your current memory consumption doesn't
> make the need for allocating some right now go away. What do you
> envision the response of PostgreSQL to be if we had that information
> about resource pressure? I don't see us using mallopt(3) or
> malloc_trim(3) anywhere in the code, so I don't think any of our
> processes give back unused heap at this point (please correct me if I'm
> wrong). This means that even if we knew about the memory pressure of the
> system, adjusting things like work_mem on the fly may not do much at
> all, unless there is a constant turnover of backends.

I'm not sure I understand your point: when we free() a pointer, malloc is
allowed to release the corresponding memory to the kernel. In the case of
glibc, it doesn't necessarily do so, but it trims the top of the heap if it is
in excess of M_TRIM_THRESHOLD. In the default glibc configuration, this
parameter is dynamically adjusted by mmap itself, to a maximum value of 64MB
IIRC. So any memory freed on the top of the heap totalling more than that
threshold ends up actually freed.

In another thread, I proposed to take control over this tuning instead of
letting malloc do it itself, as we may have better knowledge of the memory
allocations pattern than what malloc empirically discovers: in particular, we
could lower work_mem, adjust the threshold and maybe even call malloc_trim
ourselves when work_mem is lowered, to reduce the padding we may keep.

>
> So what do you propose PostgreSQL's response to high memory pressure to be?
>
>
> Regards, Jan


--
Ronan Dunklau





Re: Limiting memory allocation

From
Alvaro Herrera
Date:
On 2022-May-18, Jan Wieck wrote:

> Maybe I'm missing something, but what is it that you would actually consider
> a solution? Knowing your current memory consumption doesn't make the need
> for allocating some right now go away. What do you envision the response of
> PostgreSQL to be if we had that information about resource pressure?

What was mentioned in the talk where this issue was presented, is that
people would like malloc() to return NULL when there's memory pressure,
even if Linux has been configured indicating that memory overcommit is
OK.  The reason they can't set overcommit off is that it prevents other
services in the same system from running properly.

As I understand, setrlimit() sets the memory limit for any single
process.  But that isn't useful -- the limit needed is for the whole set
of processes under postmaster.  Limiting any individual process does no
good.

Now that's where cgroup's memory limiting features would prove useful,
if they weren't totally braindead:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
Apparently, if the cgroup goes over the "high" limit, the processes are
*throttled*.  Then if the group goes over the "max" limit, OOM-killer is
invoked.

(I can't see any way to make this even more counterproductive to the
database use case.  Making the database work more slowly doesn't fix
anything.)

So ditch cgroups.


What they (Timescale) do, is have a LD_PRELOAD library that checks
status of memory pressure, and return NULL from malloc().  This then
leads to clean abort of transactions and all is well.  There's nothing
that Postgres needs to do different than today.

I suppose that what they would like, is a way to inquire into the memory
pressure status at MemoryContextAlloc() time and return NULL if it is
too high.  How exactly this would work is unclear to me; maybe one
process keeps an eye on it in an OS-specific manner, and if it does get
near the maximum, set a bit in shared memory that other processes can
examine when MemoryContextAlloc is called.  It doesn't have to be
exactly accurate; an approximation is probably okay.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: Limiting memory allocation

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > On Tue, May 17, 2022 at 18:12 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> ulimit might be interesting to check into as well.  The last time I
> >> looked, it wasn't too helpful for this on Linux, but that was years ago.
>
> > Unfortunately I really don’t think anything here has materially changed in
> > a way which would help us.  This would also apply across all of PG’s
> > processes and I would think it’d be nice to differentiate between user
> > backends running away and sucking up a ton of memory vs backend processes
> > that shouldn’t be constrained in this way.
>
> It may well be that they've not fixed its shortcomings, but the claim
> that it couldn't be applied selectively is nonsense.  See setrlimit(2),
> which we already use successfully (AFAIK) to set stack space on a
> per-process basis.

Yeah, that thought was quite properly formed, sorry for the confusion.

That it's per-process is actually the issue, unless we were to split
up what we're given evenly across max_connections or such, which might
work but would surely end up wasting an unfortunate amount of memory.

Consider:

shared_buffers = 8G
max_memory = 8G
max_connections = 1000 (for easy math)

With setrlimit(2), we could at process start of all user backends set
RLIMIT_AS to 8G + 8G/1000 (8M) + some fudge for code, stack, etc,
meaning each process would only be allowed about 8M of memory for
work space, even though there's perhaps only 10 processes running,
resulting in over 7G of memory that PG should be able to use, but isn't.

Maybe we could do some tracking of per-process actual memory usage of
already running processes and consider that when starting new ones and
even allow processes to change their limit if they hit it, depending on
what else is going on in the system, but I'm really not sure that all of
this would end up being that much more efficient than just directly
tracking allocations and failing when we hit them ourselves, and it sure
seems like it'd be a lot more complicated.

Thanks,

Stephen

Attachment

Re: Limiting memory allocation

From
Jan Wieck
Date:
On 5/18/22 11:11, Alvaro Herrera wrote:
> On 2022-May-18, Jan Wieck wrote:
> 
>> Maybe I'm missing something, but what is it that you would actually consider
>> a solution? Knowing your current memory consumption doesn't make the need
>> for allocating some right now go away. What do you envision the response of
>> PostgreSQL to be if we had that information about resource pressure?
> 
> What was mentioned in the talk where this issue was presented, is that
> people would like malloc() to return NULL when there's memory pressure,
> even if Linux has been configured indicating that memory overcommit is
> OK.  The reason they can't set overcommit off is that it prevents other
> services in the same system from running properly.

Thank you Alvaro, that was the missing piece. Now I understand what we 
are trying to do.

> As I understand, setrlimit() sets the memory limit for any single
> process.  But that isn't useful -- the limit needed is for the whole set
> of processes under postmaster.  Limiting any individual process does no
> good.
> 
> Now that's where cgroup's memory limiting features would prove useful,
> if they weren't totally braindead:
> https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> Apparently, if the cgroup goes over the "high" limit, the processes are
> *throttled*.  Then if the group goes over the "max" limit, OOM-killer is
> invoked.
> 
> (I can't see any way to make this even more counterproductive to the
> database use case.  Making the database work more slowly doesn't fix
> anything.)
> 
> So ditch cgroups.

Agreed.

> What they (Timescale) do, is have a LD_PRELOAD library that checks
> status of memory pressure, and return NULL from malloc().  This then
> leads to clean abort of transactions and all is well.  There's nothing
> that Postgres needs to do different than today.
> 
> I suppose that what they would like, is a way to inquire into the memory
> pressure status at MemoryContextAlloc() time and return NULL if it is
> too high.  How exactly this would work is unclear to me; maybe one
> process keeps an eye on it in an OS-specific manner, and if it does get
> near the maximum, set a bit in shared memory that other processes can
> examine when MemoryContextAlloc is called.  It doesn't have to be
> exactly accurate; an approximation is probably okay.

Correct, it doesn't have to be accurate. Something /proc based setting a 
flag in shared memory WOULD be good enough, IF MemoryContextAlloc() had 
some way of figuring out that its process is actually the right one to 
abort.

On a high transaction throughput system, having such a background 
process being the only one setting and clearing a flag in shared memory 
could prove disastrous. Let it check and set/clear the flag every second 
... the whole system would throw malloc(3) failures for a whole second 
on every session. Not the system I would like to benchmark ... although 
the result charts would look hilarious.

However, once we are under memory pressure to the point of aborting 
transactions, it may be reasonable to have MemoryContextAlloc() calls 
work through a queue and return NULL one by one until the pressure is 
low enough again.

I'll roll this problem around in my head for a little longer. There 
certainly is a way to do this a bit more intelligent.


Thanks again, Jan



Re: Limiting memory allocation

From
Stephen Frost
Date:
Greetings,

* Jan Wieck (jan@wi3ck.info) wrote:
> On 5/17/22 18:30, Stephen Frost wrote:
> >This isn’t actually a solution though and that’s the problem- you end up
> >using swap but if you use more than “expected” the OOM killer comes in and
> >happily blows you up anyway. Cgroups are containers and exactly what kube
> >is doing.
>
> Maybe I'm missing something, but what is it that you would actually consider
> a solution? Knowing your current memory consumption doesn't make the need
> for allocating some right now go away. What do you envision the response of
> PostgreSQL to be if we had that information about resource pressure? I don't
> see us using mallopt(3) or malloc_trim(3) anywhere in the code, so I don't
> think any of our processes give back unused heap at this point (please
> correct me if I'm wrong). This means that even if we knew about the memory
> pressure of the system, adjusting things like work_mem on the fly may not do
> much at all, unless there is a constant turnover of backends.
>
> So what do you propose PostgreSQL's response to high memory pressure to be?

Fail the allocation, just how most PG systems are set up to do.  In such
a case, PG will almost always be able to fail the transaction, free up
the memory used, and continue running *without* ending up with a crash.

Thanks,

Stephen

Attachment

Re: Limiting memory allocation

From
Joe Conway
Date:
On 5/18/22 11:11, Alvaro Herrera wrote:
> Now that's where cgroup's memory limiting features would prove useful,
> if they weren't totally braindead:
> https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> Apparently, if the cgroup goes over the "high" limit, the processes are
> *throttled*.  Then if the group goes over the "max" limit, OOM-killer is
> invoked.
> 
> (I can't see any way to make this even more counterproductive to the
> database use case.  Making the database work more slowly doesn't fix
> anything.)

You may be misinterpreting "throttle" in this context. From [1]:

   The memory.high boundary on the other hand can be set
   much more conservatively. When hit, it throttles
   allocations by forcing them into direct reclaim to
   work off the excess, but it never invokes the OOM
   killer.

> So ditch cgroups.

You cannot ditch cgroups if you are running in a container. And in fact 
most non-container installations these days are also running in a cgroup 
under systemd.

The only difference is that you are more likely to see a memory limit 
set in a container than under systemd.

[1] 
https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/cgroup-v2.rst

-- 
Joe Conway
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Limiting memory allocation

From
Alvaro Herrera
Date:
On 2022-May-18, Joe Conway wrote:

> On 5/18/22 11:11, Alvaro Herrera wrote:

> > Apparently, if the cgroup goes over the "high" limit, the processes are
> > *throttled*.  Then if the group goes over the "max" limit, OOM-killer is
> > invoked.

> You may be misinterpreting "throttle" in this context. From [1]:
> 
>   The memory.high boundary on the other hand can be set
>   much more conservatively. When hit, it throttles
>   allocations by forcing them into direct reclaim to
>   work off the excess, but it never invokes the OOM
>   killer.

Well, that means the backend processes don't do their expected task
(process some query) but instead they have to do "direct reclaim".  I
don't know what that is, but it sounds like we'd need to add
Linux-specific code in order for this to fix anything.  And what would
we do in such a situation anyway?  Seems like our best hope would be to
get malloc() to return NULL and have the resulting transaction abort
free enough memory that things in other backends can continue to run.

*If* there is a way to have cgroups make Postgres do that, then that
would be useful enough.

> > So ditch cgroups.
> 
> You cannot ditch cgroups if you are running in a container. And in fact most
> non-container installations these days are also running in a cgroup under
> systemd.

I just meant that the cgroup abstraction doesn't offer any interfaces
that we can use to improve this, not that we would be running without
them.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it."         (ncm, http://lwn.net/Articles/174769/)



Re: Limiting memory allocation

From
Joe Conway
Date:
On 5/18/22 16:20, Alvaro Herrera wrote:
> On 2022-May-18, Joe Conway wrote:
> 
>> On 5/18/22 11:11, Alvaro Herrera wrote:
> 
>> > Apparently, if the cgroup goes over the "high" limit, the processes are
>> > *throttled*.  Then if the group goes over the "max" limit, OOM-killer is
>> > invoked.
> 
>> You may be misinterpreting "throttle" in this context. From [1]:
>> 
>>   The memory.high boundary on the other hand can be set
>>   much more conservatively. When hit, it throttles
>>   allocations by forcing them into direct reclaim to
>>   work off the excess, but it never invokes the OOM
>>   killer.
> 
> Well, that means the backend processes don't do their expected task
> (process some query) but instead they have to do "direct reclaim".  I
> don't know what that is, but it sounds like we'd need to add
> Linux-specific code in order for this to fix anything. 

Postgres does not need to do anything. The kernel just does its thing 
(e.g. clearing page cache or swapping out anon memory) more aggressively 
than normal to clear up some space for the impending allocation.

> And what would we do in such a situation anyway?  Seems like our
> best hope would be to> get malloc() to return NULL and have the
> resulting transaction abort free enough memory that things in other
> backends can continue to run.

With the right hooks an extension could detect the memory pressure in an 
OS specific way and return null.

> *If* there is a way to have cgroups make Postgres do that, then that
> would be useful enough.

Memory accounting under cgroups (particularly v2) can provide the signal 
needed for a Linux specific extension to do that.

>> > So ditch cgroups.
>> 
>> You cannot ditch cgroups if you are running in a container. And in fact most
>> non-container installations these days are also running in a cgroup under
>> systemd.
> 
> I just meant that the cgroup abstraction doesn't offer any interfaces
> that we can use to improve this, not that we would be running without
> them.

I agree that cgroups is very Linux specific, so likely we would not want 
such code in core.


-- 
Joe Conway
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Limiting memory allocation

From
Dmitry Dolgov
Date:
> On Wed, May 18, 2022 at 04:49:24PM -0400, Joe Conway wrote:
> On 5/18/22 16:20, Alvaro Herrera wrote:
> > On 2022-May-18, Joe Conway wrote:
> >
> > > On 5/18/22 11:11, Alvaro Herrera wrote:
> >
> > > > Apparently, if the cgroup goes over the "high" limit, the processes are
> > > > *throttled*.  Then if the group goes over the "max" limit, OOM-killer is
> > > > invoked.
> >
> > > You may be misinterpreting "throttle" in this context. From [1]:
> > >
> > >   The memory.high boundary on the other hand can be set
> > >   much more conservatively. When hit, it throttles
> > >   allocations by forcing them into direct reclaim to
> > >   work off the excess, but it never invokes the OOM
> > >   killer.
> >
> > Well, that means the backend processes don't do their expected task
> > (process some query) but instead they have to do "direct reclaim".  I
> > don't know what that is, but it sounds like we'd need to add
> > Linux-specific code in order for this to fix anything.
>
> Postgres does not need to do anything. The kernel just does its thing (e.g.
> clearing page cache or swapping out anon memory) more aggressively than
> normal to clear up some space for the impending allocation.
>
> > And what would we do in such a situation anyway?  Seems like our
> > best hope would be to> get malloc() to return NULL and have the
> > resulting transaction abort free enough memory that things in other
> > backends can continue to run.
>
> With the right hooks an extension could detect the memory pressure in an OS
> specific way and return null.
>
> > *If* there is a way to have cgroups make Postgres do that, then that
> > would be useful enough.
>
> Memory accounting under cgroups (particularly v2) can provide the signal
> needed for a Linux specific extension to do that.

To elaborate a bit on this, Linux PSI feature (in the context of
containers, cgroups v2 only) [1] would allow a userspace application to
register a trigger on memory pressure exceeding some threshold. The
pressure here is not exactly how much memory is allocated, but rather
memory stall, and the whole machinery would involve polling -- but still
sounds interesting in the context of this thread.

[1]: https://www.kernel.org/doc/Documentation/accounting/psi.rst



Re: Limiting memory allocation

From
Oleksii Kliukin
Date:
Hi,

> On 18. May 2022, at 17:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-May-18, Jan Wieck wrote:
>
>> Maybe I'm missing something, but what is it that you would actually consider
>> a solution? Knowing your current memory consumption doesn't make the need
>> for allocating some right now go away. What do you envision the response of
>> PostgreSQL to be if we had that information about resource pressure?
>
>
> What they (Timescale) do, is have a LD_PRELOAD library that checks
> status of memory pressure, and return NULL from malloc().  This then
> leads to clean abort of transactions and all is well.  There's nothing
> that Postgres needs to do different than today.

Correct. The library we have reads a limit supplied in an environment variable
and stores per-process and total memory usage values in shared memory counters,
updated after each call to malloc/free/calloc/realloc by the process making the
call.  When updating totals, a process picks one of N counters to update
atomically with the difference between its old and new memory usage, avoiding
congested ones; those are summed  to determine current allocations for all
processes and to compare against the limit.

>
> I suppose that what they would like, is a way to inquire into the memory
> pressure status at MemoryContextAlloc() time and return NULL if it is
> too high.

If we call user code just before malloc (and, presumably free and realloc), the
code would have to do just as much work as when it is called from the
malloc/free/realloc wrappers inside a preloaded library. Furthermore, I don’t
see why the user would want to customize that logic: a single Linux-specific
implementation would solve the problem for everyone.


>  How exactly this would work is unclear to me; maybe one
> process keeps an eye on it in an OS-specific manner,

We don’t need to do anything for non-Linux systems, as cgroups and OOM
killer doesn’t exist there.


> and if it does get
> near the maximum, set a bit in shared memory that other processes can
> examine when MemoryContextAlloc is called.  It doesn't have to be
> exactly accurate; an approximation is probably okay.

What would be a purpose of setting a bit in shared memory when the maximum Is
about to be reached?

What would be useful is a way for Postgres to count the amount of memory
allocated by each backend.  This could be advantageous for giving per-backend
memory usage to the user, as well as for enforcing a limit on the total amount
of memory allocated by the backends.

—
Oleksii Kliukin


Re: Limiting memory allocation

From
Stephen Frost
Date:
Greetings,

* Oleksii Kliukin (alexk@hintbits.com) wrote:
> > On 18. May 2022, at 17:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2022-May-18, Jan Wieck wrote:
> >> Maybe I'm missing something, but what is it that you would actually consider
> >> a solution? Knowing your current memory consumption doesn't make the need
> >> for allocating some right now go away. What do you envision the response of
> >> PostgreSQL to be if we had that information about resource pressure?
> >
> > What they (Timescale) do, is have a LD_PRELOAD library that checks
> > status of memory pressure, and return NULL from malloc().  This then
> > leads to clean abort of transactions and all is well.  There's nothing
> > that Postgres needs to do different than today.

I'm not super fixated on exactly what this one implementation does, but
rather that the kernel is evidently not interested in trying to solve
this problem and therefore it's something which we need to address.  I
agree in general that we don't need to do much different except to have
a way to effectively have a limit where we treat an allocation attempt
as failing and then the rest of our existing machinery will handle
failing the transaction and doing cleanup and such just fine.

> Correct. The library we have reads a limit supplied in an environment variable
> and stores per-process and total memory usage values in shared memory counters,
> updated after each call to malloc/free/calloc/realloc by the process making the
> call.  When updating totals, a process picks one of N counters to update
> atomically with the difference between its old and new memory usage, avoiding
> congested ones; those are summed  to determine current allocations for all
> processes and to compare against the limit.

Would be interesting to know just how many of these counters are used
and how 'congested' ones are avoided.  Though, would certainly be easier
if one could simply review this library.

> > I suppose that what they would like, is a way to inquire into the memory
> > pressure status at MemoryContextAlloc() time and return NULL if it is
> > too high.

Not really concerned with what one specific implementation that's been
done would like but rather with solving the larger issue that exists,
which is that we aren't able to cap our memory usage today and that can
lead to the OOM killer coming into play, or excessive swap usage, or
causing issue for other processes running.  While I started this with
the crash case as the main concern, and I do feel it's still a big case
to consider, there are other valuable use-cases to consider where this
would help.

> If we call user code just before malloc (and, presumably free and realloc), the
> code would have to do just as much work as when it is called from the
> malloc/free/realloc wrappers inside a preloaded library. Furthermore, I don’t
> see why the user would want to customize that logic: a single Linux-specific
> implementation would solve the problem for everyone.

If the problem is explicitly defined as "deal with the Linux OOM killer"
then, yes, a Linux-specific fix would address that.  I do think that's
certainly an important, and perhaps the most important, issue that this
solves, but there's other cases where this would be really helpful.

> >  How exactly this would work is unclear to me; maybe one
> > process keeps an eye on it in an OS-specific manner,

There seems to be a lot of focus on trying to implement this as "get the
amount of free memory from the OS and make sure we don't go over that
limit" and that adds a lot of OS-specific logic which complicates things
and also ignores the use-cases where an admin wishes to limit PG's
memory usage due to other processes running on the same system.  I'll
point out that the LD_PRELOAD library doesn't even attempt to do this,
even though it's explicitly for Linux, but uses an environment variable
instead.

In PG, we'd have that be a GUC that an admin is able to set and then we
track the memory usage (perhaps per-process, perhaps using some set of
buckets, perhaps locally per-process and then in a smaller number of
buckets in shared memory, or something else) and fail an allocation when
it would go over that limit, perhaps only when it's a regular user
backend or with other conditions around it.

> What would be useful is a way for Postgres to count the amount of memory
> allocated by each backend.  This could be advantageous for giving per-backend
> memory usage to the user, as well as for enforcing a limit on the total amount
> of memory allocated by the backends.

I agree that this would be independently useful.

Thanks,

Stephen

Attachment

Re: Limiting memory allocation

From
Tomas Vondra
Date:
On 5/20/22 21:50, Stephen Frost wrote:
> Greetings,
> 
> ...
>
>>>  How exactly this would work is unclear to me; maybe one
>>> process keeps an eye on it in an OS-specific manner,
> 
> There seems to be a lot of focus on trying to implement this as "get the
> amount of free memory from the OS and make sure we don't go over that
> limit" and that adds a lot of OS-specific logic which complicates things
> and also ignores the use-cases where an admin wishes to limit PG's
> memory usage due to other processes running on the same system.  I'll
> point out that the LD_PRELOAD library doesn't even attempt to do this,
> even though it's explicitly for Linux, but uses an environment variable
> instead.
> 
> In PG, we'd have that be a GUC that an admin is able to set and then we
> track the memory usage (perhaps per-process, perhaps using some set of
> buckets, perhaps locally per-process and then in a smaller number of
> buckets in shared memory, or something else) and fail an allocation when
> it would go over that limit, perhaps only when it's a regular user
> backend or with other conditions around it.
> 

I agree a GUC setting a memory target is a sensible starting point.

I wonder if we might eventually use this to define memory budgets. One
of the common questions I get is how do you restrict the user from
setting work_mem too high or doing too much memory-hungry things.
Currently there's no way to do that, because we have no way to limit
work_mem values, and even if we had the user could construct a more
complex query with more memory-hungry operations.

But I think it's also that we weren't sure what to do after hitting a
limit - should we try replanning the query with lower work_mem value, or
what?

However, if just failing the malloc() is acceptable, maybe we could use
this to achieve something like this?

>> What would be useful is a way for Postgres to count the amount of memory
>> allocated by each backend.  This could be advantageous for giving per-backend
>> memory usage to the user, as well as for enforcing a limit on the total amount
>> of memory allocated by the backends.
> 
> I agree that this would be independently useful.
> 

Well, we already have the memory-accounting built into the memory
context infrastructure. It kinda does the same thing as the malloc()
wrapper, except that it does not publish the information anywhere and
it's per-context (so we have to walk the context recursively).

So maybe we could make this on-request somehow? Say, we'd a signal to
the process, and it'd run MemoryContextMemAllocated() on the top memory
context and store the result somewhere.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Limiting memory allocation

From
Jan Wieck
Date:
On 5/20/22 19:08, Tomas Vondra wrote:
> Well, we already have the memory-accounting built into the memory
> context infrastructure. It kinda does the same thing as the malloc()
> wrapper, except that it does not publish the information anywhere and
> it's per-context (so we have to walk the context recursively).
> 
> So maybe we could make this on-request somehow? Say, we'd a signal to
> the process, and it'd run MemoryContextMemAllocated() on the top memory
> context and store the result somewhere.

One remaining problem with all this is that we don't get any feedback 
from calling free() telling if any memory has been returned to the OS or 
not.

How portable would using sbrk() with a zero size be? If that is an 
option then I could envision counting actual calls to malloc() and 
whenever a GUC configurable amount is reached, use sbrk() to find out, 
do the accounting in shared memory and react accordingly.

This way we are not creating another highly contended lock and use 
authoritative information.


Regards, Jan



Re: Limiting memory allocation

From
Robert Haas
Date:
On Fri, May 20, 2022 at 7:09 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I wonder if we might eventually use this to define memory budgets. One
> of the common questions I get is how do you restrict the user from
> setting work_mem too high or doing too much memory-hungry things.
> Currently there's no way to do that, because we have no way to limit
> work_mem values, and even if we had the user could construct a more
> complex query with more memory-hungry operations.
>
> But I think it's also that we weren't sure what to do after hitting a
> limit - should we try replanning the query with lower work_mem value, or
> what?

It's always seemed to me that the principled thing to do would be to
make work_mem a per-query budget rather than a per-node budget, and
have add_path() treat memory usage as an independent figure of merit
-- and also discard any paths that went over the memory budget. Thus
we might keep more expensive paths if they use less memory to produce
the result. For this to work well, memory-hungry nodes would probably
need to add multiple paths - especially nodes that do hashing, which
is likely to have breakpoints where the estimated cost changes sharply
(and the actual cost does too, if the row counts are accurate).

I've also wondered whether we could maybe do something unprincipled
instead, because that all sounds not only complicated but also
potentially expensive, if it results in us keeping extra paths around
compared to what we keep today. It might be worth it, though.
Generating query plans infinitely fast is no good if the plans suck,
and running the machine out of RAM definitely counts as sucking.

My general feeling about this topic is that, in cases where PostgreSQL
today uses more memory than is desirable, it's probably only
moderately difficult to make it fail with a nice error message
instead. Making it succeed by altering its behavior to use less memory
seems likely to be a lot harder -- which is not to say that we
shouldn't try to do it. It's an important problem. Just not an easy
one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Limiting memory allocation

From
Bruce Momjian
Date:
On Tue, May 24, 2022 at 11:49:27AM -0400, Robert Haas wrote:
> It's always seemed to me that the principled thing to do would be to
> make work_mem a per-query budget rather than a per-node budget, and
> have add_path() treat memory usage as an independent figure of merit
> -- and also discard any paths that went over the memory budget. Thus
> we might keep more expensive paths if they use less memory to produce
> the result. For this to work well, memory-hungry nodes would probably
> need to add multiple paths - especially nodes that do hashing, which
> is likely to have breakpoints where the estimated cost changes sharply
> (and the actual cost does too, if the row counts are accurate).
> 
> I've also wondered whether we could maybe do something unprincipled
> instead, because that all sounds not only complicated but also
> potentially expensive, if it results in us keeping extra paths around
> compared to what we keep today. It might be worth it, though.
> Generating query plans infinitely fast is no good if the plans suck,
> and running the machine out of RAM definitely counts as sucking.

If the plan output is independent of work_mem, I always wondered why we
didn't just determine the number of simultaneous memory requests in the
plan and just allocate accordingly, e.g. if there are four simultaneous
memory requests in the plan, each gets work_mem/4.


-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Limiting memory allocation

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> If the plan output is independent of work_mem,

... it isn't ...

> I always wondered why we
> didn't just determine the number of simultaneous memory requests in the
> plan and just allocate accordingly, e.g. if there are four simultaneous
> memory requests in the plan, each gets work_mem/4.

(1) There are not a predetermined number of allocations.  For example,
if we do a given join as nestloop+inner index scan, that doesn't require
any large amount of memory; but if we do it as merge or hash join then
it will consume memory.

(2) They may not all need the same amount of memory, eg joins might
be working on different amounts of data.

If this were an easy problem to solve, we'd have solved it decades
ago.

            regards, tom lane



Re: Limiting memory allocation

From
Bruce Momjian
Date:
On Tue, May 24, 2022 at 07:40:45PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > If the plan output is independent of work_mem,
> 
> ... it isn't ...

Good.

> > I always wondered why we
> > didn't just determine the number of simultaneous memory requests in the
> > plan and just allocate accordingly, e.g. if there are four simultaneous
> > memory requests in the plan, each gets work_mem/4.
> 
> (1) There are not a predetermined number of allocations.  For example,
> if we do a given join as nestloop+inner index scan, that doesn't require
> any large amount of memory; but if we do it as merge or hash join then
> it will consume memory.

Uh, we know from the plan whether we are doing a nestloop+inner or merge
or hash join, right?  I was suggesting we look at the plan before
execution and set the proper percentage of work_mem for each node.

> (2) They may not all need the same amount of memory, eg joins might
> be working on different amounts of data.

True. but we could cap it like we do now for work_mem, but as a
percentage of a GUC work_mem total.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Limiting memory allocation

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Tue, May 24, 2022 at 07:40:45PM -0400, Tom Lane wrote:
>> (1) There are not a predetermined number of allocations.  For example,
>> if we do a given join as nestloop+inner index scan, that doesn't require
>> any large amount of memory; but if we do it as merge or hash join then
>> it will consume memory.

> Uh, we know from the plan whether we are doing a nestloop+inner or merge
> or hash join, right?  I was suggesting we look at the plan before
> execution and set the proper percentage of work_mem for each node.

Then you just rendered all the planner's estimates fantasies.

            regards, tom lane



Re: Limiting memory allocation

From
Bruce Momjian
Date:
On Tue, May 24, 2022 at 09:20:43PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Tue, May 24, 2022 at 07:40:45PM -0400, Tom Lane wrote:
> >> (1) There are not a predetermined number of allocations.  For example,
> >> if we do a given join as nestloop+inner index scan, that doesn't require
> >> any large amount of memory; but if we do it as merge or hash join then
> >> it will consume memory.
> 
> > Uh, we know from the plan whether we are doing a nestloop+inner or merge
> > or hash join, right?  I was suggesting we look at the plan before
> > execution and set the proper percentage of work_mem for each node.
> 
> Then you just rendered all the planner's estimates fantasies.

That's what I was asking --- if the planner's estimates are based on the
size of work_mem --- I thought you said it is not.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Limiting memory allocation

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Tue, May 24, 2022 at 09:20:43PM -0400, Tom Lane wrote:
>> Then you just rendered all the planner's estimates fantasies.

> That's what I was asking --- if the planner's estimates are based on the
> size of work_mem --- I thought you said it is not.

The planner's estimates certainly vary with work_mem ... I was responding
to something you said that seemed to be asserting they didn't.

            regards, tom lane



Re: Limiting memory allocation

From
Bruce Momjian
Date:
On Tue, May 24, 2022 at 09:55:16PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Tue, May 24, 2022 at 09:20:43PM -0400, Tom Lane wrote:
> >> Then you just rendered all the planner's estimates fantasies.
> 
> > That's what I was asking --- if the planner's estimates are based on the
> > size of work_mem --- I thought you said it is not.
> 
> The planner's estimates certainly vary with work_mem ... I was responding
> to something you said that seemed to be asserting they didn't.

I see where I got confused:

    > If the plan output is independent of work_mem,
    ... it isn't ...

Okay, then my idea will not work.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson