Thread: Some interesting news about Linux 3.12 OOM
I'm not sure how many of you have been tracking this but courtesy of lwn.net I have learned that it seems that the OOM killer behavior in Linux 3.12 will be significantly different. And by description, it sounds like an improvement. I thought some people reading -hackers might be interested. Based on the description at lwn, excerpted below, it sounds like the news might be that systems with overcommit on might return OOM when a non-outlandish request for memory is made from the kernel. """ Johannes Weiner has posted a set of patches aimed at improving this situation. Following a bunch of cleanup work, these patches make two fundamental changes to how OOM conditions are handled in the kernel. The first of those is perhaps the most visible: it causes the kernel to avoid calling the OOM killer altogether for most memory allocation failures. In particular, if the allocation is being made in response to a system call, the kernel will just cause the system call to fail with an ENOMEMerror rather than trying to find a process to kill. That may cause system call failures to happen more often and in different contexts than they used to. But, naturally, that will not be a problem since all user-space code diligently checks the return status of every system call and responds with well-tested error-handling code when things go wrong. """ Subject to experiment, this may be some good news, as many programs, libraries, and runtime environments that may run parallel to Postgres on a machine are pretty lackadaisical about limiting the amount of virtual memory charged to them, and overcommit off is somewhat punishing in those situations if one really needed a large hash table from Postgres or whatever. I've seen some cases here where a good amount of VM has been reserved and caused apparent memory pressure that cut throughput short of what should ought to be possible.
On Wed, Sep 18, 2013 at 10:09 PM, Daniel Farina <daniel@heroku.com> wrote: > I'm not sure how many of you have been tracking this but courtesy of > lwn.net I have learned that it seems that the OOM killer behavior in > Linux 3.12 will be significantly different. And by description, it > sounds like an improvement. I thought some people reading -hackers > might be interested. > > Based on the description at lwn, excerpted below, it sounds like the > news might be that systems with overcommit on might return OOM when a > non-outlandish request for memory is made from the kernel. > > """ > Johannes Weiner has posted a set of patches aimed at improving this > situation. Following a bunch of cleanup work, these patches make two > fundamental changes to how OOM conditions are handled in the kernel. > The first of those is perhaps the most visible: it causes the kernel > to avoid calling the OOM killer altogether for most memory allocation > failures. In particular, if the allocation is being made in response > to a system call, the kernel will just cause the system call to fail > with an ENOMEMerror rather than trying to find a process to kill. That > may cause system call failures to happen more often and in different > contexts than they used to. But, naturally, that will not be a problem > since all user-space code diligently checks the return status of every > system call and responds with well-tested error-handling code when > things go wrong. > """ > > Subject to experiment, this may be some good news, as many programs, > libraries, and runtime environments that may run parallel to Postgres > on a machine are pretty lackadaisical about limiting the amount of > virtual memory charged to them, and overcommit off is somewhat > punishing in those situations if one really needed a large hash table > from Postgres or whatever. I've seen some cases here where a good > amount of VM has been reserved and caused apparent memory pressure > that cut throughput short of what should ought to be possible. Yes, that does sound good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 19, 2013 at 9:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> But, naturally, that will not be a problem >> since all user-space code diligently checks the return status of every >> system call and responds with well-tested error-handling code when >> things go wrong. That just short circuited my sarcasm detector. merlin
On Thu, Sep 19, 2013 at 11:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Sep 19, 2013 at 9:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> But, naturally, that will not be a problem >>> since all user-space code diligently checks the return status of every >>> system call and responds with well-tested error-handling code when >>> things go wrong. > > That just short circuited my sarcasm detector. I laughed, too, but the reality is that at least as far as PG is concerned it's probably a truthful statement, and if it isn't, nobody here is likely to complain about having to fix it. Yeah, there's a lot of other code out there not as well written or maintained as PG, but using SIGKILL as a substitute for ENOMEM because people might not be checking the return value for malloc() is extremely heavy-handed nannyism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-19 11:49:05 -0400, Robert Haas wrote: > On Thu, Sep 19, 2013 at 11:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > > On Thu, Sep 19, 2013 at 9:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >>> But, naturally, that will not be a problem > >>> since all user-space code diligently checks the return status of every > >>> system call and responds with well-tested error-handling code when > >>> things go wrong. > > > > That just short circuited my sarcasm detector. > > I laughed, too, but the reality is that at least as far as PG is > concerned it's probably a truthful statement, and if it isn't, nobody > here is likely to complain about having to fix it. Yeah, there's a > lot of other code out there not as well written or maintained as PG, > but using SIGKILL as a substitute for ENOMEM because people might not > be checking the return value for malloc() is extremely heavy-handed > nannyism. The "problem" is that it's not just about malloc() (aka brk() and mmap()) and friends. It's about many of the other systemcalls. Like e.g. send() to name one of the more likely ones. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 19, 2013 at 12:02 PM, Andres Freund <andres@2ndquadrant.com> wrote: > The "problem" is that it's not just about malloc() (aka brk() and > mmap()) and friends. It's about many of the other systemcalls. Like > e.g. send() to name one of the more likely ones. *shrug* If you're using for send() and not testing for a -1 return value, you're writing amazingly bad code anyway. And if you ARE testing for -1, you'll probably do something at least mildly sensible with a not-specifically-foreseen errno value, like print a message that includes %m. That's about what we'd probably do, and I have to imagine what most people would do. I'm not saying it won't break anything to return a proper error code; I'm just saying that sending SIGKILL is worse. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I laughed, too, but the reality is that at least as far as PG is > concerned it's probably a truthful statement, and if it isn't, nobody > here is likely to complain about having to fix it. Yeah, there's a > lot of other code out there not as well written or maintained as PG, > but using SIGKILL as a substitute for ENOMEM because people might not > be checking the return value for malloc() is extremely heavy-handed > nannyism. I've been told at several instances that this has been made for the JVM and other such programs that want to allocate huge amount of memory even if they don't really intend to use it. Back in the day that amount could well be greater that the actual amount of physical memory available. So the only way to allow Java applications on Linux was, as I've been told, to implement OOM. And as the target was the desktop, well, have it turned on by default. Now, I liked that story enough to never actually try and check about it, so if some knows for real why the linux kernel appears so stupid in its choice of implementing OOM and turning it on by default… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 2013-09-19 18:23:07 +0200, Dimitri Fontaine wrote: > I've been told at several instances that this has been made for the JVM > and other such programs that want to allocate huge amount of memory even > if they don't really intend to use it. That's not really related - what you describe is memory overcommitting (which as lots of uses besides JVMs). That's not removed by the changes references upthread. What has changed is how to react to situations where memory has been overcommitted but is now actually needed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > What has changed is how to react to situations where memory has been > overcommitted but is now actually needed. Sure. You either have a failure at malloc() or usage, over commit is all about never failing at malloc(), but now you have to deal with OOM conditions in creative way, like with the OOM Killer. Anyways, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
All, I've send kernel.org a message that we're keen on seeing these changes become committed. BTW, in the future if anyone sees kernel.org contemplating a patch which helps or hurts Postgres, don't hesiate to speak up to them. They don't get nearly enough feedback from DB developers. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
All, I've send kernel.org a message that we're keen on seeing these changes get committed. BTW, in the future if anyone sees kernel.org contemplating a patch which helps or hurts Postgres, don't hesiate to speak up to them. They don't get nearly enough feedback from DB developers. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sep 24, 2013 10:12 AM, "Josh Berkus" <josh@agliodbs.com> wrote: > > All, > > I've send kernel.org a message that we're keen on seeing these changes > become committed. I thought it was merged already in 3.12. There are a few related patches, but here's one: commit 519e52473ebe9db5cdef44670d5a97f1fd53d721 Author: Johannes Weiner <hannes@cmpxchg.org> Date: Thu Sep 12 15:13:42 2013 -0700 mm: memcg: enable memcg OOM killer only for user faults System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> $ git tag --contains 519e52473ebe9db5cdef44670d5a97f1fd53d721 v3.12-rc1 v3.12-rc2 Searching for recent work by Johannes Weiner shows the pertinent stuff more exhaustively. > BTW, in the future if anyone sees kernel.org contemplating a patch which > helps or hurts Postgres, don't hesiate to speak up to them. They don't > get nearly enough feedback from DB developers. I don't hesitate, most of the time I simply don't know.
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Sep 25, 2013 at 12:15 AM, Daniel Farina <spandir="ltr"><<a href="mailto:daniel@heroku.com" target="_blank">daniel@heroku.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":5dt" style="overflow:hidden">Enable the memcg OOM killer only for user faults, where it's really the<br /> only option available.</div></blockquote></div><br/></div><div class="gmail_extra">Is this really a big deal? I would expect most faultsto be user faults. <br /><br /></div><div class="gmail_extra">It's certainly a big deal that we need to ensure we canhandle ENOMEM from syscalls and library functions we weren't expecting to return it. But I don't expect it to actuallyreduce the OOM killing sprees by much.<br /></div><div class="gmail_extra"><br clear="all" /><br />-- <br />greg<br/></div></div>
On Wed, Sep 25, 2013 at 8:00 AM, Greg Stark <stark@mit.edu> wrote: > > On Wed, Sep 25, 2013 at 12:15 AM, Daniel Farina <daniel@heroku.com> wrote: >> >> Enable the memcg OOM killer only for user faults, where it's really the >> only option available. > > > Is this really a big deal? I would expect most faults to be user faults. > > It's certainly a big deal that we need to ensure we can handle ENOMEM from > syscalls and library functions we weren't expecting to return it. But I > don't expect it to actually reduce the OOM killing sprees by much. Hmm, I see what you mean. I have been reading through the mechanism: I got too excited about 'allocations by system calls', because I thought that might mean brk and friends, except that's not much of an allocation at all, just reservation. I think. There is some interesting stuff coming in along with these patches in bringing the user-space memcg OOM handlers up to snuff that may make it profitable to issue SIGTERM to backends when a safety margin is crossed (too bad the error messages will be confusing in that case). I was rather hoping that a regular ENOMEM could be injected by this mechanism the next time a syscall is touched (unknown), but I'm not confident if this is made easier or not, one way or another. One could imagine the kernel injecting such a fault when the amount of memory being consumed starts to look hairy, but I surmise part of the impetus for userspace handling of that is to avoid getting into that particular heuristics game. Anyway, I did do some extensive study of cgroups and memcg's implementation in particular and found it not really practical for Postgres use unless one was happy with lots and lots of database restarts, and this work still gives me some hope to try again, even if smaller modifications still seem necessary.