Thread: Setting oom_adj on linux?
I realize this is a very platform-specific thing, but should we consider setting the value of /proc/<pid>/oom_adj when running on linux? See: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/proc.txt;h=220cc6376ef80e0c9bcfec162d45552e729cdf5a;hb=45d28b097280a78893ce25a5d0db41e6a2717853 section 3.1. To get the best benefit, I think it needs to be done in cooperation between the startup scripts and PostgreSQL. We'd want -17 ("never oom kill") on the postmaster, but some different value on regular backends (since if it has to kill someone, it's better to pick a regular backend so we can do a controlled restart). Only root can drop the value, so the startup script needs to be part of it. But if we then want to increase it for sub-processes, that'd require something in the backend itself... -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander wrote: > I realize this is a very platform-specific thing, but should we > consider setting the value of /proc/<pid>/oom_adj when running on > linux? See: http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, Jan 4, 2010 at 16:45, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Magnus Hagander wrote: >> I realize this is a very platform-specific thing, but should we >> consider setting the value of /proc/<pid>/oom_adj when running on >> linux? See: > > http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org Grr. I had zero recollectoin of that :S Can't find a useful consensus though? -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Mon, 4 Jan 2010 16:57:28 +0100, Magnus Hagander <magnus@hagander.net> wrote: > On Mon, Jan 4, 2010 at 16:45, Alvaro Herrera <alvherre@commandprompt.com> > wrote: >> Magnus Hagander wrote: >>> I realize this is a very platform-specific thing, but should we >>> consider setting the value of /proc/<pid>/oom_adj when running on >>> linux? See: >> >> http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org > > Grr. I had zero recollectoin of that :S > > Can't find a useful consensus though? I don't think we should set a setting like that automatically. Perhaps a warning on startup? Joshua D. Drake -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Magnus Hagander wrote: > On Mon, Jan 4, 2010 at 16:45, Alvaro Herrera <alvherre@commandprompt.com> wrote: > >> Magnus Hagander wrote: >> >>> I realize this is a very platform-specific thing, but should we >>> consider setting the value of /proc/<pid>/oom_adj when running on >>> linux? See: >>> >> http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org >> > > Grr. I had zero recollectoin of that :S > > Can't find a useful consensus though? > > It is probably worth trying to protect the postmaster in the init script. Beyond that things probably start to get fairly difficult. cheers andrew
On Mon, Jan 4, 2010 at 17:07, Andrew Dunstan <andrew@dunslane.net> wrote: > > > Magnus Hagander wrote: >> >> On Mon, Jan 4, 2010 at 16:45, Alvaro Herrera <alvherre@commandprompt.com> >> wrote: >> >>> >>> Magnus Hagander wrote: >>> >>>> >>>> I realize this is a very platform-specific thing, but should we >>>> consider setting the value of /proc/<pid>/oom_adj when running on >>>> linux? See: >>>> >>> >>> >>> http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org >>> >> >> Grr. I had zero recollectoin of that :S >> >> Can't find a useful consensus though? >> >> > > It is probably worth trying to protect the postmaster in the init script. > Beyond that things probably start to get fairly difficult. Right. But AFAICS (though I haven't tested with -17), it will become inherited to children, which is something we'd want to *undo*, no? -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander wrote: <blockquote cite="mid:9837222c1001040757r3a3edc85hc192f52a1f8e0fba@mail.gmail.com" type="cite"><prewrap="">On Mon, Jan 4, 2010 at 16:45, Alvaro Herrera <a class="moz-txt-link-rfc2396E" href="mailto:alvherre@commandprompt.com"><alvherre@commandprompt.com></a>wrote: </pre><blockquote type="cite"><prewrap="">Magnus Hagander wrote: </pre><blockquote type="cite"><pre wrap="">I realize this is a very platform-specificthing, but should we consider setting the value of /proc/<pid>/oom_adj when running on linux? See: </pre></blockquote><pre wrap=""><a class="moz-txt-link-freetext" href="http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org">http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org</a> </pre></blockquote><pre wrap=""> Can't find a useful consensus though? </pre></blockquote><br /> In <a class="moz-txt-link-freetext" href="http://archives.postgresql.org/pgsql-hackers/2008-02/msg00049.php">http://archives.postgresql.org/pgsql-hackers/2008-02/msg00049.php</a> Tompoints out that while you could make this adjustment in the init scripts for PostgreSQL, actually doing so is quite questionableas a packaging decision. That's where that thread ended as far as I was concerned. The best I think anyonecould do here is to add such a capability into some of the init scripts, but it would probably need to be disabledby default. Since that sort of defeats the purpose of the change, I'm not sure what the benefit there is--if youhave to turn it on, you might as well do something at a higher level instead.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
Magnus Hagander wrote: > On Mon, Jan 4, 2010 at 17:07, Andrew Dunstan <andrew@dunslane.net> wrote: > >> Magnus Hagander wrote: >> >>> On Mon, Jan 4, 2010 at 16:45, Alvaro Herrera <alvherre@commandprompt.com> >>> wrote: >>> >>> >>>> Magnus Hagander wrote: >>>> >>>> >>>>> I realize this is a very platform-specific thing, but should we >>>>> consider setting the value of /proc/<pid>/oom_adj when running on >>>>> linux? See: >>>>> >>>>> >>>> http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org >>>> >>>> >>> Grr. I had zero recollectoin of that :S >>> >>> Can't find a useful consensus though? >>> >>> >>> >> It is probably worth trying to protect the postmaster in the init script. >> Beyond that things probably start to get fairly difficult. >> > > Right. But AFAICS (though I haven't tested with -17), it will become > inherited to children, which is something we'd want to *undo*, no? > [experiments] Yes, darnit, you're right. But it looks like the oom_adj file can be set to the default by the process owner: [andrew@sophia ~]$ ls -l /proc/6520/oom_adj -rw-r--r-- 1 andrew andrew 0 2010-01-04 12:37 /proc/6520/oom_adj [andrew@sophia~]$ cat /proc/6520/oom_adj 0 [andrew@sophia ~]$ id uid=500(andrew) gid=500(andrew) groups=10(wheel),500(andrew) [andrew@sophia ~]$ echo -17 > /proc/6520/oom_adj -bash: echo: write error: Permission denied [andrew@sophia ~]$ echo 0 > /proc/6520/oom_adj [andrew@sophia ~]$ echo -17 > /proc/6520/oom_adj -bash: echo:write error: Permission denied [andrew@sophia ~]$ But that would be a pain to have to do. OTOH, disabling the OOM killer is not always an option. I recently tried it on one system and had to revert it rapidly because the system stopped working in minutes. Some software just doesn't live well in such environments, sadly. cheers andrew
On Mon, Jan 4, 2010 at 17:40, Andrew Dunstan <andrew@dunslane.net> wrote: > > > Magnus Hagander wrote: >> >> On Mon, Jan 4, 2010 at 17:07, Andrew Dunstan <andrew@dunslane.net> wrote: >> >>> >>> Magnus Hagander wrote: >>> >>>> >>>> On Mon, Jan 4, 2010 at 16:45, Alvaro Herrera >>>> <alvherre@commandprompt.com> >>>> wrote: >>>> >>>> >>>>> >>>>> Magnus Hagander wrote: >>>>> >>>>> >>>>>> >>>>>> I realize this is a very platform-specific thing, but should we >>>>>> consider setting the value of /proc/<pid>/oom_adj when running on >>>>>> linux? See: >>>>>> >>>>>> >>>>> >>>>> >>>>> http://archives.postgresql.org/message-id/20080201223336.GC24780%40alvh.no-ip.org >>>>> >>>>> >>>> >>>> Grr. I had zero recollectoin of that :S >>>> >>>> Can't find a useful consensus though? >>>> >>>> >>>> >>> >>> It is probably worth trying to protect the postmaster in the init script. >>> Beyond that things probably start to get fairly difficult. >>> >> >> Right. But AFAICS (though I haven't tested with -17), it will become >> inherited to children, which is something we'd want to *undo*, no? >> > > > [experiments] > > Yes, darnit, you're right. But it looks like the oom_adj file can be set to > the default by the process owner: > > [andrew@sophia ~]$ ls -l /proc/6520/oom_adj > -rw-r--r-- 1 andrew andrew 0 2010-01-04 12:37 /proc/6520/oom_adj > [andrew@sophia ~]$ cat /proc/6520/oom_adj > 0 > [andrew@sophia ~]$ id > uid=500(andrew) gid=500(andrew) groups=10(wheel),500(andrew) > [andrew@sophia ~]$ echo -17 > /proc/6520/oom_adj > -bash: echo: write error: Permission denied > [andrew@sophia ~]$ echo 0 > /proc/6520/oom_adj > [andrew@sophia ~]$ echo -17 > /proc/6520/oom_adj > -bash: echo: write error: Permission denied > [andrew@sophia ~]$ > > But that would be a pain to have to do. > > OTOH, disabling the OOM killer is not always an option. I recently tried it > on one system and had to revert it rapidly because the system stopped > working in minutes. Some software just doesn't live well in such > environments, sadly. Right. Which is why I like the idea of disabling the OOM killer for the *postmaster*, but not the regular backends. Gives it a chance to recover. It's not nice, but it's better than nothing. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Greg Smith <greg@2ndquadrant.com> writes: > In http://archives.postgresql.org/pgsql-hackers/2008-02/msg00049.php Tom > points out that while you could make this adjustment in the init scripts > for PostgreSQL, actually doing so is quite questionable as a packaging > decision. I just wondered if it would be questioned, I didn't say there was a problem. However, the long and the short of this is that we can't do anything without the close cooperation of an init script. I think that moves it out of the realm of what Postgres as a project should be doing. It seems more like a patch that the Linux-based packagers should be carrying. Memo to self: get off duff and prepare such a patch for the Red Hat/Fedora packages. regards, tom lane
Magnus Hagander wrote: > Right. Which is why I like the idea of disabling the OOM killer for > the *postmaster*, but not the regular backends. Gives it a chance to > recover. It's not nice, but it's better than nothing. It doesn't sound like the init script can reenable the killer for the child processes though. So, if there's anything that the core code ought to do, is re-enable OOM-killer for postmaster children, after being disabled by the initscript. BTW, is it possible for pg_ctl to disable OOM-killer? I guess not, since it's not run by root ... Is there a way to disable memory overcommit for particular processes? That would be very useful -- just disable overcommit for all Postgres processes, and there shouldn't be much need to enable the killer for backends. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, Jan 4, 2010 at 17:55, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Magnus Hagander wrote: > >> Right. Which is why I like the idea of disabling the OOM killer for >> the *postmaster*, but not the regular backends. Gives it a chance to >> recover. It's not nice, but it's better than nothing. > > It doesn't sound like the init script can reenable the killer for the > child processes though. So, if there's anything that the core code > ought to do, is re-enable OOM-killer for postmaster children, after > being disabled by the initscript. Yeah, that's why the backend code would need to be involved. > BTW, is it possible for pg_ctl to disable OOM-killer? I guess not, > since it's not run by root ... No, it has to run as root. > Is there a way to disable memory overcommit for particular processes? > That would be very useful -- just disable overcommit for all Postgres > processes, and there shouldn't be much need to enable the killer for > backends. Not that I've been able to find. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Alvaro Herrera <alvherre@commandprompt.com> writes: > Is there a way to disable memory overcommit for particular processes? I would think not --- the very essence of overcommit is that you're promising more total memory than the system has got, and that's inherently a global proposition. regards, tom lane
Magnus Hagander wrote: > On Mon, Jan 4, 2010 at 17:55, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > BTW, is it possible for pg_ctl to disable OOM-killer? I guess not, > > since it's not run by root ... > > No, it has to run as root. We could at least make it work on Windows, since it is often run as Administrator and drops privileges afterwards ... ... oh, wait ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Magnus Hagander <magnus@hagander.net> writes: > I realize this is a very platform-specific thing, but should we > consider setting the value of /proc/<pid>/oom_adj when running on > linux? See: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/proc.txt;h=220cc6376ef80e0c9bcfec162d45552e729cdf5a;hb=45d28b097280a78893ce25a5d0db41e6a2717853 One interesting thing I read there is: Swapped out tasks are killed first. Half of each child's memory size is added to the parent's score if they do not sharethe same memory. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This suggests that PG's shared memory ought not be counted in the postmaster's OOM score, which would mean that the problem shouldn't be quite as bad as we've believed. I wonder if that is a recent change? Or maybe it's supposed to be that way and is not implemented correctly? BTW, the given link shows only chapter 1, see http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob_plain;f=Documentation/filesystems/proc.txt;hb=45d28b097280a78893ce25a5d0db41e6a2717853 for the whole file. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > This suggests that PG's shared memory ought not be counted in the > postmaster's OOM score, which would mean that the problem > shouldn't be quite as bad as we've believed. I wonder if that is > a recent change? Or maybe it's supposed to be that way and is not > implemented correctly? I've wondered about that based on my experience. When I found that memory leak back in 8.2devel, running on a SLES 9 SP 3 system, the OOM killer killed the offending backend rather than the postmaster, although it took out a couple Java middle tier processes before starting in on PostgreSQL. -Kevin
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> ...oom_adj... > > One interesting thing I read there is: > Swapped out tasks are killed first. Half of each child's memory size is > added to the parent's score if they do not share the same memory. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > This suggests that PG's shared memory ought not be counted in the > postmaster's OOM score, which would mean that the problem shouldn't be > quite as bad as we've believed. I wonder if that is a recent change? > Or maybe it's supposed to be that way and is not implemented correctly? The code for oom_kill.c looks fairly readable (link below [1]): 96 points = mm->total_vm; .... 117 list_for_each_entry(child, &p->children, sibling) { 118 task_lock(child); 119 if (child->mm != mm && child->mm) 120 points += child->mm->total_vm/2 + 1; 121 task_unlock(child); 122 } Which seems to add points for each child who doesn't share the same mm structure as the parent. Which I think is a quite a bit stricter interpretation of "if they do not share the same memory". [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=mm/oom_kill.c;h=f52481b1c1e5442c9a5b16b06b22221b75b9bb7c;hb=HEAD
On Mon, Jan 4, 2010 at 09:55, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Magnus Hagander wrote: > >> Right. Which is why I like the idea of disabling the OOM killer for >> the *postmaster*, but not the regular backends. Gives it a chance to >> recover. It's not nice, but it's better than nothing. > > It doesn't sound like the init script can reenable the killer for the > child processes though. So, if there's anything that the core code > ought to do, is re-enable OOM-killer for postmaster children, after > being disabled by the initscript. Exactly. FWIW here is the patch I run. Stupid as the patch may be, count it as a +1 for people in the field doing this. Hence a reason to think about doing something in core. maybe. This has some oddities like it does not reset oom to 0 for the (wal) writer process. Plus assuming you do oom, the stats collector has a good chance of being hit. Although normal backends will probably have a higher score. [ oom_adj gets set to -17 in the startup script. I run this on top of disabling overcommit, color me paranoid ] *** a/src/backend/postmaster/autovacuum.c --- b/src/backend/postmaster/autovacuum.c *************** *** 362,367 **** StartAutoVacLauncher(void) --- 362,370 ---- #ifndef EXEC_BACKEND case 0: /* in postmaster child ... */ + + oom_adjust(); + /* Close the postmaster's sockets */ ClosePostmasterPorts(false); *** a/src/backend/postmaster/fork_process.c --- b/src/backend/postmaster/fork_process.c *************** *** 65,68 **** fork_process(void) --- 65,84 ---- return result; } + void + oom_adjust(void) + { + /* adjust oom */ + FILE *oom = fopen("/proc/self/oom_adj", "w"); + + /* + * ignore errors we dont really care + */ + if (oom) + { + fprintf(oom, "0\n"); + fclose(oom); + } + } + #endif /* ! WIN32 */ *** a/src/backend/postmaster/pgarch.c --- b/src/backend/postmaster/pgarch.c *************** *** 161,166 **** pgarch_start(void) --- 161,169 ---- #ifndef EXEC_BACKEND case 0: /* in postmaster child ... */ + + oom_adjust(); + /* Close the postmaster's sockets */ ClosePostmasterPorts(false); *** a/src/backend/postmaster/pgstat.c --- b/src/backend/postmaster/pgstat.c *************** *** 622,627 **** pgstat_start(void) --- 622,630 ---- #ifndef EXEC_BACKEND case 0: /* in postmaster child ... */ + + oom_adjust(); + /* Close the postmaster's sockets */ ClosePostmasterPorts(false); *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** *** 3056,3061 **** BackendStartup(Port *port) --- 3056,3063 ---- { free(bn); + oom_adjust(); + /* * Let's clean up ourselves as the postmaster child, and close the * postmaster's listen sockets. (In EXEC_BACKEND case this is all *** a/src/backend/postmaster/syslogger.c --- b/src/backend/postmaster/syslogger.c *************** *** 530,535 **** SysLogger_Start(void) --- 530,538 ---- #ifndef EXEC_BACKEND case 0: /* in postmaster child ... */ + + oom_adjust(); + /* Close the postmaster's sockets */ ClosePostmasterPorts(true); *** a/src/include/postmaster/fork_process.h --- b/src/include/postmaster/fork_process.h *************** *** 13,17 **** --- 13,18 ---- #define FORK_PROCESS_H extern pid_t fork_process(void); + extern void oom_adjust(void); #endif /* FORK_PROCESS_H */
Attachment
Alex Hunsaker <badalex@gmail.com> writes: > FWIW here is the patch I run. Stupid as the patch may be, count it as > a +1 for people in the field doing this. Hence a reason to think > about doing something in core. maybe. Thanks for the patch --- it's certainly a fine starting point. We can either drop this in core (with a lot of #ifdef LINUX added) or expect Linux packagers to carry it as a patch. Given that the packagers would also have to modify their init scripts to go with, the patch route is not unreasonable. Comments? > This has some oddities like it does not reset oom to 0 for the (wal) > writer process. FWIW, I think that's probably a feature --- I'd vote for only resetting in regular backends and possibly autovac workers. regards, tom lane
On Thu, Jan 7, 2010 at 20:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alex Hunsaker <badalex@gmail.com> writes: > We can either drop this in core (with a lot of #ifdef LINUX added) Any thoughts on doing something like (in fork_process.c) #ifdef LINUX void oom_adjust() { ... } #else void oom_adjust() {} #endif So there is only one #ifdef? It still leaves the ugly calls to the function... > or expect Linux packagers to carry it as a patch. Given that the > packagers would also have to modify their init scripts to go with, > the patch route is not unreasonable. Comments? Id plus +1 for core. The problem certainly does not look to be going away soon (if ever). >> This has some oddities like it does not reset oom to 0 for the (wal) >> writer process. > > FWIW, I think that's probably a feature --- I'd vote for only resetting > in regular backends and possibly autovac workers. I think that makes sense +1. In-fact thats why the patch has it as a separate function instead of hacked into fork_process(). However its mainly odd because IIRC I greped for all instances of fork_process() and added the oom_adjusting to the callers. Given that it seems the wall writer procs should also be set to 0. My guess is its a race with my startup script launching postgres and then setting oom_adj. Or maybe I missed a caller? Maybe they don't use fork_process()? Ill check it out.
On Fri, Jan 8, 2010 at 04:46, Alex Hunsaker <badalex@gmail.com> wrote: > On Thu, Jan 7, 2010 at 20:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alex Hunsaker <badalex@gmail.com> writes: > >> We can either drop this in core (with a lot of #ifdef LINUX added) > > Any thoughts on doing something like (in fork_process.c) > > #ifdef LINUX > void oom_adjust() > { > ... > } > #else > void oom_adjust() {} > #endif > > So there is only one #ifdef? It still leaves the ugly calls to the function... Seems like a much better way, yes. Especially if we in the future want to do this for more than one platform (if it becomes necessary). >> or expect Linux packagers to carry it as a patch. Given that the >> packagers would also have to modify their init scripts to go with, >> the patch route is not unreasonable. Comments? > > Id plus +1 for core. The problem certainly does not look to be going > away soon (if ever). Yeah, I think core is better. It's not like it's enough code to cause a huge maintenance problem, I think. Do we need to make the value configurable? I'd certainly find it interesting to set backends to say 5 or something like that, that makes them less likely to be killed than any old "oops opened too big file in an editor"-process, but still possible to kill if the system is *really* running out of memory. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > Do we need to make the value configurable? I'd certainly find it > interesting to set backends to say 5 or something like that, that > makes them less likely to be killed than any old "oops opened too big > file in an editor"-process, but still possible to kill if the system > is *really* running out of memory. I don't want to go to the trouble of creating (and documenting) a configure option for this. Much less a GUC ;-) What I suggest is that we do something like #ifdef LINUX_OOM_ADJ...fprintf(oom, "%d\n", LINUX_OOM_ADJ);...#endif Then, somebody who wants the feature would build with, say,-DLINUX_OOM_ADJ=0 or another value if they want that. regards, tom lane
Alex Hunsaker wrote: > On Thu, Jan 7, 2010 at 20:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alex Hunsaker <badalex@gmail.com> writes: > > > We can either drop this in core (with a lot of #ifdef LINUX added) > > Any thoughts on doing something like (in fork_process.c) > > #ifdef LINUX > void oom_adjust() > { > ... > } > #else > void oom_adjust() {} > #endif > > So there is only one #ifdef? It still leaves the ugly calls to the function... The usual solution for this kind of thing is: #ifdef LINUX#define OOM_ADJUST oom_adjust()#else#define OOM_ADJUST do {} while (0)#endif so there is no call or dummy function and you reference it in the code as: OOM_ADJUST; -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, Jan 8, 2010 at 07:53, Bruce Momjian <bruce@momjian.us> wrote: > Alex Hunsaker wrote: >> On Thu, Jan 7, 2010 at 20:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Alex Hunsaker <badalex@gmail.com> writes: > The usual solution for this kind of thing is: > > #ifdef LINUX > #define OOM_ADJUST oom_adjust() > #else > #define OOM_ADJUST do {} while (0) > #endif > > so there is no call or dummy function and you reference it in the code > as: Surely any compiler worth its salt would turn a call to an empty void function into a noop? Then again maybe I just hate macros :)
On Fri, Jan 8, 2010 at 07:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Then, somebody who wants the feature would build with, say, > -DLINUX_OOM_ADJ=0 > or another value if they want that. Here is a stab at that. It sets oom_adj for: autovacuum workers archivers (pgarch.c) regular backends Also it updates the contrib linux starup script to start under an oom_adj of -17 Comments? *** a/contrib/start-scripts/linux --- b/contrib/start-scripts/linux *************** *** 53,58 **** DAEMON="$prefix/bin/postmaster" --- 53,63 ---- # What to use to shut down the postmaster PGCTL="$prefix/bin/pg_ctl" + # Adjust oom_adj on linux to avoid the postmaster from be killed + # note you probably want to compile postgres with -DLINUX_OOM_ADJ=0 + # so that regular backends will be killed on oom + OOM_ADJ=-17 + set -e # Only start if we can find the postmaster. *************** *** 62,67 **** test -x $DAEMON || exit 0 --- 67,73 ---- case $1 in start) echo -n "Starting PostgreSQL: " + echo $OOM_ADJ > /proc/self/oom_adj su - $PGUSER -c "$DAEMON -D '$PGDATA' &" >>$PGLOG 2>&1 echo "ok" ;; *** a/src/backend/postmaster/autovacuum.c --- b/src/backend/postmaster/autovacuum.c *************** *** 1403,1408 **** StartAutoVacWorker(void) --- 1403,1411 ---- /* Lose the postmaster's on-exit routines */ on_exit_reset(); + /* allow us to be killed on oom */ + oom_adjust(); + AutoVacWorkerMain(0, NULL); break; #endif *** a/src/backend/postmaster/fork_process.c --- b/src/backend/postmaster/fork_process.c *************** *** 66,68 **** fork_process(void) --- 66,97 ---- } #endif /* ! WIN32 */ + + #if defined(__linux__) && defined(LINUX_OOM_ADJ) + /* + * By default linux really likes to kill the postmaster on oom. + * Because linux does not take SYSV shared mem into account it + * (almost) always will SIGKILL the postmaster on an oom event. + * + * In the event you started the postmaster under a low (negative) + * oom_adj. This will adjust regular backends, autovac and archivers + * to LINUX_OOM_ADJ on fork(). Allowing them to be killed in an oom + * event. + * + * Later we might add other OS oom type stuff here. + */ + void + oom_adjust(void) + { + FILE *oom = fopen("/proc/self/oom_adj", "w"); + + /* ignore errors we dont really care */ + if (oom) + { + fprintf(oom, "%d\n", LINUX_OOM_ADJ); + fclose(oom); + } + } + #else + void oom_adjust(void) { } + #endif *** a/src/backend/postmaster/pgarch.c --- b/src/backend/postmaster/pgarch.c *************** *** 170,175 **** pgarch_start(void) --- 170,178 ---- /* Drop our connection to postmaster's shared memory, as well */ PGSharedMemoryDetach(); + /* allow us to be killed on oom */ + oom_adjust(); + PgArchiverMain(0, NULL); break; #endif *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** *** 3076,3081 **** BackendStartup(Port *port) --- 3076,3084 ---- /* Perform additional initialization and collect startup packet */ BackendInitialize(port); + /* allow us to be killed on oom */ + oom_adjust(); + /* And run the backend */ proc_exit(BackendRun(port)); } *** a/src/include/postmaster/fork_process.h --- b/src/include/postmaster/fork_process.h *************** *** 13,17 **** --- 13,18 ---- #define FORK_PROCESS_H extern pid_t fork_process(void); + extern void oom_adjust(void); #endif /* FORK_PROCESS_H */
Attachment
Alex Hunsaker <badalex@gmail.com> writes: > On Fri, Jan 8, 2010 at 07:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Then, somebody who wants the feature would build with, say, >> -DLINUX_OOM_ADJ=0 >> or another value if they want that. > Here is a stab at that. Anybody have an objection to this basic approach? I'm in a bit of a hurry to get something like this into the Fedora RPMs, so barring objections I'm going to review this, commit it into HEAD, and then make a back-ported patch I can use with 8.4 in Fedora. regards, tom lane
Tom Lane wrote: > Alex Hunsaker <badalex@gmail.com> writes: > > On Fri, Jan 8, 2010 at 07:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Then, somebody who wants the feature would build with, say, > >> -DLINUX_OOM_ADJ=0 > >> or another value if they want that. > > > Here is a stab at that. > > Anybody have an objection to this basic approach? I'm in a bit of a > hurry to get something like this into the Fedora RPMs, so barring > objections I'm going to review this, commit it into HEAD, and then > make a back-ported patch I can use with 8.4 in Fedora. Go for it, but FYI, we need to udpate the our OOM documentation mention to reflect this change. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
* Magnus Hagander (magnus@hagander.net) wrote: > Do we need to make the value configurable? I'd certainly find it > interesting to set backends to say 5 or something like that, that > makes them less likely to be killed than any old "oops opened too big > file in an editor"-process, but still possible to kill if the system > is *really* running out of memory. We do need to make it configurable, in at least the sense of being able to control if it's done or not. There are some environments where you won't be able to set it. Perhaps just handling failure gracefully would work, but I'd be happier if you could just disable it in the config file. Thanks, Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I don't want to go to the trouble of creating (and documenting) a > configure option for this. Much less a GUC ;-) Requiring a custom build to disable it would be horrible, in my view. Or, at best, just means that the packagers won't enable it, which obviously would be less than ideal. Sorry if it's a pain, but I think it needs to either be configurable or not done. As I said before, it definitely needs to handle failure gracefully, but I worry that even that won't be sufficient in some cases. Just thinking about how we run PG under VServers and Linux Containers and whatnot, we ran into some issues with OpenSSH trying to monkey with proc values and I'd really hate to run into the same issues with PG. Thanks, Stephen
On Fri, Jan 8, 2010 at 10:07, Stephen Frost <sfrost@snowman.net> wrote: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> I don't want to go to the trouble of creating (and documenting) a >> configure option for this. Much less a GUC ;-) > > Requiring a custom build to disable it would be horrible, in my view. > Or, at best, just means that the packagers won't enable it, which > obviously would be less than ideal. FWIW I agree. > Sorry if it's a pain, but I think it needs to either be configurable or > not done. As I said before, it definitely needs to handle failure > gracefully, but I worry that even that won't be sufficient in some > cases. Just thinking about how we run PG under VServers and Linux > Containers and whatnot, we ran into some issues with OpenSSH trying to > monkey with proc values and I'd really hate to run into the same issues > with PG. As long as the VM/container you are running under wont kill postmaster for trying to access proc-- the patch I posted should work fine. It just ignores any error (I assumed you might be running in a chroot without proc or some such).
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Alex Hunsaker <badalex@gmail.com> writes: > > On Fri, Jan 8, 2010 at 07:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Then, somebody who wants the feature would build with, say, > >> -DLINUX_OOM_ADJ=0 > >> or another value if they want that. > > > Here is a stab at that. > > Anybody have an objection to this basic approach? I'm in a bit of a > hurry to get something like this into the Fedora RPMs, so barring > objections I'm going to review this, commit it into HEAD, and then > make a back-ported patch I can use with 8.4 in Fedora. Whoah, I would caution against doing that without being very confident it won't break when installed under things like VServer, Linux Containers, SELinux configurations, etc, when back-porting it. I don't expect people would be too pleased to discover their "nice, simple, no-expected-issues" upgrade of a minor point release to all of a sudden mean their database doesn't start anymore.. Sorry I havn't got time right now to run down the issue I had with OpenSSH doing something similar, and it might have just been poor coding in OpenSSH, but I wanted to voice my concern. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> I don't want to go to the trouble of creating (and documenting) a >> configure option for this. Much less a GUC ;-) > Requiring a custom build to disable it would be horrible, in my view. > Or, at best, just means that the packagers won't enable it, which > obviously would be less than ideal. I'm a packager, and I think that this approach is perfectly fine. The place where the rubber meets the road is in the init script, which is the packager's responsibility. If the packager is going to provide an init script that sets oom_adj in the first place, he can turn on the compensation code inside the binary. If not, the compensation code has no purpose anyhow. There are no moving parts in this as far as the end user is concerned. > Sorry if it's a pain, but I think it needs to either be configurable or > not done. As I said before, it definitely needs to handle failure > gracefully, We just ignore any error from the attempt to write to /proc. > but I worry that even that won't be sufficient in some > cases. Just thinking about how we run PG under VServers and Linux > Containers and whatnot, I think you are missing the point that the code won't even be compiled except on platforms where the packager has determined that it's sensible to have it. regards, tom lane
Alex, * Alex Hunsaker (badalex@gmail.com) wrote: > As long as the VM/container you are running under wont kill postmaster > for trying to access proc-- the patch I posted should work fine. It > just ignores any error (I assumed you might be running in a chroot > without proc or some such). As I recall, oom_adj wasn't visible in the container because you explicitly set what proc entries processes can have access to when using VServers, and OpenSSH didn't handle that cleanly. Guess what I'm just saying is "don't just assume everything is as it would be on a 'normal' system when dealing with /proc and friends", and, of course, test, test, test when you're talking about back-porting things. Thanks, Stephen
Tom Lane wrote: > I don't want to go to the trouble of creating (and documenting) a > configure option for this. Much less a GUC ;-) > > What I suggest is that we do something like > > #ifdef LINUX_OOM_ADJ > ... > fprintf(oom, "%d\n", LINUX_OOM_ADJ); > ... > #endif > > Then, somebody who wants the feature would build with, say, > -DLINUX_OOM_ADJ=0 > or another value if they want that. > > +1 for this. Looks like a sound approach. cheers andrew
On Fri, Jan 8, 2010 at 12:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Stephen Frost <sfrost@snowman.net> writes: >> * Tom Lane (tgl@sss.pgh.pa.us) wrote: >>> I don't want to go to the trouble of creating (and documenting) a >>> configure option for this. Much less a GUC ;-) > >> Requiring a custom build to disable it would be horrible, in my view. >> Or, at best, just means that the packagers won't enable it, which >> obviously would be less than ideal. > > I'm a packager, and I think that this approach is perfectly fine. > The place where the rubber meets the road is in the init script, > which is the packager's responsibility. If the packager is going > to provide an init script that sets oom_adj in the first place, > he can turn on the compensation code inside the binary. If not, > the compensation code has no purpose anyhow. There are no moving > parts in this as far as the end user is concerned. There could well be moving parts if the user wants to adjust the value being written to oom_adj, and can't because it's compiled in. I don't see why we can't just add a GUC for this and be done with it. ...Robert
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> I don't want to go to the trouble of creating (and documenting) a >> configure option for this. Much less a GUC ;-) > Requiring a custom build to disable it would be horrible, in my view. BTW, maybe you're confused about the intention here? You'd need a custom build to *enable* it, not to disable it. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > There could well be moving parts if the user wants to adjust the value > being written to oom_adj, and can't because it's compiled in. I don't > see why we can't just add a GUC for this and be done with it. The number of users who will want to do that might be different from epsilon, but not by enough to justify a GUC. Furthermore, we don't have any reasonable infrastructure for supporting platform-specific GUCs, which means that the amount of effort you're proposing is extremely large. regards, tom lane
On Fri, Jan 8, 2010 at 10:24, Stephen Frost <sfrost@snowman.net> wrote: > As I recall, oom_adj wasn't visible in the container because you > explicitly set what proc entries processes can have access to when using > VServers, and OpenSSH didn't handle that cleanly. Guess what I'm just > saying is "don't just assume everything is as it would be on a 'normal' > system when dealing with /proc and friends", and, of course, test, test, > test when you're talking about back-porting things. Sure this was openssh? I just looked through the entire cvs history for opensshp and found 0 references to 'oom' let alone 'oom_adj'. Maybe something distro specific? [ for the curious here is what I tried ] $ git clone git://git.infradead.org/openssh.git $ cd openssh $ git grep oom_adj $ git grep 'oom' $ git grep oom | grep -vi loomis | grep -v room $ git log -p | grep oom | grep -vi loomis | grep -v room | grep -v tsoome
Alex Hunsaker <badalex@gmail.com> writes: > Sure this was openssh? I just looked through the entire cvs history > for opensshp and found 0 references to 'oom' let alone 'oom_adj'. > Maybe something distro specific? FWIW, I see no evidence that sshd on Fedora does anything to change its oom score --- the oom_adj file reads as zero for both the parent daemon and its children. Kinda scary to realize the OOM killer could easily lock me out of boxes I run headless. regards, tom lane
On Fri, Jan 8, 2010 at 12:45, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alex Hunsaker <badalex@gmail.com> writes: >> Sure this was openssh? I just looked through the entire cvs history >> for opensshp and found 0 references to 'oom' let alone 'oom_adj'. >> Maybe something distro specific? > > FWIW, I see no evidence that sshd on Fedora does anything to change its > oom score --- the oom_adj file reads as zero for both the parent daemon > and its children. Kinda scary to realize the OOM killer could easily > lock me out of boxes I run headless. [ OT, CC trimmed] Yeah, for me sshd has a score of 24 and has the 13th lowest oom_score on my box. While postgres with 0 connections and shared_buffers = 28MB has a score of 26558 and has the 5th highest oom_score. Only chromium and xmonad are above it. With 5 connections it just about doubles its score (to 47238). Had I been headless postgres would certainly die on oom. But even something like alsamixer or bash is higher than sshd *shrug*. For the curious here below is the raw data and how i generated it [ yes its far from perfect... ]: (for file in /proc/*/; do echo `cat $file/oom_score` `cat $file/cmdline` $file; done) | sort -n 6 /sbin/agetty-838400tty1linux /proc/2204/ 6 /sbin/agetty-838400tty2linux /proc/1977/ 6 /sbin/agetty-838400tty3linux /proc/1978/ 6 /usr/sbin/crond /proc/1954/ 6 /usr/sbin/uptimed /proc/1971/ 10 /usr/sbin/irqbalance /proc/1951/ 12 /usr/sbin/ntpd-s /proc/1967/ 12 /usr/sbin/smartd /proc/2128/ 13 hald-addon-input: Listening on /dev/input/event2 /dev/input/event1 /proc/2044/ 13 hald-addon-storage: polling /dev/sr0 (every 2 sec) /proc/2088/ 13 /usr/lib/hal/hald-addon-cpufreq /proc/2092/ 19 /usr/sbin/syslog-ng /proc/1639/ 24 /usr/sbin/sshd /proc/1943/ 26 /usr/lib/sa/sadc-F-L-SDISK6006- /proc/2716/ 27 supervising syslog-ng /proc/1638/ 38 hald-runner /proc/1992/ 53 cat/proc/self//cmdline /proc/self/ 64 /usr/lib/postfix/master /proc/2089/ 90 /usr/bin/X-nolistentcp /proc/2173/ 102 -bash /proc/10997/ 108 /home/alex/.cabal/bin/xmobar-x1 /proc/2190/ 140 /usr/bin/dbus-daemon--fork--print-pid5--print-address7--session /proc/2256/ 140 /usr/bin/dbus-daemon--system /proc/1983/ 191 /usr/lib/hal/hald-addon-acpi /proc/2093/ 194 dbus-launch--autolaunch004d7f457c373938f22d796a4ae05b60--binary-syntax--close-stderr /proc/2255/ 199 /usr/sbin/ntpd-s /proc/1960/ 206 ssh-agentxmonad /proc/2188/ 287 tail-f/var/log/httpd/error.log /proc/8688/ 310 firefox /proc/18571/ 339 xbindkeys /proc/2192/ 341 -bash /proc/10589/ 354 /usr/local/bin/cmus /proc/2205/ 400 /bin/sh/usr/bin/startx /proc/2155/ 456 sort-n /proc/10998/ 487 /usr/sbin/hald /proc/1991/ 525 qmgr-l-tfifo-u /proc/2097/ 639 /bin/sh/home/alex/.xinitrc /proc/2180/ 736 /usr/lib/GConf/gconfd-2 /proc/18578/ 861 tail-f/var/log/httpd/error.log /proc/1621/ 1122 urxvtd-q-f-o /proc/2189/ 1151 /usr/lib/chromium/chromium /proc/2220/ 1172 -bash /proc/22276/ 1351 mutt-y /proc/27637/ 1497 alsamixer /proc/3528/ 1528 ssh192.168.0.15 /proc/1523/ 1534 -bash /proc/2863/ 1534 -bash /proc/2881/ 1534 -bash /proc/2891/ 1793 /usr/lib/chromium/chromium /proc/2219/ 1828 ssh70.98.186.4 /proc/2860/ 2066 pickup-l-tfifo-u /proc/26254/ 2195 postgres: stats collector process /proc/10602/ 2281 -bash /proc/3521/ 2299 -bash /proc/1516/ 2447 -bash /proc/23990/ 2573 -bash /proc/1538/ 3082 /usr/lib/chromium/chromium --channel=2219.aa9eb00.196295212 --type=renderer --lang=en-US --force-fieldtest=AsyncSlowStart/_AsyncSlowStart_off/CacheSize/CacheSizeGroup_6/DnsImpact/_max_500ms_queue_prefetch/GlobalSdch/_global_disable_sdch/SocketLateBinding/_enable_late_binding/ /proc/2392/ 3305 -bash /proc/20490/ 3796 /usr/sbin/httpd-kstart /proc/1553/ 4697 /usr/bin/knotify4 /proc/31084/ 6117 /usr/lib/chromium/chromium --channel=2219.afcb450.251437212 --type=renderer --lang=en-US --force-fieldtest=AsyncSlowStart/_AsyncSlowStart_off/CacheSize/CacheSizeGroup_6/DnsImpact/_max_500ms_queue_prefetch/GlobalSdch/_global_disable_sdch/SocketLateBinding/_enable_late_binding/ /proc/18311/ 6564 /proc/self/exe--channel=2219.a684ce8.115295409--type=extension--lang=en-US--force-fieldtest=AsyncSlowStart/_AsyncSlowStart_off/DnsImpact/_max_500ms_queue_prefetch/GlobalSdch/_global_disable_sdch/SocketLateBinding/_enable_late_binding/ /proc/2242/ 6921 /proc/self/exe--channel=2219.a6862e0.261342415--type=extension--lang=en-US--force-fieldtest=AsyncSlowStart/_AsyncSlowStart_off/DnsImpact/_max_500ms_queue_prefetch/GlobalSdch/_global_disable_sdch/SocketLateBinding/_enable_late_binding/ /proc/2245/ 7453 /proc/self/exe--type=plugin--plugin-path=/usr/lib/mozilla/plugins/libflashplayer.so--lang=en-US--plugin-data-dir=/home/alex/.config/chromium/Default--channel=2219.afd16a58.340779570 /proc/18316/ 10132 -bash /proc/2195/ 10154 postgres: wal writer process /proc/10600/ 10154 postgres: writer process /proc/10599/ 10299 postgres: autovacuum launcher process /proc/10601/ 11019 init [3] /proc/1/ 17332 /usr/sbin/httpd-kstart /proc/1616/ 17365 /usr/sbin/httpd-kstart /proc/1611/ 17366 /usr/sbin/httpd-kstart /proc/1612/ 17398 /usr/sbin/httpd-kstart /proc/1613/ 17567 /usr/sbin/httpd-kstart /proc/1614/ 19954 xinit/home/alex/.xinitrc--/etc/X11/xinit/xserverrc:0-auth/tmp/serverauth.NHLLZS74xg /proc/2172/ 26558 bin/postgres-Dblah /proc/10597/ 26882 /proc/self/exe--channel=2219.b1692d18.1483224477--type=extension--lang=en-US--force-fieldtest=AsyncSlowStart/_AsyncSlowStart_off/CacheSize/CacheSizeGroup_6/DnsImpact/_max_500ms_queue_prefetch/GlobalSdch/_global_disable_sdch/SocketLateBinding/_enable_late_binding/ /proc/1597/ 52201 /usr/lib/chromium/chromium--type=zygote /proc/2221/ 53839 /home/alex/.xmonad/xmonad-i386-linux /proc/2187/ 61530 /usr/lib/chromium/chromium --channel=2219.aa993d0.1197730980 --type=renderer --lang=en-US --force-fieldtest=AsyncSlowStart/_AsyncSlowStart_off/CacheSize/CacheSizeGroup_6/DnsImpact/_max_500ms_queue_prefetch/GlobalSdch/_global_disable_sdch/SocketLateBinding/_enable_late_binding/ /proc/1527/
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Alex Hunsaker <badalex@gmail.com> writes: > > Sure this was openssh? I just looked through the entire cvs history > > for opensshp and found 0 references to 'oom' let alone 'oom_adj'. > > Maybe something distro specific? > > FWIW, I see no evidence that sshd on Fedora does anything to change its > oom score --- the oom_adj file reads as zero for both the parent daemon > and its children. Kinda scary to realize the OOM killer could easily > lock me out of boxes I run headless. There were a few issues, as it turns out, the particularly annoying one was in the init script which caused upgrades to fail due to sshd not being restarted, bug report here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=473573 The other issue was with a Debian-specific patch which was applied to OpenSSH which basically just created noise in the log file, bug report here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=487325 In the end, the problem was with errors being returned from attempts to modify oom_adj. As long as we can just ignore those hopefully there won't be any issues. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > The other issue was with a Debian-specific patch which was applied to > OpenSSH which basically just created noise in the log file, bug report > here: > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=487325 Hmm, that's pretty interesting, specifically this: : After some discussion with the Linux-Vserver folks, they found some : interesting information I thought it worth adding. First EPERM is not : the error that they expected, and that inside a vserver guest its really : strict about what options you open it with, both O_CREAT and O_TRUNC are : forbidden, and O_WRONLY lets you write 0\n to it. That suggests it might be worth our trouble to use open/write rather than fopen, so that we can ensure the flags are correct to avoid this type of restriction. But in any case the main takeaway seems to be to not insist on the operation succeeding ;-) regards, tom lane
On Fri, Jan 8, 2010 at 15:24, Stephen Frost <sfrost@snowman.net> wrote: > There were a few issues, as it turns out, the particularly annoying one > was in the init script which caused upgrades to fail due to sshd not > being restarted, bug report here: Thanks for the pointers! > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=473573 The changes I proposed to the example linux startup script wont suffer from that. > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=487325 > > In the end, the problem was with errors being returned from attempts to > modify oom_adj. As long as we can just ignore those hopefully there > won't be any issues. Yep sounds good. Thanks again! Tom, sounds like you got busy with other stuff :) Should I submit a new patch that uses open and O_WRONLY?
Alex Hunsaker <badalex@gmail.com> writes: > Tom, sounds like you got busy with other stuff :) Should I submit a > new patch that uses open and O_WRONLY? No, I was just waiting to see if there were more comments. I can take it from here. regards, tom lane
On fre, 2010-01-08 at 11:37 -0500, Tom Lane wrote: > Alex Hunsaker <badalex@gmail.com> writes: > > On Fri, Jan 8, 2010 at 07:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Then, somebody who wants the feature would build with, say, > >> -DLINUX_OOM_ADJ=0 > >> or another value if they want that. > > > Here is a stab at that. > > Anybody have an objection to this basic approach? I'm in a bit of a > hurry to get something like this into the Fedora RPMs, so barring > objections I'm going to review this, commit it into HEAD, and then > make a back-ported patch I can use with 8.4 in Fedora. I find this whole approach a bit evil. If word of this gets out, every server process on Linux will excuse itself from the OOM killer. And then the kernel guys will add another setting to override the process preference. It's an arms race, but maybe that's what's needed.
On Sat, Jan 9, 2010 at 14:06, Peter Eisentraut <peter_e@gmx.net> wrote: > I find this whole approach a bit evil. I would tend to agree but this type of thing has been known since about 2004... See http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/, particularly the comment from Greg Smith. > If word of this gets out, every > server process on Linux will excuse itself from the OOM killer. And > then the kernel guys will add another setting to override the process > preference. Yes, and note debian is already doing that with things like ssh. Who knows what else. (Id be curious to know) Plus maybe it will convince them its time to fix the damn thing. Although postgres really is kind of special in this regard. All the other daemons on my system include X had way lower oom scores. Alsamixer was 3 times more likely to get killed than the first daemon with the highest score (hald) while postgres was 55 times more likely.Yes its the kernel being stupid, but its been knownfor more than 6 years... (oom scores: alsamxier: 1497, hald: 487, postgres: 26558) > It's an arms race, but maybe that's what's needed. Well *shrug* regardless of what core does... Ill certainly be doing it on my postgres linux builds :) Maybe it would convince them more if we could get distros to accept patches that fix the kernel to do correct/better shared mem accounting? May I add good luck? :)
Peter Eisentraut wrote: > On fre, 2010-01-08 at 11:37 -0500, Tom Lane wrote: > >> Alex Hunsaker <badalex@gmail.com> writes: >> >>> On Fri, Jan 8, 2010 at 07:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>>> Then, somebody who wants the feature would build with, say, >>>> -DLINUX_OOM_ADJ=0 >>>> or another value if they want that. >>>> >>> Here is a stab at that. >>> >> Anybody have an objection to this basic approach? I'm in a bit of a >> hurry to get something like this into the Fedora RPMs, so barring >> objections I'm going to review this, commit it into HEAD, and then >> make a back-ported patch I can use with 8.4 in Fedora. >> > > I find this whole approach a bit evil. If word of this gets out, every > server process on Linux will excuse itself from the OOM killer. And > then the kernel guys will add another setting to override the process > preference. It's an arms race, but maybe that's what's needed. > The trouble is that the OOM heuristics are pretty bad, and many Linux hackers aren't interested in improving them. One of the most prominent told me some years ago "Just turn it off." And the point of this patch is to allow the postmaster to *remove* OOM protection from normal postgres backends. We at least would be playing nice, and not engaging in an arms race. cheers andrew
Alex Hunsaker <badalex@gmail.com> writes: > On Sat, Jan 9, 2010 at 14:06, Peter Eisentraut <peter_e@gmx.net> wrote: >> If word of this gets out, every >> server process on Linux will excuse itself from the OOM killer. And >> then the kernel guys will add another setting to override the process >> preference. > ... maybe it will convince them its time to fix the damn thing. > Although postgres really is kind of special in this regard. Yeah. If they had saner handling of shared-memory accounting, maybe there wouldn't be a need for us to kluge around the OOM logic. regards, tom lane
Alex Hunsaker <badalex@gmail.com> writes: > On Fri, Jan 8, 2010 at 07:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Then, somebody who wants the feature would build with, say, >> -DLINUX_OOM_ADJ=0 >> or another value if they want that. > Here is a stab at that. Applied with some editorialization. I concluded that it'd be better to put the oom_adj reset right into fork_process, rather than scattering the support across several different files. The latter seems vulnerable to errors of omission in future versions, and there's no really strong reason to not have all the child processes behave the same. Also, a single-file patch is a lot easier for packagers to borrow and apply to existing releases, should they choose to. (Already done and tested in Fedora packages ...) regards, tom lane