Thread: Limit allocated memory per session
I'd like to propose adding a new GUC to limit the amount of memory a backend can allocate for its own use. The problem this addresses is that sometimes one needs to set work_mem fairly high to get good query plans for large joins. However, some complex queries will then use huge amounts of memory so that one or a few of them will consume all the memory on the host and run it deep into swap or trigger the oom killer or worse. I've attached a patch based on 8.4.1. It works by keeping a track of the total memory allocated via malloc to AllocBlocks (aset.c). If this is not shot down/up too badly I will rebase it on CVS and submit it for the next commit fest. I would also like to propose a similar limit on temp space use. It is quite easy for an unintended cartesion product to use hundreds of gigabytes of scratch space and cause other processes to fail due to lack of disk space. If this is not objectionable, I'll work on it too. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
Attachment
daveg <daveg@sonic.net> writes: > I'd like to propose adding a new GUC to limit the amount of memory a backend > can allocate for its own use. Use ulimit. regards, tom lane
daveg wrote: > > I'd like to propose adding a new GUC to limit the amount of memory a backend > can allocate for its own use. The problem this addresses is that sometimes > one needs to set work_mem fairly high to get good query plans for large joins. > However, some complex queries will then use huge amounts of memory so that > one or a few of them will consume all the memory on the host and run it deep > into swap or trigger the oom killer or worse. Oh, BTW, did anyone get interested in adding the bits to disable the OOM killer for postmaster on the various Linux initscripts? It needs some games with /proc/<pid>/oom_adj and requires root privileges, but I think an initscript is in an excellent position to do it. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane escreveu: > daveg <daveg@sonic.net> writes: >> I'd like to propose adding a new GUC to limit the amount of memory a backend >> can allocate for its own use. > > Use ulimit. > What about plataforms (Windows) that don't have ulimit? -- Euler Taveira de Oliveira http://www.timbira.com/
In response to Euler Taveira de Oliveira <euler@timbira.com>: > Tom Lane escreveu: > > daveg <daveg@sonic.net> writes: > >> I'd like to propose adding a new GUC to limit the amount of memory a backend > >> can allocate for its own use. > > > > Use ulimit. > > What about plataforms (Windows) that don't have ulimit? I have a hard time believing that Windows doesn't have a ulimit equivalent. I don't want to degrade this thread into MS-bashing, but if that were the case, it would make Windows a pretty crippled OS. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
Euler Taveira de Oliveira <euler@timbira.com> writes: > Tom Lane escreveu: >> daveg <daveg@sonic.net> writes: >>> I'd like to propose adding a new GUC to limit the amount of memory a backend >>> can allocate for its own use. >> >> Use ulimit. >> > What about plataforms (Windows) that don't have ulimit? Get a real operating system ;-) Seriously, the proposed patch introduces overhead into a place that is already a known hot spot, in return for not much of anything. It will *not* bound backend memory use very accurately, because there is no way to track raw malloc() calls. And I think that 99% of users will not find it useful. regards, tom lane
On Thu, Oct 1, 2009 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Euler Taveira de Oliveira <euler@timbira.com> writes: >> Tom Lane escreveu: >>> daveg <daveg@sonic.net> writes: >>>> I'd like to propose adding a new GUC to limit the amount of memory a backend >>>> can allocate for its own use. >>> >>> Use ulimit. >>> >> What about plataforms (Windows) that don't have ulimit? > > Get a real operating system ;-) > > Seriously, the proposed patch introduces overhead into a place that is > already a known hot spot, in return for not much of anything. It will > *not* bound backend memory use very accurately, because there is no way > to track raw malloc() calls. And I think that 99% of users will not > find it useful. What WOULD be useful is to find a way to provide a way to configure work_mem per backend rather than per executor node. But that's a much harder problem. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > What WOULD be useful is to find a way to provide a way to configure > work_mem per backend rather than per executor node. But that's a much > harder problem. I think it's mostly a planner problem: how do you deal with the fact that that would make cost estimates for different sub-problems interrelated? The cost of a hash, for instance, depends a lot on how much memory you assume it can use. regards, tom lane
Robert Haas escreveu: > On Thu, Oct 1, 2009 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Euler Taveira de Oliveira <euler@timbira.com> writes: >>> Tom Lane escreveu: >>>> daveg <daveg@sonic.net> writes: >>>>> I'd like to propose adding a new GUC to limit the amount of memory a backend >>>>> can allocate for its own use. >>>> Use ulimit. >>>> >>> What about plataforms (Windows) that don't have ulimit? >> Get a real operating system ;-) >> >> Seriously, the proposed patch introduces overhead into a place that is >> already a known hot spot, in return for not much of anything. It will >> *not* bound backend memory use very accurately, because there is no way >> to track raw malloc() calls. And I think that 99% of users will not >> find it useful. > > What WOULD be useful is to find a way to provide a way to configure > work_mem per backend rather than per executor node. But that's a much > harder problem. > I see. Tough problem is: how do we get per backend memory usage accurately? Is it relying on OS specific API the only way? -- Euler Taveira de Oliveira http://www.timbira.com/
Euler Taveira de Oliveira <euler@timbira.com> writes: > I see. Tough problem is: how do we get per backend memory usage accurately? Is > it relying on OS specific API the only way? Given all the third-party libraries (perl, python, libxml2, yadda yadda) that can be in use and won't go through palloc, I think that this would have to be done at the OS level to be very meaningful. The other problem is the one Robert touched on: what you actually *want* is something entirely different, namely for the backend to actively try to meet an overall target for its memory usage, rather than having queries fail ungracefully when they hit an arbitrary limit that the planner didn't even know about. regards, tom lane
Tom Lane escreveu: > Robert Haas <robertmhaas@gmail.com> writes: >> What WOULD be useful is to find a way to provide a way to configure >> work_mem per backend rather than per executor node. But that's a much >> harder problem. > > I think it's mostly a planner problem: how do you deal with the fact > that that would make cost estimates for different sub-problems > interrelated? The cost of a hash, for instance, depends a lot on how > much memory you assume it can use. > It could introduce some complexity but you could track (subtract) the memory usage as you're walking up the tree. Also, you need to decide what to do when you have more than one node per level. :( How do you deal with priority in this case? -- Euler Taveira de Oliveira http://www.timbira.com/
Alvaro Herrera <alvherre@commandprompt.com> writes: > Oh, BTW, did anyone get interested in adding the bits to disable the OOM > killer for postmaster on the various Linux initscripts? It needs some > games with /proc/<pid>/oom_adj and requires root privileges, but I think > an initscript is in an excellent position to do it. I was imagining that this would be something for individual distros to tackle. It's probably not portable enough to go into the contrib/start-scripts examples. On the other hand, it'd make lots of sense to have the Fedora or Debian or whatever scripts do this, since they know what kernel version they're targeting. (If anyone wants to send me the fixes to make Fedora's script do this ...) regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I was imagining that this would be something for individual distros > to tackle. It's probably not portable enough to go into the > contrib/start-scripts examples. On the other hand, it'd make lots > of sense to have the Fedora or Debian or whatever scripts do this, > since they know what kernel version they're targeting. (If anyone > wants to send me the fixes to make Fedora's script do this ...) I'm not exactly keen on Debian init scripts hacking kernel settings. Should it hack up the shared memory numbers too? This is not what I would consider 'init script' material for specific applications. Thanks, Stephen
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Oh, BTW, did anyone get interested in adding the bits to disable the OOM > > killer for postmaster on the various Linux initscripts? It needs some > > games with /proc/<pid>/oom_adj and requires root privileges, but I think > > an initscript is in an excellent position to do it. > > I was imagining that this would be something for individual distros > to tackle. It's probably not portable enough to go into the > contrib/start-scripts examples. Hmm? I think it should be just (as root) if [ -f /proc/$pid_of_postmaster/oom_adj ]; thenecho -17 > /proc/$pid_of_postmaster/oom_adj fi This is supported from 2.6.11 onwards AFAIK. If the kernel is older than that, the file would not exist and this would be a noop. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Stephen Frost <sfrost@snowman.net> writes: > I'm not exactly keen on Debian init scripts hacking kernel settings. > Should it hack up the shared memory numbers too? This is not what I > would consider 'init script' material for specific applications. What was suggested was tweaking the oom_adj setting for the postmaster process only, not messing with any system-wide settings. Do you really find that unreasonable? The default OOM killer behavior is just about as unreasonable as can be :-( regards, tom lane
On Thu, Oct 01, 2009 at 10:35:55AM -0400, Tom Lane wrote: > daveg <daveg@sonic.net> writes: > > I'd like to propose adding a new GUC to limit the amount of memory a backend > > can allocate for its own use. > > Use ulimit. That was my initial thought too. However, ulimit() is documented as superceded by setrlimit(). Which has the option RLIMIT_DATA to limit the size of the data segment. Perfect! Except, RLIMIT_DATA does not appear to work on linux. The call succeeds and the new value can even be read back with getrlimit(), but it does not seem to do anything to actually limit the memory allocated. I tested this on SuSE 11: kernel 2.6.25, and Ubuntu Intrepid: kernel 2.6.28. Setting RLIMIT_AS to limit the total address space for a process works as expected. However this seems undesireable for postgresql as it can also cause stack expansion to fail, which would then force a general restart. Also, this limit would interact with the buffercache size setting as it includes the shared address space as well. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Thu, Oct 01, 2009 at 11:47:43AM -0400, Tom Lane wrote: > Euler Taveira de Oliveira <euler@timbira.com> writes: > > Tom Lane escreveu: > >> daveg <daveg@sonic.net> writes: > >>> I'd like to propose adding a new GUC to limit the amount of memory a backend > >>> can allocate for its own use. > >> > >> Use ulimit. > Seriously, the proposed patch introduces overhead into a place that is > already a known hot spot, in return for not much of anything. It will The overhead is simply an integer addition and compare with values that are likely already in processor caches. And this only occurs when we actually call malloc() to get a new block, not on every palloc. So I suspect it will not be noticable. However, I welcome any suggestion on how to test this and actually measure the overhead if any. pg_bench? Something else? > *not* bound backend memory use very accurately, because there is no way > to track raw malloc() calls. And I think that 99% of users will > not find it useful. The use case that motivated is a client that runs many postgresql instances with a mostly batch/large query workload. Some of the queries are code generated by an application and can be very complex. A few times a month one of these will run through 64GB of memory and oom the host. So it seriously hurts production. Setting work_mem low enough to prevent this results in poor query performance. This client does not use any outside libraries that call malloc() directly. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Thu, Oct 1, 2009 at 12:15 PM, Euler Taveira de Oliveira <euler@timbira.com> wrote: > Robert Haas escreveu: >> On Thu, Oct 1, 2009 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Euler Taveira de Oliveira <euler@timbira.com> writes: >>>> Tom Lane escreveu: >>>>> daveg <daveg@sonic.net> writes: >>>>>> I'd like to propose adding a new GUC to limit the amount of memory a backend >>>>>> can allocate for its own use. >>>>> Use ulimit. >>>>> >>>> What about plataforms (Windows) that don't have ulimit? >>> Get a real operating system ;-) >>> >>> Seriously, the proposed patch introduces overhead into a place that is >>> already a known hot spot, in return for not much of anything. It will >>> *not* bound backend memory use very accurately, because there is no way >>> to track raw malloc() calls. And I think that 99% of users will not >>> find it useful. >> >> What WOULD be useful is to find a way to provide a way to configure >> work_mem per backend rather than per executor node. But that's a much >> harder problem. >> > I see. Tough problem is: how do we get per backend memory usage accurately? Is > it relying on OS specific API the only way? As I see it, this is really a planning problem, not an executor problem, so measuring ACTUAL memory usage is not really important: the problem is taking memory usage into account during planning. The difficulty with adjusting work_mem right now is that the correct value depends not only on the number of queries that are concurrently executing (which isn't a constant) but also on the number of sort/hash operations being performed per query (which is also not a constant). So if your queries become more complex, a value of work_mem that was previously OK may start to cause swapping, which encourages setting work_mem conservatively. But setting it conservatively can cause the planner to pick plans that save memory at a LARGE performance cost. Fixing this isn't simple. Right now, when planning a particular joinrel, we only keep track of the best plans for each possible set of path keys, regardless of how much or little memory they use. So if we do something naive, like just track the total amount of memory that each candidate path is forecast to use and avoid letting it go above some ceiling, query planning might fail altogether, because the lower-level joinrels use as much memory as they want and the higher level nodes, which for some reason can't be done without memory, can't be planned. Or we might just end up with a badly suboptimal plan, because we pick a slightly cheaper plan lower down in the tree that uses a LOT more memory over a slightly more expensive one that uses much less. Later we'll wish we hadn't, but by that point it's too late. Another possible angle of attack is to try to give the planner a range for work_mem rather than a hard limit. The planner would ordinarily construct paths as though the lower end of the range was the limit, but for a sufficiently large cost savings it would be willing to adopt a path that used more memory. Potentially this willingness could also be conditioned on the amount of memory used by the path so far, although that has the same problems described above in kind if not in degree. I'm not really sure whether something like this can be made to work; I'm not sure there's really enough information available when constructing paths for any sort of local decision-making to prove fruitful. The other idea I have is to adopt a strategy where each plan node has upper and lower bounds on cost, as I previously suggested here with respect to index-only scans. http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php The idea would basically be to estimate the lower-bound for the cost of a sort based on the idea that we'll have the maximum possible amount of memory to work with (say, the budget for the whole query) and the upper-bound cost based on the idea that we'll have the minimum possible amount of memory (zero, or whatever the minimal amount is). We can also estimate the most memory we think we can usefully use (for example, a hash join with a smaller inner rel doesn't benefit from more memory than the amount required to hold the entire hash table in memory). After we complete the first round of planning, we look at the resulting paths and decide which sorts or hashes will get funded with how much memory. I'm hand-waving a little bit here, because there may be a knapsack problem in here (which is NP-complete), since the cost as a function of memory probably has sharp cliffs with not much change in between them - certainly for hashing, and I suspect for sorting as well, but it might be that in practice N is small enough not to matter, or we might be able to find an approximation that is good enough that we can live with it. Even if we can get past that hurdle, though, there's still all the caveats from the original email, principally that it's unclear that the necessary computations can be done without blowing planning time out of the water. Plus, if we used this strategy for multiple purposes, like position of heap fetch nodes and also allocation of work memory, there could be interdependencies that would turn the whole thing into a giant mess. So to reiterate my first comment: a MUCH harder problem. ...Robert