Thread: excessive amounts of consumed memory (RSS), triggering OOM killer
Hi all, while working on the patch decreasing amount of memory consumed by array_agg [1], I've ran into some strange OOM issues. Reproducing them using the attached SQL script is rather simple. [1] https://commitfest.postgresql.org/action/patch_view?id=1652 At first I thought there's some rare hidden memory leak, but I'm pretty sure that's not the case. I've even put some explicit counters into aset.c counting allocated/freed blocks, and it seems to be working just fine (and matching the context tree perfectly). So no memory leak. The only thing I can think of is some interaction (fragmentation?) with the system allocator (I'm using glibc-2.19 and kernel 3.16.1, btw), making the RSS values even less useful than I thought. Sadly, it seems to trigger the OOM killer :-( It's entirely possible that this is a known behavior of the allocator, and I've been unaware of it. It's also true that the work_mem values are really excessive, and the actual values would be much lower, which would make the issues much less serious. To demonstrate the problem I'll use the attached SQL script - it's a rather simple script generating sample tables and then executing trivial array_agg() queries. The script sets work_mem to two different values: work_mem = '1GB' ==> this limits the INSERT (generating data) work_mem = '1024GB' ==> bogus value, forcing a hash aggegate (assuming the hash table fits into memory) The size of the sample tables (and the amount of memory needed for the hash aggregate) is determined by the fist parameter set in the script. With this value \set size 10000 I get a OOM crash on the first execution of the SQL script (on a machine with 8GB of RAM and 512MB shared buffers), but YMMV. The problem is, that even with a much smaller dataset (say, using size 7500) you'll get an OOM error after several executions of the script. How many executions are needed seems to be inversely proportional to the size of the data set. The "problem" is that the RSS amount is increasing over time for some reason. For example with the "size = 5000", the memory stats for the process look like this over the first few minutes: VIRT RES SHR %CPU %MEM TIME+ COMMAND 5045508 2,818g 187220 51,9 36,6 0:15.39 postgres: INSERT 5045508 3,600g 214868 62,8 46,8 3:11.58 postgres: INSERT 5045508 3,771g 214868 50,9 49,0 3:40.03 postgres: INSERT 5045508 3,840g 214868 48,5 49,9 4:00.71 postgres: INSERT 5045508 3,978g 214880 51,5 51,7 4:40.73 postgres: INSERT 5045508 4,107g 214892 53,2 53,4 5:22.04 postgres: INSERT 5045508 4,297g 215276 53,9 55,9 6:22.63 postgres: INSERT ... Those are rows for the backend process, captured from "top" over time. How long the backend was running is in the TIME column. Each iteration takes ~30 seconds, so those lines represent approximately iterations 1, 6, 7, 8, 11, etc. Notice how the RSS value grows over time, and also notice that this is the INSERT, restricted by work_mem=1GB. So the memory consumption should be ~1.5GB, and MemoryContextStats(TopMemoryContext) collected at this point is consistent with that (see the mem-ctx.log). And then the value stabilizes at ~4,430g and stops growing. With size 7500 it however takes only ~20 iterations to reach the OOM issue, with a crash log like this: [10560.843547] Killed process 15862 (postgres) total-vm:7198260kB, anon-rss:6494136kB, file-rss:300436kB So, any ideas what might be the culprit here? As I said, this is clearly made worse by inappropriately high work_mem values, but I'm not sure it's completely harmless. Imagine for example long-running backends, executing complex queries with inaccurate estimates. That may easily result in using much more memory than the work_mem limit, and increasing the RSS value over time. regards Tomas
Attachment
On 12/01/2014 05:39 PM, Tomas Vondra wrote: > Hi all, > > while working on the patch decreasing amount of memory consumed by > array_agg [1], I've ran into some strange OOM issues. Reproducing them > using the attached SQL script is rather simple. > > [1] https://commitfest.postgresql.org/action/patch_view?id=1652 > > At first I thought there's some rare hidden memory leak, but I'm pretty > sure that's not the case. I've even put some explicit counters into > aset.c counting allocated/freed blocks, and it seems to be working just > fine (and matching the context tree perfectly). So no memory leak. Doesn't this line: TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 chunks); 608256 used look pretty suspicious? cheers andrew
On 2.12.2014 00:31, Andrew Dunstan wrote: > > On 12/01/2014 05:39 PM, Tomas Vondra wrote: >> Hi all, >> >> while working on the patch decreasing amount of memory consumed by >> array_agg [1], I've ran into some strange OOM issues. Reproducing them >> using the attached SQL script is rather simple. >> >> [1] https://commitfest.postgresql.org/action/patch_view?id=1652 >> >> At first I thought there's some rare hidden memory leak, but I'm pretty >> sure that's not the case. I've even put some explicit counters into >> aset.c counting allocated/freed blocks, and it seems to be working just >> fine (and matching the context tree perfectly). So no memory leak. > > > Doesn't this line: > > TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free > (500017 chunks); 608256 used > > > look pretty suspicious? It certainly looks a bit large, especially considering this is a fresh cluster with a single DB, containing a single table (created by user), no doubt about that. For comparison, this is a new backend: TopMemoryContext: 78128 total in 11 blocks; 8528 free (16 chunks); 69600 used OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits. For the record, I only tweaked these settings in the config file: test=# select name, setting from pg_settings where source like '%configuration file%'; name | setting ----------------------------+-------------------checkpoint_segments | 32DateStyle | ISO, DMYdefault_text_search_config| pg_catalog.simpledynamic_shared_memory_type | posixlc_messages | cs_CZ.UTF-8lc_monetary | cs_CZ.UTF-8lc_numeric | cs_CZ.UTF-8lc_time | cs_CZ.UTF-8log_timezone | Europe/Praguemaintenance_work_mem | 524288max_connections | 100shared_buffers | 65536TimeZone | Europe/Praguework_mem | 1048576 (14 rows) Some of those are set by the initdb, of course. regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > On 2.12.2014 00:31, Andrew Dunstan wrote: >> Doesn't this line: >> TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free >> (500017 chunks); 608256 used >> look pretty suspicious? > It certainly looks a bit large, especially considering this is a fresh > cluster with a single DB, containing a single table (created by user), > no doubt about that. > OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits. Yeah, but what was that 130MB being used for? It's not generally considered good practice to put large or random stuff in TopMemoryContext --- arguably, any code doing so is broken already, because almost by definition such allocations won't be reclaimed on error. What I suspect you're looking at here is the detritus of creation of a huge number of memory contexts. mcxt.c keeps its own state about existing contents in TopMemoryContext. So, if we posit that those contexts weren't real small, there's certainly room to believe that there was some major memory bloat going on recently. regards, tom lane
On 2.12.2014 01:33, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> On 2.12.2014 00:31, Andrew Dunstan wrote: >>> Doesn't this line: >>> TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free >>> (500017 chunks); 608256 used >>> look pretty suspicious? > >> It certainly looks a bit large, especially considering this is a fresh >> cluster with a single DB, containing a single table (created by user), >> no doubt about that. >> OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits. > > Yeah, but what was that 130MB being used for? It's not generally > considered good practice to put large or random stuff in TopMemoryContext > --- arguably, any code doing so is broken already, because almost by > definition such allocations won't be reclaimed on error. > > What I suspect you're looking at here is the detritus of creation of > a huge number of memory contexts. mcxt.c keeps its own state about > existing contents in TopMemoryContext. So, if we posit that those > contexts weren't real small, there's certainly room to believe that > there was some major memory bloat going on recently. Aha! MemoryContextCreate allocates the memory for the new context from TopMemoryContext explicitly, so that it survives resets of the parent context. Is that what you had in mind by keeping state about existing contexts? That'd probably explain the TopMemoryContext size, because array_agg() creates separate context for each group. So if you have 1M groups, you have 1M contexts. Although I don's see how the size of those contexts would matter? Maybe we could move this info (list of child contexts) to the parent context somehow, so that it's freed when the context is destroyed? Aside from the regular blocks, of course. But maybe it's not worth the effort, because you can only have so many memory contexts created concurrently. For AllocSet it's at most RAM/1kB (because that's the minimum block size) before the OOM intervenes. And in this case the contexts have 8kB initial size, so the number of contexts is even lower. So maybe it's not worth the effort. Also, this explains the TopMemoryContext size, but not the RSS size (or am I missing something)? regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > On 2.12.2014 01:33, Tom Lane wrote: >> What I suspect you're looking at here is the detritus of creation of >> a huge number of memory contexts. mcxt.c keeps its own state about >> existing contents in TopMemoryContext. So, if we posit that those >> contexts weren't real small, there's certainly room to believe that >> there was some major memory bloat going on recently. > Aha! MemoryContextCreate allocates the memory for the new context from > TopMemoryContext explicitly, so that it survives resets of the parent > context. Is that what you had in mind by keeping state about existing > contexts? Right. > That'd probably explain the TopMemoryContext size, because array_agg() > creates separate context for each group. So if you have 1M groups, you > have 1M contexts. Although I don's see how the size of those contexts > would matter? Well, if they're each 6K, that's your 6GB right there. > Maybe we could move this info (list of child contexts) to the parent > context somehow, so that it's freed when the context is destroyed? We intentionally didn't do that, because in many cases it'd result in parent contexts becoming nonempty when otherwise they'd never have anything actually in them. The idea was that such "shell" parent contexts should be cheap, requiring only a control block in TopMemoryContext and not an actual allocation arena. This idea of a million separate child contexts was never part of the design of course; we might need to rethink whether that's a good idea. Or maybe there need to be two different policies about where to put child control blocks. > Also, this explains the TopMemoryContext size, but not the RSS size (or > am I missing something)? Very possibly you're left with "islands" that prevent reclaiming very much of the peak RAM usage. It'd be hard to be sure without some sort of memory map, of course. regards, tom lane
Dne 2014-12-02 02:52, Tom Lane napsal: > Tomas Vondra <tv@fuzzy.cz> writes: >> On 2.12.2014 01:33, Tom Lane wrote: >>> What I suspect you're looking at here is the detritus of creation of >>> a huge number of memory contexts. mcxt.c keeps its own state about >>> existing contents in TopMemoryContext. So, if we posit that those >>> contexts weren't real small, there's certainly room to believe that >>> there was some major memory bloat going on recently. > >> Aha! MemoryContextCreate allocates the memory for the new context from >> TopMemoryContext explicitly, so that it survives resets of the parent >> context. Is that what you had in mind by keeping state about existing >> contexts? > > Right. > >> That'd probably explain the TopMemoryContext size, because array_agg() >> creates separate context for each group. So if you have 1M groups, you >> have 1M contexts. Although I don's see how the size of those contexts >> would matter? > > Well, if they're each 6K, that's your 6GB right there. Yeah, but this memory should be freed after the query finishes, no? >> Maybe we could move this info (list of child contexts) to the parent >> context somehow, so that it's freed when the context is destroyed? > > We intentionally didn't do that, because in many cases it'd result in > parent contexts becoming nonempty when otherwise they'd never have > anything actually in them. The idea was that such "shell" parent > contexts > should be cheap, requiring only a control block in TopMemoryContext and > not an actual allocation arena. This idea of a million separate child > contexts was never part of the design of course; we might need to > rethink > whether that's a good idea. Or maybe there need to be two different > policies about where to put child control blocks. Maybe. For me, the 130MB is not really a big deal, because for this to happen there really needs to be many child contexts at the same time, consuming much more memory. With 6.5GB consumed in total, 130MB amounts to ~2% which is negligible. Unless we can fix the RSS bloat. >> Also, this explains the TopMemoryContext size, but not the RSS size >> (or >> am I missing something)? > > Very possibly you're left with "islands" that prevent reclaiming very > much of the peak RAM usage. It'd be hard to be sure without some sort > of memory map, of course. Yes, that's something I was thinking about too - I believe what happens is that allocations of info in TopMemoryContext and the actual contexts are interleaved, and at the end only the memory contexts are deleted. The blocks allocated in TopMemoryContexts are kept, creating the "islands". If that's the case, allocating the child context info within the parent context would solve this, because these pieces would be reclaimed with the rest of the parent memory. But then again, there are probably other ways to create such islands (e.g. allocating additional block in a long-lived context while the child contexts exist). regards Tomas
Dne 2 Prosinec 2014, 10:59, Tomas Vondra napsal(a): > Dne 2014-12-02 02:52, Tom Lane napsal: >> Tomas Vondra <tv@fuzzy.cz> writes: >> >>> Also, this explains the TopMemoryContext size, but not the RSS size >>> (or am I missing something)? >> >> Very possibly you're left with "islands" that prevent reclaiming very >> much of the peak RAM usage. It'd be hard to be sure without some sort >> of memory map, of course. > > Yes, that's something I was thinking about too - I believe what happens > is that allocations of info in TopMemoryContext and the actual contexts > are interleaved, and at the end only the memory contexts are deleted. > The blocks allocated in TopMemoryContexts are kept, creating the > "islands". > > If that's the case, allocating the child context info within the parent > context would solve this, because these pieces would be reclaimed with > the rest of the parent memory. On second thought, I'm not sure this explains the continuous increase of consumed memory. When the first iteration consumes 2,818g of memory, why should the following iterations consume significantly more? The allocation patterns should be (almost) exactly the same, reusing the already allocated memory (either at the system or TopMemoryContext level). regards Tomas
On 2.12.2014 02:52, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> On 2.12.2014 01:33, Tom Lane wrote: >>> What I suspect you're looking at here is the detritus of creation >>> of a huge number of memory contexts. mcxt.c keeps its own state >>> about existing contents in TopMemoryContext. So, if we posit that >>> those contexts weren't real small, there's certainly room to >>> believe that there was some major memory bloat going on recently. > >> Aha! MemoryContextCreate allocates the memory for the new context >> from TopMemoryContext explicitly, so that it survives resets of the >> parent context. Is that what you had in mind by keeping state about >> existing contexts? > > Right. > >> That'd probably explain the TopMemoryContext size, because >> array_agg() creates separate context for each group. So if you have >> 1M groups, you have 1M contexts. Although I don's see how the size >> of those contexts would matter? > > Well, if they're each 6K, that's your 6GB right there. FWIW, I tried to repeat the test with the array_agg() patch, submitted to the next CF: http://www.postgresql.org/message-id/5479FAEA.3070501@fuzzy.cz and the memory context stats after the first iteration look like this: master (unpatched) ------------------ TopMemoryContext: 204878128 total in 25011 blocks; 204010192 free (750034 chunks); 867936 used patched with array_agg ---------------------- TopMemoryContext: 78128 total in 11 blocks; 10144 free (34 chunks); 67984 used Also, the RSS issue (which was the initial subject of this thread) pretty much disappeared. So indeed, this seems to be caused by the 'islands' referenced by TopMemoryContext. I wonder whether it'd be worth looking for places doing something similar to array_agg(), potentially creating many contexts in parallel, and maybe modifying them as in the patch. regards Tomas