Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers
From | Jehan-Guillaume de Rorthais |
---|---|
Subject | Re: Memory leak from ExecutorState context? |
Date | |
Msg-id | 20230301184840.0a897a80@karst Whole thread Raw |
In response to | Re: Memory leak from ExecutorState context? (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Memory leak from ExecutorState context?
Re: Memory leak from ExecutorState context? |
List | pgsql-hackers |
Hi Tomas, On Tue, 28 Feb 2023 20:51:02 +0100 Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote: > > * HashBatchContext goes up to 1441MB after 240s then stay flat until the end > > (400s as the last record) > > That's interesting. We're using HashBatchContext for very few things, so > what could it consume so much memory? But e.g. the number of buckets > should be limited by work_mem, so how could it get to 1.4GB? > > Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets > and print how many batches/butches are there? I did this test this morning. Batches and buckets increased really quickly to 1048576/1048576. ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of calls, always short-cut'ed to 1048576, I guess because of the conditional block «/* safety check to avoid overflow */» appearing early in this function. I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the query run faster. Enabling it at 19.1GB of memory consumption, it stayed silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly). The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning, and a last time close to the end of the query execution. > > Any idea? How could I help to have a better idea if a leak is actually > > occurring and where exactly? > > Investigating memory leaks is tough, especially for generic memory > contexts like ExecutorState :-( Even more so when you can't reproduce it > on a machine with custom builds. > > What I'd try is this: > > 1) attach breakpoints to all returns in AllocSetAlloc(), printing the > pointer and size for ExecutorState context, so something like > > break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0 > commands > print MemoryChunkGetPointer(chunk) size > cont > end > > 2) do the same for AllocSetFree() > > 3) Match the palloc/pfree calls (using the pointer address), to > determine which ones are not freed and do some stats on the size. > Usually there's only a couple distinct sizes that account for most of > the leaked memory. So here is what I end up with this afternoon, using file, lines and macro from REL_11_18: set logging on set pagination off break aset.c:781 if strcmp("ExecutorState",context.name) == 0 commands 1 print (((char *)(chunk)) + sizeof(struct AllocChunkData)) print chunk->size cont end break aset.c:820 if strcmp("ExecutorState",context.name) == 0 commands 2 print (((char *)(chunk)) + sizeof(struct AllocChunkData)) print chunk->size cont end break aset.c:979 if strcmp("ExecutorState",context.name) == 0 commands 3 print (((char *)(chunk)) + sizeof(struct AllocChunkData)) print chunk->size cont end break AllocSetFree if strcmp("ExecutorState",context.name) == 0 commands 4 print pointer cont end So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The backend had only 3'32" of CPU time: VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2727284 2.4g 17840 R 99.0 7.7 181:25.07 gdb 9054688 220648 103056 t 1.3 0.7 3:32.05 postmaster Interestingly, the RES memory of the backend did not explode yet, but VIRT is already high. I suppose the query will run for some more hours, hopefully, gdb will not exhaust the memory in the meantime... You'll find some intermediate stats I already collected in attachment: * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree. * most of the non-free'd chunk are allocated since the very beginning, before the 5000's allocation call for almost 1M call so far. * 3754 of them have a chunk->size of 0 * it seems there's some buggy stats or data: # this one actually really comes from the gdb log 0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz) # this one might be a bug in my script 0x2: break=2 num=945346 sz=2 (weird address) * ignoring the weird size requested during the 191st call, the total amount of non free'd memory is currently 5488MB I couldn't print "size" as it is optimzed away, that's why I tracked chunk->size... Is there anything wrong with my current run and gdb log? The gdb log is 5MB compressed. I'll keep it off-list, but I can provide it if needed. Stay tuned... Thank you!
Attachment
pgsql-hackers by date: