Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Memory leak from ExecutorState context? |
Date | |
Msg-id | 77a96d42-00cb-2448-465a-aa1e92d00cac@enterprisedb.com Whole thread Raw |
In response to | Re: Memory leak from ExecutorState context? (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Responses |
Re: Memory leak from ExecutorState context?
|
List | pgsql-hackers |
On 3/2/23 13:08, Jehan-Guillaume de Rorthais wrote: > ... > [...] >> But I have another idea - put a breakpoint on makeBufFile() which is the >> bit that allocates the temp files including the 8kB buffer, and print in >> what context we allocate that. I have a hunch we may be allocating it in >> the ExecutorState. That'd explain all the symptoms. > > That what I was wondering as well yesterday night. > > So, on your advice, I set a breakpoint on makeBufFile: > > (gdb) info br > Num Type Disp Enb Address What > 1 breakpoint keep y 0x00000000007229df in makeBufFile > bt 10 > p CurrentMemoryContext.name > > > Then, I disabled it and ran the query up to this mem usage: > > VIRT RES SHR S %CPU %MEM > 20.1g 7.0g 88504 t 0.0 22.5 > > Then, I enabled the breakpoint and look at around 600 bt and context name > before getting bored. They **all** looked like that: > > Breakpoint 1, BufFileCreateTemp (...) at buffile.c:201 > 201 in buffile.c > #0 BufFileCreateTemp (...) buffile.c:201 > #1 ExecHashJoinSaveTuple (tuple=0x1952c180, ...) nodeHashjoin.c:1238 > #2 ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398 > #3 ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584 > #4 ExecProcNodeInstr (node=<optimized out>) execProcnode.c:462 > #5 ExecProcNode (node=0x31a6418) > #6 ExecSort (pstate=0x31a6308) > #7 ExecProcNodeInstr (node=<optimized out>) > #8 ExecProcNode (node=0x31a6308) > #9 fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0) > > $421643 = 0x99d7f7 "ExecutorState" > > These 600-ish 8kB buffer were all allocated in "ExecutorState". I could > probably log much more of them if more checks/stats need to be collected, but > it really slow down the query a lot, granting it only 1-5% of CPU time instead > of the usual 100%. > Bingo! > So It's not exactly a leakage, as memory would be released at the end of the > query, but I suppose they should be allocated in a shorter living context, > to avoid this memory bloat, am I right? > Well, yeah and no. In principle we could/should have allocated the BufFiles in a different context (possibly hashCxt). But in practice it probably won't make any difference, because the query will probably run all the hashjoins at the same time. Imagine a sequence of joins - we build all the hashes, and then tuples from the outer side bubble up through the plans. And then you process the last tuple and release all the hashes. This would not fix the issue. It'd be helpful for accounting purposes (we'd know it's the buffiles and perhaps for which hashjoin node). But we'd still have to allocate the memory etc. (so still OOM). There's only one thing I think could help - increase the work_mem enough not to trigger the explosive growth in number of batches. Imagine there's one very common value, accounting for ~65MB of tuples. With work_mem=64MB this leads to exactly the explosive growth you're observing here. With 128MB it'd probably run just fine. The problem is we don't know how large the work_mem would need to be :-( So you'll have to try and experiment a bit. I remembered there was a thread [1] about *exactly* this issue in 2019. [1] https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net I even posted a couple patches that try to address this by accounting for the BufFile memory, and increasing work_mem a bit instead of just blindly increasing the number of batches (ignoring the fact that means more memory will be used for the BufFile stuff). I don't recall why it went nowhere, TBH. But I recall there were discussions about maybe doing something like "block nestloop" at the time, or something. Or maybe the thread just went cold. >> BTW with how many batches does the hash join start? > > * batches went from 32 to 1048576 before being growEnabled=false as suspected > * original and current nbuckets were set to 1048576 immediately > * allowed space is set to the work_mem, but current space usage is 1.3GB, as > measured previously close before system refuse more memory allocation. > Yeah, I think this is pretty expected. We start with multiple batches, so we pick optimal buckets for the whole work_mem (so no growth here). But then batches explode, in the futile hope to keep this in work_mem. Once that growth gets disabled, we end up with 1.3GB hash table. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: