Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 77a96d42-00cb-2448-465a-aa1e92d00cac@enterprisedb.com
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Memory leak from ExecutorState context?
List pgsql-hackers
On 3/2/23 13:08, Jehan-Guillaume de Rorthais wrote:
> ...
> [...]
>> But I have another idea - put a breakpoint on makeBufFile() which is the
>> bit that allocates the temp files including the 8kB buffer, and print in
>> what context we allocate that. I have a hunch we may be allocating it in
>> the ExecutorState. That'd explain all the symptoms.
> 
> That what I was wondering as well yesterday night.
> 
> So, on your advice, I set a breakpoint on makeBufFile:
> 
>   (gdb) info br
>   Num     Type           Disp Enb Address            What
>   1       breakpoint     keep y   0x00000000007229df in makeBufFile
>           bt 10
>           p CurrentMemoryContext.name
> 
> 
> Then, I disabled it and ran the query up to this mem usage:
> 
>    VIRT    RES    SHR S  %CPU %MEM
>   20.1g   7.0g  88504 t   0.0 22.5
> 
> Then, I enabled the breakpoint and look at around 600 bt and context name
> before getting bored. They **all** looked like that:
> 
>   Breakpoint 1, BufFileCreateTemp (...)                at buffile.c:201
>   201     in buffile.c
>   #0  BufFileCreateTemp (...)                             buffile.c:201
>   #1  ExecHashJoinSaveTuple (tuple=0x1952c180, ...)       nodeHashjoin.c:1238
>   #2  ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
>   #3  ExecHashJoin (pstate=0x31a6418)                     nodeHashjoin.c:584
>   #4  ExecProcNodeInstr (node=<optimized out>)            execProcnode.c:462
>   #5  ExecProcNode (node=0x31a6418)
>   #6  ExecSort (pstate=0x31a6308)
>   #7  ExecProcNodeInstr (node=<optimized out>)
>   #8  ExecProcNode (node=0x31a6308)
>   #9  fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)
>   
>   $421643 = 0x99d7f7 "ExecutorState"
> 
> These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
> probably log much more of them if more checks/stats need to be collected, but
> it really slow down the query a lot, granting it only 1-5% of CPU time instead
> of the usual 100%.
> 

Bingo!

> So It's not exactly a leakage, as memory would be released at the end of the
> query, but I suppose they should be allocated in a shorter living context,
> to avoid this memory bloat, am I right?
> 

Well, yeah and no.

In principle we could/should have allocated the BufFiles in a different
context (possibly hashCxt). But in practice it probably won't make any
difference, because the query will probably run all the hashjoins at the
same time. Imagine a sequence of joins - we build all the hashes, and
then tuples from the outer side bubble up through the plans. And then
you process the last tuple and release all the hashes.

This would not fix the issue. It'd be helpful for accounting purposes
(we'd know it's the buffiles and perhaps for which hashjoin node). But
we'd still have to allocate the memory etc. (so still OOM).

There's only one thing I think could help - increase the work_mem enough
not to trigger the explosive growth in number of batches. Imagine
there's one very common value, accounting for ~65MB of tuples. With
work_mem=64MB this leads to exactly the explosive growth you're
observing here. With 128MB it'd probably run just fine.

The problem is we don't know how large the work_mem would need to be :-(
So you'll have to try and experiment a bit.

I remembered there was a thread [1] about *exactly* this issue in 2019.

[1]
https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net

I even posted a couple patches that try to address this by accounting
for the BufFile memory, and increasing work_mem a bit instead of just
blindly increasing the number of batches (ignoring the fact that means
more memory will be used for the BufFile stuff).

I don't recall why it went nowhere, TBH. But I recall there were
discussions about maybe doing something like "block nestloop" at the
time, or something. Or maybe the thread just went cold.

>> BTW with how many batches does the hash join start?
> 
> * batches went from 32 to 1048576 before being growEnabled=false as suspected
> * original and current nbuckets were set to 1048576 immediately
> * allowed space is set to the work_mem, but current space usage is 1.3GB, as
>   measured previously close before system refuse more memory allocation.
> 

Yeah, I think this is pretty expected. We start with multiple batches,
so we pick optimal buckets for the whole work_mem (so no growth here).

But then batches explode, in the futile hope to keep this in work_mem.
Once that growth gets disabled, we end up with 1.3GB hash table.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: libpq: PQgetCopyData() and allocation overhead
Next
From: "Drouvot, Bertrand"
Date:
Subject: Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete