Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id f532d33d-e4c1-ed00-e047-94f3aafa49db@enterprisedb.com
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
List pgsql-hackers

On 3/28/23 15:17, Jehan-Guillaume de Rorthais wrote:
> On Tue, 28 Mar 2023 00:43:34 +0200
> Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> 
>> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
>>> Please, find in attachment a patch to allocate bufFiles in a dedicated
>>> context. I picked up your patch, backpatch'd it, went through it and did
>>> some minor changes to it. I have some comment/questions thought.
>>>
>>>   1. I'm not sure why we must allocate the "HashBatchFiles" new context
>>>   under ExecutorState and not under hashtable->hashCxt?
>>>
>>>   The only references I could find was in hashjoin.h:30:
>>>
>>>    /* [...]
>>>     * [...] (Exception: data associated with the temp files lives in the
>>>     * per-query context too, since we always call buffile.c in that
>>> context.)
>>>
>>>   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
>>>   original comment in the patch):
>>>
>>>    /* [...]
>>>     * Note: it is important always to call this in the regular executor
>>>     * context, not in a shorter-lived context; else the temp file buffers
>>>     * will get messed up.
>>>
>>>
>>>   But these are not explanation of why BufFile related allocations must be
>>> under a per-query context. 
>>>   
>>
>> Doesn't that simply describe the current (unpatched) behavior where
>> BufFile is allocated in the per-query context? 
> 
> I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
> rule about «**always** call buffile.c in per-query context». But maybe it ought
> to be «always call buffile.c from one of the sub-query context»? I assume the
> aim is to enforce the tmp files removal on query end/error?
> 

I don't think we need this info for tempfile cleanup - CleanupTempFiles
relies on the VfdCache, which does malloc/realloc (so it's not using
memory contexts at all).

I'm not very familiar with this part of the code, so I might be missing
something. But you can try that - just stick an elog(ERROR) somewhere
into the hashjoin code, and see if that breaks the cleanup.

Not an explicit proof, but if there was some hard-wired requirement in
which memory context to allocate BufFile stuff, I'd expect it to be
documented in buffile.c. But that actually says this:

 * Note that BufFile structs are allocated with palloc(), and therefore
 * will go away automatically at query/transaction end.  Since the
underlying
 * virtual Files are made with OpenTemporaryFile, all resources for
 * the file are certain to be cleaned up even if processing is aborted
 * by ereport(ERROR).  The data structures required are made in the
 * palloc context that was current when the BufFile was created, and
 * any external resources such as temp files are owned by the ResourceOwner
 * that was current at that time.

which I take as confirmation that it's legal to allocate BufFile in any
memory context, and that cleanup is handled by the cache in fd.c.


>> I mean, the current code calls BufFileCreateTemp() without switching the
>> context, so it's in the ExecutorState. But with the patch it very clearly is
>> not.
>>
>> And I'm pretty sure the patch should do
>>
>>     hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>>                                          "HashBatchFiles",
>>                                          ALLOCSET_DEFAULT_SIZES);
>>
>> and it'd still work. Or why do you think we *must* allocate it under
>> ExecutorState?
> 
> That was actually my very first patch and it indeed worked. But I was confused
> about the previous quoted code comments. That's why I kept your original code
> and decided to rise the discussion here.
> 

IIRC I was just lazy when writing the experimental patch, there was not
much thought about stuff like this.

> Fixed in new patch in attachment.
> 
>> FWIW The comment in hashjoin.h needs updating to reflect the change.
> 
> Done in the last patch. Is my rewording accurate?
> 
>>>   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
>>> switch seems fragile as it could be forgotten in futur code path/changes.
>>> So I added an Assert() in the function to make sure the current memory
>>> context is "HashBatchFiles" as expected.
>>>   Another way to tie this up might be to pass the memory context as
>>> argument to the function.
>>>   ... Or maybe I'm over precautionary.
>>>   
>>
>> I'm not sure I'd call that fragile, we have plenty other code that
>> expects the memory context to be set correctly. Not sure about the
>> assert, but we don't have similar asserts anywhere else.
> 
> I mostly sticked it there to stimulate the discussion around this as I needed
> to scratch that itch.
> 
>> But I think it's just ugly and overly verbose
> 
> +1
> 
> Your patch was just a demo/debug patch by the time. It needed some cleanup now
> :)
> 
>> it'd be much nicer to e.g. pass the memory context as a parameter, and do
>> the switch inside.
> 
> That was a proposition in my previous mail, so I did it in the new patch. Let's
> see what other reviewers think.
> 

+1

>>>   3. You wrote:
>>>   
>>>>> A separate BufFile memory context helps, although people won't see it
>>>>> unless they attach a debugger, I think. Better than nothing, but I was
>>>>> wondering if we could maybe print some warnings when the number of batch
>>>>> files gets too high ...  
>>>
>>>   So I added a WARNING when batches memory are exhausting the memory size
>>>   allowed.
>>>
>>>    +   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
>>>    +       elog(WARNING, "Growing number of hash batch is exhausting
>>> memory");
>>>
>>>   This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
>>>   overflows the memory budget. I realize now I should probably add the
>>> memory limit, the number of current batch and their memory consumption.
>>>   The message is probably too cryptic for a user. It could probably be
>>>   reworded, but some doc or additionnal hint around this message might help.
>>>   
>>
>> Hmmm, not sure is WARNING is a good approach, but I don't have a better
>> idea at the moment.
> 
> I stepped it down to NOTICE and added some more infos.
> 
> Here is the output of the last patch with a 1MB work_mem:
> 
>   =# explain analyze select * from small join large using (id);
>   WARNING:  increasing number of batches from 1 to 2
>   WARNING:  increasing number of batches from 2 to 4
>   WARNING:  increasing number of batches from 4 to 8
>   WARNING:  increasing number of batches from 8 to 16
>   WARNING:  increasing number of batches from 16 to 32
>   WARNING:  increasing number of batches from 32 to 64
>   WARNING:  increasing number of batches from 64 to 128
>   WARNING:  increasing number of batches from 128 to 256
>   WARNING:  increasing number of batches from 256 to 512
>   NOTICE:  Growing number of hash batch to 512 is exhausting allowed memory
>            (2164736 > 2097152)
>   WARNING:  increasing number of batches from 512 to 1024
>   NOTICE:  Growing number of hash batch to 1024 is exhausting allowed memory
>            (4329472 > 2097152)
>   WARNING:  increasing number of batches from 1024 to 2048
>   NOTICE:  Growing number of hash batch to 2048 is exhausting allowed memory
>            (8626304 > 2097152)
>   WARNING:  increasing number of batches from 2048 to 4096
>   NOTICE:  Growing number of hash batch to 4096 is exhausting allowed memory
>            (17252480 > 2097152)
>   WARNING:  increasing number of batches from 4096 to 8192
>   NOTICE:  Growing number of hash batch to 8192 is exhausting allowed memory
>            (34504832 > 2097152)
>   WARNING:  increasing number of batches from 8192 to 16384
>   NOTICE:  Growing number of hash batch to 16384 is exhausting allowed memory
>            (68747392 > 2097152)
>   WARNING:  increasing number of batches from 16384 to 32768
>   NOTICE:  Growing number of hash batch to 32768 is exhausting allowed memory
>            (137494656 > 2097152)
> 
>                                 QUERY PLAN
>   --------------------------------------------------------------------------
>   Hash Join  (cost=6542057.16..7834651.23 rows=7 width=74)
>              (actual time=558502.127..724007.708 rows=7040 loops=1)
>              Hash Cond: (small.id = large.id)
>   ->  Seq Scan on small (cost=0.00..940094.00 rows=94000000 width=41)
>                         (actual time=0.035..3.666 rows=10000 loops=1)
>   ->  Hash (cost=6542057.07..6542057.07 rows=7 width=41)
>            (actual time=558184.152..558184.153 rows=700000000 loops=1) 
>            Buckets: 32768 (originally 1024)
>            Batches: 32768 (originally 1)
>            Memory Usage: 1921kB
>     -> Seq Scan on large (cost=0.00..6542057.07 rows=7 width=41)
>                          (actual time=0.324..193750.567 rows=700000000 loops=1)
>   Planning Time: 1.588 ms
>   Execution Time: 724011.074 ms (8 rows)
> 
> Regards,

OK, although NOTICE that may actually make it less useful - the default
level is WARNING, and regular users are unable to change the level. So
very few people will actually see these messages.


thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Initial Schema Sync for Logical Replication
Next
From: Jelte Fennema
Date:
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel