Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 41c5766d-ed71-b70c-bbbc-d3396c462d62@enterprisedb.com
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
List pgsql-hackers
On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> On Wed, 1 Mar 2023 20:29:11 +0100
> Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:
>>> On Tue, 28 Feb 2023 20:51:02 +0100
>>> Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:  
>>>> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:  
>>>>> * HashBatchContext goes up to 1441MB after 240s then stay flat until the
>>>>> end (400s as the last record)
>>>>
>>>> That's interesting. We're using HashBatchContext for very few things, so
>>>> what could it consume so much memory? But e.g. the number of buckets
>>>> should be limited by work_mem, so how could it get to 1.4GB?
>>>>
>>>> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
>>>> and print how many batches/butches are there?  
>>>
>>> I did this test this morning.
>>>
>>> Batches and buckets increased really quickly to 1048576/1048576.
>>
>> OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
>> buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
>> fillfactor 1.0).
>>
>> But 1M batches? I guess that could be problematic. It doesn't seem like
>> much, but we need 1M files on each side - 1M for the hash table, 1M for
>> the outer relation. That's 16MB of pointers, but the files are BufFile
>> and we keep 8kB buffer for each of them. That's ~16GB right there :-(
>>
>> In practice it probably won't be that bad, because not all files will be
>> allocated/opened concurrently (especially if this is due to many tuples
>> having the same value). Assuming that's what's happening here, ofc.
> 
> And I suppose they are close/freed concurrently as well?
> 

Yeah. There can be different subsets of the files used, depending on
when the number of batches start to explode, etc.

>>> ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
>>> of calls, always short-cut'ed to 1048576, I guess because of the
>>> conditional block «/* safety check to avoid overflow */» appearing early in
>>> this function. 
>>
>> Hmmm, that's a bit weird, no? I mean, the check is
>>
>>   /* safety check to avoid overflow */
>>   if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
>>      return;
>>
>> Why would it stop at 1048576? It certainly is not higher than  INT_MAX/2
>> and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
>> happening here?
> 
> Indeed, not the good suspect. But what about this other short-cut then?
> 
>     /* do nothing if we've decided to shut off growth */
>     if (!hashtable->growEnabled)
>         return;
> 
>     [...]
> 
>     /*
>      * If we dumped out either all or none of the tuples in the table, disable
>      * further expansion of nbatch.  This situation implies that we have
>      * enough tuples of identical hashvalues to overflow spaceAllowed.
>      * Increasing nbatch will not fix it since there's no way to subdivide the
>      * group any more finely. We have to just gut it out and hope the server
>      * has enough RAM.
>      */
>     if (nfreed == 0 || nfreed == ninmemory)
>     {
>         hashtable->growEnabled = false;
>   #ifdef HJDEBUG
>         printf("Hashjoin %p: disabling further increase of nbatch\n",
>                hashtable);
>   #endif
>     }
> 
> If I guess correctly, the function is not able to split the current batch, so
> it sits and hopes. This is a much better suspect and I can surely track this
> from gdb.
> 

Yes, this would make much more sense - it'd be consistent with the
hypothesis that this is due to number of batches exploding (it's a
protection exactly against that).

You specifically mentioned the other check earlier, but now I realize
you've been just speculating it might be that.

> Being able to find what are the fields involved in the join could help as well
> to check or gather some stats about them, but I hadn't time to dig this yet...
> 

It's going to be tricky, because all parts of the plan may be doing
something, and there may be multiple hash joins. So you won't know if
you're executing the part of the plan that's causing issues :-(


But I have another idea - put a breakpoint on makeBufFile() which is the
bit that allocates the temp files including the 8kB buffer, and print in
what context we allocate that. I have a hunch we may be allocating it in
the ExecutorState. That'd explain all the symptoms.


BTW with how many batches does the hash join start?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Making empty Bitmapsets always be NULL
Next
From: Andres Freund
Date:
Subject: Re: Non-superuser subscription owners