Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 20230508155648.lzddwywo6emote2b@liskov
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Memory leak from ExecutorState context?
List pgsql-hackers
Thanks for continuing to work on this!

On Thu, May 04, 2023 at 07:30:06PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Fri, 21 Apr 2023 16:44:48 -0400 Melanie Plageman <melanieplageman@gmail.com> wrote:
...
> > I think the biggest change that is needed is to implement this memory
> > context usage for parallel hash join.
> 
> Indeed. 

...

> 4. accessor->read_buffer is currently allocated in accessor->context as well.
> 
>    This buffer holds tuple read from the fileset. This is still a buffer, but
>    not related to any file anymore...
> 
> Because of 3 and 4, and the gross context granularity of SharedTuplestore
> related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".

I think bufCxt is a potentially confusing name. The context contains
pointers to the batch files and saying those are related to buffers is
confusing. It also sounds like it could include any kind of buffer
related to the hashtable or hash join.

Perhaps we could call this memory context the "spillCxt"--indicating it
is for the memory required for spilling to permanent storage while
executing hash joins.

I discuss this more in my code review below.

> From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> Date: Mon, 27 Mar 2023 15:54:39 +0200
> Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
>  context

> diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
> index 5fd1c5553b..a4fbf29301 100644
> --- a/src/backend/executor/nodeHash.c
> +++ b/src/backend/executor/nodeHash.c
> @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
>  
>      if (nbatch > 1 && hashtable->parallel_state == NULL)
>      {
> +        MemoryContext oldctx;
> +
>          /*
>           * allocate and initialize the file arrays in hashCxt (not needed for
>           * parallel case which uses shared tuplestores instead of raw files)
>           */
> +        oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
> +
>          hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
>          hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
>          /* The files will not be opened until needed... */
>          /* ... but make sure we have temp tablespaces established for them */

I haven't studied it closely, but I wonder if we shouldn't switch back
into the oldctx before calling PrepareTempTablespaces().
PrepareTempTablespaces() is explicit about what memory context it is
allocating in, however, I'm not sure it makes sense to call it in the
fileCxt. If you have a reason, you should add a comment and do the same
in ExecHashIncreaseNumBatches().

>          PrepareTempTablespaces();
> +
> +        MemoryContextSwitchTo(oldctx);
>      }

> @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
>             hashtable, nbatch, hashtable->spaceUsed);
>  #endif
>  
> -    oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> -
>      if (hashtable->innerBatchFile == NULL)
>      {
> +        MemoryContext oldcxt = MemoryContextSwitchTo(hashtable->fileCxt);
> +
>          /* we had no file arrays before */
>          hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
>          hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> +

As mentioned above, you should likely make ExecHashTableCreate()
consistent with this.

> +        MemoryContextSwitchTo(oldcxt);
> +
>          /* time to establish the temp tablespaces, too */
>          PrepareTempTablespaces();
>      }
> @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)

I don't see a reason to call repalloc0_array() in a different context
than the initial palloc0_array().

>          hashtable->outerBatchFile = repalloc0_array(hashtable->outerBatchFile, BufFile *, oldnbatch, nbatch);
>      }
>  
> -    MemoryContextSwitchTo(oldcxt);
> -
>      hashtable->nbatch = nbatch;
>  
>      /*

> diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
> index 920d1831c2..ac72fbfbb6 100644
> --- a/src/backend/executor/nodeHashjoin.c
> +++ b/src/backend/executor/nodeHashjoin.c
> @@ -485,8 +485,10 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
>                       */
>                      Assert(parallel_state == NULL);
>                      Assert(batchno > hashtable->curbatch);
> +
>                      ExecHashJoinSaveTuple(mintuple, hashvalue,
> -                                          &hashtable->outerBatchFile[batchno]);
> +                                          &hashtable->outerBatchFile[batchno],
> +                                          hashtable->fileCxt);
>  
>                      if (shouldFree)
>                          heap_free_minimal_tuple(mintuple);
> @@ -1308,21 +1310,27 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
>   * The data recorded in the file for each tuple is its hash value,

It doesn't sound accurate to me to say that it should be called *in* the
HashBatchFiles context. We now switch into that context, so you can
probably remove this comment and add a comment above the switch into the
filecxt which explains that the temp file buffers should be allocated in
the filecxt (both because they need to be allocated in a sufficiently
long-lived context and to increase visibility of their memory overhead).

>   * then the tuple in MinimalTuple format.
>   *
> - * Note: it is important always to call this in the regular executor
> - * context, not in a shorter-lived context; else the temp file buffers
> - * will get messed up.
> + * Note: it is important always to call this in the HashBatchFiles context,
> + * not in a shorter-lived context; else the temp file buffers will get messed
> + * up.
>   */
>  void
>  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> -                      BufFile **fileptr)
> +                      BufFile **fileptr, MemoryContext filecxt)
>  {
>      BufFile    *file = *fileptr;
>  
>      if (file == NULL)
>      {
> +        MemoryContext oldctx;
> +
> +        oldctx = MemoryContextSwitchTo(filecxt);
> +
>          /* First write to this batch file, so open it. */
>          file = BufFileCreateTemp(false);
>          *fileptr = file;
> +
> +        MemoryContextSwitchTo(oldctx);
>      }

> diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
> index 8ee59d2c71..74867c3e40 100644
> --- a/src/include/executor/hashjoin.h
> +++ b/src/include/executor/hashjoin.h
> @@ -25,10 +25,14 @@
>   *
>   * Each active hashjoin has a HashJoinTable control block, which is
>   * palloc'd in the executor's per-query context.  All other storage needed
> - * for the hashjoin is kept in private memory contexts, two for each hashjoin.
> + * for the hashjoin is kept in private memory contexts, three for each
> + * hashjoin:

Maybe "hash table control block". I know the phrase "control block" is
used elsewhere in the comments, but it is a bit confusing. Also, I wish
there was a way to make it clear this is for the hashtable but relevant
to all batches.

> + * - HashTableContext (hashCxt): the control block associated to the hash table

I might say "per-batch storage for serial hash join".

> + * - HashBatchContext (batchCxt): storages for batches

So, if we are going to allocate the array of pointers to the spill files
in the fileCxt, we should revise this comment. As I mentioned above, I
agree with you that if the SharedTupleStore-related buffers are also
allocated in this context, perhaps it shouldn't be called the fileCxt.

One idea I had is calling it the spillCxt. Almost all memory allocated in this
context is related to needing to spill to permanent storage during execution.

The one potential confusing part of this is batch 0 for parallel hash
join. I would have to dig back into it again, but from a cursory look at
ExecParallelHashJoinSetUpBatches() it seems like batch 0 also gets a
shared tuplestore with associated accessors and files even if it is a
single batch parallel hashjoin.

Are the parallel hash join read_buffer and write_chunk also used for a
single batch parallel hash join?

Though, perhaps spillCxt still makes sense? It doesn't specify
multi-batch...

> --- a/src/include/executor/nodeHashjoin.h
> +++ b/src/include/executor/nodeHashjoin.h
> @@ -29,6 +29,6 @@ extern void ExecHashJoinInitializeWorker(HashJoinState *state,
>                                           ParallelWorkerContext *pwcxt);
>  

I would add a comment explaining why ExecHashJoinSaveTuple() is passed
with the fileCxt as a parameter.

>  extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> -                                  BufFile **fileptr);
> +                                  BufFile **fileptr, MemoryContext filecxt);

- Melanie



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Add standard collation UNICODE
Next
From: gkokolatos@pm.me
Date:
Subject: Re: Add LZ4 compression in pg_dump