Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 20230512213606.isxoy2s57majc4ji@liskov
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Memory leak from ExecutorState context?
List pgsql-hackers
Thanks for continuing to work on this.

Are you planning to modify what is displayed for memory usage in
EXPLAIN ANALYZE?

Also, since that won't help a user who OOMs, I wondered if the spillCxt
is helpful on its own or if we need some kind of logging message for
users to discover that this is what led them to running out of memory.

On Wed, May 10, 2023 at 02:24:19PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Mon, 8 May 2023 11:56:48 -0400 Melanie Plageman <melanieplageman@gmail.com> wrote:
> 
> > > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
> > >             hashtable, nbatch, hashtable->spaceUsed);
> > >  #endif
> > >  
> > > -    oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> > > -
> > >      if (hashtable->innerBatchFile == NULL)
> > >      {
> > > +        MemoryContext oldcxt =
> > > MemoryContextSwitchTo(hashtable->fileCxt); +
> > >          /* we had no file arrays before */
> > >          hashtable->innerBatchFile = palloc0_array(BufFile *,
> > > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > > +  
> > > +        MemoryContextSwitchTo(oldcxt);
> > > +
> > >          /* time to establish the temp tablespaces, too */
> > >          PrepareTempTablespaces();
> > >      }
> > > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)  
> > 
> > I don't see a reason to call repalloc0_array() in a different context
> > than the initial palloc0_array().
> 
> Unless I'm wrong, wrapping the whole if/else blocks in memory context
> hashtable->fileCxt seemed useless as repalloc() actually realloc in the context
> the original allocation occurred. So I only wrapped the palloc() calls.

My objection is less about correctness and more about the diff. The
existing memory context switch is around the whole if/else block. If you
want to move it to only wrap the if statement, I would do that in a
separate commit with a message describing the rationale. It doesn't seem
to save us much and it makes the diff a bit more confusing. I don't feel
strongly enough about this to protest much more, though.

> > > diff --git a/src/include/executor/hashjoin.h
> > > b/src/include/executor/hashjoin.h index 8ee59d2c71..74867c3e40 100644
> > > --- a/src/include/executor/hashjoin.h
> > > +++ b/src/include/executor/hashjoin.h
> > > @@ -25,10 +25,14 @@
> > >   *
> > >   * Each active hashjoin has a HashJoinTable control block, which is
> > >   * palloc'd in the executor's per-query context.  All other storage needed
> > > - * for the hashjoin is kept in private memory contexts, two for each
> > > hashjoin.
> > > + * for the hashjoin is kept in private memory contexts, three for each
> > > + * hashjoin:  
> > 
> > Maybe "hash table control block". I know the phrase "control block" is
> > used elsewhere in the comments, but it is a bit confusing. Also, I wish
> > there was a way to make it clear this is for the hashtable but relevant
> > to all batches.
> 
> I tried to reword the comment with this additional info in mind in v6. Does it
> match what you had in mind?

Review of that below.

> > So, if we are going to allocate the array of pointers to the spill files
> > in the fileCxt, we should revise this comment. As I mentioned above, I
> > agree with you that if the SharedTupleStore-related buffers are also
> > allocated in this context, perhaps it shouldn't be called the fileCxt.
> > 
> > One idea I had is calling it the spillCxt. Almost all memory allocated in this
> > context is related to needing to spill to permanent storage during execution.
> 
> Agree
> 
> > The one potential confusing part of this is batch 0 for parallel hash
> > join. I would have to dig back into it again, but from a cursory look at
> > ExecParallelHashJoinSetUpBatches() it seems like batch 0 also gets a
> > shared tuplestore with associated accessors and files even if it is a
> > single batch parallel hashjoin.
> > 
> > Are the parallel hash join read_buffer and write_chunk also used for a
> > single batch parallel hash join?
> 
> I don't think so.
> 
> For the inner side, there's various Assert() around the batchno==0 special
> case. Plus, it always has his own block when inserting in a batch, to directly
> write in shared memory calling ExecParallelHashPushTuple().
> 
> The outer side of the join actually creates all batches using shared tuple
> storage mechanism, including batch 0, **only** if the number of batch is
> greater than 1. See in ExecParallelHashJoinOuterGetTuple:
> 
>   /*
>    * In the Parallel Hash case we only run the outer plan directly for
>    * single-batch hash joins. Otherwise we have to go to batch files, even
>    * for batch 0.
>    */
>   if (curbatch == 0 && hashtable->nbatch == 1)
>   {
>       slot = ExecProcNode(outerNode);
> 
> So, for a single batch PHJ, it seems there's no temp files involved.

spill context seems appropriate, then.

> 
> > Though, perhaps spillCxt still makes sense? It doesn't specify
> > multi-batch...
> 
> I'm not sure the see where would be the confusing part here? Is it that some
> STS mechanism are allocated but never used? When the number of batch is 1, it
> doesn't really matter much I suppose, as the context consumption stays
> really low. Plus, there's some other useless STS/context around there (ie. inner
> batch 0 and batch context in PHJ). I'm not sure it worth trying optimizing this
> compare to the cost of the added code complexity.
> 
> Or am I off-topic and missing something obvious?

My concern was that, because the shared tuplestore is used for single
batch parallel hashjoins, if memory allocated for the shared tuplestore
was allocated in the spill context (like the accessors or some buffers)
but no temp files were used because it is a single batch hash join, that
it would be confusing (more for the developer than the user) to see
memory allocated in a "spill" context (which implies spilling). But, I
think there is no point in belaboring this any further.

> > > --- a/src/include/executor/nodeHashjoin.h
> > > +++ b/src/include/executor/nodeHashjoin.h
> > > @@ -29,6 +29,6 @@ extern void ExecHashJoinInitializeWorker(HashJoinState
> > > *state, ParallelWorkerContext *pwcxt);
> > >    
> > 
> > I would add a comment explaining why ExecHashJoinSaveTuple() is passed
> > with the fileCxt as a parameter.
> > 
> > >  extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> 
> Isn't the comment added in the function itself, in v6, enough? It seems
> uncommon to comment on function parameters in headers.

The comment seems good to me.

> Last, about your TODO in 0001 patch, do you mean that we should document
> that after splitting a batch N, its rows can only redispatch in N0 or N1 ?

I mean that if we split batch 5 during execution, tuples can only be
spilled to batches 6+ (since we will have already processed batches
0-4). I think this is how it works, though I haven't reviewed this part
of the code in some time.

> From 6c9056979eb4d638f0555a05453686e01b1d1d11 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> Date: Mon, 27 Mar 2023 15:54:39 +0200
> Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
>  context

This patch mainly looks good now. I had some suggested rewording below.
 
> diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h

I've suggested a few edits to your block comment in hashjoin.h below:

>  /* ----------------------------------------------------------------
>   *                hash-join hash table structures
>   *

[...]
 * Each active hashjoin has a HashJoinTable structure, which is
 * palloc'd in the executor's per-query context.  Other storage needed for
 * each hashjoin is split amongst three child contexts:
 *   - HashTableContext (hashCxt): the top hash table storage context
 *     - HashSpillContext (spillCxt): storage for temp files buffers
 *     - HashBatchContext (batchCxt): storage for a batch in serial hash join
 *
 [...]
 *
 * Data is allocated in the "hashCxt" when it should live throughout the
 * lifetime of the join. This mainly consists of hashtable metadata.
 *
 * Data associated with temporary files needed when the hash join must
 * spill to disk is allocated in the "spillCxt" context. This context
 * lives lives for the duration of the join, as spill files concerning
 * multiple batches coexist. These files are explicitly destroyed by
 * calling BufFileClose() when the hash join has finished executing the
 * batch. done with them. The aim of this context is to help account for
 * the memory dedicated to temp files and their buffers.
 *
 * Finally, storage that is only wanted for the current batch is
 * allocated in the "batchCxt".  By resetting the batchCxt at the end of
 * each batch, we free all the per-batch storage reliably and without
 * tedium.
 *
 [...]


- Melanie



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: improve more permissions-related error messages
Next
From: Andres Freund
Date:
Subject: Re: smgrzeroextend clarification