Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 20230516200052.sbg6z4ghcmsas3wv@liskov
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Memory leak from ExecutorState context?
List pgsql-hackers
On Tue, May 16, 2023 at 04:00:51PM +0200, Jehan-Guillaume de Rorthais wrote:

> From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 30 Apr 2020 07:16:28 -0700
> Subject: [PATCH v8 1/3] Describe hash join implementation
> 
> This is just a draft to spark conversation on what a good comment might
> be like in this file on how the hybrid hash join algorithm is
> implemented in Postgres. I'm pretty sure this is the accepted term for
> this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join

I recommend changing the commit message to something like this:

    Describe hash join implementation

    Add details to the block comment in nodeHashjoin.c describing the
    hybrid hash join algorithm at a high level.

    Author: Melanie Plageman <melanieplageman@gmail.com>
    Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
    Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
    Discussion: https://postgr.es/m/20230516160051.4267a800%40karst

> diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
> index 0a3f32f731..93a78f6f74 100644
> --- a/src/backend/executor/nodeHashjoin.c
> +++ b/src/backend/executor/nodeHashjoin.c
> @@ -10,6 +10,47 @@
>   * IDENTIFICATION
>   *      src/backend/executor/nodeHashjoin.c
>   *
> + * HASH JOIN
> + *
> + * This is based on the "hybrid hash join" algorithm described shortly in the
> + * following page, and in length in the pdf in page's notes:
> + *
> + *   https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
> + *
> + * If the inner side tuples of a hash join do not fit in memory, the hash join
> + * can be executed in multiple batches.
> + *
> + * If the statistics on the inner side relation are accurate, planner chooses a
> + * multi-batch strategy and estimates the number of batches.
> + *
> + * The query executor measures the real size of the hashtable and increases the
> + * number of batches if the hashtable grows too large.
> + *
> + * The number of batches is always a power of two, so an increase in the number
> + * of batches doubles it.
> + *
> + * Serial hash join measures batch size lazily -- waiting until it is loading a
> + * batch to determine if it will fit in memory. While inserting tuples into the
> + * hashtable, serial hash join will, if that tuple were to exceed work_mem,
> + * dump out the hashtable and reassign them either to other batch files or the
> + * current batch resident in the hashtable.
> + *
> + * Parallel hash join, on the other hand, completes all changes to the number
> + * of batches during the build phase. If it increases the number of batches, it
> + * dumps out all the tuples from all batches and reassigns them to entirely new
> + * batch files. Then it checks every batch to ensure it will fit in the space
> + * budget for the query.
> + *
> + * In both parallel and serial hash join, the executor currently makes a best
> + * effort. If a particular batch will not fit in memory, it tries doubling the
> + * number of batches. If after a batch increase, there is a batch which
> + * retained all or none of its tuples, the executor disables growth in the
> + * number of batches globally. After growth is disabled, all batches that would
> + * have previously triggered an increase in the number of batches instead
> + * exceed the space allowed.
> + *
> + * TODO: should we discuss that tuples can only spill forward?

I would just cut this for now since we haven't started on an agreed-upon
wording.

> From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> Date: Tue, 16 May 2023 15:42:14 +0200
> Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in a dedicated
>  context

Here is a draft commit message for the second patch:

    Dedicated memory context for hash join spill metadata
    
    A hash join's hashtable may be split up into multiple batches if it
    would otherwise exceed work_mem. The number of batches is doubled each
    time a given batch is determined not to fit in memory. Each batch file
    is allocated with a block-sized buffer for buffering tuples (parallel
    hash join has additional sharedtuplestore accessor buffers).
    
    In some cases, often with skewed data, bad stats, or very large
    datasets, users can run out-of-memory while attempting to fit an
    oversized batch in memory solely from the memory overhead of all the
    batch files' buffers.
    
    Batch files were allocated in the ExecutorState memory context, making
    it very hard to identify when this batch explosion was the source of an
    OOM. By allocating the batch files in a dedicated memory context, it
    should be easier for users to identify the cause of an OOM and work to
    avoid it.

I recommend editing and adding links to the various discussions on this
topic from your research.

As for the patch itself, I noticed that there are few things needing
pgindenting.

I usually do the following to run pgindent (in case you haven't done
this recently).

Change the pg_bsd_indent meson file "install" value to true (like this
diff):

diff --git a/src/tools/pg_bsd_indent/meson.build b/src/tools/pg_bsd_indent/meson.build
index 5545c097bf..85bedf13f6 100644
--- a/src/tools/pg_bsd_indent/meson.build
+++ b/src/tools/pg_bsd_indent/meson.build
@@ -21,7 +21,7 @@ pg_bsd_indent = executable('pg_bsd_indent',
   dependencies: [frontend_code],
   include_directories: include_directories('.'),
   kwargs: default_bin_args + {
-    'install': false,
+    'install': true,
 # possibly at some point do this:
 #   'install_dir': dir_pgxs / 'src/tools/pg_bsd_indent',
   },

Install pg_bsd_indent.
Run pg_indent.
I do the following to run pgindent:

src/tools/pgindent/pgindent --indent \
$INSTALL_PATH/pg_bsd_indent --typedef \
src/tools/pgindent/typedefs.list -- $(git diff origin/master --name-only '*.c' '*.h')

There are some existing indentation issues in these files, but you can
leave those or put them in a separate commit.

> @@ -3093,8 +3107,11 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
>      pstate->nbatch = nbatch;
>      batches = dsa_get_address(hashtable->area, pstate->batches);
>  
> -    /* Use hash join memory context. */
> -    oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);

Add a period at the end of this comment.

> +    /*
> +     * Use hash join spill memory context to allocate accessors and their
> +     * buffers
> +     */
> +    oldcxt = MemoryContextSwitchTo(hashtable->spillCxt);
>  
>      /* Allocate this backend's accessor array. */
>      hashtable->nbatch = nbatch;
> @@ -3196,8 +3213,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
>       */
>      Assert(DsaPointerIsValid(pstate->batches));
>  
> -    /* Use hash join memory context. */
> -    oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);

Add a period at the end of this comment.

> +    /* Use hash join spill memory context to allocate accessors */
> +    oldcxt = MemoryContextSwitchTo(hashtable->spillCxt);
>  
>      /* Allocate this backend's accessor array. */
>      hashtable->nbatch = pstate->nbatch;

> diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
> @@ -1313,21 +1314,30 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
>   * The data recorded in the file for each tuple is its hash value,
>   * then the tuple in MinimalTuple format.
>   *
> - * Note: it is important always to call this in the regular executor
> - * context, not in a shorter-lived context; else the temp file buffers
> - * will get messed up.
> + * If this is the first write to the batch file, this function first
> + * create it.  The associated BufFile buffer is allocated in the given
> + * context.  It is important to always give the HashSpillContext
> + * context.  First to avoid a shorter-lived context, else the temp file
> + * buffers will get messed up.  Second, for a better accounting of the
> + * spilling memory consumption.
> + *
>   */

Here is my suggested wording fot this block comment:

The batch file is lazily created. If this is the first tuple written to
this batch, the batch file is created and its buffer is allocated in the
given context. It is important to pass in a memory context which will
live for the entirety of the lifespan of the batch.

Since we went to the trouble of naming the context something generic,
perhaps move the comment about accounting for the memory consumption to
the call site.

>  void
>  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> -                      BufFile **fileptr)
> +                      BufFile **fileptr, MemoryContext cxt)
>  {

> diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
> index 8ee59d2c71..ac27222d18 100644
> --- a/src/include/executor/hashjoin.h
> +++ b/src/include/executor/hashjoin.h
> @@ -23,12 +23,12 @@
>  /* ----------------------------------------------------------------
>   *                hash-join hash table structures
>   *
> - * Each active hashjoin has a HashJoinTable control block, which is
> - * palloc'd in the executor's per-query context.  All other storage needed
> - * for the hashjoin is kept in private memory contexts, two for each hashjoin.
> - * This makes it easy and fast to release the storage when we don't need it
> - * anymore.  (Exception: data associated with the temp files lives in the
> - * per-query context too, since we always call buffile.c in that context.)
> + * Each active hashjoin has a HashJoinTable structure, which is

"Other storages" should be "Other storage"

> + * palloc'd in the executor's per-query context.  Other storages needed for
> + * each hashjoin is kept in child contexts, three for each hashjoin:
> + *   - HashTableContext (hashCxt): the parent hash table storage context
> + *   - HashSpillContext (spillCxt): storage for temp files buffers
> + *   - HashBatchContext (batchCxt): storage for a batch in serial hash join
>   *
>   * The hashtable contexts are made children of the per-query context, ensuring
>   * that they will be discarded at end of statement even if the join is
> @@ -36,9 +36,19 @@
>   * be cleaned up by the virtual file manager in event of an error.)
>   *
>   * Storage that should live through the entire join is allocated from the
> - * "hashCxt", while storage that is only wanted for the current batch is
> - * allocated in the "batchCxt".  By resetting the batchCxt at the end of
> - * each batch, we free all the per-batch storage reliably and without tedium.

"mainly hash's meta datas" -> "mainly the hashtable's metadata"

> + * "hashCxt" (mainly hash's meta datas). Also, the "hashCxt" context is the
> + * parent of "spillCxt" and "batchCxt". It makes it easy and fast to release
> + * the storage when we don't need it anymore.
> + *

Suggested alternative wording for the below:

* Data associated with temp files is allocated in the "spillCxt" context
* which lives for the duration of the entire join as batch files'
* creation and usage may span batch execution. These files are
* explicitly destroyed by calling BufFileClose() when the code is done
* with them. The aim of this context is to help accounting for the
* memory allocated for temp files and their buffers.

> + * Data associated with temp files lives in the "spillCxt" context which lives
> + * during the entire join as temp files might need to survives batches. These
> + * files are explicitly destroyed by calling BufFileClose() when the code is
> + * done with them. The aim of this context is to help accounting the memory
> + * allocations dedicated to temp files and their buffers.
> + *

Suggested alternative wording for the below:

* Finally, data used only during a single batch's execution is allocated
* in the "batchCxt". By resetting the batchCxt at the end of each batch,
* we free all the per-batch storage reliably and without tedium.

> + * Finaly, storage that is only wanted for the current batch is allocated in
> + * the "batchCxt".  By resetting the batchCxt at the end of each batch, we free
> + * all the per-batch storage reliably and without tedium.

- Melanie



pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: Order changes in PG16 since ICU introduction
Next
From: Thomas Munro
Date:
Subject: Re: WL_SOCKET_ACCEPT fairness on Windows