Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Jehan-Guillaume de Rorthais
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 20230328151745.0f6061f8@karst
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Memory leak from ExecutorState context?
List pgsql-hackers
On Tue, 28 Mar 2023 00:43:34 +0200
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
> > Please, find in attachment a patch to allocate bufFiles in a dedicated
> > context. I picked up your patch, backpatch'd it, went through it and did
> > some minor changes to it. I have some comment/questions thought.
> >
> >   1. I'm not sure why we must allocate the "HashBatchFiles" new context
> >   under ExecutorState and not under hashtable->hashCxt?
> >
> >   The only references I could find was in hashjoin.h:30:
> >
> >    /* [...]
> >     * [...] (Exception: data associated with the temp files lives in the
> >     * per-query context too, since we always call buffile.c in that
> > context.)
> >
> >   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
> >   original comment in the patch):
> >
> >    /* [...]
> >     * Note: it is important always to call this in the regular executor
> >     * context, not in a shorter-lived context; else the temp file buffers
> >     * will get messed up.
> >
> >
> >   But these are not explanation of why BufFile related allocations must be
> > under a per-query context.
> >
>
> Doesn't that simply describe the current (unpatched) behavior where
> BufFile is allocated in the per-query context?

I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
rule about «**always** call buffile.c in per-query context». But maybe it ought
to be «always call buffile.c from one of the sub-query context»? I assume the
aim is to enforce the tmp files removal on query end/error?

> I mean, the current code calls BufFileCreateTemp() without switching the
> context, so it's in the ExecutorState. But with the patch it very clearly is
> not.
>
> And I'm pretty sure the patch should do
>
>     hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>                                          "HashBatchFiles",
>                                          ALLOCSET_DEFAULT_SIZES);
>
> and it'd still work. Or why do you think we *must* allocate it under
> ExecutorState?

That was actually my very first patch and it indeed worked. But I was confused
about the previous quoted code comments. That's why I kept your original code
and decided to rise the discussion here.

Fixed in new patch in attachment.

> FWIW The comment in hashjoin.h needs updating to reflect the change.

Done in the last patch. Is my rewording accurate?

> >   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
> > switch seems fragile as it could be forgotten in futur code path/changes.
> > So I added an Assert() in the function to make sure the current memory
> > context is "HashBatchFiles" as expected.
> >   Another way to tie this up might be to pass the memory context as
> > argument to the function.
> >   ... Or maybe I'm over precautionary.
> >
>
> I'm not sure I'd call that fragile, we have plenty other code that
> expects the memory context to be set correctly. Not sure about the
> assert, but we don't have similar asserts anywhere else.

I mostly sticked it there to stimulate the discussion around this as I needed
to scratch that itch.

> But I think it's just ugly and overly verbose

+1

Your patch was just a demo/debug patch by the time. It needed some cleanup now
:)

> it'd be much nicer to e.g. pass the memory context as a parameter, and do
> the switch inside.

That was a proposition in my previous mail, so I did it in the new patch. Let's
see what other reviewers think.

> >   3. You wrote:
> >
> >>> A separate BufFile memory context helps, although people won't see it
> >>> unless they attach a debugger, I think. Better than nothing, but I was
> >>> wondering if we could maybe print some warnings when the number of batch
> >>> files gets too high ...
> >
> >   So I added a WARNING when batches memory are exhausting the memory size
> >   allowed.
> >
> >    +   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
> >    +       elog(WARNING, "Growing number of hash batch is exhausting
> > memory");
> >
> >   This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
> >   overflows the memory budget. I realize now I should probably add the
> > memory limit, the number of current batch and their memory consumption.
> >   The message is probably too cryptic for a user. It could probably be
> >   reworded, but some doc or additionnal hint around this message might help.
> >
>
> Hmmm, not sure is WARNING is a good approach, but I don't have a better
> idea at the moment.

I stepped it down to NOTICE and added some more infos.

Here is the output of the last patch with a 1MB work_mem:

  =# explain analyze select * from small join large using (id);
  WARNING:  increasing number of batches from 1 to 2
  WARNING:  increasing number of batches from 2 to 4
  WARNING:  increasing number of batches from 4 to 8
  WARNING:  increasing number of batches from 8 to 16
  WARNING:  increasing number of batches from 16 to 32
  WARNING:  increasing number of batches from 32 to 64
  WARNING:  increasing number of batches from 64 to 128
  WARNING:  increasing number of batches from 128 to 256
  WARNING:  increasing number of batches from 256 to 512
  NOTICE:  Growing number of hash batch to 512 is exhausting allowed memory
           (2164736 > 2097152)
  WARNING:  increasing number of batches from 512 to 1024
  NOTICE:  Growing number of hash batch to 1024 is exhausting allowed memory
           (4329472 > 2097152)
  WARNING:  increasing number of batches from 1024 to 2048
  NOTICE:  Growing number of hash batch to 2048 is exhausting allowed memory
           (8626304 > 2097152)
  WARNING:  increasing number of batches from 2048 to 4096
  NOTICE:  Growing number of hash batch to 4096 is exhausting allowed memory
           (17252480 > 2097152)
  WARNING:  increasing number of batches from 4096 to 8192
  NOTICE:  Growing number of hash batch to 8192 is exhausting allowed memory
           (34504832 > 2097152)
  WARNING:  increasing number of batches from 8192 to 16384
  NOTICE:  Growing number of hash batch to 16384 is exhausting allowed memory
           (68747392 > 2097152)
  WARNING:  increasing number of batches from 16384 to 32768
  NOTICE:  Growing number of hash batch to 32768 is exhausting allowed memory
           (137494656 > 2097152)

                                QUERY PLAN
  --------------------------------------------------------------------------
  Hash Join  (cost=6542057.16..7834651.23 rows=7 width=74)
             (actual time=558502.127..724007.708 rows=7040 loops=1)
             Hash Cond: (small.id = large.id)
  ->  Seq Scan on small (cost=0.00..940094.00 rows=94000000 width=41)
                        (actual time=0.035..3.666 rows=10000 loops=1)
  ->  Hash (cost=6542057.07..6542057.07 rows=7 width=41)
           (actual time=558184.152..558184.153 rows=700000000 loops=1)
           Buckets: 32768 (originally 1024)
           Batches: 32768 (originally 1)
           Memory Usage: 1921kB
    -> Seq Scan on large (cost=0.00..6542057.07 rows=7 width=41)
                         (actual time=0.324..193750.567 rows=700000000 loops=1)
  Planning Time: 1.588 ms
  Execution Time: 724011.074 ms (8 rows)

Regards,

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: "variable not found in subplan target list"
Next
From: Daniel Gustafsson
Date:
Subject: Re: TAP output format in pg_regress