Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Jehan-Guillaume de Rorthais
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 20230301184840.0a897a80@karst
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Memory leak from ExecutorState context?
Re: Memory leak from ExecutorState context?
List pgsql-hackers
Hi Tomas,

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> > * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
> >   (400s as the last record)
>
> That's interesting. We're using HashBatchContext for very few things, so
> what could it consume so much memory? But e.g. the number of buckets
> should be limited by work_mem, so how could it get to 1.4GB?
>
> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
> and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of
calls, always short-cut'ed to 1048576, I guess because of the conditional block
«/* safety check to avoid overflow */» appearing early in this function.

I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the
query run faster. Enabling it at 19.1GB of memory consumption, it stayed
silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly).

The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning,
and a last time close to the end of the query execution.

> > Any idea? How could I help to have a better idea if a leak is actually
> > occurring and where exactly?
>
> Investigating memory leaks is tough, especially for generic memory
> contexts like ExecutorState :-( Even more so when you can't reproduce it
> on a machine with custom builds.
>
> What I'd try is this:
>
> 1) attach breakpoints to all returns in AllocSetAlloc(), printing the
> pointer and size for ExecutorState context, so something like
>
>   break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
>   commands
>   print MemoryChunkGetPointer(chunk) size
>   cont
>   end
>
> 2) do the same for AllocSetFree()
>
> 3) Match the palloc/pfree calls (using the pointer address), to
> determine which ones are not freed and do some stats on the size.
> Usually there's only a couple distinct sizes that account for most of
> the leaked memory.

So here is what I end up with this afternoon, using file, lines and macro from
REL_11_18:

  set logging  on
  set pagination off

  break aset.c:781 if strcmp("ExecutorState",context.name) == 0
  commands 1
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont
  end

  break aset.c:820 if strcmp("ExecutorState",context.name) == 0
  commands 2
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont
  end

  break aset.c:979 if strcmp("ExecutorState",context.name) == 0
  commands 3
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont
  end

  break AllocSetFree if strcmp("ExecutorState",context.name) == 0
  commands 4
  print pointer
  cont
  end

So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The
backend had only 3'32" of CPU time:

   VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  2727284   2.4g  17840 R  99.0  7.7 181:25.07 gdb
  9054688 220648 103056 t   1.3  0.7   3:32.05 postmaster

Interestingly, the RES memory of the backend did not explode yet, but VIRT is
already high.

I suppose the query will run for some more hours, hopefully, gdb will not
exhaust the memory in the meantime...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
  the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
       # this one actually really comes from the gdb log
       0x38a77b8: break=3 num=191       sz=4711441762604810240 (weird sz)
       # this one might be a bug in my script
             0x2: break=2 num=945346    sz=2                   (weird address)
* ignoring the weird size requested during the 191st call, the total amount
  of non free'd memory is currently 5488MB

I couldn't print "size" as it is optimzed away, that's why I tracked
chunk->size... Is there anything wrong with my current run and gdb log?

The gdb log is 5MB compressed. I'll keep it off-list, but I can provide it if
needed.

Stay tuned...

Thank you!

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Weird failure with latches in curculio on v15
Next
From: Melanie Plageman
Date:
Subject: Add pg_walinspect function with block info columns