Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id 8338f4ee-47f0-f622-e37f-b1ce1dbfdf38@garret.ru
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers

On 21.04.2023 1:51 AM, Melanie Plageman wrote:
> On Thu, Apr 20, 2023 at 12:42 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>> On 11.04.2023 8:14 PM, Jehan-Guillaume de Rorthais wrote:
>>> On Sat, 8 Apr 2023 02:01:19 +0200
>>> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
>>>
>>>> On Fri, 31 Mar 2023 14:06:11 +0200
>>>> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
>>>>
>>>>    [...]
>>>>
>>>> After rebasing Tomas' memory balancing patch, I did some memory measures
>>>> to answer some of my questions. Please, find in attachment the resulting
>>>> charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
>>>> between HEAD and Tomas' patch. They shows an alternance of numbers
>>>> before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
>>>> didn't try to find the exact last total peak of memory consumption during the
>>>> join phase and before all the BufFiles are destroyed. So the last number
>>>> might be underestimated.
>>> I did some more analysis about the total memory consumption in filecxt of HEAD,
>>> v3 and v4 patches. My previous debug numbers only prints memory metrics during
>>> batch increments or hash table destruction. That means:
>>>
>>> * for HEAD: we miss the batches consumed during the outer scan
>>> * for v3: adds twice nbatch in spaceUsed, which is a rough estimation
>>> * for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak
>>>
>>> Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated"
>>> from there, here are the maximum allocated memory for bufFile context for each
>>> branch:
>>>
>>>           batches   max bufFiles   total  spaceAllowed rise
>>>     HEAD    16384      199966960  ~194MB
>>>     v3       4096       65419456   ~78MB
>>>     v4(*3)   2048       34273280    48MB  nbatch*sizeof(PGAlignedBlock)*3
>>>     v4(*4)   1024       17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
>>>     v4(*5)   2048       34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5
>>>
>>> It seems account for bufFile in spaceUsed allows a better memory balancing and
>>> management. The precise factor to rise spaceAllowed is yet to be defined. *3 or
>>> *4 looks good, but this is based on a single artificial test case.
>>>
>>> Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
>>> far wrong with the reality. So even if we don't commit the balancing memory
>>> patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?
>>>
>>> Regards,
>> Thank you for the patch.
>> I  faced with the same problem (OOM caused by hash join).
>> I tried to create simplest test reproducing the problem:
>>
>> create table t(pk int, val int);
>> insert into t values (generate_series(1,100000000),0);
>> set work_mem='64kB';
>> explain (analyze,buffers) select count(*) from t t1 join t t2 on
>> (t1.pk=t2.pk);
>>
>>
>> There are three workers and size of each exceeds 1.3Gb.
>>
>> Plan is the following:
>>
>>    Finalize Aggregate  (cost=355905977972.87..355905977972.88 rows=1
>> width=8) (actual time=2
>> 12961.033..226097.513 rows=1 loops=1)
>>      Buffers: shared hit=32644 read=852474 dirtied=437947 written=426374,
>> temp read=944407 w
>> ritten=1130380
>>      ->  Gather  (cost=355905977972.65..355905977972.86 rows=2 width=8)
>> (actual time=212943.
>> 505..226097.497 rows=3 loops=1)
>>            Workers Planned: 2
>>            Workers Launched: 2
>>            Buffers: shared hit=32644 read=852474 dirtied=437947
>> written=426374, temp read=94
>> 4407 written=1130380
>>            ->  Partial Aggregate (cost=355905976972.65..355905976972.66
>> rows=1 width=8) (ac
>> tual time=212938.410..212940.035 rows=1 loops=3)
>>                  Buffers: shared hit=32644 read=852474 dirtied=437947
>> written=426374, temp r
>> ead=944407 written=1130380
>>                  ->  Parallel Hash Join (cost=1542739.26..303822614472.65
>> rows=20833345000002 width=0) (actual time=163268.274..207829.524
>> rows=33333333 loops=3)
>>                        Hash Cond: (t1.pk = t2.pk)
>>                        Buffers: shared hit=32644 read=852474
>> dirtied=437947 written=426374, temp read=944407 written=1130380
>>                        ->  Parallel Seq Scan on t t1
>> (cost=0.00..859144.78 rows=41666678 width=4) (actual
>> time=0.045..30828.051 rows=33333333 loops=3)
>>                              Buffers: shared hit=16389 read=426089 written=87
>>                        ->  Parallel Hash (cost=859144.78..859144.78
>> rows=41666678 width=4) (actual time=82202.445..82202.447 rows=33333333
>> loops=3)
>>                              Buckets: 4096 (originally 4096)  Batches:
>> 32768 (originally 8192)  Memory Usage: 192kB
>>                              Buffers: shared hit=16095 read=426383
>> dirtied=437947 written=426287, temp read=267898 written=737164
>>                              ->  Parallel Seq Scan on t t2
>> (cost=0.00..859144.78 rows=41666678 width=4) (actual
>> time=0.054..12647.534 rows=33333333 loops=3)
>>                                    Buffers: shared hit=16095 read=426383
>> dirtied=437947 writ
>> ten=426287
>>    Planning:
>>      Buffers: shared hit=69 read=38
>>    Planning Time: 2.819 ms
>>    Execution Time: 226113.292 ms
>> (22 rows)
>>
>>
>>
>> -----------------------------
>>
>> So we have increased number of batches to 32k.
>> I applied your patches 0001-0004 but unfortunately them have not reduced
>> memory consumption - still size of each backend is more than 1.3Gb.
> Is this EXPLAIN ANALYZE run on an instance with Jehan-Guillaume's
> patchset applied or without?
>
> I'm asking because the fourth patch in the series updates spaceUsed with
> the size of the BufFile->buffer, but I notice in your EXPLAIN ANALZYE,
> Memory Usage for the hashtable is reported as 192 kB, which, while
> larger than the 64kB work_mem you set, isn't as large as I might expect.
>
> - Melanie
Yes, this is explain analyze for the Postgres version with applied 4 
patches:

0001-v4-Describe-hybrid-hash-join-implementation.patch
0002-v4-Allocate-hash-batches-related-BufFile-in-a-dedicated.patch
0003-v4-Add-some-debug-and-metrics.patch
0004-v4-Limit-BufFile-memory-explosion-with-bad-HashJoin.patch

Just as workaround I tried the attached patch - it prevents backups 
memory footprint growth
by limiting number of created batches. I am not sure that it is right 
solution, because in any case we allocate more memory than specified by 
work_mem. The alternative is to prohibit hash join plan in this case. 
But it is also not so good solution, because merge join is used to be 
much slower.

Attachment

pgsql-hackers by date:

Previous
From: Sandro Santilli
Date:
Subject: Re: Order changes in PG16 since ICU introduction
Next
From: Jeff Davis
Date:
Subject: Re: Order changes in PG16 since ICU introduction