Thread: Memory leak on subquery as scalar operand

Memory leak on subquery as scalar operand

From
Lauri Laanmets
Date:
There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on Windows). At least in versions 12 to 14 (and not in 9.5). It can be reproduced with the following SQL script on an empty database.

* When repeated, Linux process starts using more and more RAM until OOM-killer will kill it.
* 'pg_backend_memory_contexts' doesn't show any increase.
* (gdb) p MemoryContextStats(TopMemoryContext) - doesn't show any increase.

-/
CREATE TABLE IF NOT EXISTS public.leak_test
(
   id integer NOT NULL,
   CONSTRAINT leak_test_pkey PRIMARY KEY (id)
);

INSERT INTO leak_test(id)
   SELECT id
   FROM generate_series(1,100000) id
ON CONFLICT DO NOTHING;

SELECT
    id,
    (SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result,
    (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result
    /* Leaks memory around 80 kB on each query, but only if two sub-queries are used. */
FROM leak_test l;
/-

Regards
Lauri

Re: Memory leak on subquery as scalar operand

From
Tom Lane
Date:
Lauri Laanmets <pcspets@gmail.com> writes:
> There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on
> Windows). At least in versions 12 to 14 (and not in 9.5). It can be
> reproduced with the following SQL script on an empty database.

Thanks for the test case!  Unfortunately, I can't see any ill effects
with it in currently released versions (I tried 12, 14, 15, and HEAD).
Which minor releases did you try?  Do you have any non-default
settings or build options?

            regards, tom lane



Re: Memory leak on subquery as scalar operand

From
Lauri Laanmets
Date:
I have tested it on version 12 that comes with Ubuntu 20.04.5 LTS. Then upgraded to "PostgreSQL 14.5 (Ubuntu 14.5-2.pgdg20.04+2) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit" from "deb http://apt.postgresql.org/pub/repos/apt focal-pgdg main" but the behaviour is the same - if my example query is repeatedly executed on an open session, the postgres process will start using 1 MB more RAM after around every 14 times of execution. Until OOM-killer kills it.

I haven't changed any default settings and it can be reproduced also by pgAdmin Query Tool that executes the query under 'postgres' user. Although it's a bit tiresome to make so many executions manually.

It is interesting that it doesn't happen on Windows. Might be that it's caused by the OS or system in general but I have no idea what to try or test next.

The machine is: Linux 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Regards
Lauri

Kontakt Tom Lane (<tgl@sss.pgh.pa.us>) kirjutas kuupäeval E, 31. oktoober 2022 kell 15:49:
Lauri Laanmets <pcspets@gmail.com> writes:
> There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on
> Windows). At least in versions 12 to 14 (and not in 9.5). It can be
> reproduced with the following SQL script on an empty database.

Thanks for the test case!  Unfortunately, I can't see any ill effects
with it in currently released versions (I tried 12, 14, 15, and HEAD).
Which minor releases did you try?  Do you have any non-default
settings or build options?

                        regards, tom lane

Re: Memory leak on subquery as scalar operand

From
David Rowley
Date:
On Mon, 31 Oct 2022 at 23:03, Lauri Laanmets <pcspets@gmail.com> wrote:
> There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on Windows). At least in versions 12 to 14
(andnot in 9.5). It can be reproduced with the following SQL script on an empty database.
 
>
> * When repeated, Linux process starts using more and more RAM until OOM-killer will kill it.
> * 'pg_backend_memory_contexts' doesn't show any increase.
> * (gdb) p MemoryContextStats(TopMemoryContext) - doesn't show any increase.

Does it still OOM if you do:

SET jit TO off; ?

David



Re: Memory leak on subquery as scalar operand

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> Does it still OOM if you do:
> SET jit TO off; ?

Ah, bingo --- I can reproduce the behavior if I use a JIT-enabled
build.  So this is related to the known problems with leakage of
compiled objects.  It's pretty odd though that it requires two
sub-selects to cause the problem.

            regards, tom lane



Re: Memory leak on subquery as scalar operand

From
David Rowley
Date:
On Tue, 1 Nov 2022 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's pretty odd though that it requires two
> sub-selects to cause the problem.

Perhaps that's just what it takes to bump the costs above the JIT threshold.

David



Re: Memory leak on subquery as scalar operand

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Tue, 1 Nov 2022 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's pretty odd though that it requires two
>> sub-selects to cause the problem.

> Perhaps that's just what it takes to bump the costs above the JIT threshold.

I see JIT being invoked either way:

regression=# explain verbose SELECT
    id,
    (SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result,
    (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result
FROM leak_test l;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Seq Scan on public.leak_test l  (cost=0.00..865943.00 rows=100000 width=20)
   Output: l.id, (SubPlan 1), (SubPlan 2)
   SubPlan 1
     ->  Aggregate  (cost=4.31..4.32 rows=1 width=8)
           Output: count(*)
           ->  Index Only Scan using leak_test_pkey on public.leak_test x  (cost=0.29..4.31 rows=1 width=0)
                 Output: x.id
                 Index Cond: (x.id = l.id)
   SubPlan 2
     ->  Aggregate  (cost=4.31..4.32 rows=1 width=8)
           Output: count(*)
           ->  Index Only Scan using leak_test_pkey on public.leak_test y  (cost=0.29..4.31 rows=1 width=0)
                 Output: y.id
                 Index Cond: (y.id = l.id)
 JIT:
   Functions: 12
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(17 rows)

regression=# explain verbose SELECT
    id,
    (SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result FROM leak_test l;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Seq Scan on public.leak_test l  (cost=0.00..433693.00 rows=100000 width=12)
   Output: l.id, (SubPlan 1)
   SubPlan 1
     ->  Aggregate  (cost=4.31..4.32 rows=1 width=8)
           Output: count(*)
           ->  Index Only Scan using leak_test_pkey on public.leak_test x  (cost=0.29..4.31 rows=1 width=0)
                 Output: x.id
                 Index Cond: (x.id = l.id)
 JIT:
   Functions: 7
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(11 rows)

Maybe the different "inlining" choice makes a difference?

            regards, tom lane



Re: Memory leak on subquery as scalar operand

From
David Rowley
Date:
On Tue, 1 Nov 2022 at 17:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe the different "inlining" choice makes a difference?

The single subquery version also crashes for me, so perhaps it's just
the amount of memory that's being used and when the OOM killer is
triggering.

It crashes even when I set jit_inline_above_cost and
jit_optimize_above_cost above the query's cost.

postgres=# set jit_optimize_above_cost=10000000;
SET
postgres=# set jit_inline_above_cost=10000000;
SET
postgres=# set jit=0;
SET
postgres=# explain verbose SELECT
    id,
    (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result
FROM leak_test l;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Seq Scan on public.leak_test l  (cost=0.00..941723.86 rows=112965 width=12)
   Output: l.id, (SubPlan 1)
   SubPlan 1
     ->  Aggregate  (cost=8.31..8.32 rows=1 width=8)
           Output: count(*)
           ->  Index Only Scan using leak_test_pkey on
public.leak_test y  (cost=0.29..8.31 rows=1 width=0)
                 Output: y.id
                 Index Cond: (y.id = l.id)
(8 rows)

postgres=# set jit=1;
SET
postgres=# explain verbose SELECT
    id,
    (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result
FROM leak_test l;
server closed the connection unexpectedly

David



Re: Memory leak on subquery as scalar operand

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> The single subquery version also crashes for me, so perhaps it's just
> the amount of memory that's being used and when the OOM killer is
> triggering.
> It crashes even when I set jit_inline_above_cost and
> jit_optimize_above_cost above the query's cost.

Hmm, maybe we're not seeing the same thing?  For me, the behavior
seems similar to what the OP reported: there's a per-query leakage
but it's less than 100kB per query.  It'd take more than a handful
of repetitions to get to an OOM failure.  This is with LLVM 13.0.1
on RHEL 8.6.

Also, as far as I can see there is no leak with the single-subquery
version.  The process's reported VIRT consumption bounces around a
fair amount, but it doesn't go above 300MB, even after ~10min in
a tight plpgsql DO loop.  VIRT bounces around a lot with the
two-subquery version too, actually, but there does seem to be a
general uptrend there.  I did not have the patience to wait for
actual OOM; it looked like it'd take a good long while, tens
of minutes at least.

If the behavior varies across LLVM versions, as is now seeming
a bit likely, it might be their bug not ours.

            regards, tom lane



Re: Memory leak on subquery as scalar operand

From
Andres Freund
Date:
Hi,

On 2022-11-01 00:52:16 -0400, Tom Lane wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
> > The single subquery version also crashes for me, so perhaps it's just
> > the amount of memory that's being used and when the OOM killer is
> > triggering.
> > It crashes even when I set jit_inline_above_cost and
> > jit_optimize_above_cost above the query's cost.
> 
> Hmm, maybe we're not seeing the same thing?

I can't reproduce that either.


> For me, the behavior seems similar to what the OP reported: there's a
> per-query leakage but it's less than 100kB per query.  It'd take more than a
> handful of repetitions to get to an OOM failure.  This is with LLVM 13.0.1
> on RHEL 8.6.

This I can reproduce. Here's an updated patchset addressing this. This query,
for some reason, leaks a lot more aggressively than what I've seen in the
past, so I needed to reduce the amount of time until an llvm context is
recycled substantially. A bit of benchmarking showed no negative consequences
of going to 100 uses till recycling, even with absurd settings
(i.e. jit_*_cost = 0), but did show impact on lower values.

Greetings,

Andres Freund

Attachment

Re: Memory leak on subquery as scalar operand

From
Lauri Laanmets
Date:
> Does it still OOM if you do:
> SET jit TO off; ?

Indeed, it doesn't leak if JIT is OFF. Thank you for a workaround hint!

Regards
Lauri



Kontakt Tom Lane (<tgl@sss.pgh.pa.us>) kirjutas kuupäeval T, 1. november 2022 kell 02:44:
David Rowley <dgrowleyml@gmail.com> writes:
> Does it still OOM if you do:
> SET jit TO off; ?

Ah, bingo --- I can reproduce the behavior if I use a JIT-enabled
build.  So this is related to the known problems with leakage of
compiled objects.  It's pretty odd though that it requires two
sub-selects to cause the problem.

                        regards, tom lane

Re: Memory leak on subquery as scalar operand

From
Justin Pryzby
Date:
On Mon, Oct 31, 2022 at 10:51:32PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-11-01 00:52:16 -0400, Tom Lane wrote:
> > David Rowley <dgrowleyml@gmail.com> writes:
> > > The single subquery version also crashes for me, so perhaps it's just
> > > the amount of memory that's being used and when the OOM killer is
> > > triggering.
> > > It crashes even when I set jit_inline_above_cost and
> > > jit_optimize_above_cost above the query's cost.
> > 
> > Hmm, maybe we're not seeing the same thing?
> 
> I can't reproduce that either.

@David: does the query really crash as you described - with neither jit
inlining nor optimization ?

Is it due to OOM or something else ?
Does it crash with/without Andres' patches ?
What version were you testing ?

-- 
Justin



Re: Memory leak on subquery as scalar operand

From
Daniel Gustafsson
Date:
> On 1 Nov 2022, at 06:51, Andres Freund <andres@anarazel.de> wrote:
>> David Rowley <dgrowleyml@gmail.com> writes:

>> For me, the behavior seems similar to what the OP reported: there's a
>> per-query leakage but it's less than 100kB per query.  It'd take more than a
>> handful of repetitions to get to an OOM failure.  This is with LLVM 13.0.1
>> on RHEL 8.6.
>
> This I can reproduce. Here's an updated patchset addressing this. This query,
> for some reason, leaks a lot more aggressively than what I've seen in the
> past, so I needed to reduce the amount of time until an llvm context is
> recycled substantially. A bit of benchmarking showed no negative consequences
> of going to 100 uses till recycling, even with absurd settings
> (i.e. jit_*_cost = 0), but did show impact on lower values.

I had a look at this and I concur with the findings in this thread.  I didn't
do any benchmarking but running various tests I was unable to trigger an OOM.
Will do more testing and stressing of it.  A few small comments on the
patchset:

From reading it seems that patch 0002 and 0003 can be committed regardless of
the other patches in this series.  Were they included because they were found
while looking at this, or is there a deeper connection I'm missing?

0004:

The commit message states: "That incurs some overhead, so only do so after
10000 JITed queries.", but I fail to see how that's implemented.  There is
currently recreation after 100 reuses, was the intention to have a different
number here or is this just a leftover from an earlier patch-version?


+    /*
+     * The LLVM Context used by this JIT context. An LLVM context is reused
+     * across many compilations, but occasionally reset to prevent it using
+     * too much memory due to more and more types accumulating.
+     */
+    LLVMContextRef llvm_context;

llvm_context is added as a member in LLVMJitContext but is never set or read,
the static llvmjit.c:llvm_context is still used for everything.  Is this a
lefover or was the plan to move this to LLVMJitContext?


+     * FIXME: should split the handling of llvm_triple / llvm_layout out
+     * of llvm_create_types() - that doesn't need to be redone.

Agreed, that seems wasteful.  AFAICT there would not be any reason to recreate
this once set?


+    /*
+     * Consider as cleaned up even if we skip doing so below, that way we can
+     * verify the tracking is correct (see llvm_shutdown()).
+     */
+    llvm_jit_context_in_use_count--;

Since this doesn't actually release even if llvm_jit_context_in_use_count goes
to zero here, this hunk in llvm_release_context is seemingly a bit at odds with
the following from jit/README: "If it is desirable to release resources
earlier, jit_release_context() can be used".  If we want to verify the usage
tracking in llvm_shutdown it's hard to see how we could much else, but maybe a
note in the README that not all resources are guaranteed to be released could
be in order?

Regarding llvm_release_context we currently have this:

static void
llvm_release_context(JitContext *context)
{
    LLVMJitContext *llvm_context = (LLVMJitContext *) context;

..which shadows the "static LLVMContextRef llvm_context;" declared in this
patchset, we should probably rename the local var in llvm_release_context.
(this is referred to in the 0004 commit message.)

The attached 0005 is a WIP attempt to address a few of the FIXME's in this
patchset on top of your the 0001-0004.

--
Daniel Gustafsson


Attachment

Re: Memory leak on subquery as scalar operand

From
Daniel Gustafsson
Date:
> On 2 May 2023, at 15:33, Daniel Gustafsson <daniel@yesql.se> wrote:

> The attached 0005 is a WIP attempt to address a few of the FIXME's in this
> patchset on top of your the 0001-0004.

I've attached a rebased v5 which has the above fixes as well as a few smaller
ones missed in v4.  Testing through the reports linked to from the open item
it's clear that while there is still a continuous rise in resident size, this
slows it down dramatically with the size levelling out which unpatched master
doesn't.  Reducing LLVMJIT_LLVM_CONTEXT_REUSE_MAX back even further pushes the
levelling off further as well, making me wonder if this is something worth
using a GUC for given that we're stuck with the destructive LLVM linker for the
foreseeable future?

--
Daniel Gustafsson


Attachment

Re: Memory leak on subquery as scalar operand

From
Daniel Gustafsson
Date:
Attached is a v6 rebase of this patchset which curbs a memory leak in llvmjit
by explicitly using an LLVMContextRef for types which is dropped and recreated
at intervals to free unused types.  The attached graph plots the memory usage
of a backend continuously running the query from the OP in this thread.  The
patched version (running with all JIT costs at zero to get all inlining etc)
goes to a levelled off memory usage where master just continues to grow until
terminated.  There is more to do on llvmjit memory usage, but this is clearly a
win for queries which otherwise accumulte type leaks potentially ending with an
OOM.

0001 and 0002 are tangentially related, but are mainly of cleanup character.
0003 contains the LLVMContextRef work which is the meat of the patchset.

I think it would be good to get this in early in the v17 cycle such that we
have time to revisit the herustic if need be.

Thoughts?

--
Daniel Gustafsson






Attachment

Re: Memory leak on subquery as scalar operand

From
Daniel Gustafsson
Date:
> On 13 Sep 2023, at 15:24, Daniel Gustafsson <daniel@yesql.se> wrote:

> I think it would be good to get this in early in the v17 cycle such that we
> have time to revisit the herustic if need be.

With an offlist +1 from Andres, and another couple of rounds of read-throughs
and jit_above_cost=0 check-worlds I went ahead and applied this to master.  The
buildfarm has green builds for LLVM enabled builds on Linux, FreeBSD and macOS.

--
Daniel Gustafsson




Re: Memory leak on subquery as scalar operand

From
Andres Freund
Date:
On 2023-09-27 14:07:16 +0200, Daniel Gustafsson wrote:
> > On 13 Sep 2023, at 15:24, Daniel Gustafsson <daniel@yesql.se> wrote:
> 
> > I think it would be good to get this in early in the v17 cycle such that we
> > have time to revisit the herustic if need be.
> 
> With an offlist +1 from Andres, and another couple of rounds of read-throughs
> and jit_above_cost=0 check-worlds I went ahead and applied this to master.  The
> buildfarm has green builds for LLVM enabled builds on Linux, FreeBSD and macOS.

Thanks a lot or working on this!



Re: Memory leak on subquery as scalar operand

From
Daniel Gustafsson
Date:
> On 27 Sep 2023, at 14:07, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> On 13 Sep 2023, at 15:24, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> I think it would be good to get this in early in the v17 cycle such that we
>> have time to revisit the herustic if need be.
>
> With an offlist +1 from Andres, and another couple of rounds of read-throughs
> and jit_above_cost=0 check-worlds I went ahead and applied this to master.  The
> buildfarm has green builds for LLVM enabled builds on Linux, FreeBSD and macOS.

This, as well as the prerequisite backpatch of bab150045bd9 to 12 and 13 has
now been backpatched to all supported branches.  By now there has been green
builds by LLVM enabled animals in the buildfarm for all branches.

--
Daniel Gustafsson