Re: terminate called after throwing an instance of 'std::bad_alloc' - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: terminate called after throwing an instance of 'std::bad_alloc'
Date
Msg-id 20210418001324.GP3315@telsasoft.com
Whole thread Raw
In response to Re: terminate called after throwing an instance of 'std::bad_alloc'  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: terminate called after throwing an instance of 'std::bad_alloc'
List pgsql-hackers
On Fri, Apr 16, 2021 at 10:18:37PM -0500, Justin Pryzby wrote:
> On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote:
> > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > > > I'd be happy to run with a prototype fix for the leak to see if the other issue
> > > > does (not) recur.
> > > 
> > > I just posted a prototype fix to
https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> > > (just because that was the first thread I re-found). It'd be cool if you
> > > could have a look!
> > 
> > This doesn't seem to address the problem triggered by the reproducer at
> > https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com
> > (sorry I didn't CC you)
> 
> I take that back - I forgot that this doesn't release RAM until hitting a
> threshold.

I tried this on the query that was causing the original c++ exception.

It still grows to 2GB within 5min.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
23084 postgres  20   0 2514364   1.6g  29484 R  99.7 18.2   3:40.87 postgres: telsasoft ts 192.168.122.11(50892) SELECT
                                                                                                         
 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
23084 postgres  20   0 3046960   2.1g  29484 R 100.0 24.1   4:30.64 postgres: telsasoft ts 192.168.122.11(50892) SELECT
                                                                                                         
 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
23084 postgres  20   0 4323500   3.3g  29488 R  99.7 38.4   8:20.63 postgres: telsasoft ts 192.168.122.11(50892) SELECT
                                                                                                         
 

When I first reported this issue, the affected process was a long-running,
single-threaded python tool.  We since updated it (partially to avoid issues
like this) to use multiprocessing, therefor separate postgres backends.

I'm now realizing that that's RAM use for a single query, not from continuous
leaks across multiple queries.  This is still true with the patch even if I
#define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
28438 postgres  20   0 3854264   2.8g  29428 R  98.7 33.2   8:56.79 postgres: telsasoft ts 192.168.122.11(53614) BIND
                                                                                                         
 

python3 ./jitleak.py # runs telsasoft reports
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
PID 27742 finished running report; est=None rows=40745; cols=34; ... duration:538
INFO:  recreating LLVM context after 81492 uses

I did:

-               llvm_llvm_context_reuse_count = 0;
                Assert(llvm_context != NULL);
+               elog(INFO, "recreating LLVM context after %zu uses", llvm_llvm_context_reuse_count);
+               llvm_llvm_context_reuse_count = 0;

Maybe we're missing this condition somehow ?
        if (llvm_jit_context_in_use_count == 0 &&

Also, I just hit this assertion by cancelling the query with ^C / sigint.  But
I don't have a reprodcer for it.

< 2021-04-17 19:14:23.509 ADT telsasoft >PANIC:  LLVMJitContext in use count not 0 at exit (is 1)

-- 
Justin



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Replication slot stats misgivings
Next
From: vignesh C
Date:
Subject: Re: Replication slot stats misgivings