Thread: unconstrained memory growth in long running procedure stored procedure after upgrading 11-12

Hello all,

We just upgraded from postgres 11 to 12.6 and our server is running
out of memory and rebooting about 1-2 times a day.    Application
architecture is a single threaded stored procedure, executed with CALL
that loops and never terminates. With postgres 11 we had no memory
issues.  Ultimately the crash looks like this:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
2021-03-29 04:34:31.262 CDT [1413] LOG:  server process (PID 9792) was
terminated by signal 6: Aborted
2021-03-29 04:34:31.262 CDT [1413] DETAIL:  Failed process was
running: CALL Main()
2021-03-29 04:34:31.262 CDT [1413] LOG:  terminating any other active
server processes
2021-03-29 04:34:31.264 CDT [9741] WARNING:  terminating connection
because of crash of another server process
2021-03-29 04:34:31.264 CDT [9741] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2021-03-29 04:34:31.264 CDT [9741] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
2021-03-29 04:34:31.267 CDT [1413] LOG:  archiver process (PID 9742)
exited with exit code 1
2021-03-29 04:34:31.267 CDT [1413] LOG:  all server processes
terminated; reinitializing

Attached is a self contained test case which reproduces the problem.

Instructions:
1. run the attached script in psql, pgtask_test.sql. It will create a
database, initialize it, and run the main procedure. dblink must be
available
2. in another window, run SELECT CreateTaskChain('test', 'DEV');

In the console that ran main(), you should see output that the
procedure began to do work. Once it does, a 'top' should show resident
memory growth immediately.   It's about a gigabyte an hour in my test.
Sorry for the large-ish attachment.

merlin

Attachment
On Tue, Mar 30, 2021 at 04:17:03PM -0500, Merlin Moncure wrote:
> Hello all,
> 
> We just upgraded from postgres 11 to 12.6 and our server is running
> out of memory and rebooting about 1-2 times a day.    Application
> architecture is a single threaded stored procedure, executed with CALL
> that loops and never terminates. With postgres 11 we had no memory
> issues.  Ultimately the crash looks like this:
> 
> terminate called after throwing an instance of 'std::bad_alloc'
>   what():  std::bad_alloc
> 2021-03-29 04:34:31.262 CDT [1413] LOG:  server process (PID 9792) was
> terminated by signal 6: Aborted

I haven't tried your test, but this sounds a lot like the issue I reported with
JIT, which is enabled by default in v12.

https://www.postgresql.org/docs/12/release-12.html
Enable Just-in-Time (JIT) compilation by default, if the server has been built with support for it (Andres Freund)
Note that this support is not built by default, but has to be selected explicitly while configuring the build.

https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com
terminate called after throwing an instance of 'std::bad_alloc'

I suggest to try ALTER SYSTEM SET jit_inline_above_cost=-1; SELECT pg_reload_conf();

> memory growth immediately.   It's about a gigabyte an hour in my test.
> Sorry for the large-ish attachment.

Your reproducer is probably much better than mine was.

-- 
Justin



Justin Pryzby <pryzby@telsasoft.com> writes:
> On Tue, Mar 30, 2021 at 04:17:03PM -0500, Merlin Moncure wrote:
>> We just upgraded from postgres 11 to 12.6 and our server is running
>> out of memory and rebooting about 1-2 times a day.

> I haven't tried your test, but this sounds a lot like the issue I reported with
> JIT, which is enabled by default in v12.

FWIW, I just finished failing to reproduce any problem with that
test case ... but I was using a non-JIT-enabled build.

            regards, tom lane



On Tue, Mar 30, 2021 at 04:17:03PM -0500, Merlin Moncure wrote:
> Instructions:
> 1. run the attached script in psql, pgtask_test.sql. It will create a
> database, initialize it, and run the main procedure. dblink must be
> available
> 2. in another window, run SELECT CreateTaskChain('test', 'DEV');

For your reproducer, I needed to:  
1.1) comment this:
|INSERT INTO Task SELECT
|  -- 'test',
1.2) then run: CALL MAIN();

Anyway I reproduced this without an extension this time:

CREATE OR REPLACE FUNCTION cfn() RETURNS void LANGUAGE PLPGSQL AS $$ declare a record; begin FOR a IN SELECT
generate_series(1,99)LOOP PERFORM format('select 1'); END LOOP; END $$;
 
$ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET client_min_messages=debug; SET
log_executor_stats=on;SELECT cfn();' |head -11 |psql 2>&1 |grep 'max resident'
 
!       33708 kB max resident size
!       35956 kB max resident size
!       37800 kB max resident size
!       40300 kB max resident size
!       41928 kB max resident size
!       43928 kB max resident size
!       48496 kB max resident size
!       48964 kB max resident size
!       50460 kB max resident size
!       52272 kB max resident size
!       53740 kB max resident size

There's also a relatively microscopic leak even if inline is off.  It may be
that this is what I reproduced last time - I couldn't see how a few hundred kB
leak was causing a our process to be GB sized.  It may or may not be a separate
issue, though.

-- 
Justin



On Tue, Mar 30, 2021 at 7:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Tue, Mar 30, 2021 at 04:17:03PM -0500, Merlin Moncure wrote:
> >> We just upgraded from postgres 11 to 12.6 and our server is running
> >> out of memory and rebooting about 1-2 times a day.
>
> > I haven't tried your test, but this sounds a lot like the issue I reported with
> > JIT, which is enabled by default in v12.
>
> FWIW, I just finished failing to reproduce any problem with that
> test case ... but I was using a non-JIT-enabled build.

Yep.  Disabling jit (fully, fia jit=off, not what was suggested
upthread) eliminated the issue, or at least highly mitigated the leak.
I was using pgdg rpm packaging, which enables jit by default.  Thanks
everyone for looking at this, and the workaround is quick and easy.

merlin