Thread: 12.2: Howto check memory-leak in worker?

12.2: Howto check memory-leak in worker?

From

Peter

Date:

04 May 2020, 11:56:49

Hi all,
  I have something that looks a bit insane:

# ps axl | grep 6145
  UID   PID  PPID CPU PRI NI     VSZ    RSS MWCHAN   STAT TT         TIME COMMAND
  770  6145     1   0  20  0  241756    868 select   SsJ   -      0:24.62 /usr/local/bin/postgres -D 
  770  6147  6145   0  23  0  243804 109784 select   IsJ   -      3:18.52 postgres: checkpointer    (
  770  6148  6145   0  20  0  241756  21348 select   SsJ   -      2:02.83 postgres: background writer
  770  6149  6145   0  20  0  241756   7240 select   SsJ   -     16:36.80 postgres: walwriter    (pos
  770  6150  6145   0  20  0   21980    876 select   SsJ   -      0:13.92 postgres: archiver   last w
  770  6151  6145   0  20  0   21980    980 select   SsJ   -      0:58.45 postgres: stats collector  
  770  6152  6145   0  20  0  241756   1268 select   IsJ   -      0:02.07 postgres: logical replicati
  770 43315  6145   0  21  0  251844   7520 select   IsJ   -      1:07.74 postgres: admin postgres 19
  770 43317  6145   0  25  0  251764   8684 select   IsJ   -      1:28.89 postgres: admin bareos 192.
  770 43596  6145   0  20  0  245620   4476 select   IsJ   -      0:00.12 postgres: admin bareos 192.
  770 43761  6145   0  20  0  245620   4476 select   IsJ   -      0:00.15 postgres: admin bareos 192.
  770 90206  6145   0  52  0 1331256 219720 racct    DsJ   -    563:45.41 postgres: bareos bareos 192

The 90206 is continuously growing. It is the unspecific, all-purpose
worker for the www.bareos.com backup tool, so it is a bit difficult to
figure what precisely it does - but it tends to be rather simple
straight-forward queries, so it is unlikely to have dozens of "internal sort
operations and hash tables".

What I can say that at times this worker is completely idle in
ClientRead, but does not shrink in memory. Is this a normal behaviour?

Here is a more dynamic picture: it continues to add 2048kB chunks (and
does not do noticeable paging):

  UID   PID  PPID CPU PRI NI     VSZ    RSS MWCHAN   STAT TT         TIME COMMAND
Mon May  4 13:33:09 CEST 2020
  770 90206  6145   0  91  0 1335352 226900 -        RsJ   -    569:09.19 postgres: bareos bareos SELECT (postgres)
Mon May  4 13:33:39 CEST 2020
  770 90206  6145   0  93  0 1335352 227696 -        RsJ   -    569:28.48 postgres: bareos bareos idle (postgres)
Mon May  4 13:34:09 CEST 2020
  770 90206  6145   0  92  0 1337400 228116 -        RsJ   -    569:47.46 postgres: bareos bareos SELECT (postgres)
Mon May  4 13:34:39 CEST 2020
  770 90206  6145   0  92  0 1337400 228596 -        RsJ   -    570:06.56 postgres: bareos bareos UPDATE (postgres)
Mon May  4 13:35:09 CEST 2020
  770 90206  6145   0  92  0 1337400 228944 -        RsJ   -    570:25.62 postgres: bareos bareos SELECT (postgres)
Mon May  4 13:35:40 CEST 2020
  770 90206  6145   0  52  0 1337400 229288 racct    DsJ   -    570:44.33 postgres: bareos bareos UPDATE (postgres)
Mon May  4 13:36:10 CEST 2020
  770 90206  6145   0  91  0 1337400 229952 -        RsJ   -    571:03.20 postgres: bareos bareos SELECT (postgres)
Mon May  4 13:36:40 CEST 2020
  770 90206  6145   0  52  0 1337400 223772 racct    DsJ   -    571:21.50 postgres: bareos bareos SELECT (postgres)
Mon May  4 13:37:10 CEST 2020
  770 90206  6145   0  91  0 1337400 224448 -        RsJ   -    571:40.63 postgres: bareos bareos idle (postgres)
Mon May  4 13:37:40 CEST 2020
  770 90206  6145   0  91  0 1339448 225464 -        RsJ   -    571:58.36 postgres: bareos bareos SELECT (postgres)
Mon May  4 13:38:10 CEST 2020
  770 90206  6145   0  52  0 1339448 215620 select   SsJ   -    572:14.24 postgres: bareos bareos idle (postgres)
Mon May  4 13:38:40 CEST 2020
  770 90206  6145   0  81  0 1339448 215320 -        RsJ   -    572:21.09 postgres: bareos bareos idle (postgres)
Mon May  4 13:39:10 CEST 2020


OS is FreeBSD 11.3-RELEASE-p8 r360175M i386
PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit

autovacuum is Disabled.

The memory-specific config is:
> shared_buffers = 200MB
> temp_buffers = 40MB
> work_mem = 80MB
> maintenance_work_mem = 250MB
> dynamic_shared_memory_type = posix
> random_page_cost = 2.0
> effective_cache_size = 1GB
(others are left at default)

I remember vaguely that there are means to have a closer look into
what is using the memory, but do not recall the specifics. Some
pointers or ideas to proceed would be gladly appreciated (Dtrace
should work) - processes will usually fail with OOM at this size, due
to machine configuration - I'm waiting for that now (it is a very very
old pentium3 machine ;) ). 

cheerio,
PMc

Re: 12.2: Howto check memory-leak in worker?

From

Dilip Kumar

Date:

04 May 2020, 16:14:29

On Mon, May 4, 2020 at 5:43 PM Peter <pmc@citylink.dinoex.sub.org> wrote:
>
> Hi all,
>   I have something that looks a bit insane:
>
> # ps axl | grep 6145
>   UID   PID  PPID CPU PRI NI     VSZ    RSS MWCHAN   STAT TT         TIME COMMAND
>   770  6145     1   0  20  0  241756    868 select   SsJ   -      0:24.62 /usr/local/bin/postgres -D
>   770  6147  6145   0  23  0  243804 109784 select   IsJ   -      3:18.52 postgres: checkpointer    (
>   770  6148  6145   0  20  0  241756  21348 select   SsJ   -      2:02.83 postgres: background writer
>   770  6149  6145   0  20  0  241756   7240 select   SsJ   -     16:36.80 postgres: walwriter    (pos
>   770  6150  6145   0  20  0   21980    876 select   SsJ   -      0:13.92 postgres: archiver   last w
>   770  6151  6145   0  20  0   21980    980 select   SsJ   -      0:58.45 postgres: stats collector
>   770  6152  6145   0  20  0  241756   1268 select   IsJ   -      0:02.07 postgres: logical replicati
>   770 43315  6145   0  21  0  251844   7520 select   IsJ   -      1:07.74 postgres: admin postgres 19
>   770 43317  6145   0  25  0  251764   8684 select   IsJ   -      1:28.89 postgres: admin bareos 192.
>   770 43596  6145   0  20  0  245620   4476 select   IsJ   -      0:00.12 postgres: admin bareos 192.
>   770 43761  6145   0  20  0  245620   4476 select   IsJ   -      0:00.15 postgres: admin bareos 192.
>   770 90206  6145   0  52  0 1331256 219720 racct    DsJ   -    563:45.41 postgres: bareos bareos 192
>
> The 90206 is continuously growing. It is the unspecific, all-purpose
> worker for the www.bareos.com backup tool, so it is a bit difficult to
> figure what precisely it does - but it tends to be rather simple
> straight-forward queries, so it is unlikely to have dozens of "internal sort
> operations and hash tables".
>
> What I can say that at times this worker is completely idle in
> ClientRead, but does not shrink in memory. Is this a normal behaviour?
>
> Here is a more dynamic picture: it continues to add 2048kB chunks (and
> does not do noticeable paging):
>
>   UID   PID  PPID CPU PRI NI     VSZ    RSS MWCHAN   STAT TT         TIME COMMAND
> Mon May  4 13:33:09 CEST 2020
>   770 90206  6145   0  91  0 1335352 226900 -        RsJ   -    569:09.19 postgres: bareos bareos SELECT (postgres)
> Mon May  4 13:33:39 CEST 2020
>   770 90206  6145   0  93  0 1335352 227696 -        RsJ   -    569:28.48 postgres: bareos bareos idle (postgres)
> Mon May  4 13:34:09 CEST 2020
>   770 90206  6145   0  92  0 1337400 228116 -        RsJ   -    569:47.46 postgres: bareos bareos SELECT (postgres)
> Mon May  4 13:34:39 CEST 2020
>   770 90206  6145   0  92  0 1337400 228596 -        RsJ   -    570:06.56 postgres: bareos bareos UPDATE (postgres)
> Mon May  4 13:35:09 CEST 2020
>   770 90206  6145   0  92  0 1337400 228944 -        RsJ   -    570:25.62 postgres: bareos bareos SELECT (postgres)
> Mon May  4 13:35:40 CEST 2020
>   770 90206  6145   0  52  0 1337400 229288 racct    DsJ   -    570:44.33 postgres: bareos bareos UPDATE (postgres)
> Mon May  4 13:36:10 CEST 2020
>   770 90206  6145   0  91  0 1337400 229952 -        RsJ   -    571:03.20 postgres: bareos bareos SELECT (postgres)
> Mon May  4 13:36:40 CEST 2020
>   770 90206  6145   0  52  0 1337400 223772 racct    DsJ   -    571:21.50 postgres: bareos bareos SELECT (postgres)
> Mon May  4 13:37:10 CEST 2020
>   770 90206  6145   0  91  0 1337400 224448 -        RsJ   -    571:40.63 postgres: bareos bareos idle (postgres)
> Mon May  4 13:37:40 CEST 2020
>   770 90206  6145   0  91  0 1339448 225464 -        RsJ   -    571:58.36 postgres: bareos bareos SELECT (postgres)
> Mon May  4 13:38:10 CEST 2020
>   770 90206  6145   0  52  0 1339448 215620 select   SsJ   -    572:14.24 postgres: bareos bareos idle (postgres)
> Mon May  4 13:38:40 CEST 2020
>   770 90206  6145   0  81  0 1339448 215320 -        RsJ   -    572:21.09 postgres: bareos bareos idle (postgres)
> Mon May  4 13:39:10 CEST 2020
>
>
> OS is FreeBSD 11.3-RELEASE-p8 r360175M i386
> PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit
>
> autovacuum is Disabled.
>
> The memory-specific config is:
> > shared_buffers = 200MB
> > temp_buffers = 40MB
> > work_mem = 80MB
> > maintenance_work_mem = 250MB
> > dynamic_shared_memory_type = posix
> > random_page_cost = 2.0
> > effective_cache_size = 1GB
> (others are left at default)
>
> I remember vaguely that there are means to have a closer look into
> what is using the memory, but do not recall the specifics. Some
> pointers or ideas to proceed would be gladly appreciated (Dtrace
> should work) - processes will usually fail with OOM at this size, due
> to machine configuration - I'm waiting for that now (it is a very very
> old pentium3 machine ;) ).

One idea is that you can attach your process in gdb and call
MemoryContextStats(TopMemoryContext).  This will show which context is
using how much memory.  So basically u can call this function 2-3
times with some interval and see in which context the memory is
continuously increasing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: 12.2: Howto check memory-leak in worker?

From

Adrian Klaver

Date:

04 May 2020, 19:55:38

On 5/4/20 4:56 AM, Peter wrote:
> Hi all,
>    I have something that looks a bit insane:
> 
> # ps axl | grep 6145
>    UID   PID  PPID CPU PRI NI     VSZ    RSS MWCHAN   STAT TT         TIME COMMAND
>    770  6145     1   0  20  0  241756    868 select   SsJ   -      0:24.62 /usr/local/bin/postgres -D
>    770  6147  6145   0  23  0  243804 109784 select   IsJ   -      3:18.52 postgres: checkpointer    (
>    770  6148  6145   0  20  0  241756  21348 select   SsJ   -      2:02.83 postgres: background writer
>    770  6149  6145   0  20  0  241756   7240 select   SsJ   -     16:36.80 postgres: walwriter    (pos
>    770  6150  6145   0  20  0   21980    876 select   SsJ   -      0:13.92 postgres: archiver   last w
>    770  6151  6145   0  20  0   21980    980 select   SsJ   -      0:58.45 postgres: stats collector
>    770  6152  6145   0  20  0  241756   1268 select   IsJ   -      0:02.07 postgres: logical replicati
>    770 43315  6145   0  21  0  251844   7520 select   IsJ   -      1:07.74 postgres: admin postgres 19
>    770 43317  6145   0  25  0  251764   8684 select   IsJ   -      1:28.89 postgres: admin bareos 192.
>    770 43596  6145   0  20  0  245620   4476 select   IsJ   -      0:00.12 postgres: admin bareos 192.
>    770 43761  6145   0  20  0  245620   4476 select   IsJ   -      0:00.15 postgres: admin bareos 192.
>    770 90206  6145   0  52  0 1331256 219720 racct    DsJ   -    563:45.41 postgres: bareos bareos 192
> 
> The 90206 is continuously growing. It is the unspecific, all-purpose
> worker for the www.bareos.com backup tool, so it is a bit difficult to
> figure what precisely it does - but it tends to be rather simple
> straight-forward queries, so it is unlikely to have dozens of "internal sort
> operations and hash tables".
> 

Is there any relevant information in the bareos or Postgres logs?


> 
> OS is FreeBSD 11.3-RELEASE-p8 r360175M i386
> PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit
> 
> autovacuum is Disabled.

Any particular reason for above?


> cheerio,
> PMc
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: 12.2: Howto check memory-leak in worker?

From

Peter

Date:

04 May 2020, 22:01:32

On Mon, May 04, 2020 at 12:55:38PM -0700, Adrian Klaver wrote:

! > The 90206 is continuously growing. It is the unspecific, all-purpose
! > worker for the www.bareos.com backup tool, so it is a bit difficult to
! > figure what precisely it does - but it tends to be rather simple
! > straight-forward queries, so it is unlikely to have dozens of "internal sort
! > operations and hash tables".
! > 
! 
! Is there any relevant information in the bareos or Postgres logs?

No. Only the final exitus-in-tabula.
 
! > autovacuum is Disabled.
! 
! Any particular reason for above?

Yes. Disk spindown doesn't work so, and I see no reason to have a
backup zone that is used few times a day spinning 24 hours.

BTW, I would greatly appreciate if we would reconsider the need for
the server to read the postmaster.pid file every few seconds (probably
needed for something, I don't know).
That makes it necessary to set atime=off to get a spindown, and I
usually prefer to have atime=on so I can see what my stuff is
currently doing.


cheerio,
PMc

Re: 12.2: Howto check memory-leak in worker?

From

Thomas Munro

Date:

04 May 2020, 22:57:04

On Tue, May 5, 2020 at 10:13 AM Peter <pmc@citylink.dinoex.sub.org> wrote:
> BTW, I would greatly appreciate if we would reconsider the need for
> the server to read the postmaster.pid file every few seconds (probably
> needed for something, I don't know).
> That makes it necessary to set atime=off to get a spindown, and I
> usually prefer to have atime=on so I can see what my stuff is
> currently doing.

That'd be this:

        /*
         * Once a minute, verify that postmaster.pid hasn't been removed or
         * overwritten.  If it has, we force a shutdown.  This avoids having
         * postmasters and child processes hanging around after their database
         * is gone, and maybe causing problems if a new database cluster is
         * created in the same place.  It also provides some protection
         * against a DBA foolishly removing postmaster.pid and manually
         * starting a new postmaster.  Data corruption is likely to ensue from
         * that anyway, but we can minimize the damage by aborting ASAP.
         */
        if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
        {
            if (!RecheckDataDirLockFile())

Re: 12.2: Howto check memory-leak in worker?

From

Peter

Date:

05 May 2020, 10:45:07

On Tue, May 05, 2020 at 10:57:04AM +1200, Thomas Munro wrote:
! On Tue, May 5, 2020 at 10:13 AM Peter <pmc@citylink.dinoex.sub.org> wrote:
! > BTW, I would greatly appreciate if we would reconsider the need for
! > the server to read the postmaster.pid file every few seconds (probably
! > needed for something, I don't know).
! > That makes it necessary to set atime=off to get a spindown, and I
! > usually prefer to have atime=on so I can see what my stuff is
! > currently doing.
! 
! That'd be this:
! 
!         /*
!          * Once a minute, verify that postmaster.pid hasn't been removed or
!          * overwritten.  If it has, we force a shutdown.  This avoids having
!          * postmasters and child processes hanging around after their database
!          * is gone, and maybe causing problems if a new database cluster is
!          * created in the same place.  It also provides some protection
!          * against a DBA foolishly removing postmaster.pid and manually
!          * starting a new postmaster.  Data corruption is likely to ensue from
!          * that anyway, but we can minimize the damage by aborting ASAP.
!          */
!         if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
!         {
!             if (!RecheckDataDirLockFile())

Ah. Yes. Thanks for identifying this issue.

It is really hard to comment on this in a way that might not be
considered offensive, so lets put it that way: You all have seen
the Apollo-13 movie, so You know that a need to power-down as much
as possible may appear en-route, and in a situation where you have
lots of other issues, so what you need the least is things like
this getting in your way.
I usually take this a bit further and consider only two relevant
quality levels for software: interplanetary and interstellar - where
i.p. is the kind of stuff on which we will trust our lives during
the manned Jupiter mission. PG might well qualify for this, except
with things like the quoted; so I think such should be tagged with
appropriate compile-time switches.

Now concerning the memory leak:
That one was introduced with the work done on the GSSAPI encryption;
it goes away when setting 'hostnogssenc' in pg_hba. (And, whoever has
editor rights, there is also a little fixup needed for that switch in
pg_hba.conf; it should probably be "eleven formats".)


cheerio,
PMc

Re: 12.2: Howto check memory-leak in worker?

From

Tom Lane

Date:

05 May 2020, 15:59:27

Peter <pmc@citylink.dinoex.sub.org> writes:
> On Tue, May 05, 2020 at 10:57:04AM +1200, Thomas Munro wrote:
> !          * Once a minute, verify that postmaster.pid hasn't been removed or
> !          * overwritten.  If it has, we force a shutdown.

> It is really hard to comment on this in a way that might not be
> considered offensive, so lets put it that way: You all have seen
> the Apollo-13 movie, so You know that a need to power-down as much
> as possible may appear en-route, and in a situation where you have
> lots of other issues, so what you need the least is things like
> this getting in your way.

Well, the choice we face is preventing somebody's disk from spinning
down, versus preventing somebody else from completely corrupting their
database.  From where I sit that's not a difficult choice, nor one
that I feel a need to let users second-guess.

> Now concerning the memory leak:
> That one was introduced with the work done on the GSSAPI encryption;
> it goes away when setting 'hostnogssenc' in pg_hba.

Oooh ... it looks like some of the encryption code paths have neglected
to call gss_release_buffer.  Will fix, thanks for the report!

            regards, tom lane

Re: 12.2: Howto check memory-leak in worker?

From

Peter

Date:

05 May 2020, 17:54:37

On Tue, May 05, 2020 at 11:59:27AM -0400, Tom Lane wrote:

! Well, the choice we face is preventing somebody's disk from spinning
! down, versus preventing somebody else from completely corrupting their
! database.  From where I sit that's not a difficult choice, nor one
! that I feel a need to let users second-guess.

Then maybe You see a scenario where that feature would actually
prevent db corruption, while I have not yet found a realistic one.
Instead, what gave me headaches is that ZFS might take a single
tablespace (=pool) offline with the cluster continuing to run - and 
I am not sure if the db is supposed to survive that (mine did, after I
had hit the power button in horror), nor is it easy to protect from
that.
Anyway, I can now switch that feature off per local patch, which is
the second-best solution.
 
! Oooh ... it looks like some of the encryption code paths have neglected
! to call gss_release_buffer.  Will fix, thanks for the report!

Great! So I assume I don't need to send a bug report I had prepared
interim. Feel free to ask me for anything that might be still needed.

cheerio,
PMc