Thread: 12.2: Howto check memory-leak in worker?
Hi all, I have something that looks a bit insane: # ps axl | grep 6145 UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 770 6145 1 0 20 0 241756 868 select SsJ - 0:24.62 /usr/local/bin/postgres -D 770 6147 6145 0 23 0 243804 109784 select IsJ - 3:18.52 postgres: checkpointer ( 770 6148 6145 0 20 0 241756 21348 select SsJ - 2:02.83 postgres: background writer 770 6149 6145 0 20 0 241756 7240 select SsJ - 16:36.80 postgres: walwriter (pos 770 6150 6145 0 20 0 21980 876 select SsJ - 0:13.92 postgres: archiver last w 770 6151 6145 0 20 0 21980 980 select SsJ - 0:58.45 postgres: stats collector 770 6152 6145 0 20 0 241756 1268 select IsJ - 0:02.07 postgres: logical replicati 770 43315 6145 0 21 0 251844 7520 select IsJ - 1:07.74 postgres: admin postgres 19 770 43317 6145 0 25 0 251764 8684 select IsJ - 1:28.89 postgres: admin bareos 192. 770 43596 6145 0 20 0 245620 4476 select IsJ - 0:00.12 postgres: admin bareos 192. 770 43761 6145 0 20 0 245620 4476 select IsJ - 0:00.15 postgres: admin bareos 192. 770 90206 6145 0 52 0 1331256 219720 racct DsJ - 563:45.41 postgres: bareos bareos 192 The 90206 is continuously growing. It is the unspecific, all-purpose worker for the www.bareos.com backup tool, so it is a bit difficult to figure what precisely it does - but it tends to be rather simple straight-forward queries, so it is unlikely to have dozens of "internal sort operations and hash tables". What I can say that at times this worker is completely idle in ClientRead, but does not shrink in memory. Is this a normal behaviour? Here is a more dynamic picture: it continues to add 2048kB chunks (and does not do noticeable paging): UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND Mon May 4 13:33:09 CEST 2020 770 90206 6145 0 91 0 1335352 226900 - RsJ - 569:09.19 postgres: bareos bareos SELECT (postgres) Mon May 4 13:33:39 CEST 2020 770 90206 6145 0 93 0 1335352 227696 - RsJ - 569:28.48 postgres: bareos bareos idle (postgres) Mon May 4 13:34:09 CEST 2020 770 90206 6145 0 92 0 1337400 228116 - RsJ - 569:47.46 postgres: bareos bareos SELECT (postgres) Mon May 4 13:34:39 CEST 2020 770 90206 6145 0 92 0 1337400 228596 - RsJ - 570:06.56 postgres: bareos bareos UPDATE (postgres) Mon May 4 13:35:09 CEST 2020 770 90206 6145 0 92 0 1337400 228944 - RsJ - 570:25.62 postgres: bareos bareos SELECT (postgres) Mon May 4 13:35:40 CEST 2020 770 90206 6145 0 52 0 1337400 229288 racct DsJ - 570:44.33 postgres: bareos bareos UPDATE (postgres) Mon May 4 13:36:10 CEST 2020 770 90206 6145 0 91 0 1337400 229952 - RsJ - 571:03.20 postgres: bareos bareos SELECT (postgres) Mon May 4 13:36:40 CEST 2020 770 90206 6145 0 52 0 1337400 223772 racct DsJ - 571:21.50 postgres: bareos bareos SELECT (postgres) Mon May 4 13:37:10 CEST 2020 770 90206 6145 0 91 0 1337400 224448 - RsJ - 571:40.63 postgres: bareos bareos idle (postgres) Mon May 4 13:37:40 CEST 2020 770 90206 6145 0 91 0 1339448 225464 - RsJ - 571:58.36 postgres: bareos bareos SELECT (postgres) Mon May 4 13:38:10 CEST 2020 770 90206 6145 0 52 0 1339448 215620 select SsJ - 572:14.24 postgres: bareos bareos idle (postgres) Mon May 4 13:38:40 CEST 2020 770 90206 6145 0 81 0 1339448 215320 - RsJ - 572:21.09 postgres: bareos bareos idle (postgres) Mon May 4 13:39:10 CEST 2020 OS is FreeBSD 11.3-RELEASE-p8 r360175M i386 PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit autovacuum is Disabled. The memory-specific config is: > shared_buffers = 200MB > temp_buffers = 40MB > work_mem = 80MB > maintenance_work_mem = 250MB > dynamic_shared_memory_type = posix > random_page_cost = 2.0 > effective_cache_size = 1GB (others are left at default) I remember vaguely that there are means to have a closer look into what is using the memory, but do not recall the specifics. Some pointers or ideas to proceed would be gladly appreciated (Dtrace should work) - processes will usually fail with OOM at this size, due to machine configuration - I'm waiting for that now (it is a very very old pentium3 machine ;) ). cheerio, PMc
On Mon, May 4, 2020 at 5:43 PM Peter <pmc@citylink.dinoex.sub.org> wrote: > > Hi all, > I have something that looks a bit insane: > > # ps axl | grep 6145 > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > 770 6145 1 0 20 0 241756 868 select SsJ - 0:24.62 /usr/local/bin/postgres -D > 770 6147 6145 0 23 0 243804 109784 select IsJ - 3:18.52 postgres: checkpointer ( > 770 6148 6145 0 20 0 241756 21348 select SsJ - 2:02.83 postgres: background writer > 770 6149 6145 0 20 0 241756 7240 select SsJ - 16:36.80 postgres: walwriter (pos > 770 6150 6145 0 20 0 21980 876 select SsJ - 0:13.92 postgres: archiver last w > 770 6151 6145 0 20 0 21980 980 select SsJ - 0:58.45 postgres: stats collector > 770 6152 6145 0 20 0 241756 1268 select IsJ - 0:02.07 postgres: logical replicati > 770 43315 6145 0 21 0 251844 7520 select IsJ - 1:07.74 postgres: admin postgres 19 > 770 43317 6145 0 25 0 251764 8684 select IsJ - 1:28.89 postgres: admin bareos 192. > 770 43596 6145 0 20 0 245620 4476 select IsJ - 0:00.12 postgres: admin bareos 192. > 770 43761 6145 0 20 0 245620 4476 select IsJ - 0:00.15 postgres: admin bareos 192. > 770 90206 6145 0 52 0 1331256 219720 racct DsJ - 563:45.41 postgres: bareos bareos 192 > > The 90206 is continuously growing. It is the unspecific, all-purpose > worker for the www.bareos.com backup tool, so it is a bit difficult to > figure what precisely it does - but it tends to be rather simple > straight-forward queries, so it is unlikely to have dozens of "internal sort > operations and hash tables". > > What I can say that at times this worker is completely idle in > ClientRead, but does not shrink in memory. Is this a normal behaviour? > > Here is a more dynamic picture: it continues to add 2048kB chunks (and > does not do noticeable paging): > > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > Mon May 4 13:33:09 CEST 2020 > 770 90206 6145 0 91 0 1335352 226900 - RsJ - 569:09.19 postgres: bareos bareos SELECT (postgres) > Mon May 4 13:33:39 CEST 2020 > 770 90206 6145 0 93 0 1335352 227696 - RsJ - 569:28.48 postgres: bareos bareos idle (postgres) > Mon May 4 13:34:09 CEST 2020 > 770 90206 6145 0 92 0 1337400 228116 - RsJ - 569:47.46 postgres: bareos bareos SELECT (postgres) > Mon May 4 13:34:39 CEST 2020 > 770 90206 6145 0 92 0 1337400 228596 - RsJ - 570:06.56 postgres: bareos bareos UPDATE (postgres) > Mon May 4 13:35:09 CEST 2020 > 770 90206 6145 0 92 0 1337400 228944 - RsJ - 570:25.62 postgres: bareos bareos SELECT (postgres) > Mon May 4 13:35:40 CEST 2020 > 770 90206 6145 0 52 0 1337400 229288 racct DsJ - 570:44.33 postgres: bareos bareos UPDATE (postgres) > Mon May 4 13:36:10 CEST 2020 > 770 90206 6145 0 91 0 1337400 229952 - RsJ - 571:03.20 postgres: bareos bareos SELECT (postgres) > Mon May 4 13:36:40 CEST 2020 > 770 90206 6145 0 52 0 1337400 223772 racct DsJ - 571:21.50 postgres: bareos bareos SELECT (postgres) > Mon May 4 13:37:10 CEST 2020 > 770 90206 6145 0 91 0 1337400 224448 - RsJ - 571:40.63 postgres: bareos bareos idle (postgres) > Mon May 4 13:37:40 CEST 2020 > 770 90206 6145 0 91 0 1339448 225464 - RsJ - 571:58.36 postgres: bareos bareos SELECT (postgres) > Mon May 4 13:38:10 CEST 2020 > 770 90206 6145 0 52 0 1339448 215620 select SsJ - 572:14.24 postgres: bareos bareos idle (postgres) > Mon May 4 13:38:40 CEST 2020 > 770 90206 6145 0 81 0 1339448 215320 - RsJ - 572:21.09 postgres: bareos bareos idle (postgres) > Mon May 4 13:39:10 CEST 2020 > > > OS is FreeBSD 11.3-RELEASE-p8 r360175M i386 > PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit > > autovacuum is Disabled. > > The memory-specific config is: > > shared_buffers = 200MB > > temp_buffers = 40MB > > work_mem = 80MB > > maintenance_work_mem = 250MB > > dynamic_shared_memory_type = posix > > random_page_cost = 2.0 > > effective_cache_size = 1GB > (others are left at default) > > I remember vaguely that there are means to have a closer look into > what is using the memory, but do not recall the specifics. Some > pointers or ideas to proceed would be gladly appreciated (Dtrace > should work) - processes will usually fail with OOM at this size, due > to machine configuration - I'm waiting for that now (it is a very very > old pentium3 machine ;) ). One idea is that you can attach your process in gdb and call MemoryContextStats(TopMemoryContext). This will show which context is using how much memory. So basically u can call this function 2-3 times with some interval and see in which context the memory is continuously increasing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 5/4/20 4:56 AM, Peter wrote: > Hi all, > I have something that looks a bit insane: > > # ps axl | grep 6145 > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > 770 6145 1 0 20 0 241756 868 select SsJ - 0:24.62 /usr/local/bin/postgres -D > 770 6147 6145 0 23 0 243804 109784 select IsJ - 3:18.52 postgres: checkpointer ( > 770 6148 6145 0 20 0 241756 21348 select SsJ - 2:02.83 postgres: background writer > 770 6149 6145 0 20 0 241756 7240 select SsJ - 16:36.80 postgres: walwriter (pos > 770 6150 6145 0 20 0 21980 876 select SsJ - 0:13.92 postgres: archiver last w > 770 6151 6145 0 20 0 21980 980 select SsJ - 0:58.45 postgres: stats collector > 770 6152 6145 0 20 0 241756 1268 select IsJ - 0:02.07 postgres: logical replicati > 770 43315 6145 0 21 0 251844 7520 select IsJ - 1:07.74 postgres: admin postgres 19 > 770 43317 6145 0 25 0 251764 8684 select IsJ - 1:28.89 postgres: admin bareos 192. > 770 43596 6145 0 20 0 245620 4476 select IsJ - 0:00.12 postgres: admin bareos 192. > 770 43761 6145 0 20 0 245620 4476 select IsJ - 0:00.15 postgres: admin bareos 192. > 770 90206 6145 0 52 0 1331256 219720 racct DsJ - 563:45.41 postgres: bareos bareos 192 > > The 90206 is continuously growing. It is the unspecific, all-purpose > worker for the www.bareos.com backup tool, so it is a bit difficult to > figure what precisely it does - but it tends to be rather simple > straight-forward queries, so it is unlikely to have dozens of "internal sort > operations and hash tables". > Is there any relevant information in the bareos or Postgres logs? > > OS is FreeBSD 11.3-RELEASE-p8 r360175M i386 > PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit > > autovacuum is Disabled. Any particular reason for above? > cheerio, > PMc > > -- Adrian Klaver adrian.klaver@aklaver.com
On Mon, May 04, 2020 at 12:55:38PM -0700, Adrian Klaver wrote: ! > The 90206 is continuously growing. It is the unspecific, all-purpose ! > worker for the www.bareos.com backup tool, so it is a bit difficult to ! > figure what precisely it does - but it tends to be rather simple ! > straight-forward queries, so it is unlikely to have dozens of "internal sort ! > operations and hash tables". ! > ! ! Is there any relevant information in the bareos or Postgres logs? No. Only the final exitus-in-tabula. ! > autovacuum is Disabled. ! ! Any particular reason for above? Yes. Disk spindown doesn't work so, and I see no reason to have a backup zone that is used few times a day spinning 24 hours. BTW, I would greatly appreciate if we would reconsider the need for the server to read the postmaster.pid file every few seconds (probably needed for something, I don't know). That makes it necessary to set atime=off to get a spindown, and I usually prefer to have atime=on so I can see what my stuff is currently doing. cheerio, PMc
On Tue, May 5, 2020 at 10:13 AM Peter <pmc@citylink.dinoex.sub.org> wrote: > BTW, I would greatly appreciate if we would reconsider the need for > the server to read the postmaster.pid file every few seconds (probably > needed for something, I don't know). > That makes it necessary to set atime=off to get a spindown, and I > usually prefer to have atime=on so I can see what my stuff is > currently doing. That'd be this: /* * Once a minute, verify that postmaster.pid hasn't been removed or * overwritten. If it has, we force a shutdown. This avoids having * postmasters and child processes hanging around after their database * is gone, and maybe causing problems if a new database cluster is * created in the same place. It also provides some protection * against a DBA foolishly removing postmaster.pid and manually * starting a new postmaster. Data corruption is likely to ensue from * that anyway, but we can minimize the damage by aborting ASAP. */ if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE) { if (!RecheckDataDirLockFile())
On Tue, May 05, 2020 at 10:57:04AM +1200, Thomas Munro wrote: ! On Tue, May 5, 2020 at 10:13 AM Peter <pmc@citylink.dinoex.sub.org> wrote: ! > BTW, I would greatly appreciate if we would reconsider the need for ! > the server to read the postmaster.pid file every few seconds (probably ! > needed for something, I don't know). ! > That makes it necessary to set atime=off to get a spindown, and I ! > usually prefer to have atime=on so I can see what my stuff is ! > currently doing. ! ! That'd be this: ! ! /* ! * Once a minute, verify that postmaster.pid hasn't been removed or ! * overwritten. If it has, we force a shutdown. This avoids having ! * postmasters and child processes hanging around after their database ! * is gone, and maybe causing problems if a new database cluster is ! * created in the same place. It also provides some protection ! * against a DBA foolishly removing postmaster.pid and manually ! * starting a new postmaster. Data corruption is likely to ensue from ! * that anyway, but we can minimize the damage by aborting ASAP. ! */ ! if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE) ! { ! if (!RecheckDataDirLockFile()) Ah. Yes. Thanks for identifying this issue. It is really hard to comment on this in a way that might not be considered offensive, so lets put it that way: You all have seen the Apollo-13 movie, so You know that a need to power-down as much as possible may appear en-route, and in a situation where you have lots of other issues, so what you need the least is things like this getting in your way. I usually take this a bit further and consider only two relevant quality levels for software: interplanetary and interstellar - where i.p. is the kind of stuff on which we will trust our lives during the manned Jupiter mission. PG might well qualify for this, except with things like the quoted; so I think such should be tagged with appropriate compile-time switches. Now concerning the memory leak: That one was introduced with the work done on the GSSAPI encryption; it goes away when setting 'hostnogssenc' in pg_hba. (And, whoever has editor rights, there is also a little fixup needed for that switch in pg_hba.conf; it should probably be "eleven formats".) cheerio, PMc
Peter <pmc@citylink.dinoex.sub.org> writes: > On Tue, May 05, 2020 at 10:57:04AM +1200, Thomas Munro wrote: > ! * Once a minute, verify that postmaster.pid hasn't been removed or > ! * overwritten. If it has, we force a shutdown. > It is really hard to comment on this in a way that might not be > considered offensive, so lets put it that way: You all have seen > the Apollo-13 movie, so You know that a need to power-down as much > as possible may appear en-route, and in a situation where you have > lots of other issues, so what you need the least is things like > this getting in your way. Well, the choice we face is preventing somebody's disk from spinning down, versus preventing somebody else from completely corrupting their database. From where I sit that's not a difficult choice, nor one that I feel a need to let users second-guess. > Now concerning the memory leak: > That one was introduced with the work done on the GSSAPI encryption; > it goes away when setting 'hostnogssenc' in pg_hba. Oooh ... it looks like some of the encryption code paths have neglected to call gss_release_buffer. Will fix, thanks for the report! regards, tom lane
On Tue, May 05, 2020 at 11:59:27AM -0400, Tom Lane wrote: ! Well, the choice we face is preventing somebody's disk from spinning ! down, versus preventing somebody else from completely corrupting their ! database. From where I sit that's not a difficult choice, nor one ! that I feel a need to let users second-guess. Then maybe You see a scenario where that feature would actually prevent db corruption, while I have not yet found a realistic one. Instead, what gave me headaches is that ZFS might take a single tablespace (=pool) offline with the cluster continuing to run - and I am not sure if the db is supposed to survive that (mine did, after I had hit the power button in horror), nor is it easy to protect from that. Anyway, I can now switch that feature off per local patch, which is the second-best solution. ! Oooh ... it looks like some of the encryption code paths have neglected ! to call gss_release_buffer. Will fix, thanks for the report! Great! So I assume I don't need to send a bug report I had prepared interim. Feel free to ask me for anything that might be still needed. cheerio, PMc