Thread: Debugging leaking memory in Postgresql 13.2/Postgis 3.1
Hello, I have a setup in which I have constantly 8 connections to the database open which query data for map rendering. The query involve only a hand full of tables, but could be complex and use postgis functions. The connection stays open. Only read-requests are happening, no update/insert/delete. I observe available memory going down at a rate of roughly 2GB per hour. I am running the database inside a docker container, using latest postgis/postgis image which is then using the latest postgres image as a base. PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit POSTGIS=""3.1.1 aaf4c79"" [EXTENSION] PGSQL=""130"" GEOS=""3.7.1-CAPI-1.11.1 27a5e771"" PROJ=""Rel. 5.2.0, September 15th, 2018"" LIBXML=""2.9.4"" LIBJSON=""0.12.1"" LIBPROTOBUF=""1.3.1"" WAGYU=""0.5.0 (Internal)"" It keeps increasing RssAnon memory of the backends until OOM. The wiki suggested to dump MemoryContext states for more details, but something strange happens when attaching gdb. It seems that the process is immediately killed and I can no longer dump such details. Is this related to running within docker? I added capabilities and seccomp for gdb: cap_add: - SYS_PTRACE security_opt: - seccomp:unconfined with count(*) in pg_stat_activity I see 20 connections. Besides the 8 expected client backend it also lists some parallel worker and other postgres maintenance tasks. work_mem is set to 50MB. shared_buffers to 3500MB. By looking at /proc/meminfo I think that all of this shared memory is statically allocated as HugePages. I see no increase there. Only increase is within RssAnon. Any ideas on how to debug deeper into this issue? Any idea why gdb is not working as expected? More logs below. It seems that the binary is terminating while gdb is attaching also sending a cont immediately does not change it. Any ideas what configuration parameter might help? I had a similar setup working with older versions of postgresql and postgis and it did not show this constant loss of memory. I can work-around by periodically sending a SIGHUP to my clients, which will then reconnect to postgresql and this frees the memory of the backends. So is something kept longer in memory than needed? Others seem to have similar issues using slightly older versions as well https://github.com/openstreetmap/mod_tile/issues/181#issuecomment-738303445 I would appreciate some ideas/pointers on how to track this down further. Logs of OOM (before using hugepages, but this did not change the situation): postgres invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 CPU: 9 PID: 3322620 Comm: postgres Not tainted 5.4.0-67-generic #75-Ubuntu Hardware name: Gigabyte Technology Co., Ltd. B360 HD3P-LM/B360HD3PLM-CF, BIOS F7 HZ 07/24/2020 Call Trace: dump_stack+0x6d/0x8b dump_header+0x4f/0x1eb oom_kill_process.cold+0xb/0x10 out_of_memory.part.0+0x1df/0x3d0 out_of_memory+0x6d/0xd0 __alloc_pages_slowpath+0xd5e/0xe50 __alloc_pages_nodemask+0x2d0/0x320 alloc_pages_current+0x87/0xe0 __page_cache_alloc+0x72/0x90 pagecache_get_page+0xbf/0x300 filemap_fault+0x6b2/0xa50 ? unlock_page_memcg+0x12/0x20 ? page_add_file_rmap+0xff/0x1a0 ? xas_load+0xd/0x80 ? xas_find+0x17f/0x1c0 ? filemap_map_pages+0x24c/0x380 ext4_filemap_fault+0x32/0x50 __do_fault+0x3c/0x130 do_fault+0x24b/0x640 ? __switch_to_asm+0x34/0x70 __handle_mm_fault+0x4c5/0x7a0 handle_mm_fault+0xca/0x200 do_user_addr_fault+0x1f9/0x450 __do_page_fault+0x58/0x90 do_page_fault+0x2c/0xe0 page_fault+0x34/0x40 RIP: 0033:0x558ddf0beded Code: Bad RIP value. RSP: 002b:00007ffe5214a020 EFLAGS: 00010202 RAX: 00007fea26b16b28 RBX: 0000000000000028 RCX: 00007fea26b16b68 RDX: 0000000000000028 RSI: 0000000000000000 RDI: 00007fea26b16b28 RBP: 0000000000000010 R08: 00007fea26b16b28 R09: 0000000000000019 R10: 0000000000000001 R11: 0000000000000001 R12: 00000000ffffffff R13: 00007fea26af76d8 R14: 00007fea26af7728 R15: 0000000000000000 Mem-Info: active_anon:29797121 inactive_anon:2721653 isolated_anon:32 active_file:323 inactive_file:83 isolated_file:0 unevictable:16 dirty:14 writeback:0 unstable:0 slab_reclaimable:85925 slab_unreclaimable:106003 mapped:1108591 shmem:14943591 pagetables:69567 bounce:0 free:148637 free_pcp:1619 free_cma:0 Node 0 active_anon:119188484kB inactive_anon:10886612kB active_file:1292kB inactive_file:332kB unevictable:64kB isolated(anon):128kB isolated(file):0kB mapped:4434364kB dirty:56kB writeback:0kB shmem:59774364kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Node 0 DMA free:15904kB min:8kB low:20kB high:32kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 128670 128670 128670 Node 0 DMA32 free:511716kB min:424kB low:1252kB high:2080kB active_anon:328544kB inactive_anon:8540kB active_file:204kB inactive_file:0kB unevictable:0kB writepending:0kB present:947448kB managed:881912kB mlocked:0kB kernel_stack:0kB pagetables:416kB bounce:0kB free_pcp:1412kB local_pcp:288kB free_cma:0kB lowmem_reserve[]: 0 0 127860 127860 127860 Node 0 Normal free:66928kB min:67148kB low:198076kB high:329004kB active_anon:118859812kB inactive_anon:10877988kB active_file:1812kB inactive_file:1480kB unevictable:64kB writepending:56kB present:133160960kB managed:130937320kB mlocked:64kB kernel_stack:18336kB pagetables:277852kB bounce:0kB free_pcp:5064kB local_pcp:384kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB Node 0 DMA32: 1388*4kB (UME) 1274*8kB (UME) 1772*16kB (UE) 1564*32kB (UME) 1082*64kB (UE) 514*128kB (UME) 160*256kB (UE) 46*512kB (UME) 23*1024kB (UE) 11*2048kB (UME) 42*4096kB (UME) = 511808kB Node 0 Normal: 715*4kB (UEH) 108*8kB (UEH) 1400*16kB (UMEH) 1156*32kB (UMEH) 58*64kB (UMEH) 7*128kB (UMEH) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 67980kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 14982418 total pagecache pages 38157 pages in swap cache Swap cache stats: add 14086008, delete 14047525, find 102734363/105748939 Free swap = 0kB Total swap = 4189180kB 33531099 pages RAM 0 pages HighMem/MovableOnly 572315 pages reserved 0 pages cma reserved 0 pages hwpoisoned oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=fae7f79d710f4449fd87c58f38eb164a470e3f837b33630c6c10a9fbca10a82b,mems_allowed=0,global_oom,task_memcg=/docker/e88b65d20ef39588c6bf9c00e7aa2946f134a61a6195c210f7081d7ed4d9a5fa,task=postgres,pid=2453393,uid=999 Out of memory: Killed process 2453393 (postgres) total-vm:10625568kB, anon-rss:6768188kB, file-rss:4kB, shmem-rss:3592772kB, UID:999 pgtables:20756kB oom_score_adj:0 oom_reaper: reaped process 2453393 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:3592772kB Logs of gdb attach # gdb -p 1296 -ex cont GNU gdb (Debian 8.2.1-2+b3) 8.2.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 1296 Reading symbols from /usr/lib/postgresql/13/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/31/ae2853776500091d313e76cf679017e697884b.debug...done. done. Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug/.build-id/e9/1114987a0147bd050addbd591eb8994b29f4b3.debug...done. done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Reading symbols from /usr/lib/x86_64-linux-gnu/libxml2.so.2...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libpam.so.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libssl.so.1.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...Reading symbols from /usr/lib/debug/.build-id/5d/cf98ad684962be494af28a1051793fd39e4ebc.debug...done. done. Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug/.build-id/d3/583c742dd47aaa860c5ae0c0c5bdbcd2d54f61.debug...done. done. Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...Reading symbols from /usr/lib/debug/.build-id/88/5dda4b4a5cea600e7b5b98c1ad86996c8d2299.debug...done. done. Reading symbols from /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libicui18n.so.63...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libicuuc.so.63...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libsystemd.so.0...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug/.build-id/18/b9a9a8c523e5cfe5b5d946d605d09242f09798.debug...done. done. Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/.build-id/f2/5dfd7b95be4ba386fd71080accae8c0732b711.debug...done. done. Reading symbols from /usr/lib/x86_64-linux-gnu/libicudata.so.63...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/liblzma.so.5...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libaudit.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libkrb5.so.3...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libk5crypto.so.3...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libcom_err.so.2...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libkrb5support.so.0...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libkeyutils.so.1...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libresolv.so.2...Reading symbols from /usr/lib/debug/.build-id/02/6c3ba167f64f631eb8781fca2269fbc2ee7ca5.debug...done. done. Reading symbols from /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libsasl2.so.2...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libgnutls.so.30...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/liblz4.so.1...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libgcrypt.so.20...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libcap-ng.so.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libp11-kit.so.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libidn2.so.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libunistring.so.2...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libtasn1.so.6...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libnettle.so.6...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libhogweed.so.4...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libgmp.so.10...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libgpg-error.so.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libffi.so.6...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...Reading symbols from /usr/lib/debug/.build-id/4b/ff8b782e1602c596e856bdef06e642e50e7fa7.debug...done. done. Reading symbols from /usr/lib/postgresql/13/lib/postgis-3.so...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libgeos_c.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libproj.so.13...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libjson-c.so.3...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libprotobuf-c.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libSFCGAL.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libgeos-3.7.1.so...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libCGAL_Core.so.13...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_program_options.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_chrono.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_filesystem.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_timer.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_system.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_serialization.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_date_time.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_atomic.so.1.67.0...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libCGAL.so.13...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libgmpxx.so.4...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libmpfr.so.6...(no debugging symbols found)...done. Reading symbols from /usr/lib/postgresql/13/lib/hstore.so...Reading symbols from /usr/lib/debug/.build-id/ae/d0b2a75765cea927097e4d89b4e2f39a5534be.debug...done. done. Reading symbols from /usr/lib/postgresql/13/lib/llvmjit.so...Reading symbols from /usr/lib/debug/.build-id/d3/f52dce996ffe1929204774abcd4fa75fbb317e.debug...done. done. Reading symbols from /usr/lib/x86_64-linux-gnu/libLLVM-7.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libedit.so.2...(no debugging symbols found)...done. Reading symbols from /lib/x86_64-linux-gnu/libtinfo.so.6...(no debugging symbols found)...done. Reading symbols from /usr/lib/x86_64-linux-gnu/libbsd.so.0...(no debugging symbols found)...done. 0x00007fe4bb39a7b7 in epoll_wait (epfd=4, events=0x55d1517d4990, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory. Continuing. Program received signal SIGQUIT, Quit. 0x00007fe4bb39a7b7 in epoll_wait (epfd=4, events=0x55d1517d4990, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 in ../sysdeps/unix/sysv/linux/epoll_wait.c (gdb) p MemoryContextStats(TopMemoryContext) Couldn't get extended state status: No such process. (gdb) bt #0 0x00007fe4bb39a7b7 in epoll_wait (epfd=4, events=0x55d1517d4990, maxevents=1, maxevents@entry=<error reading variable: Cannot access memory at address 0x7fff7515518c>, timeout=-1, timeout@entry=<error reading variable: Cannot access memory at address 0x7fff75155178>) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000055d14fe9d089 in WaitEventSetWaitBlock (nevents=<error reading variable: Cannot access memory at address 0x7fff7515518c>, occurred_events=<error reading variable: Cannot access memory at address 0x7fff75155168>, cur_timeout=<error reading variable: Cannot access memory at address 0x7fff75155178>, set=0x55d1517d4918) at ./build/../src/backend/storage/ipc/latch.c:1295 #2 WaitEventSetWait (set=0x55d1517d4918, timeout=<error reading variable: Cannot access memory at address 0x7fff75155170>, occurred_events=<error reading variable: Cannot access memory at address 0x7fff75155168>, nevents=<error reading variable: Cannot access memory at address 0x7fff7515518c>, wait_event_info=<optimized out>) at ./build/../src/backend/storage/ipc/latch.c:1247 Backtrace stopped: Cannot access memory at address 0x7fff75155208 (gdb) cont Continuing. Program terminated with signal SIGKILL, Killed. The program no longer exists. (gdb) detach The program is not being run. (gdb) quit
Stephan Knauss <pgsql@stephans-server.de> writes: > The wiki suggested to dump MemoryContext states for more details, but > something strange happens when attaching gdb. It seems that the process > is immediately killed and I can no longer dump such details. You might try running the postmaster under a restrictive ulimit (I think the -v option is the one that matters on Linux, not -d as you might guess). The idea here is that the backends would get an actual ENOMEM failure from malloc() before reaching the point where the kernel's OOM-kill behavior takes over. Given that, they'd dump memory maps to stderr of their own accord, and you could maybe get some insight as to what's leaking. This'd also reduce the severity of the problem when it does happen. regards, tom lane
On 30.03.2021 20:46, Tom Lane wrote: > Stephan Knauss <pgsql@stephans-server.de> writes: >> The wiki suggested to dump MemoryContext states for more details, but >> something strange happens when attaching gdb. It seems that the process >> is immediately killed and I can no longer dump such details. > (I think the -v option is the one that matters on Linux, not -d > as you might guess). The idea here is that the backends would > get an actual ENOMEM failure from malloc() before reaching the > point where the kernel's OOM-kill behavior takes over. Given > that, they'd dump memory maps to stderr of their own accord, > and you could maybe get some insight as to what's leaking. > This'd also reduce the severity of the problem when it does > happen. Hello Tom, the output below looks similar to the OOM output you expected. Can you give a hint how to interpret the results? I had a backend which had a larger amount of memory allocated already. So I gave "gcore -a" a try. In contrast to the advertised behavior, the process did not continue to run but I got a core file at least. Probably related to gcore just calling gdb attach which somehow triggers a SIGKILL of all backends. With 4.2GB in size it hopefully has most of the relevant memory structures are there. Without a running process I still can not call MemoryContextStats(), but I found a macro which claims to decode the memory structure post mortem: https://www.cybertec-postgresql.com/en/checking-per-memory-context-memory-consumption/ This gave me the following memory structure: How should it be interpreted? It looks like the size is bytes as it calculates with pointers. But the numbers look a bit small, given that I had a backend with roughly 6GB RSS memory. I thought it might print overall size and then indent and print the memory of children, but the numbers do indicate this is not the case, having a higher level smaller size than children: CachedPlanSource: 67840 unnamed prepared statement: 261920 So how to read it and any indication why I have a constantly increasing memory footprint? Is there any indication where multiple gigabytes are allocated? root@0ec98d20bda2:/# gdb /usr/lib/postgresql/13/bin/postgres core.154218 <gdb-context GNU gdb (Debian 8.2.1-2+b3) 8.2.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/lib/postgresql/13/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/31/ae2853776500091d313e76cf679017e697884b.debug...done. done. warning: core file may not match specified executable file. [New LWP 154218] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: osm gis 172.20.0.3(51894) idle'. #0 0x00007fc01cfa07b7 in epoll_wait (epfd=4, events=0x55f403584080, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory. (gdb) >>>> > > >>>(gdb) (gdb) >>>> > > >>>>> > > >>(gdb) (gdb) TopMemoryContext: 109528 dynahash: 7968 HandleParallelMessages: 7968 dynahash: 7968 dynahash: 7968 dynahash: 7968 dynahash: 24392 dynahash: 24352 RowDescriptionContext: 24352 MessageContext: 7968 dynahash: 7968 dynahash: 32544 TransactionAbortContext: 32544 dynahash: 7968 TopPortalContext: 7968 dynahash: 16160 CacheMemoryContext: 1302944 CachedPlan: 138016 CachedPlanSource: 67840 unnamed prepared statement: 261920 index info: 1824 index info: 1824 index info: 3872 index info: 1824 index info: 1824 index info: 3872 index info: 3872 index info: 3872 index info: 1824 index info: 3872 relation rules: 32544 index info: 1824 index info: 1824 index info: 1824 index info: 3872 relation rules: 24352 index info: 3872 index info: 3872 index info: 1824 index info: 3872 index info: 3872 index info: 3872 index info: 1824 index info: 3872 index info: 1824 index info: 3872 relation rules: 32544 index info: 1824 index info: 2848 index info: 1824 index info: 3872 index info: 3872 index info: 3872 index info: 3872 index info: 3872 index info: 3872 index info: 3872 index info: 1824 index info: 3872 index info: 1824 index info: 1824 relation rules: 32544 index info: 1824 index info: 2848 index info: 1824 index info: 800 index info: 1824 index info: 800 index info: 800 index info: 2848 index info: 1824 index info: 800 index info: 800 index info: 800 index info: 2848 index info: 1824 index info: 1824 --Type <RET> for more, q to quit, c to continue without paging-- index info: 2848 index info: 1824 index info: 1824 index info: 800 index info: 1824 index info: 800 index info: 800 index info: 800 index info: 2848 index info: 2848 index info: 1824 index info: 1824 index info: 800 index info: 800 index info: 2848 index info: 800 index info: 1824 index info: 1824 index info: 800 index info: 1824 index info: 1824 index info: 1824 index info: 800 index info: 1824 index info: 1824 index info: 1824 index info: 800 index info: 2848 index info: 2848 index info: 2848 index info: 800 index info: 800 index info: 1824 index info: 1824 index info: 1824 index info: 800 index info: 1824 index info: 1824 index info: 2848 index info: 1824 index info: 1824 index info: 1824 index info: 1824 index info: 800 index info: 1824 index info: 2848 index info: 800 index info: 1824 index info: 800 index info: 1824 index info: 1824 index info: 800 index info: 1824 index info: 1824 index info: 1824 index info: 800 index info: 1824 index info: 2848 index info: 1824 index info: 1824 index info: 1824 index info: 1824 index info: 1824 index info: 1824 index info: 1824 WAL record construction: 49544 dynahash: 7968 MdSmgr: 7968 dynahash: 16160 dynahash: 103896 ErrorContext: 7968 (gdb) quit root@0ec98d20bda2:/# cat gdb-context define sum_context_blocks set $context = $arg0 set $block = ((AllocSet) $context)->blocks set $size = 0 while ($block) set $size = $size + (((AllocBlock) $block)->endptr - ((char *) $block)) set $block = ((AllocBlock) $block)->next end printf "%s: %d\n",((MemoryContext)$context)->name, $size end define walk_contexts set $parent_$arg0 = ($arg1) set $indent_$arg0 = ($arg0) set $i_$arg0 = $indent_$arg0 while ($i_$arg0) printf " " set $i_$arg0 = $i_$arg0 - 1 end sum_context_blocks $parent_$arg0 set $child_$arg0 = ((MemoryContext) $parent_$arg0)->firstchild set $indent_$arg0 = $indent_$arg0 + 1 while ($child_$arg0) walk_contexts $indent_$arg0 $child_$arg0 set $child_$arg0 = ((MemoryContext) $child_$arg0)->nextchild end end walk_contexts 0 TopMemoryContext
Stephan Knauss <pgsql@stephans-server.de> writes: > Hello Tom, the output below looks similar to the OOM output you > expected. Can you give a hint how to interpret the results? Looks like the answer is that wherever the leak is, it's not accounted for by this info; none of those contexts are particularly large. Based on nearby threads, it occurs to me to ask whether you have JIT enabled, and if so whether turning it off helps. There seems to be a known leak of the code fragments generated by that in some cases. If that's not it, then the leak must be accumulating through plain old malloc calls. There's not much of that in the core backend (although if you use ispell text search dictionaries, maybe [1] is relevant), so my suspicions would next fall on any extensions you might be using. regards, tom lane [1] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=eba939551
> On Mar 31, 2021, at 11:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Stephan Knauss <pgsql@stephans-server.de> writes: >> Hello Tom, the output below looks similar to the OOM output you >> expected. Can you give a hint how to interpret the results? > > Looks like the answer is that wherever the leak is, it's not accounted > for by this info; none of those contexts are particularly large. > > Based on nearby threads, it occurs to me to ask whether you have JIT > enabled, and if so whether turning it off helps. There seems to be > a known leak of the code fragments generated by that in some cases. > > If that's not it, then the leak must be accumulating through plain > old malloc calls. There's not much of that in the core backend > (although if you use ispell text search dictionaries, maybe [1] is > relevant), so my suspicions would next fall on any extensions you > might be using. Would be interested in the queries being run. We have a reproduceable leak in <-> geography operator that we have been unableto track down. P > > regards, tom lane > > [1] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=eba939551 > >
On 31.03.2021 20:27, Paul Ramsey wrote: > On Mar 31, 2021, at 11:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If that's not it, then the leak must be accumulating through plain >> old malloc calls. There's not much of that in the core backend >> (although if you use ispell text search dictionaries, maybe [1] is >> relevant), so my suspicions would next fall on any extensions you >> might be using. > Would be interested in the queries being run. We have a reproduceable leak in <-> geography operator that we have beenunable to track down. > I tried to get any idea where the memory is by feeding my core file into chap. I ended up not knowing how to proceed further. I found a std::vector which was roughly 70 MB in size and a lot of allocations chap could not identify where they belonged to: chap> summarize used Unrecognized allocations have 23098683 instances taking 0xe6973688(3,868,669,576) bytes. Unrecognized allocations of size 0x68 have 8214926 instances taking 0x32ec61b0(854,352,304) bytes. Unrecognized allocations of size 0x28 have 4890673 instances taking 0xba907a8(195,626,920) bytes. Unrecognized allocations of size 0x38 have 3255995 instances taking 0xade38e8(182,335,720) bytes. Unrecognized allocations of size 0x48 have 2096221 instances taking 0x8fefa28(150,927,912) bytes. Unrecognized allocations of size 0x78 have 1792042 instances taking 0xcd153b0(215,045,040) bytes. Unrecognized allocations of size 0x88 have 1393038 instances taking 0xb4ad370(189,453,168) bytes. Unrecognized allocations of size 0x58 have 657193 instances taking 0x3727618(57,832,984) bytes. Unrecognized allocations of size 0xa8 have 360742 instances taking 0x39cc0f0(60,604,656) bytes. Unrecognized allocations of size 0x18 have 270834 instances taking 0x632eb0(6,500,016) bytes. Unrecognized allocations of size 0x98 have 60372 instances taking 0x8c05e0(9,176,544) bytes. Unrecognized allocations of size 0xb8 have 37559 instances taking 0x697388(6,910,856) bytes. Unrecognized allocations of size 0xd8 have 19033 instances taking 0x3ebb18(4,111,128) bytes. Unrecognized allocations of size 0x128 have 19010 instances taking 0x55dc50(5,626,960) bytes. Unrecognized allocations of size 0xe8 have 9526 instances taking 0x21b8f0(2,210,032) bytes. Unrecognized allocations of size 0x118 have 9513 instances taking 0x28a4d8(2,663,640) bytes. Unrecognized allocations of size 0x1f8 have 9499 instances taking 0x490d28(4,787,496) bytes. Unrecognized allocations of size 0x408 have 292 instances taking 0x49920(301,344) bytes. Unrecognized allocations of size 0x2008 have 194 instances taking 0x184610(1,590,800) bytes. Unrecognized allocations of size 0x1008 have 155 instances taking 0x9b4d8(636,120) bytes. Unrecognized allocations of size 0x4008 have 137 instances taking 0x224448(2,245,704) bytes. Unrecognized allocations of size 0x8008 have 133 instances taking 0x428428(4,359,208) bytes. Unrecognized allocations of size 0x10008 have 131 instances taking 0x830418(8,586,264) bytes. Unrecognized allocations of size 0x40008 have 131 instances taking 0x20c0418(34,341,912) bytes. Unrecognized allocations of size 0x100008 have 130 instances taking 0x8200410(136,315,920) bytes. Unrecognized allocations of size 0x20008 have 129 instances taking 0x1020408(16,909,320) bytes. Unrecognized allocations of size 0x80008 have 129 instances taking 0x4080408(67,634,184) bytes. Unrecognized allocations of size 0x200008 have 129 instances taking 0x10200408(270,533,640) bytes. Unrecognized allocations of size 0x400008 have 128 instances taking 0x20000400(536,871,936) bytes. Unrecognized allocations of size 0x418 have 53 instances taking 0xd8f8(55,544) bytes. [...] Pattern %VectorBody has 57401 instances taking 0x42eaaf0(70,167,280) bytes. [...] Pattern %MapOrSetNode has 51362 instances taking 0x3f2f40(4,140,864) bytes. let me know if it would be interesting for you to have this (huge) core file with memory dump available somewhere. I have executed a few minutes of queries on a single backend setup. The RssAnon was growing during this time by about 10 MB in size. Full query logs are here: https://downloads.osm-tools.org/postgresql-2021-04-03_183913.csv.gz I have the following extensions installed. hstore is heavily used and a little bit postgis. gis=# SELECT extname FROM pg_extension; extname ----------- plpgsql adminpack postgis hstore (4 rows) as per other suggestions I enabled auto_explain as well. I hope you can get a hint what is allocating the memory. I could offer to share some docker images to reproduce, but It would require a few commands to set it up and a few hundred megabytes of download. Stephan
Hello Tom, On 31.03.2021 20:24, Tom Lane wrote: > Based on nearby threads, it occurs to me to ask whether you have JIT > enabled, and if so whether turning it off helps. There seems to be > a known leak of the code fragments generated by that in some cases. That's it! I am quite surprised that a functionality, which is on by default does generate such a massive leak and goes sort of undetected. A single backend was leaking 250MB/hour, with my multiple connections it was 2GB. But exactly that happened. Doing a set jit=off immediately stopped the leak. You mentioned that this seems to be known. Do you have pointers to the relevant bug-tracker/thread? I would like to follow up on this. I have not measured the impact of jit, but in theory it could bring larger performance benefits. So having it enabled sounds like a good idea, once it stops leaking. I tried running Valgrind on postgres but I had not much success with it. processes seemed to terminate quite frequently. My last use of Valgrind is a while ago and my use-case back then was probably much simpler. Is it known which queries are leading to a leak? I still have the recording of mine, including explain. Would it help to narrow it further down to single queries which leak? Or is the JIT re-creating optimized code for each slightly modified one without freeing the old ones? So re-running the same query would not leak? https://downloads.osm-tools.org/postgresql-2021-04-03_183913.csv.gz Stephan
Stephan Knauss <pgsql@stephans-server.de> writes: > On 31.03.2021 20:24, Tom Lane wrote: >> Based on nearby threads, it occurs to me to ask whether you have JIT >> enabled, and if so whether turning it off helps. There seems to be >> a known leak of the code fragments generated by that in some cases. > That's it! > You mentioned that this seems to be known. Do you have pointers to the > relevant bug-tracker/thread? I would like to follow up on this. According to the v14 open issues page [1], there are a couple of older threads. I just added this one. regards, tom lane [1] https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items