Thread: terminate called after throwing an instance of 'std::bad_alloc'
A VM crashed which is now running PG13.0 on centos7: Sep 30 19:40:08 database7 abrt-hook-ccpp: Process 17905 (postgres) of user 26 killed by SIGABRT - dumping core Core was generated by `postgres: telsasoft ts 192.168.122.11(34608) SELECT '. Unfortunately, the filesystem wasn't large enough and the corefile is truncated. The first badness in our logfiles looks like this ; this is the very head of the logfile: |[pryzbyj@database7 ~]$ sudo gzip -dc /var/log/postgresql/crash-postgresql-2020-09-30_194000.log.gz |head |[sudo] password for pryzbyj: |terminate called after throwing an instance of 'std::bad_alloc' | what(): std::bad_alloc |< 2020-09-30 19:40:09.653 ADT >LOG: checkpoint starting: time |< 2020-09-30 19:40:17.002 ADT >LOG: checkpoint complete: wrote 74 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;write=7.331 s, sync=0.006 s, total=7.348 s; sync files=25, longest=0.002 s, average=0.000 s; distance=295 kB, estimate=4183kB |< 2020-09-30 19:40:22.642 ADT >LOG: server process (PID 17905) was terminated by signal 6: Aborted |< 2020-09-30 19:40:22.642 ADT >DETAIL: Failed process was running: --BEGIN SQL | SELECT * FROM I was able to grep the filesystem to find what looks like the preceding logfile (which our script had immediately rotated in its attempt to be helpful). |< 2020-09-30 19:39:09.096 ADT >LOG: checkpoint starting: time |< 2020-09-30 19:39:12.640 ADT >LOG: checkpoint complete: wrote 35 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;write=3.523 s, sync=0.006 s, total=3.544 s; sync files=14, longest=0.002 s, average=0.000 s; distance=103 kB, estimate=4615kB This seemed familiar, and I found it had happened 18 months ago on a different server: |terminate called after throwing an instance of 'std::bad_alloc' | what(): std::bad_alloc |< 2019-02-01 15:36:54.434 CST >LOG: server process (PID 13557) was terminated by signal 6: Aborted |< 2019-02-01 15:36:54.434 CST >DETAIL: Failed process was running: | SELECT *, '' as n | FROM a, b WHERE | a.id = b.id AND | b.c IS NOT NULL | ORDER BY b.c LIMIT 9 I have a log of pg_settings, which shows that this server ran v11.1 until 2019-02-14, when it was upgraded to v11.2. Since v11.2 included a fix for a bug I reported involving JIT and wide tables, I probably saw this crash and dismissed it, even though tables here named "a" and "b" have only ~30 columns combined. The query that crashed in 2019 is actually processing a small queue, and runs every 5sec. The query that crashed today is a "day" level query which runs every 15min, so it ran 70some times today with no issue. Our DBs use postgis, and today's crash JOINs to the table with geometry columns, but does not use them at all. But the 2019 doesn't even include the geometry table. I'm not sure if these are even the same crash, but if they are, I think it's maybe an JIT issue and not postgis (??) I've had JIT disabled since 2019, due to no performance benefit for us, but I've been re-enabling it during upgrades and transitions, and instead disabling jit_tuple_deforming (since this performs badly for columns with high attnum). So maybe this will recur before too long.
Justin Pryzby <pryzby@telsasoft.com> writes: > A VM crashed which is now running PG13.0 on centos7: > Sep 30 19:40:08 database7 abrt-hook-ccpp: Process 17905 (postgres) of user 26 killed by SIGABRT - dumping core > Core was generated by `postgres: telsasoft ts 192.168.122.11(34608) SELECT '. > Unfortunately, the filesystem wasn't large enough and the corefile is > truncated. Hm, can you fix that? A stack trace would be mighty handy here. regards, tom lane
Hi, On 2020-09-30 21:16:10 -0500, Justin Pryzby wrote: > Sep 30 19:40:08 database7 abrt-hook-ccpp: Process 17905 (postgres) of user 26 killed by SIGABRT - dumping core > Core was generated by `postgres: telsasoft ts 192.168.122.11(34608) SELECT '. > > Unfortunately, the filesystem wasn't large enough and the corefile is > truncated. Too bad. > |[pryzbyj@database7 ~]$ sudo gzip -dc /var/log/postgresql/crash-postgresql-2020-09-30_194000.log.gz |head > |[sudo] password for pryzbyj: > |terminate called after throwing an instance of 'std::bad_alloc' > | what(): std::bad_alloc I think this shouldn't happen while we're inside LLVM. To detect / handle allocation failures in the middle of LLVM code in a halfway sensible way, we install an allocation error handler just while in LLVM code (c.f. llvm_enter_fatal_on_oom/llvm_leave_fatal_on_oom). It's possible of course the JIT somehow corrupted data that is passed somewhere to postgis, which then caused postgis to run out of memory. But it seems like that'd need a fair bit of bad luck to not cause an other error before that. > But the 2019 doesn't even include the geometry table. I'm not sure if these > are even the same crash, but if they are, I think it's maybe an JIT issue and > not postgis (??) It'd be good to get a backtrace - but this doesn't immediately scream out loud "I am a JIT bug" to me. Greetings, Andres Freund
On Wed, Sep 30, 2020 at 09:16:09PM -0500, Justin Pryzby wrote: > Our DBs use postgis, and today's crash JOINs to the table with geometry > columns, but does not use them at all. I'm wrong here - it does return the geometry columns. We didn't use JIT with v12, but I've been setting jit=on along with v13 upgrades, so I was trying to convince myself that there's a change in v13, and not just our config.. I've seen repeated OOMs on a small number of reports. It looks like JIT optimization/inlining used a good deal of RAM in v12, and even more on v13. Maybe that's what's expected; it looks like this scales with the number of partitions and number of functions, so it could be arbitrarily large, and not respectful of work_mem (below, at the 4MB default). CREATE TABLE t(a int) PARTITION BY RANGE(a); \pset pager off SELECT format('CREATE TABLE t%s PARTITION OF t FOR VALUES FROM (%s)TO(%s) WITH (parallel_workers=2)', i, 100*i, 100*i+100)FROM generate_series(0,666)i; \gexec INSERT INTO t SELECT i%10000 FROM generate_series(1,99999)i; CREATE INDEX ON t(left(a::text,1)); VACUUM ANALYZE t; \o /dev/null \\ SET client_min_messages=debug; SET log_executor_stats=on; SET jit=on; SET jit_expressions=on; SET jit_above_cost=0;SET jit_optimize_above_cost=1; SET jit_inline_above_cost=1; explain( analyze) SELECT * FROM t WHERE substr(a::text,length(a::text)-6,6)='1'OR substr(a::text,length(a::text)-6,6)='2' OR substr(a::text,length(a::text)-6,6)='3'OR substr(a::text,length(a::text)-6,6)='4' OR substr(a::text,length(a::text)-6,6)='5'OR substr(a::text,length(a::text)-6,6)='6'; In v13: With optimize_above=-1, inline_above=-1: 333472 kB max resident size With optimize_above=1, inline_above=1: 654036 kB max resident size In v12: With optimize_above=-1, inline_above=-1: 330988 kB max resident size With optimize_above=1, inline_above=1: 556844 kB max resident size These tests are on the same centos7 server, which has: llvm5.0-libs-5.0.1-7.el7.x86_64 This appears to be similar under centos8 with LLVM9. The OOM itself could be due to our JIT settings and maybe not a bug. My thinking is that when we hit OOM, that sometimes manifests itself as the C++ exception, which somehow trickles up into an SIGABRT in postgres. I don't know if there's a way to capture useful debugging info on OOM, but the live processes do seem to leaking memory, or otherwise not reusing it. For example, look at a this huge backend, which is connected to the process which have been experiencing OOM. The backend is 4.2GB while idle, so I'd have hoped for it to have released everything it could. I'm aware that free() usually cannot return small allocations to the kernel, but then should not lead to continued growth over time. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18000 postgres 20 0 4975152 4.2g 4684 S 0.0 55.6 81:29.87 postgres: main main 10.124.252.14(58296) idle main=# SELECT * FROM pg_stat_activity WHERE pid=18000; datid | 151510 datname | main pid | 18000 leader_pid | usesysid | 16714 usename | main application_name | run_thresholds --daemon client_addr | 10.124.252.14 client_hostname | client_port | 58296 backend_start | 2020-10-02 22:26:16.676936-06 xact_start | query_start | 2020-10-03 12:37:37.639971-06 state_change | 2020-10-03 12:37:37.640183-06 wait_event_type | Client wait_event | ClientRead state | idle backend_xid | backend_xmin | query | ROLLBACK backend_type | client backend This is mostly just in "heap": $ sudo cat /proc/18000/maps |awk -Wnon-decimal -F '[- ]' '{a="0x"$1; b="0x"$2; print (b-a)/1024**2,$0}' |sort -gr |head 3428.65 010d6000-d757c000 rw-p 00000000 00:00 0 [heap] 512.004 7f3ec96e8000-7f3ee96e9000 rw-p 00000000 00:00 0 384.004 7f3e9e6e1000-7f3eb66e2000 rw-p 00000000 00:00 0 148 2aaaaac00000-2aaab4000000 rw-s 00000000 00:0e 44184914 /anon_hugepage (deleted) The postgres memory contexts don't show anything nearly that large. MessageContext: 25236472 total in 20 blocks; 2990656 free (4 chunks); 22245816 used CacheMemoryContext: 25892160 total in 42 blocks; 22770984 free (28285 chunks); 3121176 used Grand total: 70879896 bytes in 1466 blocks; 29518440 free (29097 chunks); 41361456 used I was able to make a small, apparent leak like so. This applies to postgis3.0/postgres12/LLVM5/centos7, and postgis3.1alpha/postgres13/LLVM9/centos8. Inlining is essential to see the leak, optimization is not. Showing every 9th query: $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0; SET jit_expressions=on;SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyze SELECT st_x(site_geo)FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p' ! 40552 kB max resident size ! 41188 kB max resident size ! 41820 kB max resident size ! 42576 kB max resident size ! 43080 kB max resident size ! 43508 kB max resident size ! 44572 kB max resident size ! 45040 kB max resident size ! 45588 kB max resident size ! 46024 kB max resident size ! 46492 kB max resident size Which would lead to an OOM in our longlived process. -- Justin
Hi, On 2020-10-03 17:02:36 -0500, Justin Pryzby wrote: > We didn't use JIT with v12, but I've been setting jit=on along with v13 > upgrades, so I was trying to convince myself that there's a change in v13, and > not just our config.. > > I've seen repeated OOMs on a small number of reports. It looks like JIT > optimization/inlining used a good deal of RAM in v12, and even more on v13. > Maybe that's what's expected; it looks like this scales with the number of > partitions and number of functions, so it could be arbitrarily large, and not > respectful of work_mem (below, at the 4MB default). > In v13: > With optimize_above=-1, inline_above=-1: 333472 kB max resident size > With optimize_above=1, inline_above=1: 654036 kB max resident size > > In v12: > With optimize_above=-1, inline_above=-1: 330988 kB max resident size > With optimize_above=1, inline_above=1: 556844 kB max resident size That's probably to some degree just the shared library being loaded / faulted in (see below for more detail). I don't think there's any difference between 12/13 in what inlining / optimization does, so I don't quite understand the difference. > The OOM itself could be due to our JIT settings and maybe not a bug. My > thinking is that when we hit OOM, that sometimes manifests itself as the C++ > exception, which somehow trickles up into an SIGABRT in postgres. C++ code throws an exception when memory allocation fails, unless a different error handler is set. While in LLVM for JITing we set a different error handler (which doesn't throw, but FATALs), but that doesn't appear to be done for all the postgis code. Nor does it - as it really should, catch the exeption. An unhandled exception is treated as an abort(), which in turn does a SIGABRT. > For example, look at a this huge backend, which is connected to the process > which have been experiencing OOM. The backend is 4.2GB while idle, so I'd have > hoped for it to have released everything it could. I'm aware that free() > usually cannot return small allocations to the kernel, but then should not lead > to continued growth over time. The below turned out to be somewhat obsoleted by what you wrote below, which I unfortunately hadn't yet read - but I think it's still good information, so I'm not gonna delete it: RES includes all the shared buffers that a backend touched, unless you use huge pages. At least on linux RES is unfortunately pretty much useless for postgres. You can gain a bit more information by looking at /proc/$pid/status and /proc/$pid/smaps_rollup E.g. here is a backend without having run any JIT (pruned all of these): andres@awork3:~$ grep ^Rss /proc/2810440/status RssAnon: 2324 kB RssFile: 6932 kB RssShmem: 4868 kB andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup Pss: 3912 kB Pss_Anon: 1177 kB Pss_File: 1004 kB Pss_Shmem: 1730 kB and after loading the llvmjit library (LOAD 'llvmjit'), but not actually jitting anything: andres@awork3:~$ grep ^Rss /proc/2810440/status RssAnon: 7528 kB RssFile: 104844 kB RssShmem: 5204 kB andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup Pss: 57773 kB Pss_Anon: 6436 kB Pss_File: 49606 kB Pss_Shmem: 1730 kB Note how RSS has increased by ~100MB - but all of that is file-backed. It's just the shared library being loaded (not even all required to be in memory). And if you look at the 'Pss' (p stands for proportional) values you can see how it's lower, because it's shared between all the processes that have JIT loaded - in my case here that was just two session, hence halved. The actually allocated memory differs between the two, but it's ~2.3MB vs 7.5MB. (you also can see how a postgres backend doesn't actually weigh 2MB, because a lot of the memory allocations are shared between postmaster and backends) And if you additionally utilize buffers in shared buffers, RSS also will included that. E.g. postgres[2810440][1]=# SHOW shared_buffers; ┌────────────────┐ │ shared_buffers │ ├────────────────┤ │ 2GB │ └────────────────┘ (1 row) postgres[2810325][1]=# \dt+ large List of relations ┌────────┬───────┬───────┬────────┬─────────────┬───────────────┬────────┬─────────────┐ │ Schema │ Name │ Type │ Owner │ Persistence │ Access Method │ Size │ Description │ ├────────┼───────┼───────┼────────┼─────────────┼───────────────┼────────┼─────────────┤ │ public │ large │ table │ andres │ permanent │ heap │ 501 MB │ │ └────────┴───────┴───────┴────────┴─────────────┴───────────────┴────────┴─────────────┘ (1 row) andres@awork3:~$ grep ^Rss /proc/2810440/status RssAnon: 8284 kB RssFile: 107236 kB RssShmem: 19184 kB andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup Pss: 63282 kB Pss_Anon: 7413 kB Pss_File: 50034 kB Pss_Shmem: 5834 kB postgres[2810440][1]=# SELECT count(generate_series), count(random) FROM large; ... andres@awork3:~$ grep ^Rss /proc/2810440/status RssAnon: 8592 kB RssFile: 107464 kB RssShmem: 390328 kB andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup Pss: 248889 kB Pss_Anon: 7560 kB Pss_File: 50069 kB Pss_Shmem: 191259 kB So now, despite only ~8.5MB of allocations, ps's RSS will show a memory usage considerably larger: andres@awork3:~$ ps -u --pid 2810440 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND andres 2810440 0.2 0.2 2420584 506384 ? Ss 15:19 0:02 postgres: dev assert: andres postgres [local] idle > The postgres memory contexts don't show anything nearly that large. > MessageContext: 25236472 total in 20 blocks; 2990656 free (4 chunks); 22245816 used > CacheMemoryContext: 25892160 total in 42 blocks; 22770984 free (28285 chunks); 3121176 used > Grand total: 70879896 bytes in 1466 blocks; 29518440 free (29097 chunks); 41361456 used I wonder if it'd be worth trying to redirect all LLVM allocations to be done into a memory context. Not to be able to free it, but to be able to see it more clearly. But it looks like it'd be too fragile to do so. > I was able to make a small, apparent leak like so. This applies to > postgis3.0/postgres12/LLVM5/centos7, and > postgis3.1alpha/postgres13/LLVM9/centos8. > > Inlining is essential to see the leak, optimization is not. Showing every 9th > query: > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0; SETjit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyze SELECT st_x(site_geo)FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p' > ! 46024 kB max resident size > ! 46492 kB max resident size Huh, that's certainly not good. Looking. Greetings, Andres Freund
On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote: > The below turned out to be somewhat obsoleted by what you wrote below, > which I unfortunately hadn't yet read - but I think it's still good > information, so I'm not gonna delete it: Right, I understand that RES RAM might includes shared buffers and demand loaded file-backed and mmap stuff. But on our servers, I tentatively wouldn't expect it to be more than a handful of GB. It might be true that most of our report processes run in a forked children, specifically to avoid leaks, and the forked children would have their own, shortlived backend, which would hide any postgres leak (in addition to having tried to use jit since v11). > > I was able to make a small, apparent leak like so. This applies to > > postgis3.0/postgres12/LLVM5/centos7, and > > postgis3.1alpha/postgres13/LLVM9/centos8. > > > > Inlining is essential to see the leak, optimization is not. Showing every 9th > > query: > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0; SETjit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyze SELECT st_x(site_geo)FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p' > > ! 46024 kB max resident size > > ! 46492 kB max resident size > > Huh, that's certainly not good. Looking. Reproduce like: postgres=# CREATE EXTENSION postgis; postgres=# CREATE TABLE geo(g geometry(point)); postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999); postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET log_executor_stats=on; explainanalyze SELECT st_x(g) FROM geo;
Hi, On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote: > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote: > > > I was able to make a small, apparent leak like so. This applies to > > > postgis3.0/postgres12/LLVM5/centos7, and > > > postgis3.1alpha/postgres13/LLVM9/centos8. > > > > > > Inlining is essential to see the leak, optimization is not. Showing every 9th > > > query: > > > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p' > > > ! 46024 kB max resident size > > > ! 46492 kB max resident size > > > > Huh, that's certainly not good. Looking. > > Reproduce like: > postgres=# CREATE EXTENSION postgis; > postgres=# CREATE TABLE geo(g geometry(point)); > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999); > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo; It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically whenever we load an IR module to inline from, it renames its type names to make them not conflict with already known type names in the current LLVMContext. That, over time, leads to a large number of type names in memory. We can avoid the re-loading of the module by instead cloning it, but that still leads to new types being created for some internal reasons. It's possible to create/drop separate LLVMContext instances, but we'd have to figure out when to do so - it's too expensive to always do that. Greetings, Andres Freund
Hi, On 2020-10-03 19:01:27 -0700, Andres Freund wrote: > On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote: > > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote: > > > > I was able to make a small, apparent leak like so. This applies to > > > > postgis3.0/postgres12/LLVM5/centos7, and > > > > postgis3.1alpha/postgres13/LLVM9/centos8. > > > > > > > > Inlining is essential to see the leak, optimization is not. Showing every 9th > > > > query: > > > > > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p' > > > > ! 46024 kB max resident size > > > > ! 46492 kB max resident size > > > > > > Huh, that's certainly not good. Looking. > > > > Reproduce like: > > postgres=# CREATE EXTENSION postgis; > > postgres=# CREATE TABLE geo(g geometry(point)); > > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999); > > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo; > > It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically > whenever we load an IR module to inline from, it renames its type names > to make them not conflict with already known type names in the current > LLVMContext. That, over time, leads to a large number of type names in > memory. We can avoid the re-loading of the module by instead cloning it, > but that still leads to new types being created for some internal > reasons. It's possible to create/drop separate LLVMContext instances, > but we'd have to figure out when to do so - it's too expensive to always > do that. I spent a lot of time (probably too much time) trying to find a good way out of that. I think I mostly figured it out, but it's gonna take me a bit longer to nail down the loose ends. The nice part is that it actually speeds up inlining a lot. The bad part is that it's not a small change, so I don't yet know how to deal with existing releases. Details: The issue with the leaks is basically that when LLVM loads a bitcode module it doesn't "unify" named types with types in other bitcode module - they could after all be entirely independent. That means that we end up with a lot of types after a while of running. That ought to be addressable by not re-loading modules we inline from from disk every time we inline, but unfortunately that doesn't work well either: The "cross module import" logic unfortunately modifies the types used in the "source module", leading to subsequent imports not working well. That last issue is why I had moved to re-loading modules from "disk" during import. The partial patch I've written doesn't use the existing cross module import code anymore, but moves to the lower level functionality it uses. For traditional compilers, which is what that linker is mainly written for, it makes sense to reduce memory usage by "destructively" moving code from the source into the target - it'll only be used once. But for the JIT case it's better to do so non-destructively. That requires us to implement some "type resolution" between the source and target modules, but then gains not needing to reload / copy the source module anymore, saving a lot of the costs. It's possible that for existing releases we ought to go for a simpler approach, even if it's not as good. The only really feasible way I see is to basically count the number of times inlining has been done, and then just re-initialize after a certain time. But that's pretty darn yucky. Greetings, Andres Freund
Hi, Sorry for the second send of this email, but somehow I managed to mangle the headers in the last version I sent. On 2020-10-03 19:01:27 -0700, Andres Freund wrote: > On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote: > > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote: > > > > I was able to make a small, apparent leak like so. This applies to > > > > postgis3.0/postgres12/LLVM5/centos7, and > > > > postgis3.1alpha/postgres13/LLVM9/centos8. > > > > > > > > Inlining is essential to see the leak, optimization is not. Showing every 9th > > > > query: > > > > > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p' > > > > ! 46024 kB max resident size > > > > ! 46492 kB max resident size > > > > > > Huh, that's certainly not good. Looking. > > > > Reproduce like: > > postgres=# CREATE EXTENSION postgis; > > postgres=# CREATE TABLE geo(g geometry(point)); > > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999); > > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo; > > It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically > whenever we load an IR module to inline from, it renames its type names > to make them not conflict with already known type names in the current > LLVMContext. That, over time, leads to a large number of type names in > memory. We can avoid the re-loading of the module by instead cloning it, > but that still leads to new types being created for some internal > reasons. It's possible to create/drop separate LLVMContext instances, > but we'd have to figure out when to do so - it's too expensive to always > do that. I spent a lot of time (probably too much time) trying to find a good way out of that. I think I mostly figured it out, but it's gonna take me a bit longer to nail down the loose ends. The nice part is that it actually speeds up inlining a lot. The bad part is that it's not a small change, so I don't yet know how to deal with existing releases. Details: The issue with the leaks is basically that when LLVM loads a bitcode module it doesn't "unify" named types with types in other bitcode module - they could after all be entirely independent. That means that we end up with a lot of types after a while of running. That ought to be addressable by not re-loading modules we inline from from disk every time we inline, but unfortunately that doesn't work well either: The "cross module import" logic unfortunately modifies the types used in the "source module", leading to subsequent imports not working well. That last issue is why I had moved to re-loading modules from "disk" during import. The partial patch I've written doesn't use the existing cross module import code anymore, but moves to the lower level functionality it uses. For traditional compilers, which is what that linker is mainly written for, it makes sense to reduce memory usage by "destructively" moving code from the source into the target - it'll only be used once. But for the JIT case it's better to do so non-destructively. That requires us to implement some "type resolution" between the source and target modules, but then gains not needing to reload / copy the source module anymore, saving a lot of the costs. It's possible that for existing releases we ought to go for a simpler approach, even if it's not as good. The only really feasible way I see is to basically count the number of times inlining has been done, and then just re-initialize after a certain time. But that's pretty darn yucky. Greetings, Andres Freund
I'm re-raising this issue (since I needed to track down why I'd reflexively saved a 512g backup of the filesystem where this first happened). I'm 99% sure the "bad_alloc" is from LLVM. It happened multiple times on different servers (running a similar report) after setting jit=on during pg13 upgrade, and never happened since re-setting jit=off. I reproduced a memory leak, but I don't think you publicized a fix for it. I *think* the memory leak probably explains the bad_alloc. We were seeing postgres killed by OOM every ~16hours, although I don't see how that's possible from such a tiny leak. Our query also uses postgis (at first I said it didn't), so it may be that they're involved. I'd be happy to run with a prototype fix for the leak to see if the other issue does (not) recur. On Wed, Oct 07, 2020 at 04:02:42PM -0700, Andres Freund wrote: > Hi, > > Sorry for the second send of this email, but somehow I managed to mangle > the headers in the last version I sent. > > On 2020-10-03 19:01:27 -0700, Andres Freund wrote: > > On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote: > > > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote: > > > > > I was able to make a small, apparent leak like so. This applies to > > > > > postgis3.0/postgres12/LLVM5/centos7, and > > > > > postgis3.1alpha/postgres13/LLVM9/centos8. > > > > > > > > > > Inlining is essential to see the leak, optimization is not. Showing every 9th > > > > > query: > > > > > > > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p' > > > > > ! 46024 kB max resident size > > > > > ! 46492 kB max resident size > > > > > > > > Huh, that's certainly not good. Looking. > > > > > > Reproduce like: > > > postgres=# CREATE EXTENSION postgis; > > > postgres=# CREATE TABLE geo(g geometry(point)); > > > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999); > > > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo; > > > > It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically > > whenever we load an IR module to inline from, it renames its type names > > to make them not conflict with already known type names in the current > > LLVMContext. That, over time, leads to a large number of type names in > > memory. We can avoid the re-loading of the module by instead cloning it, > > but that still leads to new types being created for some internal > > reasons. It's possible to create/drop separate LLVMContext instances, > > but we'd have to figure out when to do so - it's too expensive to always > > do that. > > I spent a lot of time (probably too much time) trying to find a good way > out of that. I think I mostly figured it out, but it's gonna take me a > bit longer to nail down the loose ends. The nice part is that it > actually speeds up inlining a lot. The bad part is that it's not a small > change, so I don't yet know how to deal with existing releases. > > Details: > > The issue with the leaks is basically that when LLVM loads a bitcode > module it doesn't "unify" named types with types in other bitcode module > - they could after all be entirely independent. That means that we end > up with a lot of types after a while of running. > > That ought to be addressable by not re-loading modules we inline from > from disk every time we inline, but unfortunately that doesn't work well > either: The "cross module import" logic unfortunately modifies the types > used in the "source module", leading to subsequent imports not working > well. That last issue is why I had moved to re-loading modules from > "disk" during import. > > The partial patch I've written doesn't use the existing cross module > import code anymore, but moves to the lower level functionality it > uses. For traditional compilers, which is what that linker is mainly > written for, it makes sense to reduce memory usage by "destructively" > moving code from the source into the target - it'll only be used > once. But for the JIT case it's better to do so non-destructively. That > requires us to implement some "type resolution" between the source and > target modules, but then gains not needing to reload / copy the source > module anymore, saving a lot of the costs. > > > It's possible that for existing releases we ought to go for a simpler > approach, even if it's not as good. The only really feasible way I see > is to basically count the number of times inlining has been done, and > then just re-initialize after a certain time. But that's pretty darn > yucky. > > Greetings, > > Andres Freund
Hi, On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > I'd be happy to run with a prototype fix for the leak to see if the other issue > does (not) recur. I just posted a prototype fix to https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de (just because that was the first thread I re-found). It'd be cool if you could have a look! Greetings, Andres Freund
On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote: > Hi, > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > > I'd be happy to run with a prototype fix for the leak to see if the other issue > > does (not) recur. > > I just posted a prototype fix to https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de > (just because that was the first thread I re-found). It'd be cool if you > could have a look! This doesn't seem to address the problem triggered by the reproducer at https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com (sorry I didn't CC you) CREATE OR REPLACE FUNCTION cfn() RETURNS void LANGUAGE PLPGSQL AS $$ declare a record; begin FOR a IN SELECT generate_series(1,99)LOOP PERFORM format('select 1'); END LOOP; END $$; $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET client_min_messages=debug; SET log_executor_stats=on;SELECT cfn();' |head -11 |psql 2>&1 |grep 'max resident' -- Justin
On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote: > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote: > > Hi, > > > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > > > I'd be happy to run with a prototype fix for the leak to see if the other issue > > > does (not) recur. > > > > I just posted a prototype fix to https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de > > (just because that was the first thread I re-found). It'd be cool if you > > could have a look! > > This doesn't seem to address the problem triggered by the reproducer at > https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com > (sorry I didn't CC you) I take that back - I forgot that this doesn't release RAM until hitting a threshold. Without the patch, it looks like: $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET client_min_messages=debug; SET log_executor_stats=on;SELECT cfn();' |head -11 |psql -h /tmp postgres 2>&1 |grep 'max resident' ! 61820 kB max resident size ! 65020 kB max resident size ! 68812 kB max resident size ! 71152 kB max resident size ! 76820 kB max resident size ! 78760 kB max resident size ! 81140 kB max resident size ! 83520 kB max resident size ! 93084 kB max resident size ! 94756 kB max resident size ! 96416 kB max resident size With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like this: ! 61436 kB max resident size ! 61572 kB max resident size ! 63236 kB max resident size ! 63236 kB max resident size ! 63556 kB max resident size ! 63556 kB max resident size ! 63880 kB max resident size ! 65416 kB max resident size ! 65416 kB max resident size ! 65416 kB max resident size ! 65416 kB max resident size -- Justin
Hi, On Fri, Apr 16, 2021, at 20:18, Justin Pryzby wrote: > On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote: > > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > > > > I'd be happy to run with a prototype fix for the leak to see if the other issue > > > > does (not) recur. > > > > > > I just posted a prototype fix to https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de > > > (just because that was the first thread I re-found). It'd be cool if you > > > could have a look! > > > > This doesn't seem to address the problem triggered by the reproducer at > > https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com > > (sorry I didn't CC you) > > I take that back - I forgot that this doesn't release RAM until hitting a > threshold. > Phew. > Without the patch, it looks like: > > $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; > SET client_min_messages=debug; SET log_executor_stats=on; SELECT > cfn();' |head -11 |psql -h /tmp postgres 2>&1 |grep 'max resident' > ! 61820 kB max resident size > ! 65020 kB max resident size > ! 68812 kB max resident size > ! 71152 kB max resident size > ! 76820 kB max resident size > ! 78760 kB max resident size > ! 81140 kB max resident size > ! 83520 kB max resident size > ! 93084 kB max resident size > ! 94756 kB max resident size > ! 96416 kB max resident size > > With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like > this: > > ! 61436 kB max resident size > ! 61572 kB max resident size > ! 63236 kB max resident size > ! 63236 kB max resident size > ! 63556 kB max resident size > ! 63556 kB max resident size > ! 63880 kB max resident size > ! 65416 kB max resident size > ! 65416 kB max resident size > ! 65416 kB max resident size > ! 65416 kB max resident size If you set the define to much lower, how low does this get? I didn't see an easy way to make the recreation of the contextmemory usage dependant, but that'd of course be much better than an #iterations based approach. I'll play around withyour test tomorrow or so, if you don't, but now the next step is risotto... Regards, Andres
On Fri, Apr 16, 2021 at 10:18:37PM -0500, Justin Pryzby wrote: > On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote: > > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > > > > I'd be happy to run with a prototype fix for the leak to see if the other issue > > > > does (not) recur. > > > > > > I just posted a prototype fix to https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de > > > (just because that was the first thread I re-found). It'd be cool if you > > > could have a look! > > > > This doesn't seem to address the problem triggered by the reproducer at > > https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com > > (sorry I didn't CC you) > > I take that back - I forgot that this doesn't release RAM until hitting a > threshold. I tried this on the query that was causing the original c++ exception. It still grows to 2GB within 5min. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23084 postgres 20 0 2514364 1.6g 29484 R 99.7 18.2 3:40.87 postgres: telsasoft ts 192.168.122.11(50892) SELECT PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23084 postgres 20 0 3046960 2.1g 29484 R 100.0 24.1 4:30.64 postgres: telsasoft ts 192.168.122.11(50892) SELECT PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23084 postgres 20 0 4323500 3.3g 29488 R 99.7 38.4 8:20.63 postgres: telsasoft ts 192.168.122.11(50892) SELECT When I first reported this issue, the affected process was a long-running, single-threaded python tool. We since updated it (partially to avoid issues like this) to use multiprocessing, therefor separate postgres backends. I'm now realizing that that's RAM use for a single query, not from continuous leaks across multiple queries. This is still true with the patch even if I #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28438 postgres 20 0 3854264 2.8g 29428 R 98.7 33.2 8:56.79 postgres: telsasoft ts 192.168.122.11(53614) BIND python3 ./jitleak.py # runs telsasoft reports INFO: recreating LLVM context after 2 uses INFO: recreating LLVM context after 2 uses INFO: recreating LLVM context after 2 uses INFO: recreating LLVM context after 2 uses INFO: recreating LLVM context after 2 uses PID 27742 finished running report; est=None rows=40745; cols=34; ... duration:538 INFO: recreating LLVM context after 81492 uses I did: - llvm_llvm_context_reuse_count = 0; Assert(llvm_context != NULL); + elog(INFO, "recreating LLVM context after %zu uses", llvm_llvm_context_reuse_count); + llvm_llvm_context_reuse_count = 0; Maybe we're missing this condition somehow ? if (llvm_jit_context_in_use_count == 0 && Also, I just hit this assertion by cancelling the query with ^C / sigint. But I don't have a reprodcer for it. < 2021-04-17 19:14:23.509 ADT telsasoft >PANIC: LLVMJitContext in use count not 0 at exit (is 1) -- Justin
Hi, On 2021-04-17 19:13:24 -0500, Justin Pryzby wrote: > I'm now realizing that that's RAM use for a single query, not from continuous > leaks across multiple queries. What's the memory usage with inlining disabled, and whats the usage without jit? > This is still true with the patch even if I > #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1 > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 28438 postgres 20 0 3854264 2.8g 29428 R 98.7 33.2 8:56.79 postgres: telsasoft ts 192.168.122.11(53614) BIND > > python3 ./jitleak.py # runs telsasoft reports > INFO: recreating LLVM context after 2 uses > INFO: recreating LLVM context after 2 uses > INFO: recreating LLVM context after 2 uses > INFO: recreating LLVM context after 2 uses > INFO: recreating LLVM context after 2 uses > PID 27742 finished running report; est=None rows=40745; cols=34; ... duration:538 > INFO: recreating LLVM context after 81492 uses > > I did: > > - llvm_llvm_context_reuse_count = 0; > Assert(llvm_context != NULL); > + elog(INFO, "recreating LLVM context after %zu uses", llvm_llvm_context_reuse_count); > + llvm_llvm_context_reuse_count = 0; > > Maybe we're missing this condition somehow ? > if (llvm_jit_context_in_use_count == 0 && Do you utilize cursors or such? > Also, I just hit this assertion by cancelling the query with ^C / sigint. But > I don't have a reprodcer for it. > > < 2021-04-17 19:14:23.509 ADT telsasoft >PANIC: LLVMJitContext in use count not 0 at exit (is 1) Hm, I'll try to see how that happens - if not, do you have a backtrace? Greetings, Andres Freund
On Mon, Apr 19, 2021 at 09:41:30AM -0700, Andres Freund wrote: > On 2021-04-17 19:13:24 -0500, Justin Pryzby wrote: > > I'm now realizing that that's RAM use for a single query, not from continuous > > leaks across multiple queries. > > What's the memory usage with inlining disabled, and whats the usage > without jit? PGOPTIONS="-c jit=off -c jit_inline_above_cost=0 -c jit_above_cost=0 -c client_min_messages=debug -c log_executor_stats=on"python3 `pwd`/jitleak.py ! [5.272604 s user, 0.329912 s system total] ! 120948 kB max resident size PGOPTIONS="-c jit=on -c jit_inline_above_cost=0 -c jit_above_cost=0 -c client_min_messages=debug -c log_executor_stats=on"python3 `pwd`/jitleak.py DEBUG: time to inline: 0.003s, opt: 0.000s, emit: 0.003s ... repeated many times ... ! [585.090087 s user, 31.835228 s system total] ! 4639556 kB max resident size PGOPTIONS="-c jit=on -c jit_inline_above_cost=-1 -c jit_above_cost=0 -c client_min_messages=debug -c log_executor_stats=on"python3 `pwd`/jitleak.py DEBUG: time to inline: 0.000s, opt: 0.000s, emit: 0.003s ... repeated many times ... which is confusing, since inlining is disabled ... ! [318.556489 s user, 8.413007 s system total] ! 159436 kB max resident size > > Maybe we're missing this condition somehow ? > > if (llvm_jit_context_in_use_count == 0 && > > Do you utilize cursors or such? Not in this backend/process. I believe postgis/mapserver uses cursors in a separate process, though... Our report runs like: SELECT large_object_oid FROM ... WHERE name=$1 AND user=$2 ... FOR UPDATE -- our reports are stored as LOs and read by readlo begin; SET enable_nestloop=off; -- I'm sorry to say but nothing else has avoids occasional pathological query plans SELECT report text...; rollback; > > Also, I just hit this assertion by cancelling the query with ^C / sigint. But > > I don't have a reprodcer for it. > > > > < 2021-04-17 19:14:23.509 ADT telsasoft >PANIC: LLVMJitContext in use count not 0 at exit (is 1) > > Hm, I'll try to see how that happens - if not, do you have a backtrace? This may be because I backpatched to v13, which had one conflict (0a2bc5d61). Maybe this is relevant, too: bab150045 If you could provide a patch for v13, it's a good bet there's no issue that I didn't cause. I'm able to reproduce the crash like this, with my patch to v13, but not your patch on master. python3 -c "import pg; db=pg.DB('dbname=postgres'); db.query('SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on;SET client_min_messages=debug'); db.query('begin'); db.query_formatted('SELECT 1 FROM generate_series(1,99)a WHEREa=%s', [1], inline=False);" -- Justin
I don't know if this is related to the other issues, but this seems leaky. create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec FROMpg_class LIMIT 1; end $$; explain analyze SELECT fn() FROM generate_series(1,999); PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=on -clog_min_duration_statement=-1 -cjit=on -cjit_above_cost=0-cjit_inline_above_cost=0' psql ts -f jitleak.sql ! 70172 kB max resident size PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=on -clog_min_duration_statement=-1 -cjit=on -cjit_above_cost=0-cjit_inline_above_cost=-1' psql ts -f jitleak.sql ! 25672 kB max resident size
On Tue, Apr 20, 2021 at 12:38:26AM -0500, Justin Pryzby wrote: > I don't know if this is related to the other issues, but this seems leaky. And it explains how the context use counter can exceed its threshold. create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec FROMpg_class LIMIT 1; end $$; explain analyze SELECT fn() FROM generate_series(1,99); SELECT SUM(a) FROM (VALUES(1))a(a); time PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=off -clog_min_duration_statement=-1 -cjit=on -cjit_above_cost=0-cjit_inline_above_cost=0' psql ts -f jitleak.sql ... psql:jitleak.sql:6: DEBUG: recreating LLVM context after 100 uses Question: does the jit context need to be recreated only when inlining is enabled? Or maybe it's better if it's not conditionalized like that..
Hi, On 2021-04-20 00:58:21 -0500, Justin Pryzby wrote: > On Tue, Apr 20, 2021 at 12:38:26AM -0500, Justin Pryzby wrote: > > I don't know if this is related to the other issues, but this seems leaky. > > And it explains how the context use counter can exceed its threshold. > > create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec FROMpg_class LIMIT 1; end $$; > explain analyze > SELECT fn() > FROM generate_series(1,99); > > SELECT SUM(a) FROM (VALUES(1))a(a); > > time PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=off -clog_min_duration_statement=-1 -cjit=on -cjit_above_cost=0-cjit_inline_above_cost=0' psql ts -f jitleak.sql > ... > psql:jitleak.sql:6: DEBUG: recreating LLVM context after 100 uses Right - at the moment the context can only be recreated when there's no JITed queries ongoing. That's why I asked whether your "real" query contains function calls, and whether those functions are expensive enough to be JITed. > Question: does the jit context need to be recreated only when inlining is > enabled? Or maybe it's better if it's not conditionalized like that.. It'd be sufficient to do it when doing inlining, but currently that's not tracked separately. Greetings, Andres Freund
On Tue, Apr 20, 2021 at 05:20:56PM -0700, Andres Freund wrote: > On 2021-04-20 00:58:21 -0500, Justin Pryzby wrote: > > On Tue, Apr 20, 2021 at 12:38:26AM -0500, Justin Pryzby wrote: > > > I don't know if this is related to the other issues, but this seems leaky. > > > > And it explains how the context use counter can exceed its threshold. > > > > create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec FROMpg_class LIMIT 1; end $$; > Right - at the moment the context can only be recreated when there's no > JITed queries ongoing. That's why I asked whether your "real" query > contains function calls, The real query has a bunch of function calls, including aggregates and custom plpgsql functions. I reduced the whole thing to the above. ts=# explain analyze SELECT fn() FROM generate_series(1,999); ! 71692 kB max resident size ts=# explain analyze SELECT fn() FROM generate_series(1,9999); ! 332852 kB max resident size Which doesn't seem to leak if I "SELECT 1 INTO rec" instead of "SELECT relpages INTO rec". > and whether those functions are expensive enough to be JITed. The real query has: | Append (cost=44.54..192946.35 rows=79 width=741) That's a query over a 2 day period (midnight to midnight+2), so it's not hard for me to believe it sometimes exceeds 100k cost units (jit_inline_above_cost), depending on where we are in that interval, and on planner statistics. It might've been different when I first reported the problem, too. I'm not sure what you mean about expensive enough to be jited ? I think our custom functions don't have a COST set. explain analyze is showing: JIT: Functions: 201 -- Justin
Hi, On 2021-04-20 20:03:13 -0500, Justin Pryzby wrote: > That's a query over a 2 day period (midnight to midnight+2), so it's not hard > for me to believe it sometimes exceeds 100k cost units (jit_inline_above_cost), > depending on where we are in that interval, and on planner statistics. It > might've been different when I first reported the problem, too. > > I'm not sure what you mean about expensive enough to be jited ? > I think our custom functions don't have a COST set. What get's JITed is queries. Unless your functions are fully inlinable SQL functions that means that query executions in functions are separately JITed. And need to separately pass the cost minimums. > explain analyze is showing: > > JIT: > Functions: 201 Note that this isn't about how many plpsql functions or such are called, but about how many JITed functions are generated. Typically there's one function per expression, and one for each table accessed. Greetings, Andres Freund
On Tue, Apr 20, 2021 at 06:16:28PM -0700, Andres Freund wrote: > On 2021-04-20 20:03:13 -0500, Justin Pryzby wrote: > > That's a query over a 2 day period (midnight to midnight+2), so it's not hard > > for me to believe it sometimes exceeds 100k cost units (jit_inline_above_cost), > > depending on where we are in that interval, and on planner statistics. It > > might've been different when I first reported the problem, too. > > > > I'm not sure what you mean about expensive enough to be jited ? > > I think our custom functions don't have a COST set. > > What get's JITed is queries. Unless your functions are fully inlinable > SQL functions that means that query executions in functions are > separately JITed. And need to separately pass the cost minimums. I think the queries our plpgsql functions make wouldn't be expensive - mostly lookups by primary key. And any functions called by the queries within the plpgsql function would not have COST set. What do you think of the test case I sent? | SELECT fn() FROM generate_series If the JIT contexts can't be reset for this query, I assume the same reason applies to our more complicated query. > Note that this isn't about how many plpsql functions or such are called, > but about how many JITed functions are generated. Typically there's one > function per expression, and one for each table accessed. Thanks for the explanations. Note that in our config I have set jit_tuple_deforming=off, so I think the table accesses don't use functions, right? Oh...that appears to be relevant somehow: $ PGOPTIONS='-c client_min_messages=debug -c log_executor_stats=on -c jit=on -c jit_above_cost=0 -c jit_inline_above_cost=0'psql ts -c "explain analyze SELECT fn() FROM generate_series(1,999)" ! 69872 kB max resident size $ PGOPTIONS='-c client_min_messages=debug -c log_executor_stats=on -c jit=on -c jit_above_cost=0 -c jit_inline_above_cost=0-c jit_tuple_deforming=on' psql ts -c "explain analyze SELECT fn() FROM generate_series(1,999)" ! 36140 kB max resident size -- Justin
I think you were thinking that the jit_inline_above_cost=0 for queries within plpgsql functions is pathological and not a useful test case, since the function can be called an arbitrarily large number of times, but the jit context can't be reset mid-query due to nonzero reference count. I think you were wondering whether in our normal environment the plpgsql functions run queries that are expensive enough to be jitted. I think they're not expensive enough to be JITed - but it makes debugging/diagnosing/testing much harder if it somehow "doesn't count" to set inline_above_cost=0. I think that means 1) it's still a memory leak, since it's O(ntuples), which is not generally okay....but, 2) it's certainly better than nothing. It'd be good enough for back branches to have a mitigation rather than be killed by OOM. I'd forgotten doing it, but I had run your patch on one of the servers which first crashed. I remembered after upgrading to v13.3 and having the processed killed by OOM twice per day for the last 3 days (on a very consistent schedule). Could you provide a patch to apply to v13 ? My attempted backpatch introduced some issues. On Mon, Apr 19, 2021 at 12:41:36PM -0500, Justin Pryzby wrote: > > > Also, I just hit this assertion by cancelling the query with ^C / sigint. But > > > I don't have a reprodcer for it. > > > > > > < 2021-04-17 19:14:23.509 ADT telsasoft >PANIC: LLVMJitContext in use count not 0 at exit (is 1) > > > > Hm, I'll try to see how that happens - if not, do you have a backtrace? > > This may be because I backpatched to v13, which had one conflict (0a2bc5d61). > Maybe this is relevant, too: bab150045 > > If you could provide a patch for v13, it's a good bet there's no issue that I > didn't cause. > > I'm able to reproduce the crash like this, with my patch to v13, but not your > patch on master. > > python3 -c "import pg; db=pg.DB('dbname=postgres'); db.query('SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on;SET client_min_messages=debug'); db.query('begin'); db.query_formatted('SELECT 1 FROM generate_series(1,99)a WHEREa=%s', [1], inline=False);"
There's a memory leak affecting JIT expressions, even when inlining, optimization, and tuple deforming are disabled. The server that got OOM after installing PG13.3 (because your patch wasn't applied) still gets OOM even with inline_above_cost=-1, optimize_above_cost=-1, and deforming=off. Actually I think this is an even more severe leak, even though it seems to be infrequent. The inlining leak is slow enough that I don't think that it, alone, would cause a crash. It's possible there's a race condition or something that makes this hard to notice or reproduce, or the inlining leak is just easier to find. I guess your patch avoids this issue, too, since it didn't OOM with your patch. -- Justin
Thread starting here: https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com On Fri, Dec 18, 2020 at 05:56:07PM -0600, Justin Pryzby wrote: > I'm 99% sure the "bad_alloc" is from LLVM. It happened multiple times on > different servers (running a similar report) after setting jit=on during pg13 > upgrade, and never happened since re-setting jit=off. Since this recurred a few times recently (now running pg14.0), and I finally managed to get a non-truncated corefile... Core was generated by `postgres: telsasoft ts 192.168.122.13(40808) SELECT '. Program terminated with signal 6, Aborted. #0 0x00007f2497880337 in raise () from /lib64/libc.so.6 #1 0x00007f2497881a28 in abort () from /lib64/libc.so.6 #2 0x00007f2487cbf265 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #3 0x00007f2487c66696 in __cxxabiv1::__terminate(void (*)()) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #4 0x00007f2487c666c3 in std::terminate() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #5 0x00007f2487c687d3 in __cxa_throw () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #6 0x00007f2487c686cd in operator new(unsigned long) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #7 0x00007f2486477b9c in llvm::DenseMap<llvm::APInt, std::unique_ptr<llvm::ConstantInt, std::default_delete<llvm::ConstantInt>>, llvm::DenseMapAPIntKeyInfo, llvm::detail::DenseMapPair<llvm::APInt, std::unique_ptr<llvm::ConstantInt,std::default_delete<llvm::ConstantInt> > > >::grow(unsigned int) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #8 0x00007f2486477f29 in llvm::ConstantInt::get(llvm::LLVMContext&, llvm::APInt const&) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #9 0x00007f2486478263 in llvm::ConstantInt::get(llvm::IntegerType*, unsigned long, bool) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so #10 0x00007f2488f66c18 in l_ptr_const (type=0x3000650, ptr=<optimized out>) at ../../../../src/include/jit/llvmjit_emit.h:29 #11 llvm_compile_expr (state=<optimized out>) at llvmjit_expr.c:246 #12 0x0000000000635499 in ExecReadyExpr (state=state@entry=0x1a5317178) at execExpr.c:885 #13 0x00000000006388a5 in ExecInitQual (qual=0x1a6d07de8, parent=parent@entry=0x1a4c4ade0) at execExpr.c:280 #14 0x0000000000661d9d in ExecInitHashJoin (node=node@entry=0x1a6c935b0, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeHashjoin.c:733 #15 0x0000000000648f6e in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:307 #16 0x000000000065e8fd in ExecInitHash (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at nodeHash.c:381 #17 0x0000000000648e87 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:365 #18 0x0000000000661cd5 in ExecInitHashJoin (node=node@entry=0x1a6c93ce0, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeHashjoin.c:663 #19 0x0000000000648f6e in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:307 #20 0x00000000006723e4 in ExecInitSort (node=node@entry=0x1a6c94a50, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeSort.c:210 #21 0x0000000000648f2f in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:320 #22 0x0000000000655ef9 in ExecInitAgg (node=node@entry=0x1a6c95100, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeAgg.c:3382 #23 0x0000000000648ef0 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:340 #24 0x0000000000674628 in ExecInitSubqueryScan (node=node@entry=0x1a6c94020, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeSubqueryscan.c:126 #25 0x0000000000649055 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:249 #26 0x0000000000658e61 in ExecInitAppend (node=node@entry=0x1a6c60008, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeAppend.c:232 #27 0x0000000000649166 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:181 #28 0x0000000000670e1a in ExecInitResult (node=node@entry=0x1a6c95c40, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeResult.c:210 #29 0x00000000006491a5 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:166 #30 0x0000000000658e61 in ExecInitAppend (node=node@entry=0x1a6c5d548, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeAppend.c:232 #31 0x0000000000649166 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:181 #32 0x0000000000674628 in ExecInitSubqueryScan (node=node@entry=0x1a6ca5368, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeSubqueryscan.c:126 #33 0x0000000000649055 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:249 #34 0x00000000006723e4 in ExecInitSort (node=node@entry=0x1a6ca5a28, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeSort.c:210 #35 0x0000000000648f2f in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:320 #36 0x00000000006793eb in ExecInitWindowAgg (node=node@entry=0x1a6ca6ab0, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeWindowAgg.c:2314 #37 0x0000000000648edb in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:345 #38 0x0000000000674628 in ExecInitSubqueryScan (node=node@entry=0x1a6ca5628, estate=estate@entry=0x1a4da87b8, eflags=eflags@entry=16)at nodeSubqueryscan.c:126 #39 0x0000000000649055 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at execProcnode.c:249 #40 0x0000000000642ab1 in InitPlan (eflags=16, queryDesc=0x344dc88) at execMain.c:936 #41 standard_ExecutorStart (queryDesc=0x344dc88, eflags=16) at execMain.c:263 #42 0x00007f248b45656d in pgss_ExecutorStart (queryDesc=0x344dc88, eflags=<optimized out>) at pg_stat_statements.c:965 #43 0x00007f248b03a861 in explain_ExecutorStart (queryDesc=0x344dc88, eflags=<optimized out>) at auto_explain.c:301 #44 0x00000000007b8d1b in PortalStart (portal=0x1be2168, params=0x0, eflags=0, snapshot=0x0) at pquery.c:512 -- Justin
Re: terminate called after throwing an instance of 'std::bad_alloc' (llvmjit)
From
Justin Pryzby
Date:
On Wed, Nov 10, 2021 at 09:56:44AM -0600, Justin Pryzby wrote: > Thread starting here: > https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com > > On Fri, Dec 18, 2020 at 05:56:07PM -0600, Justin Pryzby wrote: > > I'm 99% sure the "bad_alloc" is from LLVM. It happened multiple times on > > different servers (running a similar report) after setting jit=on during pg13 > > upgrade, and never happened since re-setting jit=off. > > Since this recurred a few times recently (now running pg14.0), and I finally > managed to get a non-truncated corefile... I think the reason this recurred is that, since upgrading to pg14, I no longer had your memleak patches applied. I'd forgotten about it, but was probably running a locally compiled postgres with your patches applied. I should've mentioned that this crash was associated with the message from the original problem report: |terminate called after throwing an instance of 'std::bad_alloc' | what(): std::bad_alloc The leak discussed on other threads seems fixed by your patches - I compiled v14 and now running with no visible leaks since last week. https://www.postgresql.org/message-id/flat/20210417021602.7dilihkdc7oblrf7@alap3.anarazel.de As I understand it, there's still an issue with an allocation failure causing SIGABRT rather than FATAL. It took me several tries to get the corefile since the process is huge, caused by the leak (and abrtd wanted to truncate it, nullifying its utility). -rw-------. 1 postgres postgres 8.4G Nov 10 08:57 /var/lib/pgsql/14/data/core.31345 I installed more debug packages to get a fuller stacktrace. #0 0x00007f2497880337 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f2497881a28 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007f2487cbf265 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so No symbol table info available. #3 0x00007f2487c66696 in __cxxabiv1::__terminate(void (*)()) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so No symbol table info available. #4 0x00007f2487c666c3 in std::terminate() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so No symbol table info available. #5 0x00007f2487c687d3 in __cxa_throw () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so No symbol table info available. #6 0x00007f2487c686cd in operator new(unsigned long) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so No symbol table info available. #7 0x00007f2486477b9c in allocateBuckets (this=0x2ff7f38, this=0x2ff7f38, Num=<optimized out>) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:753 No locals. #8 llvm::DenseMap<llvm::APInt, std::unique_ptr<llvm::ConstantInt, std::default_delete<llvm::ConstantInt> >, llvm::DenseMapAPIntKeyInfo,llvm::detail::DenseMapPair<llvm::APInt, std::unique_ptr<llvm::ConstantInt, std::default_delete<llvm::ConstantInt>> > >::grow (this=this@entry=0x2ff7f38, AtLeast=<optimized out>) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:691 OldNumBuckets = 33554432 OldBuckets = 0x7f23f3e42010 #9 0x00007f2486477f29 in grow (AtLeast=<optimized out>, this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:461 No locals. #10 InsertIntoBucketImpl<llvm::APInt> (TheBucket=<optimized out>, Lookup=..., Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:510 NewNumEntries = <optimized out> EmptyKey = <optimized out> #11 InsertIntoBucket<llvm::APInt const&> (Key=..., TheBucket=<optimized out>, this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:471 No locals. #12 FindAndConstruct (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:271 TheBucket = <optimized out> #13 operator[] (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:275 No locals. #14 llvm::ConstantInt::get (Context=..., V=...) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:550 pImpl = 0x2ff7eb0 #15 0x00007f2486478263 in llvm::ConstantInt::get (Ty=0x2ff85a8, V=<optimized out>, isSigned=isSigned@entry=false) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:571 No locals. #16 0x00007f248648673d in LLVMConstInt (IntTy=<optimized out>, N=<optimized out>, SignExtend=SignExtend@entry=0) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Core.cpp:952 No locals. #17 0x00007f2488f66c18 in l_ptr_const (type=0x3000650, ptr=<optimized out>) at ../../../../src/include/jit/llvmjit_emit.h:29 c = <optimized out> #18 llvm_compile_expr (state=<optimized out>) at llvmjit_expr.c:246 op = 0x1a5317690 opcode = EEOP_OUTER_VAR opno = 5 parent = <optimized out> funcname = 0x1a53184e8 "evalexpr_4827_151" context = 0x1ba79b8 b = <optimized out> mod = 0x1a5513d30 eval_fn = <optimized out> entry = <optimized out> v_state = 0x1a5ce09e0 v_econtext = 0x1a5ce0a08 v_isnullp = 0x1a5ce0a30 v_tmpvaluep = 0x1a5ce0aa8 v_tmpisnullp = 0x1a5ce0b48 starttime = {tv_sec = 10799172, tv_nsec = 781670770} endtime = {tv_sec = 7077194792, tv_nsec = 0} __func__ = "llvm_compile_expr" [...]
Re: terminate called after throwing an instance of 'std::bad_alloc' (llvmjit)
From
Justin Pryzby
Date:
There's no leak after running for ~5 weeks. $ ps -O lstart,vsize,rss 17930 PID STARTED VSZ RSS S TTY TIME COMMAND 17930 Tue Nov 30 15:35:26 2021 1019464 117424 S ? 7-04:54:03 postgres: telsasoft ts 192.168.122.13(57640) idle Unless you suggest otherwise , I'm planning to restart the DB soon and go back to running the pgdg rpm binaries with jit=off rather than what I compiled and patched locally. On Thu, Nov 18, 2021 at 03:20:39PM -0600, Justin Pryzby wrote: > On Wed, Nov 10, 2021 at 09:56:44AM -0600, Justin Pryzby wrote: > > Thread starting here: > > https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com > > > > On Fri, Dec 18, 2020 at 05:56:07PM -0600, Justin Pryzby wrote: > > > I'm 99% sure the "bad_alloc" is from LLVM. It happened multiple times on > > > different servers (running a similar report) after setting jit=on during pg13 > > > upgrade, and never happened since re-setting jit=off. > > > > Since this recurred a few times recently (now running pg14.0), and I finally > > managed to get a non-truncated corefile... > > I think the reason this recurred is that, since upgrading to pg14, I no longer > had your memleak patches applied. I'd forgotten about it, but was probably > running a locally compiled postgres with your patches applied. > > I should've mentioned that this crash was associated with the message from the > original problem report: > > |terminate called after throwing an instance of 'std::bad_alloc' > | what(): std::bad_alloc > > The leak discussed on other threads seems fixed by your patches - I compiled > v14 and now running with no visible leaks since last week. > https://www.postgresql.org/message-id/flat/20210417021602.7dilihkdc7oblrf7@alap3.anarazel.de > > As I understand it, there's still an issue with an allocation failure causing > SIGABRT rather than FATAL. > > It took me several tries to get the corefile since the process is huge, caused > by the leak (and abrtd wanted to truncate it, nullifying its utility). > > -rw-------. 1 postgres postgres 8.4G Nov 10 08:57 /var/lib/pgsql/14/data/core.31345 > > I installed more debug packages to get a fuller stacktrace. > > #0 0x00007f2497880337 in raise () from /lib64/libc.so.6 > No symbol table info available. > #1 0x00007f2497881a28 in abort () from /lib64/libc.so.6 > No symbol table info available. > #2 0x00007f2487cbf265 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so > No symbol table info available. > #3 0x00007f2487c66696 in __cxxabiv1::__terminate(void (*)()) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so > No symbol table info available. > #4 0x00007f2487c666c3 in std::terminate() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so > No symbol table info available. > #5 0x00007f2487c687d3 in __cxa_throw () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so > No symbol table info available. > #6 0x00007f2487c686cd in operator new(unsigned long) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so > No symbol table info available. > #7 0x00007f2486477b9c in allocateBuckets (this=0x2ff7f38, this=0x2ff7f38, Num=<optimized out>) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:753 > No locals. > #8 llvm::DenseMap<llvm::APInt, std::unique_ptr<llvm::ConstantInt, std::default_delete<llvm::ConstantInt> >, llvm::DenseMapAPIntKeyInfo,llvm::detail::DenseMapPair<llvm::APInt, std::unique_ptr<llvm::ConstantInt, std::default_delete<llvm::ConstantInt>> > >::grow (this=this@entry=0x2ff7f38, AtLeast=<optimized out>) > at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:691 > OldNumBuckets = 33554432 > OldBuckets = 0x7f23f3e42010 > #9 0x00007f2486477f29 in grow (AtLeast=<optimized out>, this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:461 > No locals. > #10 InsertIntoBucketImpl<llvm::APInt> (TheBucket=<optimized out>, Lookup=..., Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:510 > NewNumEntries = <optimized out> > EmptyKey = <optimized out> > #11 InsertIntoBucket<llvm::APInt const&> (Key=..., TheBucket=<optimized out>, this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:471 > No locals. > #12 FindAndConstruct (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:271 > TheBucket = <optimized out> > #13 operator[] (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:275 > No locals. > #14 llvm::ConstantInt::get (Context=..., V=...) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:550 > pImpl = 0x2ff7eb0 > #15 0x00007f2486478263 in llvm::ConstantInt::get (Ty=0x2ff85a8, V=<optimized out>, isSigned=isSigned@entry=false) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:571 > No locals. > #16 0x00007f248648673d in LLVMConstInt (IntTy=<optimized out>, N=<optimized out>, SignExtend=SignExtend@entry=0) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Core.cpp:952 > No locals. > #17 0x00007f2488f66c18 in l_ptr_const (type=0x3000650, ptr=<optimized out>) at ../../../../src/include/jit/llvmjit_emit.h:29 > c = <optimized out> > #18 llvm_compile_expr (state=<optimized out>) at llvmjit_expr.c:246 > op = 0x1a5317690 > opcode = EEOP_OUTER_VAR > opno = 5 > parent = <optimized out> > funcname = 0x1a53184e8 "evalexpr_4827_151" > context = 0x1ba79b8 > b = <optimized out> > mod = 0x1a5513d30 > eval_fn = <optimized out> > entry = <optimized out> > v_state = 0x1a5ce09e0 > v_econtext = 0x1a5ce0a08 > v_isnullp = 0x1a5ce0a30 > v_tmpvaluep = 0x1a5ce0aa8 > v_tmpisnullp = 0x1a5ce0b48 > starttime = {tv_sec = 10799172, tv_nsec = 781670770} > endtime = {tv_sec = 7077194792, tv_nsec = 0} > __func__ = "llvm_compile_expr" > [...]