JIT causes core dump during error recovery - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | JIT causes core dump during error recovery |
Date | |
Msg-id | 1565654.1719425368@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: JIT causes core dump during error recovery
Re: JIT causes core dump during error recovery Re: JIT causes core dump during error recovery |
List | pgsql-hackers |
I initially ran into this while trying to reproduce the recent reports of trouble with LLVM 14 on ARM. However, it also reproduces with LLVM 17 on x86_64, and I see no reason to think it's at all arch-specific. I also reproduced it in back branches (only tried v14, but it's definitely not new in HEAD). To reproduce: 1. Build with --with-llvm 2. Create a config file containing $ cat $HOME/tmp/temp_config # enable jit at max jit_above_cost = 1 jit_inline_above_cost = 1 jit_optimize_above_cost = 1 and do export TEMP_CONFIG=$HOME/tmp/temp_config 3. cd to .../src/pl/plpgsql/src/, and do "make check". It gets a SIGSEGV in plpgsql_transaction.sql's cursor_fail_during_commit test. The stack trace looks like (gdb) bt #0 __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77 #1 0x0000000000735c58 in pq_sendstring (buf=0x7ffd80f8eeb0, str=0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>) at pqformat.c:197 #2 0x00000000009ca09c in err_sendstring (buf=0x7ffd80f8eeb0, str=0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>) at elog.c:3449 #3 0x00000000009ca4ba in send_message_to_frontend (edata=0xf786a0 <errordata>) at elog.c:3568 #4 0x00000000009c73a3 in EmitErrorReport () at elog.c:1715 #5 0x00000000008987e7 in PostgresMain (dbname=<optimized out>, username=0x29fdb00 "postgres") at postgres.c:4378 #6 0x0000000000893c5d in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:105 The errordata structure it's trying to print out contains (gdb) p *edata $1 = {elevel = 21, output_to_server = true, output_to_client = true, hide_stmt = false, hide_ctx = false, filename = 0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>, lineno = 843, funcname = 0x7f77cffdf033 <error: Cannot access memory at address 0x7f77cffdf033>, domain = 0xbd3baa "postgres-17", context_domain = 0x7f77c3343320 "plpgsql-17", sqlerrcode = 33816706, message = 0x29fdc20 "division by zero", detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x29fdc50 "PL/pgSQL function cursor_fail_during_commit() line 6 at COMMIT", backtrace = 0x0, message_id = 0x7f77cffdf022 <error: Cannot access memory at address 0x7f77cffdf022>, schema_name = 0x0, table_name = 0x0,column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 2, assoc_context = 0x29fdb20} lineno = 843 matches the expected error location in int4_div(). The three string fields containing obviously-garbage pointers are ones that elog.c expects to point at compile-time constants, so it just records the caller's pointers without strdup'ing them. Perhaps somebody else will know better, but what I think is happening here is A. Thanks to the low jit cost settings, we choose to jit-compile the "1/(x-1000)" expression inside cursor_fail_during_commit(). B. When x reaches 1000, the division-by-zero error that the test intends to provoke is thrown from jit-compiled code. C. Somewhere between there and EmitErrorReport(), something decided it could unmap the jit-compiled code. D. Now filename/funcname are pointing into the void, and send_message_to_frontend() dumps core while trying to send them. One way to fix this could be to pstrdup those strings even though they should be constants. I don't especially like the amount of overhead that'd add though. What I think is the right solution is to fix things so that seemingly-no-longer-used jit compilations are not thrown away until transaction cleanup. I don't know the JIT code nearly well enough to take point on fixing it like that, though. regards, tom lane
pgsql-hackers by date: