Thread: JIT causes core dump during error recovery
I initially ran into this while trying to reproduce the recent reports of trouble with LLVM 14 on ARM. However, it also reproduces with LLVM 17 on x86_64, and I see no reason to think it's at all arch-specific. I also reproduced it in back branches (only tried v14, but it's definitely not new in HEAD). To reproduce: 1. Build with --with-llvm 2. Create a config file containing $ cat $HOME/tmp/temp_config # enable jit at max jit_above_cost = 1 jit_inline_above_cost = 1 jit_optimize_above_cost = 1 and do export TEMP_CONFIG=$HOME/tmp/temp_config 3. cd to .../src/pl/plpgsql/src/, and do "make check". It gets a SIGSEGV in plpgsql_transaction.sql's cursor_fail_during_commit test. The stack trace looks like (gdb) bt #0 __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77 #1 0x0000000000735c58 in pq_sendstring (buf=0x7ffd80f8eeb0, str=0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>) at pqformat.c:197 #2 0x00000000009ca09c in err_sendstring (buf=0x7ffd80f8eeb0, str=0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>) at elog.c:3449 #3 0x00000000009ca4ba in send_message_to_frontend (edata=0xf786a0 <errordata>) at elog.c:3568 #4 0x00000000009c73a3 in EmitErrorReport () at elog.c:1715 #5 0x00000000008987e7 in PostgresMain (dbname=<optimized out>, username=0x29fdb00 "postgres") at postgres.c:4378 #6 0x0000000000893c5d in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:105 The errordata structure it's trying to print out contains (gdb) p *edata $1 = {elevel = 21, output_to_server = true, output_to_client = true, hide_stmt = false, hide_ctx = false, filename = 0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>, lineno = 843, funcname = 0x7f77cffdf033 <error: Cannot access memory at address 0x7f77cffdf033>, domain = 0xbd3baa "postgres-17", context_domain = 0x7f77c3343320 "plpgsql-17", sqlerrcode = 33816706, message = 0x29fdc20 "division by zero", detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x29fdc50 "PL/pgSQL function cursor_fail_during_commit() line 6 at COMMIT", backtrace = 0x0, message_id = 0x7f77cffdf022 <error: Cannot access memory at address 0x7f77cffdf022>, schema_name = 0x0, table_name = 0x0,column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 2, assoc_context = 0x29fdb20} lineno = 843 matches the expected error location in int4_div(). The three string fields containing obviously-garbage pointers are ones that elog.c expects to point at compile-time constants, so it just records the caller's pointers without strdup'ing them. Perhaps somebody else will know better, but what I think is happening here is A. Thanks to the low jit cost settings, we choose to jit-compile the "1/(x-1000)" expression inside cursor_fail_during_commit(). B. When x reaches 1000, the division-by-zero error that the test intends to provoke is thrown from jit-compiled code. C. Somewhere between there and EmitErrorReport(), something decided it could unmap the jit-compiled code. D. Now filename/funcname are pointing into the void, and send_message_to_frontend() dumps core while trying to send them. One way to fix this could be to pstrdup those strings even though they should be constants. I don't especially like the amount of overhead that'd add though. What I think is the right solution is to fix things so that seemingly-no-longer-used jit compilations are not thrown away until transaction cleanup. I don't know the JIT code nearly well enough to take point on fixing it like that, though. regards, tom lane
Em qua., 26 de jun. de 2024 às 15:09, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
I initially ran into this while trying to reproduce the recent
reports of trouble with LLVM 14 on ARM. However, it also reproduces
with LLVM 17 on x86_64, and I see no reason to think it's at all
arch-specific. I also reproduced it in back branches (only tried
v14, but it's definitely not new in HEAD).
To reproduce:
1. Build with --with-llvm
2. Create a config file containing
$ cat $HOME/tmp/temp_config
# enable jit at max
jit_above_cost = 1
jit_inline_above_cost = 1
jit_optimize_above_cost = 1
and do
export TEMP_CONFIG=$HOME/tmp/temp_config
3. cd to .../src/pl/plpgsql/src/, and do "make check".
It gets a SIGSEGV in plpgsql_transaction.sql's
cursor_fail_during_commit test. The stack trace looks like
(gdb) bt
#0 __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
#1 0x0000000000735c58 in pq_sendstring (buf=0x7ffd80f8eeb0,
str=0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>)
at pqformat.c:197
#2 0x00000000009ca09c in err_sendstring (buf=0x7ffd80f8eeb0,
str=0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>)
at elog.c:3449
#3 0x00000000009ca4ba in send_message_to_frontend (edata=0xf786a0 <errordata>)
at elog.c:3568
#4 0x00000000009c73a3 in EmitErrorReport () at elog.c:1715
#5 0x00000000008987e7 in PostgresMain (dbname=<optimized out>,
username=0x29fdb00 "postgres") at postgres.c:4378
#6 0x0000000000893c5d in BackendMain (startup_data=<optimized out>,
startup_data_len=<optimized out>) at backend_startup.c:105
The errordata structure it's trying to print out contains
(gdb) p *edata
$1 = {elevel = 21, output_to_server = true, output_to_client = true,
hide_stmt = false, hide_ctx = false,
filename = 0x7f77cffdf000 <error: Cannot access memory at address 0x7f77cffdf000>, lineno = 843,
funcname = 0x7f77cffdf033 <error: Cannot access memory at address 0x7f77cffdf033>, domain = 0xbd3baa "postgres-17",
context_domain = 0x7f77c3343320 "plpgsql-17", sqlerrcode = 33816706,
message = 0x29fdc20 "division by zero", detail = 0x0, detail_log = 0x0,
hint = 0x0,
context = 0x29fdc50 "PL/pgSQL function cursor_fail_during_commit() line 6 at COMMIT", backtrace = 0x0,
message_id = 0x7f77cffdf022 <error: Cannot access memory at address 0x7f77cffdf022>, schema_name = 0x0, table_name = 0x0, column_name = 0x0,
datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0,
internalquery = 0x0, saved_errno = 2, assoc_context = 0x29fdb20}
lineno = 843 matches the expected error location in int4_div().
Did you mean *int4div*, right?
Since there is no reference to int4_div in *.c or *.h
best regards,
Ranier Vilela
Ranier Vilela <ranier.vf@gmail.com> writes: > Em qua., 26 de jun. de 2024 às 15:09, Tom Lane <tgl@sss.pgh.pa.us> escreveu: >> lineno = 843 matches the expected error location in int4_div(). > Did you mean *int4div*, right? Right, typo. regards, tom lane
I wrote: > It gets a SIGSEGV in plpgsql_transaction.sql's > cursor_fail_during_commit test. Here's a simpler way to reproduce: just run the attached script in a --with-llvm build. (This is merely extracting the troublesome regression case for convenience.) Interesting, if you take out any one of the three "set" commands, it doesn't crash. This probably explains why, for example, buildfarm member urutu hasn't shown this --- it's only reducing one of the three costs to zero. I don't have any idea what to make of that result, except that it suggests the problem might be at least partly LLVM's fault. Surely, if we are prematurely unmapping a compiled code segment, that behavior wouldn't depend on whether we had asked for inlining? regards, tom lane drop table if exists test1; create table test1 (x int); CREATE OR REPLACE PROCEDURE cursor_fail_during_commit() LANGUAGE plpgsql AS $$ DECLARE id int; BEGIN FOR id IN SELECT 1/(x-1000) FROM generate_series(1,1000) x LOOP INSERT INTO test1 VALUES(id); COMMIT; END LOOP; END; $$; set jit_above_cost=0; set jit_optimize_above_cost=0; set jit_inline_above_cost=0; CALL cursor_fail_during_commit();
I wrote: > What I think is the right solution is to fix things so that > seemingly-no-longer-used jit compilations are not thrown away > until transaction cleanup. I don't know the JIT code nearly > well enough to take point on fixing it like that, though. Or maybe not. I found by bisecting that it doesn't fail before 2e517818f (Fix SPI's handling of errors during transaction commit). A salient part of that commit message: Having made that API redefinition, we can fix this mess by having SPI_commit[_and_chain] trap errors and start a new, clean transaction before re-throwing the error. Likewise for SPI_rollback[_and_chain]. So delaying removal of the jit-created code segment until transaction cleanup wouldn't be enough to prevent this crash, if I'm reading things right. The extra-pstrdup solution may be the only viable one. I could use confirmation from someone who knows the JIT code about when jit-created code is unloaded. It also remains very unclear why there is no crash if we don't force both jit_optimize_above_cost and jit_inline_above_cost to small values. regards, tom lane
I wrote: > So delaying removal of the jit-created code segment until transaction > cleanup wouldn't be enough to prevent this crash, if I'm reading > things right. The extra-pstrdup solution may be the only viable one. > I could use confirmation from someone who knows the JIT code about > when jit-created code is unloaded. It also remains very unclear > why there is no crash if we don't force both jit_optimize_above_cost > and jit_inline_above_cost to small values. I found where the unload happens: ResOwnerReleaseJitContext, which is executed during the resource owner BEFORE_LOCKS phase. (Which seems like a pretty dubious choice from here; why can't we leave it till the less time-critical phase after we've let go of locks?) But anyway, we definitely do drop this stuff during xact cleanup. Also, it seems that the reason that both jit_optimize_above_cost and jit_inline_above_cost must be small is that otherwise int4div is simply called from the JIT-generated code, not inlined into it. This gives me very considerable fear about how well that behavior has been tested: if throwing an error from inlined code doesn't work, and we hadn't noticed that, how much can it really have been exercised? I have also got an itchy feeling that we have code that will be broken by this behavior of "if it happens to get inlined then string constants aren't so constant". In any case, I found that adding some copying logic to CopyErrorData() is enough to solve this problem, since the SPI infrastructure applies that before executing xact cleanup. I had feared that we'd have to add copying to every single elog/ereport sequence, which would have been an annoying amount of overhead; but the attached seems acceptable. We do get through check-world with this patch and the JIT parameters all set to small values. regards, tom lane diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index d91a85cb2d..d1d1632bdd 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -1743,7 +1743,21 @@ CopyErrorData(void) newedata = (ErrorData *) palloc(sizeof(ErrorData)); memcpy(newedata, edata, sizeof(ErrorData)); - /* Make copies of separately-allocated fields */ + /* + * Make copies of separately-allocated strings. Note that we copy even + * theoretically-constant strings such as filename. This is because those + * could point into JIT-created code segments that might get unloaded at + * transaction cleanup. In some cases we need the copied ErrorData to + * survive transaction boundaries, so we'd better copy those strings too. + */ + if (newedata->filename) + newedata->filename = pstrdup(newedata->filename); + if (newedata->funcname) + newedata->funcname = pstrdup(newedata->funcname); + if (newedata->domain) + newedata->domain = pstrdup(newedata->domain); + if (newedata->context_domain) + newedata->context_domain = pstrdup(newedata->context_domain); if (newedata->message) newedata->message = pstrdup(newedata->message); if (newedata->detail) @@ -1756,6 +1770,8 @@ CopyErrorData(void) newedata->context = pstrdup(newedata->context); if (newedata->backtrace) newedata->backtrace = pstrdup(newedata->backtrace); + if (newedata->message_id) + newedata->message_id = pstrdup(newedata->message_id); if (newedata->schema_name) newedata->schema_name = pstrdup(newedata->schema_name); if (newedata->table_name)
Em qui., 27 de jun. de 2024 às 13:18, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
I wrote:
> So delaying removal of the jit-created code segment until transaction
> cleanup wouldn't be enough to prevent this crash, if I'm reading
> things right. The extra-pstrdup solution may be the only viable one.
> I could use confirmation from someone who knows the JIT code about
> when jit-created code is unloaded. It also remains very unclear
> why there is no crash if we don't force both jit_optimize_above_cost
> and jit_inline_above_cost to small values.
I found where the unload happens: ResOwnerReleaseJitContext, which
is executed during the resource owner BEFORE_LOCKS phase. (Which
seems like a pretty dubious choice from here; why can't we leave
it till the less time-critical phase after we've let go of locks?)
But anyway, we definitely do drop this stuff during xact cleanup.
Also, it seems that the reason that both jit_optimize_above_cost
and jit_inline_above_cost must be small is that otherwise int4div
is simply called from the JIT-generated code, not inlined into it.
This gives me very considerable fear about how well that behavior
has been tested: if throwing an error from inlined code doesn't
work, and we hadn't noticed that, how much can it really have been
exercised? I have also got an itchy feeling that we have code that
will be broken by this behavior of "if it happens to get inlined
then string constants aren't so constant".
In any case, I found that adding some copying logic to CopyErrorData()
is enough to solve this problem, since the SPI infrastructure applies
that before executing xact cleanup.
In this case, I think that these fields, in struct definition struct ErrorData (src/include/utils/elog.h)
should be changed too?
from const char * to char*
best regards,
Ranier Vilela
Ranier Vilela <ranier.vf@gmail.com> writes: > Em qui., 27 de jun. de 2024 às 13:18, Tom Lane <tgl@sss.pgh.pa.us> escreveu: >> In any case, I found that adding some copying logic to CopyErrorData() >> is enough to solve this problem, since the SPI infrastructure applies >> that before executing xact cleanup. > In this case, I think that these fields, in struct definition struct > ErrorData (src/include/utils/elog.h) > should be changed too? > from const char * to char* No, that would imply casting away const in errstart() etc. We're still mostly expecting those things to be pointers to constant strings. I'm about half tempted to file this as an LLVM bug. When it inlines a function, it should still reference the same string constants that the original code did, otherwise it's failing to be a transparent conversion. But they'll probably cite some standards-ese that claims this is undefined behavior: const char * foo(void) { return "foo"; } void bar(void) { Assert( foo() == foo() ); } on which I call BS, but it's probably in there somewhere. regards, tom lane