Thread: gs_group_1 crashing on 13beta2/s390x
On the Debian s390x buildd, the 13beta2 build is crashing: 2020-07-15 01:19:59.149 UTC [859] LOG: server process (PID 1415) was terminated by signal 11: Segmentation fault 2020-07-15 01:19:59.149 UTC [859] DETAIL: Failed process was running: create table gs_group_1 as select g100, g10, sum(g::numeric), count(*), max(g::text) from gs_data_1 group by cube (g1000, g100,g10); Full build log at https://buildd.debian.org/status/fetch.php?pkg=postgresql-13&arch=s390x&ver=13%7Ebeta2-1&stamp=1594776007&raw=0 The failure is reproducible there: https://buildd.debian.org/status/logs.php?pkg=postgresql-13&ver=13%7Ebeta2-1&arch=s390x I tried a manual build on a s390x machine, but that one went through fine, so I can't provide a backtrace at the moment. Christoph
Re: To PostgreSQL Hackers > On the Debian s390x buildd, the 13beta2 build is crashing: > > 2020-07-15 01:19:59.149 UTC [859] LOG: server process (PID 1415) was terminated by signal 11: Segmentation fault > 2020-07-15 01:19:59.149 UTC [859] DETAIL: Failed process was running: create table gs_group_1 as > select g100, g10, sum(g::numeric), count(*), max(g::text) > from gs_data_1 group by cube (g1000, g100,g10); I wired gdb into the build process and got this backtrace: 2020-07-15 16:03:38.310 UTC [21073] LOG: server process (PID 21575) was terminated by signal 11: Segmentation fault 2020-07-15 16:03:38.310 UTC [21073] DETAIL: Failed process was running: create table gs_group_1 as select g100, g10, sum(g::numeric), count(*), max(g::text) from gs_data_1 group by cube (g1000, g100,g10); ******** build/src/bin/pg_upgrade/tmp_check/data.old/core ******** [New LWP 21575] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1". Core was generated by `postgres: buildd regression [local] CREATE TABLE AS '. Program terminated with signal SIGSEGV, Segmentation fault. #0 datumCopy (typByVal=false, typLen=-1, value=0) at ./build/../src/backend/utils/adt/datum.c:142 142 if (VARATT_IS_EXTERNAL_EXPANDED(vl)) #0 datumCopy (typByVal=false, typLen=-1, value=0) at ./build/../src/backend/utils/adt/datum.c:142 vl = 0x0 res = <optimized out> res = <optimized out> vl = <optimized out> eoh = <optimized out> resultsize = <optimized out> resultptr = <optimized out> realSize = <optimized out> resultptr = <optimized out> realSize = <optimized out> resultptr = <optimized out> #1 datumCopy (value=0, typByVal=false, typLen=-1) at ./build/../src/backend/utils/adt/datum.c:131 res = <optimized out> vl = <optimized out> eoh = <optimized out> resultsize = <optimized out> resultptr = <optimized out> realSize = <optimized out> resultptr = <optimized out> #2 0x000002aa04423af8 in finalize_aggregate (aggstate=aggstate@entry=0x2aa05775920, peragg=peragg@entry=0x2aa056e02f0, resultVal=0x2aa056e0208,resultIsNull=0x2aa056e022a, pergroupstate=<optimized out>, pergroupstate=<optimized out>) at ./build/../src/backend/executor/nodeAgg.c:1128 fcinfodata = {fcinfo = {flinfo = 0x2aa056e0250, context = 0x2aa05775920, resultinfo = 0x0, fncollation = 0, isnull= false, nargs = 1, args = 0x3fff6a7b578}, fcinfo_data = "\000\000\002\252\005n\002P\000\000\002\252\005wY ", '\000'<repeats 13 times>, "\247\000\001\000\000\002\252\005t\265\250\000\000\003\377\211\341\207F\000\000\003\377\000\000\002\001\000\000\000\000\000\000\003\376\000\000\000\000\000\000\017\370\000\000\000\000\000\000\001\377\000\000\000\000\000\000\000\260\000\000\000k\000\000\000k\000\000\000\000\000\000 \000\000\000\003\377\213\016J \000\000\000p\000\000\000k\000\000\000\000\000\000\000\200\000\000\000\000\000\000\000\020",'\000' <repeats 11 times>, "w\000\000\000|\000\000\000\000\000\000\000\002\000\000\002\252\006&9\250\000\000\002\252\005wZh\000\000\002\252\005wZH\000\000\003\377\213\n_\210\000\000\000\000\000\000\000\000\000\000"...} fcinfo = 0x3fff6a7b558 anynull = <optimized out> oldContext = <optimized out> i = <optimized out> lc = <optimized out> pertrans = <error reading variable pertrans (value has been optimized out)> #3 0x000002aa04423ff4 in finalize_aggregates (aggstate=aggstate@entry=0x2aa05775920, peraggs=peraggs@entry=0x2aa056e0240,pergroup=0x2aa056c8ed8) at ./build/../src/backend/executor/nodeAgg.c:1345 peragg = 0x2aa056e02f0 transno = <optimized out> pergroupstate = 0x2aa056c8ef8 econtext = <optimized out> aggvalues = 0x2aa056e01f8 aggnulls = 0x2aa056e0228 aggno = 2 transno = <optimized out> #4 0x000002aa04424f5c in agg_retrieve_direct (aggstate=0x2aa05775920) at ./build/../src/backend/executor/nodeAgg.c:2480 econtext = 0x2aa05776080 firstSlot = 0x2aa062639a8 numGroupingSets = <optimized out> node = <optimized out> tmpcontext = 0x2aa05775d60 peragg = 0x2aa056e0240 outerslot = <optimized out> nextSetSize = <optimized out> pergroups = 0x2aa056c8ea8 result = <optimized out> hasGroupingSets = <optimized out> currentSet = <optimized out> numReset = <optimized out> i = <optimized out> node = <optimized out> econtext = <optimized out> tmpcontext = <optimized out> peragg = <optimized out> pergroups = <optimized out> outerslot = <optimized out> firstSlot = <optimized out> result = <optimized out> hasGroupingSets = <optimized out> numGroupingSets = <optimized out> currentSet = <optimized out> nextSetSize = <optimized out> numReset = <optimized out> i = <optimized out> #5 ExecAgg (pstate=0x2aa05775920) at ./build/../src/backend/executor/nodeAgg.c:2140 node = 0x2aa05775920 result = 0x0 #6 0x000002aa0441001a in ExecProcNode (node=0x2aa05775920) at ./build/../src/include/executor/executor.h:245 No locals. #7 ExecutePlan (execute_once=<optimized out>, dest=0x2aa0565fa58, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2aa05775920, estate=0x2aa057756f8)at ./build/../src/backend/executor/execMain.c:1646 slot = <optimized out> current_tuple_count = 0 slot = <optimized out> current_tuple_count = <optimized out> #8 standard_ExecutorRun (queryDesc=0x2aa062df508, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:364 estate = 0x2aa057756f8 operation = CMD_SELECT dest = 0x2aa0565fa58 sendTuples = <optimized out> oldcontext = 0x2aa0565f830 __func__ = "standard_ExecutorRun" #9 0x000002aa043933fa in ExecCreateTableAs (pstate=pstate@entry=0x2aa0565f948, stmt=stmt@entry=0x2aa055bea28, params=params@entry=0x0,queryEnv=queryEnv@entry=0x0, qc=0x3fff6a7ca30) at ./build/../src/backend/commands/createas.c:354 query = <optimized out> into = <optimized out> is_matview = <optimized out> dest = 0x2aa0565fa58 save_userid = 0 save_sec_context = 0 save_nestlevel = 0 address = {classId = 70885274, objectId = 1023, objectSubId = -156778696} rewritten = <optimized out> plan = 0x2aa062df3f8 queryDesc = 0x2aa062df508 __func__ = "ExecCreateTableAs" #10 0x000002aa0459f378 in ProcessUtilitySlow (pstate=pstate@entry=0x2aa0565f948, pstmt=pstmt@entry=0x2aa055bead8, queryString=queryString@entry=0x2aa055bcfb8"create table gs_group_1 as\nselect g100, g10, sum(g::numeric), count(*), max(g::text)\nfromgs_data_1 group by cube (g1000, g100,g10);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0,queryEnv=0x0, qc=0x3fff6a7ca30, dest=<error reading variable: value has been optimized out>) at ./build/../src/backend/tcop/utility.c:1600 _save_exception_stack = 0x3fff6a7c820 _save_context_stack = 0x0 _local_sigjmp_buf = {{__jmpbuf = {{__gregs = {0, 2929258264904, 2929257605848, 0, 2929257605672, 2929257598904, 4396084256648,1, 18739560736704083, 18738779339296963}, __fpregs = {2929259797352, 2929259797352, 2929257598904, 2929258351376,0, 4397889735684, 4397889735216, 0}}}, __mask_was_saved = 0, __saved_mask = {__val = {4397889733480, 10, 2929257605400,0, 0, 2929257606120, 4096, 1, 0, 4396084256648, 2929242951088, 2929239505962, 4397889733120, 18736975094525723,2929257598904, 18446744069489843728}}}} _do_rethrow = false parsetree = 0x2aa055bea28 isTopLevel = true isCompleteQuery = true needCleanup = false commandCollected = false address = {classId = 1023, objectId = 4138189632, objectSubId = 682} secondaryObject = {classId = 0, objectId = 0, objectSubId = 0} __func__ = "ProcessUtilitySlow" #11 0x000002aa0459dd36 in standard_ProcessUtility (pstmt=0x2aa055bead8, queryString=0x2aa055bcfb8 "create table gs_group_1as\nselect g100, g10, sum(g::numeric), count(*), max(g::text)\nfrom gs_data_1 group by cube (g1000, g100,g10);",context=<optimized out>, params=0x0, queryEnv=<optimized out>, dest=0x2aa057d5b68, qc=0x3fff6a7ca30) at ./build/../src/backend/tcop/utility.c:1069 parsetree = 0x2aa055bea28 isTopLevel = <optimized out> isAtomicContext = <optimized out> pstate = 0x2aa0565f948 readonly_flags = <optimized out> __func__ = "standard_ProcessUtility" #12 0x000002aa0459e874 in ProcessUtility (pstmt=pstmt@entry=0x2aa055bead8, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL,params=<optimized out>, queryEnv=queryEnv@entry=0x0, dest=0x2aa057d5b68, qc=0x3fff6a7ca30)at ./build/../src/backend/tcop/utility.c:524 No locals. #13 0x000002aa0459b210 in PortalRunUtility (portal=0x2aa05620008, pstmt=0x2aa055bead8, isTopLevel=<optimized out>, setHoldSnapshot=<optimizedout>, dest=<optimized out>, qc=0x3fff6a7ca30) at ./build/../src/backend/tcop/pquery.c:1157 utilityStmt = <optimized out> snapshot = 0x2aa05674c58 #14 0x000002aa0459bca0 in PortalRunMulti (portal=portal@entry=0x2aa05620008, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=<optimized out>, dest@entry=0x2aa057d5b68, altdest=<optimized out>, altdest@entry=0x2aa057d5b68,qc=0x3fff6a7ca30) at ./build/../src/backend/tcop/pquery.c:1303 pstmt = <optimized out> stmtlist_item__state = {l = 0x2aa057d5b18, i = 0} active_snapshot_set = false stmtlist_item = 0x2aa057d5b30 #15 0x000002aa0459c9f4 in PortalRun (portal=portal@entry=0x2aa05620008, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x2aa057d5b68, altdest=0x2aa057d5b68, qc=0x3fff6a7ca30)at ./build/../src/backend/tcop/pquery.c:779 _save_exception_stack = 0x3fff6a7ccc0 _save_context_stack = 0x0 _local_sigjmp_buf = {{__jmpbuf = {{__gregs = {2929258004488, 2, 2929258004488, 2929167695872, 2929257605768, 1, 4396084256648,4397889736544, 18739560736712267, 18738779339319315}, __fpregs = {89, 0, 2929257598904, 2929258351376, 4397889735216,4397889735684, 4397889735214, 0}}}, __mask_was_saved = 0, __saved_mask = {__val = {0, 8192, 2, 1, 1, 2929243650206,2929258004488, 2929257605768, 2, 4396084256648, 4397889736544, 2929237235946, 4397889734816, 4397889736544,2929239411110, 4397889734656}}}} _do_rethrow = <optimized out> result = <optimized out> nprocessed = <optimized out> saveTopTransactionResourceOwner = 0x2aa055e9bf0 saveTopTransactionContext = 0x2aa05674b10 saveActivePortal = 0x0 saveResourceOwner = 0x2aa055e9bf0 savePortalContext = 0x0 saveMemoryContext = 0x2aa05674b10 __func__ = "PortalRun" #16 0x000002aa0459830a in exec_simple_query (query_string=<optimized out>) at ./build/../src/backend/tcop/postgres.c:1239 snapshot_set = <optimized out> per_parsetree_context = <optimized out> plantree_list = <optimized out> parsetree = 0x2aa055bea58 commandTag = <optimized out> qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0} querytree_list = <optimized out> portal = 0x2aa05620008 receiver = 0x2aa057d5b68 format = 0 parsetree_item__state = {l = 0x2aa055bea88, i = 0} dest = DestRemote oldcontext = <optimized out> parsetree_list = 0x2aa055bea88 parsetree_item = 0x2aa055beaa0 save_log_statement_stats = false was_logged = false use_implicit_block = false msec_str = "\000\000\002\252\000\000\000Q\000\000\002\252\005[ϸ\000\000\002\252\000\000\000\000\000\000\003\377\213\n_\210" __func__ = "exec_simple_query" #17 0x000002aa0459a05e in PostgresMain (argc=<optimized out>, argv=argv@entry=0x2aa055e8190, dbname=<optimized out>, username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4315 query_string = 0x2aa055bcfb8 "create table gs_group_1 as\nselect g100, g10, sum(g::numeric), count(*), max(g::text)\nfromgs_data_1 group by cube (g1000, g100,g10);" firstchar = 81 input_message = {data = 0x2aa055bcfb8 "create table gs_group_1 as\nselect g100, g10, sum(g::numeric), count(*), max(g::text)\nfromgs_data_1 group by cube (g1000, g100,g10);", len = 133, maxlen = 1024, cursor = 133} local_sigjmp_buf = {{__jmpbuf = {{__gregs = {8388608, 64, 2929244860962, 2929257583240, 2929244863032, 2929244863024,4396084256648, 4397889736544, 18739560736695925, 18738779339318659}, __fpregs = {4397889736544, 4397889736532,6, 4397419976768, 2929690573168, 2930182562592, 4397889736696, 0}}}, __mask_was_saved = 1, __saved_mask = {__val= {0, 2929239505782, 2929257748144, 4397889736288, 4396084256648, 4397889736544, 2929242274994, 4397889735960, 1024,4396084256648, 4397889736544, 2929242135450, 4, 0, 4397889736320, 4397889736160}}}} send_ready_for_query = false disable_idle_in_transaction_timeout = false __func__ = "PostgresMain" #18 0x000002aa04512066 in BackendRun (port=0x2aa055e16b0, port=0x2aa055e16b0) at ./build/../src/backend/postmaster/postmaster.c:4523 av = 0x2aa055e8190 maxac = <optimized out> ac = 1 i = 1 av = <optimized out> maxac = <optimized out> ac = <optimized out> i = <optimized out> __func__ = "BackendRun" __errno_location = <optimized out> __errno_location = <optimized out> __errno_location = <optimized out> #19 BackendStartup (port=0x2aa055e16b0) at ./build/../src/backend/postmaster/postmaster.c:4215 bn = <optimized out> pid = <optimized out> bn = <optimized out> pid = <optimized out> __func__ = "BackendStartup" __errno_location = <optimized out> __errno_location = <optimized out> save_errno = <optimized out> __errno_location = <optimized out> __errno_location = <optimized out> #20 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1727 port = 0x2aa055e16b0 i = <optimized out> rmask = {fds_bits = {32, 0 <repeats 15 times>}} selres = <optimized out> now = <optimized out> readmask = {fds_bits = {32, 0 <repeats 15 times>}} nSockets = 0 last_lockfile_recheck_time = 1594829011 last_touch_time = 1594829011 __func__ = "ServerLoop" #21 0x000002aa04513128 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1400 opt = <optimized out> status = <optimized out> userDoption = <optimized out> listen_addr_saved = false i = <optimized out> output_config_variable = <optimized out> __func__ = "PostmasterMain" #22 0x000002aa04243fb4 in main (argc=<optimized out>, argv=0x2aa055b71b0) at ./build/../src/backend/main/main.c:210 do_check_root = <optimized out> Christoph
Christoph Berg <myon@debian.org> writes: >> On the Debian s390x buildd, the 13beta2 build is crashing: > I wired gdb into the build process and got this backtrace: > #0 datumCopy (typByVal=false, typLen=-1, value=0) at ./build/../src/backend/utils/adt/datum.c:142 > vl = 0x0 > res = <optimized out> > res = <optimized out> > vl = <optimized out> > eoh = <optimized out> > resultsize = <optimized out> > resultptr = <optimized out> > realSize = <optimized out> > resultptr = <optimized out> > realSize = <optimized out> > resultptr = <optimized out> > #1 datumCopy (value=0, typByVal=false, typLen=-1) at ./build/../src/backend/utils/adt/datum.c:131 > res = <optimized out> > vl = <optimized out> > eoh = <optimized out> > resultsize = <optimized out> > resultptr = <optimized out> > realSize = <optimized out> > resultptr = <optimized out> > #2 0x000002aa04423af8 in finalize_aggregate (aggstate=aggstate@entry=0x2aa05775920, peragg=peragg@entry=0x2aa056e02f0,resultVal=0x2aa056e0208, resultIsNull=0x2aa056e022a, pergroupstate=<optimized out>, pergroupstate=<optimizedout>) at ./build/../src/backend/executor/nodeAgg.c:1128 Hmm. If gdb isn't lying to us, that has to be coming from here: /* * If result is pass-by-ref, make sure it is in the right context. */ if (!peragg->resulttypeByVal && !*resultIsNull && !MemoryContextContains(CurrentMemoryContext, DatumGetPointer(*resultVal))) *resultVal = datumCopy(*resultVal, peragg->resulttypeByVal, peragg->resulttypeLen); The line numbers in HEAD are a bit different, but that's the only call of datumCopy() in finalize_aggregate(). It's hardly surprising that datumCopy would segfault when given a null "value" and told it is pass-by-reference. However, to get to the datumCopy call, we must have passed the MemoryContextContains check on that very same pointer value, and that would surely have segfaulted as well, one would think. Given the apparently-can't-happen situation at the call site, and the fact that we're not seeing similar failures reported elsewhere (and note that every line shown above is at least five years old), I'm kind of forced to the conclusion that this is a compiler bug. Does adjusting the -O level make it go away? regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> It's hardly surprising that datumCopy would segfault when given a Tom> null "value" and told it is pass-by-reference. However, to get to Tom> the datumCopy call, we must have passed the MemoryContextContains Tom> check on that very same pointer value, and that would surely have Tom> segfaulted as well, one would think. Nope, because MemoryContextContains just returns "false" if passed a NULL pointer. -- Andrew (irc:RhodiumToad)
Re: Tom Lane > Given the apparently-can't-happen situation at the call site, > and the fact that we're not seeing similar failures reported > elsewhere (and note that every line shown above is at least > five years old), I'm kind of forced to the conclusion that this > is a compiler bug. Does adjusting the -O level make it go away? The problem is that a manual build doesn't crash, and I'm somewhat reluctant to do a full new package upload (which will keep buildds for all architectures busy) just for a -O0 test unless we are sure it helps. I'd rather play more with the manual build artifacts (which should be using the same compiler and everything), if anyone has ideas what I should be trying. Christoph
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> It's hardly surprising that datumCopy would segfault when given a > Tom> null "value" and told it is pass-by-reference. However, to get to > Tom> the datumCopy call, we must have passed the MemoryContextContains > Tom> check on that very same pointer value, and that would surely have > Tom> segfaulted as well, one would think. > Nope, because MemoryContextContains just returns "false" if passed a > NULL pointer. Ah, right. So you could imagine getting here if the finalfn had returned PointerGetDatum(NULL) with isnull = false. We have some aggregate transfns that are capable of doing that for internal-type transvalues, I think, but the finalfn never should do it. In any case we still have the fact that this isn't being seen in our buildfarm; and that's not for lack of s390 machines. So I still think the most likely explanation is a compiler bug in bleeding-edge gcc. Probably what Christoph should be trying to figure out is why he can't reproduce it manually. There must be some discrepancy between his environment and the build machines; but what? regards, tom lane
Re: Tom Lane > > Tom> It's hardly surprising that datumCopy would segfault when given a > > Tom> null "value" and told it is pass-by-reference. However, to get to > > Tom> the datumCopy call, we must have passed the MemoryContextContains > > Tom> check on that very same pointer value, and that would surely have > > Tom> segfaulted as well, one would think. > > > Nope, because MemoryContextContains just returns "false" if passed a > > NULL pointer. > > Ah, right. So you could imagine getting here if the finalfn had returned > PointerGetDatum(NULL) with isnull = false. We have some aggregate > transfns that are capable of doing that for internal-type transvalues, > I think, but the finalfn never should do it. So I had another stab at this. As expected, the 13.0 upload to Debian/unstable crashed again on the buildd, while a manual everything-should-be-the-same build succeeded. I don't know why I didn't try this before, but this time I took this manual build and started a PG instance from it. Pasting the gs_group_1 queries made it segfault instantly. So here we are: #0 datumCopy (value=0, typLen=-1, typByVal=false) at ./build/../src/backend/utils/adt/datum.c:142 #1 0x000002aa3bf6322e in datumCopy (value=<optimized out>, typByVal=<optimized out>, typLen=<optimized out>) at ./build/../src/backend/utils/adt/datum.c:178 #2 0x000002aa3bda6dd6 in finalize_aggregate (aggstate=aggstate@entry=0x2aa3defbfd0, peragg=peragg@entry=0x2aa3e0671f0, pergroupstate=pergroupstate@entry=0x2aa3e026b78, resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a) at ./build/../src/backend/executor/nodeAgg.c:1153 (gdb) p *resultVal $3 = 0 (gdb) p *resultIsNull $6 = false (gdb) p *peragg $7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn = {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict= false, fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs= 0x0, resulttypeLen = -1, resulttypeByVal = false, shareable = true} Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else` branch of the if (OidIsValid) in finalize_aggregate(): else { /* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */ *resultVal = pergroupstate->transValue; *resultIsNull = pergroupstate->transValueIsNull; } (gdb) p *pergroupstate $12 = {transValue = 0, transValueIsNull = false, noTransValue = false} That comes from finalize_aggregates: #3 0x000002aa3bda7e10 in finalize_aggregates (aggstate=aggstate@entry=0x2aa3defbfd0, peraggs=peraggs@entry=0x2aa3e067140, pergroup=0x2aa3e026b58) at ./build/../src/backend/executor/nodeAgg.c:1369 /* * Run the final functions. */ for (aggno = 0; aggno < aggstate->numaggs; aggno++) { AggStatePerAgg peragg = &peraggs[aggno]; int transno = peragg->transno; AggStatePerGroup pergroupstate; pergroupstate = &pergroup[transno]; if (DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit)) finalize_partialaggregate(aggstate, peragg, pergroupstate, &aggvalues[aggno], &aggnulls[aggno]); else finalize_aggregate(aggstate, peragg, pergroupstate, &aggvalues[aggno], &aggnulls[aggno]); } ... but at that point I'm lost. Maybe "aggno" and "transno" got mixed up here? (I'll leave the gdb session open for further suggestions.) Christoph
I poked around with the SET in the offending tests, and the crash is only present if `set jit_above_cost = 0;` is present. Removing that makes it pass. Removing work_mem or enable_hashagg does not make a difference. llvm version is 10.0.1. Test file: -- -- Compare results between plans using sorting and plans using hash -- aggregation. Force spilling in both cases by setting work_mem low -- and altering the statistics. -- create table gs_data_1 as select g%1000 as g1000, g%100 as g100, g%10 as g10, g from generate_series(0,1999) g; analyze gs_data_1; alter table gs_data_1 set (autovacuum_enabled = 'false'); update pg_class set reltuples = 10 where relname='gs_data_1'; SET work_mem='64kB'; -- Produce results with sorting. set enable_hashagg = false; set jit_above_cost = 0; -- remove this to remove crash explain (costs off) select g100, g10, sum(g::numeric), count(*), max(g::text) from gs_data_1 group by cube (g1000, g100,g10); create table gs_group_1 as select g100, g10, sum(g::numeric), count(*), max(g::text) from gs_data_1 group by cube (g1000, g100,g10); Christoph
Re: To Tom Lane > I poked around with the SET in the offending tests, and the crash is > only present if `set jit_above_cost = 0;` is present. Removing that > makes it pass. Removing work_mem or enable_hashagg does not make a > difference. llvm version is 10.0.1. I put jit_above_cost=0 into postgresql.conf and ran "make installcheck" again. There are more crashes: From src/test/regress/sql/interval.sql: 2020-09-25 17:00:25.130 CEST [8135] LOG: Serverprozess (PID 8369) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:00:25.130 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: select avg(f1) from interval_tbl; From src/test/regress/sql/tid.sql: 2020-09-25 17:01:20.593 CEST [8135] LOG: Serverprozess (PID 9015) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:01:20.593 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: SELECT max(ctid) FROM tid_tab; From src/test/regress/sql/collate*.sql: 2020-09-25 17:07:17.852 CEST [8135] LOG: Serverprozess (PID 12232) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:07:17.852 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: SELECT min(b), max(b) FROM collate_test1; From src/test/regress/sql/plpgsql.sql: 2020-09-25 17:11:56.495 CEST [8135] LOG: Serverprozess (PID 15562) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:11:56.495 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: select * from returnqueryf(); Contrary to the others this one is not related to aggregates: -- test RETURN QUERY with dropped columns create table tabwithcols(a int, b int, c int, d int); insert into tabwithcols values(10,20,30,40),(50,60,70,80); create or replace function returnqueryf() returns setof tabwithcols as $$ begin return query select * from tabwithcols; return query execute 'select * from tabwithcols'; end; $$ language plpgsql; select * from returnqueryf(); alter table tabwithcols drop column b; select * from returnqueryf(); alter table tabwithcols drop column d; select * from returnqueryf(); alter table tabwithcols add column d int; select * from returnqueryf(); drop function returnqueryf(); drop table tabwithcols; src/test/regress/sql/rangefuncs.sql: 2020-09-25 17:16:04.209 CEST [8135] LOG: Serverprozess (PID 17372) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:16:04.209 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: select * from usersview; src/test/regress/sql/alter_table.sql: 2020-09-25 17:21:36.187 CEST [8135] LOG: Serverprozess (PID 19217) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:21:36.187 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: update atacc3 set test2 = 4 where test2is null; src/test/regress/sql/polymorphism.sql: 2020-09-25 17:23:55.509 CEST [8135] LOG: Serverprozess (PID 21010) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:23:55.509 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: select myleast(1.1, 0.22, 0.55); 2020-09-25 17:28:26.222 CEST [8135] LOG: Serverprozess (PID 22325) wurde von Signal 11 beendet: Speicherzugriffsfehler 2020-09-25 17:28:26.222 CEST [8135] DETAIL: Der fehlgeschlagene Prozess führte aus: select f.longname from fullname f; (stopping here) There are also a lot of these log lines (without prefix): ORC error: No callback manager available for s390x-ibm-linux-gnu Is that worrying? I'm not sure but I think I've seen these on other architectures as well. I guess that suggests two things: * jit is not ready for prime time on s390x and I should disable it * jit is not exercised enough by "make installcheck" Christoph
Hi, On 2020-09-25 17:29:07 +0200, Christoph Berg wrote: > I guess that suggests two things: > * jit is not ready for prime time on s390x and I should disable it I don't know how good LLVMs support for s390x JITing is, and given that it's unrealistic for people to get access to s390x... > * jit is not exercised enough by "make installcheck" So far we've exercised more widely it by setting up machines that use it for all queries (by setting the config option). I'm doubtful it's worth doing differently. Greetings, Andres Freund
Am 25. September 2020 18:42:04 MESZ schrieb Andres Freund <andres@anarazel.de> >> * jit is not exercised enough by "make installcheck" > >So far we've exercised more widely it by setting up machines that use >it >for all queries (by setting the config option). I'm doubtful it's worth >doing differently. Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of them withthe extra JIT option, so I should probably make sure to do that here when running tests.
Em sex., 25 de set. de 2020 às 11:30, Christoph Berg <myon@debian.org> escreveu:
Re: Tom Lane
> > Tom> It's hardly surprising that datumCopy would segfault when given a
> > Tom> null "value" and told it is pass-by-reference. However, to get to
> > Tom> the datumCopy call, we must have passed the MemoryContextContains
> > Tom> check on that very same pointer value, and that would surely have
> > Tom> segfaulted as well, one would think.
>
> > Nope, because MemoryContextContains just returns "false" if passed a
> > NULL pointer.
>
> Ah, right. So you could imagine getting here if the finalfn had returned
> PointerGetDatum(NULL) with isnull = false. We have some aggregate
> transfns that are capable of doing that for internal-type transvalues,
> I think, but the finalfn never should do it.
So I had another stab at this. As expected, the 13.0 upload to
Debian/unstable crashed again on the buildd, while a manual
everything-should-be-the-same build succeeded. I don't know why I
didn't try this before, but this time I took this manual build and
started a PG instance from it. Pasting the gs_group_1 queries made it
segfault instantly.
So here we are:
#0 datumCopy (value=0, typLen=-1, typByVal=false) at ./build/../src/backend/utils/adt/datum.c:142
#1 0x000002aa3bf6322e in datumCopy (value=<optimized out>, typByVal=<optimized out>, typLen=<optimized out>)
at ./build/../src/backend/utils/adt/datum.c:178
#2 0x000002aa3bda6dd6 in finalize_aggregate (aggstate=aggstate@entry=0x2aa3defbfd0, peragg=peragg@entry=0x2aa3e0671f0,
pergroupstate=pergroupstate@entry=0x2aa3e026b78, resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a)
at ./build/../src/backend/executor/nodeAgg.c:1153
(gdb) p *resultVal
$3 = 0
(gdb) p *resultIsNull
$6 = false
(gdb) p *peragg
$7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn = {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false,
fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0,
resulttypeLen = -1, resulttypeByVal = false, shareable = true}
Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
branch of the if (OidIsValid) in finalize_aggregate():
else
{
/* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
*resultVal = pergroupstate->transValue;
*resultIsNull = pergroupstate->transValueIsNull;
}
(gdb) p *pergroupstate
$12 = {transValue = 0, transValueIsNull = false, noTransValue = false}
Here transValueIsNull shouldn't be "true"?
thus, DatumCopy would be protected, for this test: "!*resultIsNull"
Ranier Vilela
Em sex., 25 de set. de 2020 às 14:36, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em sex., 25 de set. de 2020 às 11:30, Christoph Berg <myon@debian.org> escreveu:Re: Tom Lane
> > Tom> It's hardly surprising that datumCopy would segfault when given a
> > Tom> null "value" and told it is pass-by-reference. However, to get to
> > Tom> the datumCopy call, we must have passed the MemoryContextContains
> > Tom> check on that very same pointer value, and that would surely have
> > Tom> segfaulted as well, one would think.
>
> > Nope, because MemoryContextContains just returns "false" if passed a
> > NULL pointer.
>
> Ah, right. So you could imagine getting here if the finalfn had returned
> PointerGetDatum(NULL) with isnull = false. We have some aggregate
> transfns that are capable of doing that for internal-type transvalues,
> I think, but the finalfn never should do it.
So I had another stab at this. As expected, the 13.0 upload to
Debian/unstable crashed again on the buildd, while a manual
everything-should-be-the-same build succeeded. I don't know why I
didn't try this before, but this time I took this manual build and
started a PG instance from it. Pasting the gs_group_1 queries made it
segfault instantly.
So here we are:
#0 datumCopy (value=0, typLen=-1, typByVal=false) at ./build/../src/backend/utils/adt/datum.c:142
#1 0x000002aa3bf6322e in datumCopy (value=<optimized out>, typByVal=<optimized out>, typLen=<optimized out>)
at ./build/../src/backend/utils/adt/datum.c:178
#2 0x000002aa3bda6dd6 in finalize_aggregate (aggstate=aggstate@entry=0x2aa3defbfd0, peragg=peragg@entry=0x2aa3e0671f0,
pergroupstate=pergroupstate@entry=0x2aa3e026b78, resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a)
at ./build/../src/backend/executor/nodeAgg.c:1153
(gdb) p *resultVal
$3 = 0
(gdb) p *resultIsNull
$6 = false
(gdb) p *peragg
$7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn = {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false,
fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0,
resulttypeLen = -1, resulttypeByVal = false, shareable = true}
Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
branch of the if (OidIsValid) in finalize_aggregate():
else
{
/* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
*resultVal = pergroupstate->transValue;
*resultIsNull = pergroupstate->transValueIsNull;
}
(gdb) p *pergroupstate
$12 = {transValue = 0, transValueIsNull = false, noTransValue = false}Here transValueIsNull shouldn't be "true"?thus, DatumCopy would be protected, for this test: "!*resultIsNull"
Observe this excerpt (line 1129):
/* don't call a strict function with NULL inputs */
*resultVal = (Datum) 0;
*resultIsNull = true;
*resultVal = (Datum) 0;
*resultIsNull = true;
Now, it does not contradict this principle.
If all the values are null, they should be filled with True (1),
and not 0 (false)?
If all the values are null, they should be filled with True (1),
and not 0 (false)?
Line (4711), function ExecReScanAgg:
MemSet(econtext->ecxt_aggvalues, 0, sizeof(Datum) * node->numaggs);
MemSet(econtext->ecxt_aggnulls, true, sizeof(bool) * node->numaggs);
MemSet(econtext->ecxt_aggnulls, true, sizeof(bool) * node->numaggs);
zero, here, mean False, aggvalues is Null? Not.
regards,
Ranier Vilela
Christoph Berg <cb@df7cb.de> writes: > Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of themwith the extra JIT option, so I should probably make sure to do that here when running tests. +1. I rather doubt our farm is running this type of test on anything but x86_64. Of course, we can't actually *fix* any LLVM bugs, but it'd be nice to know whether they're there. regards, tom lane
Hi, On 2020-09-25 19:05:52 +0200, Christoph Berg wrote: > Am 25. September 2020 18:42:04 MESZ schrieb Andres Freund <andres@anarazel.de> > >> * jit is not exercised enough by "make installcheck" > > > >So far we've exercised more widely it by setting up machines that use > >it > >for all queries (by setting the config option). I'm doubtful it's worth > >doing differently. > > Ok, but given that Debian is currently targeting 22 architectures, I > doubt the PostgreSQL buildfarm covers all of them with the extra JIT > option, so I should probably make sure to do that here when running > tests. Forcing to JIT a lot of queries that are otherwise really fast unfortunately has a significant time cost. Doing that on slow architectures might be prohibitively slow. Kinda wonder if we shouldn't just restrict JIT to a few architectures that we have a bit more regular access to (x86, arm, maybe also ppc?). It's not like anybody would run large analytics queries on mips. Greetings, Andres Freund
Hi, On 2020-09-25 14:11:46 -0400, Tom Lane wrote: > Christoph Berg <cb@df7cb.de> writes: > > Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of themwith the extra JIT option, so I should probably make sure to do that here when running tests. > > +1. I rather doubt our farm is running this type of test on anything > but x86_64. There's quite a few arm animals and at least one mips animal that do some minimal coverage of JITing (i.e. the queries that are actually somewhat expensive). I pinged two owners asking whether one of the arm animals could be changed to force JITing. Greetings, Andres Freund
Re: Andres Freund > > > Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of themwith the extra JIT option, so I should probably make sure to do that here when running tests. > > > > +1. I rather doubt our farm is running this type of test on anything > > but x86_64. > > There's quite a few arm animals and at least one mips animal that do > some minimal coverage of JITing (i.e. the queries that are actually > somewhat expensive). I pinged two owners asking whether one of the arm > animals could be changed to force JITing. I pushed a change that should enable LLVM-10-JIT-testing everywhere [*] and (admittedly to my surprise) all other architectures passed just fine: https://buildd.debian.org/status/logs.php?pkg=postgresql-13&ver=13.0-2 For the record, the architectures with llvm disabled are these: clang-10 [!alpha !hppa !hurd-i386 !ia64 !kfreebsd-amd64 !kfreebsd-i386 !m68k !powerpc !riscv64 !s390x !sh4 !sparc64 !x32], After the tests I realized that LLVM 11 is also already packaged, but s390x still segfaults with that version. Christoph [*] apparently pgbench --temp-config=/no/such/file is not an error, which makes verifying this change a bit harder
Hi, On 2020-09-28 14:22:01 +0200, Christoph Berg wrote: > Re: Andres Freund > > > > Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all ofthem with the extra JIT option, so I should probably make sure to do that here when running tests. > > > > > > +1. I rather doubt our farm is running this type of test on anything > > > but x86_64. > > > > There's quite a few arm animals and at least one mips animal that do > > some minimal coverage of JITing (i.e. the queries that are actually > > somewhat expensive). I pinged two owners asking whether one of the arm > > animals could be changed to force JITing. > > I pushed a change that should enable LLVM-10-JIT-testing everywhere [*] > and (admittedly to my surprise) all other architectures passed just > fine: > > https://buildd.debian.org/status/logs.php?pkg=postgresql-13&ver=13.0-2 Thanks! > For the record, the architectures with llvm disabled are these: > > clang-10 [!alpha !hppa !hurd-i386 !ia64 !kfreebsd-amd64 !kfreebsd-i386 !m68k !powerpc !riscv64 !s390x !sh4 !sparc64 !x32], !powerpc doesn't exclude ppc64, I assume? > After the tests I realized that LLVM 11 is also already packaged, but > s390x still segfaults with that version. > > Christoph > > [*] apparently pgbench --temp-config=/no/such/file is not an error, > which makes verifying this change a bit harder pgbench? I assume you mean pg_regress? FWIW, an easy way to enable JIT for just about all tests, including tap tests, is to set PGOPTIONS='-c jit=1 -c jit_above_cost=0 ...' in the environment before starting the tests. Can a non-debian dev get access to a s390x? It'd be nice to isolate this enough to report a bug to LLVM - and that's probably a lot easier for me than you... My guess would be that some relocation processing or such is off. Greetings, Andres Freund
Re: Andres Freund > > clang-10 [!alpha !hppa !hurd-i386 !ia64 !kfreebsd-amd64 !kfreebsd-i386 !m68k !powerpc !riscv64 !s390x !sh4 !sparc64 !x32], > > !powerpc doesn't exclude ppc64, I assume? That's direct matches only, there's no architecture-family logic in there. > > [*] apparently pgbench --temp-config=/no/such/file is not an error, > > which makes verifying this change a bit harder > > pgbench? I assume you mean pg_regress? Err yes of course. > FWIW, an easy way to enable JIT for just about all tests, including tap > tests, is to set > PGOPTIONS='-c jit=1 -c jit_above_cost=0 ...' > in the environment before starting the tests. Ok, that might simplify the setup a bit. > Can a non-debian dev get access to a s390x? It'd be nice to isolate this > enough to report a bug to LLVM - and that's probably a lot easier for me > than you... My guess would be that some relocation processing or such is > off. You already had an account there in the past I think. I'll see if I can get that reactivated. Thanks for the offer! Christoph
Hi, Christoph helped me to get access to a s390x machine - I wasn't able to reproduce exactly the error he hit. Initially all tests passed, but after recompiling with build flags more similar to Christop's I was able to hit another instance of what I assume to be the same bug. I am fairly sure that I see the problem. Before a post LLVM 10 change the "runtime linker" for JITed code only asserted that relocations that need to be performed are of a known type. Since the debian build - correctly - uses a release version of LLVM, this results in unhandled relocations basically being resolved to 0. I suspect that building with LDFLAGS="-Wl,-z,relro -Wl,-z,now" - which is what I think the debian package does - creates the types of relocations that LLVM doesn't handle for elf + s390. 10 release branch: void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section, uint64_t Offset, uint64_t Value, uint32_t Type, int64_t Addend) { uint8_t *LocalAddress = Section.getAddressWithOffset(Offset); switch (Type) { default: llvm_unreachable("Relocation type not implemented yet!"); break; 11/master: void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section, uint64_t Offset, uint64_t Value, uint32_t Type, int64_t Addend) { uint8_t *LocalAddress = Section.getAddressWithOffset(Offset); switch (Type) { default: report_fatal_error("Relocation type not implemented yet!"); break; Verifying that that's the case by rebuilding against 11 (and then an LLVM debug build, which will take a day or two). Greetings, Andres Freund
Hi, On 2020-10-14 14:58:35 -0700, Andres Freund wrote: > I suspect that building with LDFLAGS="-Wl,-z,relro -Wl,-z,now" - which > is what I think the debian package does - creates the types of > relocations that LLVM doesn't handle for elf + s390. > > 10 release branch: > > void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section, > uint64_t Offset, uint64_t Value, > uint32_t Type, int64_t Addend) { > uint8_t *LocalAddress = Section.getAddressWithOffset(Offset); > switch (Type) { > default: > llvm_unreachable("Relocation type not implemented yet!"); > break; > > 11/master: > > void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section, > uint64_t Offset, uint64_t Value, > uint32_t Type, int64_t Addend) { > uint8_t *LocalAddress = Section.getAddressWithOffset(Offset); > switch (Type) { > default: > report_fatal_error("Relocation type not implemented yet!"); > break; > > Verifying that that's the case by rebuilding against 11 (and then an > LLVM debug build, which will take a day or two). Oh dear. It's not as simple as that. The issue indeed are relocations, but we don't hit those errors. The issue rather is that the systemz specific relative redirection code thought that the only relative symbols are functions. So it creates a stub function to redirect them. Which turns out to not work well with variables like CurrentMemoryContext... Example debug output: This is a SystemZ indirect relocation. Create a new stub function RelType: 20 Addend: 2 TargetName: ExecAggInitGroup SectionID: 0 Offset: 624 This is a SystemZ indirect relocation. Create a new stub function RelType: 26 Addend: 2 TargetName: CurrentMemoryContext SectionID: 0 Offset: 712 Opening a bug report... Greetings, Andres Freund
Hi, On 2020-10-14 17:56:16 -0700, Andres Freund wrote: > Oh dear. It's not as simple as that. The issue indeed are relocations, > but we don't hit those errors. The issue rather is that the systemz > specific relative redirection code thought that the only relative > symbols are functions. So it creates a stub function to redirect > them. Which turns out to not work well with variables like > CurrentMemoryContext... That might be a problem - but the main problem causing the crash at hand is likely something else. The prototypes we create for ExecAggTransReparent() were missing the 'zeroext' parameter for a the 'isnull' attribute, because the code for copying the attributes from llvmjit_types.bc didn't go deep enough (i.e. I didn't quite grok the pretty weird API). On s390x that lead to the newValue argument in ExecAggTransReparent() having a 0 lower byte, but set higher bytes - which then *sometimes* fooled the if (!newValueIsNull) check, which assumed that the higher bits were unset. I have a fix for this, but I've just stared at s390 assembly code for ~10h, never having done so before. So that'll have to wait for tomorrow. It's quite possible that that fix would also help on other architectures... Greetings, Andres Freund
Hi, On 2020-10-15 01:32:46 -0700, Andres Freund wrote: > I have a fix for this, but I've just stared at s390 assembly code for > ~10h, never having done so before. So that'll have to wait for tomorrow. > > It's quite possible that that fix would also help on other > architectures... Pushed now to 11-master. Author: Andres Freund <andres@anarazel.de> Branch: master [72559438f] 2020-10-15 14:29:53 -0700 Branch: REL_13_STABLE [ae3e75aba] 2020-10-15 14:30:40 -0700 Branch: REL_12_STABLE [c8a2bb0f1] 2020-10-15 14:31:32 -0700 Branch: REL_11_STABLE [f3dee5b9a] 2020-10-15 15:06:16 -0700 llvmjit: Also copy parameter / return value attributes from template functions. Previously we only copied the function attributes. That caused problems at least on s390x: Because we didn't copy the 'zeroext' attribute for ExecAggTransReparent()'s *IsNull parameters, expressions invoking it didn't ensure that the upper bytes of the registers were zeroed. In the - relatively rare - cases where not, ExecAggTransReparent() wrongly ended up in the newValueIsNull branch due to the register not being zero. Subsequently causing a crash. It's quite possible that this would cause problems on other platforms, and in other places than just ExecAggTransReparent() on s390x. Thanks to Christoph (and the Debian project) for providing me with access to a s390x machine, allowing me to debug this. Reported-By: Christoph Berg Author: Andres Freund Discussion: https://postgr.es/m/20201015083246.kie5726xerdt3ael@alap3.anarazel.de Backpatch: 11-, where JIT was added I had a successful check-world run with maximum jittery on s390x. But I did hit the issue in different places than you did, so it'd be cool if you could re-enable JIT for s390x - I think you have a package tracking HEAD? Thanks again Christoph! Greetings, Andres Freund
Hi, On 2020-10-15 15:29:24 -0700, Andres Freund wrote: > Pushed now to 11-master. Ugh - there's a failure with an old LLVM version (3.9): https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-10-15%2022%3A24%3A04 Need to rebuild that locally to reproduce. Greetings, Andres Freund
Hi, On 2020-10-15 15:37:01 -0700, Andres Freund wrote: > On 2020-10-15 15:29:24 -0700, Andres Freund wrote: > > Pushed now to 11-master. > > Ugh - there's a failure with an old LLVM version (3.9): > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-10-15%2022%3A24%3A04 > > Need to rebuild that locally to reproduce. It's a bug that was fixed in LLVM 4, but too late to be backported to 3.9. The easiest seems to be to just use a wrapper function that does the necessary pre-checks. Something like the below (in llvmjit_wrap.cpp). Since the wrapper still needs to call into LLVMGetAttributeCountAtIndexPG, it seems easier to just use the seperate function name, rather than #define'ing LLVMGetAttributeCountAtIndex() to the PG version? /* * Like LLVM's LLVMGetAttributeCountAtIndex(), works around a bug in LLVM 3.9. * * In LLVM <= 3.9, LLVMGetAttributeCountAtIndex() segfaults if there are no * attributes at an index (fixed in LLVM commit ce9bb1097dc2). */ unsigned LLVMGetAttributeCountAtIndexPG(LLVMValueRef F, uint32 Idx) { /* * This is more expensive, so only do when using a problematic LLVM * version. */ #if LLVM_VERSION_MAJOR < 4 if (!llvm::unwrap<llvm::Function>(F)->getAttributes().hasAttributes(Idx)) return 0; #endif /* * There is no nice public API to determine the count nicely, so just * always fall back to LLVM's C API. */ return LLVMGetAttributeCountAtIndex(F, Idx); } Greetings, Andres Freund
Hi, On 2020-10-15 17:12:54 -0700, Andres Freund wrote: > On 2020-10-15 15:37:01 -0700, Andres Freund wrote: > It's a bug that was fixed in LLVM 4, but too late to be backported to > 3.9. > > The easiest seems to be to just use a wrapper function that does the > necessary pre-checks. Something like the below (in llvmjit_wrap.cpp). > > Since the wrapper still needs to call into > LLVMGetAttributeCountAtIndexPG, it seems easier to just use the seperate > function name, rather than #define'ing LLVMGetAttributeCountAtIndex() to > the PG version? > > /* > * Like LLVM's LLVMGetAttributeCountAtIndex(), works around a bug in LLVM 3.9. > * > * In LLVM <= 3.9, LLVMGetAttributeCountAtIndex() segfaults if there are no > * attributes at an index (fixed in LLVM commit ce9bb1097dc2). > */ > unsigned > LLVMGetAttributeCountAtIndexPG(LLVMValueRef F, uint32 Idx) > { > /* > * This is more expensive, so only do when using a problematic LLVM > * version. > */ > #if LLVM_VERSION_MAJOR < 4 > if (!llvm::unwrap<llvm::Function>(F)->getAttributes().hasAttributes(Idx)) > return 0; > #endif > > /* > * There is no nice public API to determine the count nicely, so just > * always fall back to LLVM's C API. > */ > return LLVMGetAttributeCountAtIndex(F, Idx); > } Seems to have calmed the buildfarm, without negative consequences so far. Greetings, Andres Freund
Re: Andres Freund > I had a successful check-world run with maximum jittery on s390x. But I > did hit the issue in different places than you did, so it'd be cool if > you could re-enable JIT for s390x - I think you have a package tracking > HEAD? Cool, thanks! I'm tracking PG14 head with apt.postgresql.org, but that doesn't have s390x. I'll pull the patches for PG13, re-enable JIT on some more architectures, and use the opportunity to bump the LLVM version used to 11. Christoph