Thread: gs_group_1 crashing on 13beta2/s390x

gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
On the Debian s390x buildd, the 13beta2 build is crashing:

2020-07-15 01:19:59.149 UTC [859] LOG:  server process (PID 1415) was terminated by signal 11: Segmentation fault
2020-07-15 01:19:59.149 UTC [859] DETAIL:  Failed process was running: create table gs_group_1 as
    select g100, g10, sum(g::numeric), count(*), max(g::text)
    from gs_data_1 group by cube (g1000, g100,g10);

Full build log at
https://buildd.debian.org/status/fetch.php?pkg=postgresql-13&arch=s390x&ver=13%7Ebeta2-1&stamp=1594776007&raw=0

The failure is reproducible there:
https://buildd.debian.org/status/logs.php?pkg=postgresql-13&ver=13%7Ebeta2-1&arch=s390x

I tried a manual build on a s390x machine, but that one went through
fine, so I can't provide a backtrace at the moment.

Christoph



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Re: To PostgreSQL Hackers
> On the Debian s390x buildd, the 13beta2 build is crashing:
> 
> 2020-07-15 01:19:59.149 UTC [859] LOG:  server process (PID 1415) was terminated by signal 11: Segmentation fault
> 2020-07-15 01:19:59.149 UTC [859] DETAIL:  Failed process was running: create table gs_group_1 as
>     select g100, g10, sum(g::numeric), count(*), max(g::text)
>     from gs_data_1 group by cube (g1000, g100,g10);

I wired gdb into the build process and got this backtrace:

2020-07-15 16:03:38.310 UTC [21073] LOG:  server process (PID 21575) was terminated by signal 11: Segmentation fault
2020-07-15 16:03:38.310 UTC [21073] DETAIL:  Failed process was running: create table gs_group_1 as
    select g100, g10, sum(g::numeric), count(*), max(g::text)
    from gs_data_1 group by cube (g1000, g100,g10);

******** build/src/bin/pg_upgrade/tmp_check/data.old/core ********
[New LWP 21575]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: buildd regression [local] CREATE TABLE AS                           '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  datumCopy (typByVal=false, typLen=-1, value=0) at ./build/../src/backend/utils/adt/datum.c:142
142            if (VARATT_IS_EXTERNAL_EXPANDED(vl))
#0  datumCopy (typByVal=false, typLen=-1, value=0) at ./build/../src/backend/utils/adt/datum.c:142
        vl = 0x0
        res = <optimized out>
        res = <optimized out>
        vl = <optimized out>
        eoh = <optimized out>
        resultsize = <optimized out>
        resultptr = <optimized out>
        realSize = <optimized out>
        resultptr = <optimized out>
        realSize = <optimized out>
        resultptr = <optimized out>
#1  datumCopy (value=0, typByVal=false, typLen=-1) at ./build/../src/backend/utils/adt/datum.c:131
        res = <optimized out>
        vl = <optimized out>
        eoh = <optimized out>
        resultsize = <optimized out>
        resultptr = <optimized out>
        realSize = <optimized out>
        resultptr = <optimized out>
#2  0x000002aa04423af8 in finalize_aggregate (aggstate=aggstate@entry=0x2aa05775920, peragg=peragg@entry=0x2aa056e02f0,
resultVal=0x2aa056e0208,resultIsNull=0x2aa056e022a, pergroupstate=<optimized out>, pergroupstate=<optimized out>) at
./build/../src/backend/executor/nodeAgg.c:1128
        fcinfodata = {fcinfo = {flinfo = 0x2aa056e0250, context = 0x2aa05775920, resultinfo = 0x0, fncollation = 0,
isnull= false, nargs = 1, args = 0x3fff6a7b578}, fcinfo_data = "\000\000\002\252\005n\002P\000\000\002\252\005wY ",
'\000'<repeats 13 times>,
"\247\000\001\000\000\002\252\005t\265\250\000\000\003\377\211\341\207F\000\000\003\377\000\000\002\001\000\000\000\000\000\000\003\376\000\000\000\000\000\000\017\370\000\000\000\000\000\000\001\377\000\000\000\000\000\000\000\260\000\000\000k\000\000\000k\000\000\000\000\000\000
\000\000\000\003\377\213\016J
\000\000\000p\000\000\000k\000\000\000\000\000\000\000\200\000\000\000\000\000\000\000\020",'\000' <repeats 11 times>,
"w\000\000\000|\000\000\000\000\000\000\000\002\000\000\002\252\006&9\250\000\000\002\252\005wZh\000\000\002\252\005wZH\000\000\003\377\213\n_\210\000\000\000\000\000\000\000\000\000\000"...}
        fcinfo = 0x3fff6a7b558
        anynull = <optimized out>
        oldContext = <optimized out>
        i = <optimized out>
        lc = <optimized out>
        pertrans = <error reading variable pertrans (value has been optimized out)>
#3  0x000002aa04423ff4 in finalize_aggregates (aggstate=aggstate@entry=0x2aa05775920,
peraggs=peraggs@entry=0x2aa056e0240,pergroup=0x2aa056c8ed8) at ./build/../src/backend/executor/nodeAgg.c:1345
 
        peragg = 0x2aa056e02f0
        transno = <optimized out>
        pergroupstate = 0x2aa056c8ef8
        econtext = <optimized out>
        aggvalues = 0x2aa056e01f8
        aggnulls = 0x2aa056e0228
        aggno = 2
        transno = <optimized out>
#4  0x000002aa04424f5c in agg_retrieve_direct (aggstate=0x2aa05775920) at
./build/../src/backend/executor/nodeAgg.c:2480
        econtext = 0x2aa05776080
        firstSlot = 0x2aa062639a8
        numGroupingSets = <optimized out>
        node = <optimized out>
        tmpcontext = 0x2aa05775d60
        peragg = 0x2aa056e0240
        outerslot = <optimized out>
        nextSetSize = <optimized out>
        pergroups = 0x2aa056c8ea8
        result = <optimized out>
        hasGroupingSets = <optimized out>
        currentSet = <optimized out>
        numReset = <optimized out>
        i = <optimized out>
        node = <optimized out>
        econtext = <optimized out>
        tmpcontext = <optimized out>
        peragg = <optimized out>
        pergroups = <optimized out>
        outerslot = <optimized out>
        firstSlot = <optimized out>
        result = <optimized out>
        hasGroupingSets = <optimized out>
        numGroupingSets = <optimized out>
        currentSet = <optimized out>
        nextSetSize = <optimized out>
        numReset = <optimized out>
        i = <optimized out>
#5  ExecAgg (pstate=0x2aa05775920) at ./build/../src/backend/executor/nodeAgg.c:2140
        node = 0x2aa05775920
        result = 0x0
#6  0x000002aa0441001a in ExecProcNode (node=0x2aa05775920) at ./build/../src/include/executor/executor.h:245
No locals.
#7  ExecutePlan (execute_once=<optimized out>, dest=0x2aa0565fa58, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2aa05775920,
estate=0x2aa057756f8)at ./build/../src/backend/executor/execMain.c:1646
 
        slot = <optimized out>
        current_tuple_count = 0
        slot = <optimized out>
        current_tuple_count = <optimized out>
#8  standard_ExecutorRun (queryDesc=0x2aa062df508, direction=<optimized out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:364
        estate = 0x2aa057756f8
        operation = CMD_SELECT
        dest = 0x2aa0565fa58
        sendTuples = <optimized out>
        oldcontext = 0x2aa0565f830
        __func__ = "standard_ExecutorRun"
#9  0x000002aa043933fa in ExecCreateTableAs (pstate=pstate@entry=0x2aa0565f948, stmt=stmt@entry=0x2aa055bea28,
params=params@entry=0x0,queryEnv=queryEnv@entry=0x0, qc=0x3fff6a7ca30) at
./build/../src/backend/commands/createas.c:354
        query = <optimized out>
        into = <optimized out>
        is_matview = <optimized out>
        dest = 0x2aa0565fa58
        save_userid = 0
        save_sec_context = 0
        save_nestlevel = 0
        address = {classId = 70885274, objectId = 1023, objectSubId = -156778696}
        rewritten = <optimized out>
        plan = 0x2aa062df3f8
        queryDesc = 0x2aa062df508
        __func__ = "ExecCreateTableAs"
#10 0x000002aa0459f378 in ProcessUtilitySlow (pstate=pstate@entry=0x2aa0565f948, pstmt=pstmt@entry=0x2aa055bead8,
queryString=queryString@entry=0x2aa055bcfb8"create table gs_group_1 as\nselect g100, g10, sum(g::numeric), count(*),
max(g::text)\nfromgs_data_1 group by cube (g1000, g100,g10);", context=context@entry=PROCESS_UTILITY_TOPLEVEL,
params=params@entry=0x0,queryEnv=0x0, qc=0x3fff6a7ca30, dest=<error reading variable: value has been optimized out>) at
./build/../src/backend/tcop/utility.c:1600
        _save_exception_stack = 0x3fff6a7c820
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{__jmpbuf = {{__gregs = {0, 2929258264904, 2929257605848, 0, 2929257605672, 2929257598904,
4396084256648,1, 18739560736704083, 18738779339296963}, __fpregs = {2929259797352, 2929259797352, 2929257598904,
2929258351376,0, 4397889735684, 4397889735216, 0}}}, __mask_was_saved = 0, __saved_mask = {__val = {4397889733480, 10,
2929257605400,0, 0, 2929257606120, 4096, 1, 0, 4396084256648, 2929242951088, 2929239505962, 4397889733120,
18736975094525723,2929257598904, 18446744069489843728}}}}
 
        _do_rethrow = false
        parsetree = 0x2aa055bea28
        isTopLevel = true
        isCompleteQuery = true
        needCleanup = false
        commandCollected = false
        address = {classId = 1023, objectId = 4138189632, objectSubId = 682}
        secondaryObject = {classId = 0, objectId = 0, objectSubId = 0}
        __func__ = "ProcessUtilitySlow"
#11 0x000002aa0459dd36 in standard_ProcessUtility (pstmt=0x2aa055bead8, queryString=0x2aa055bcfb8 "create table
gs_group_1as\nselect g100, g10, sum(g::numeric), count(*), max(g::text)\nfrom gs_data_1 group by cube (g1000,
g100,g10);",context=<optimized out>, params=0x0, queryEnv=<optimized out>, dest=0x2aa057d5b68, qc=0x3fff6a7ca30) at
./build/../src/backend/tcop/utility.c:1069
        parsetree = 0x2aa055bea28
        isTopLevel = <optimized out>
        isAtomicContext = <optimized out>
        pstate = 0x2aa0565f948
        readonly_flags = <optimized out>
        __func__ = "standard_ProcessUtility"
#12 0x000002aa0459e874 in ProcessUtility (pstmt=pstmt@entry=0x2aa055bead8, queryString=<optimized out>,
context=context@entry=PROCESS_UTILITY_TOPLEVEL,params=<optimized out>, queryEnv=queryEnv@entry=0x0, dest=0x2aa057d5b68,
qc=0x3fff6a7ca30)at ./build/../src/backend/tcop/utility.c:524
 
No locals.
#13 0x000002aa0459b210 in PortalRunUtility (portal=0x2aa05620008, pstmt=0x2aa055bead8, isTopLevel=<optimized out>,
setHoldSnapshot=<optimizedout>, dest=<optimized out>, qc=0x3fff6a7ca30) at ./build/../src/backend/tcop/pquery.c:1157
 
        utilityStmt = <optimized out>
        snapshot = 0x2aa05674c58
#14 0x000002aa0459bca0 in PortalRunMulti (portal=portal@entry=0x2aa05620008, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=<optimized out>, dest@entry=0x2aa057d5b68, altdest=<optimized out>,
altdest@entry=0x2aa057d5b68,qc=0x3fff6a7ca30) at ./build/../src/backend/tcop/pquery.c:1303
 
        pstmt = <optimized out>
        stmtlist_item__state = {l = 0x2aa057d5b18, i = 0}
        active_snapshot_set = false
        stmtlist_item = 0x2aa057d5b30
#15 0x000002aa0459c9f4 in PortalRun (portal=portal@entry=0x2aa05620008, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x2aa057d5b68, altdest=0x2aa057d5b68,
qc=0x3fff6a7ca30)at ./build/../src/backend/tcop/pquery.c:779
 
        _save_exception_stack = 0x3fff6a7ccc0
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{__jmpbuf = {{__gregs = {2929258004488, 2, 2929258004488, 2929167695872, 2929257605768, 1,
4396084256648,4397889736544, 18739560736712267, 18738779339319315}, __fpregs = {89, 0, 2929257598904, 2929258351376,
4397889735216,4397889735684, 4397889735214, 0}}}, __mask_was_saved = 0, __saved_mask = {__val = {0, 8192, 2, 1, 1,
2929243650206,2929258004488, 2929257605768, 2, 4396084256648, 4397889736544, 2929237235946, 4397889734816,
4397889736544,2929239411110, 4397889734656}}}}
 
        _do_rethrow = <optimized out>
        result = <optimized out>
        nprocessed = <optimized out>
        saveTopTransactionResourceOwner = 0x2aa055e9bf0
        saveTopTransactionContext = 0x2aa05674b10
        saveActivePortal = 0x0
        saveResourceOwner = 0x2aa055e9bf0
        savePortalContext = 0x0
        saveMemoryContext = 0x2aa05674b10
        __func__ = "PortalRun"
#16 0x000002aa0459830a in exec_simple_query (query_string=<optimized out>) at
./build/../src/backend/tcop/postgres.c:1239
        snapshot_set = <optimized out>
        per_parsetree_context = <optimized out>
        plantree_list = <optimized out>
        parsetree = 0x2aa055bea58
        commandTag = <optimized out>
        qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
        querytree_list = <optimized out>
        portal = 0x2aa05620008
        receiver = 0x2aa057d5b68
        format = 0
        parsetree_item__state = {l = 0x2aa055bea88, i = 0}
        dest = DestRemote
        oldcontext = <optimized out>
        parsetree_list = 0x2aa055bea88
        parsetree_item = 0x2aa055beaa0
        save_log_statement_stats = false
        was_logged = false
        use_implicit_block = false
        msec_str =
"\000\000\002\252\000\000\000Q\000\000\002\252\005[ϸ\000\000\002\252\000\000\000\000\000\000\003\377\213\n_\210"
        __func__ = "exec_simple_query"
#17 0x000002aa0459a05e in PostgresMain (argc=<optimized out>, argv=argv@entry=0x2aa055e8190, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4315
 
        query_string = 0x2aa055bcfb8 "create table gs_group_1 as\nselect g100, g10, sum(g::numeric), count(*),
max(g::text)\nfromgs_data_1 group by cube (g1000, g100,g10);"
 
        firstchar = 81
        input_message = {data = 0x2aa055bcfb8 "create table gs_group_1 as\nselect g100, g10, sum(g::numeric), count(*),
max(g::text)\nfromgs_data_1 group by cube (g1000, g100,g10);", len = 133, maxlen = 1024, cursor = 133}
 
        local_sigjmp_buf = {{__jmpbuf = {{__gregs = {8388608, 64, 2929244860962, 2929257583240, 2929244863032,
2929244863024,4396084256648, 4397889736544, 18739560736695925, 18738779339318659}, __fpregs = {4397889736544,
4397889736532,6, 4397419976768, 2929690573168, 2930182562592, 4397889736696, 0}}}, __mask_was_saved = 1, __saved_mask =
{__val= {0, 2929239505782, 2929257748144, 4397889736288, 4396084256648, 4397889736544, 2929242274994, 4397889735960,
1024,4396084256648, 4397889736544, 2929242135450, 4, 0, 4397889736320, 4397889736160}}}}
 
        send_ready_for_query = false
        disable_idle_in_transaction_timeout = false
        __func__ = "PostgresMain"
#18 0x000002aa04512066 in BackendRun (port=0x2aa055e16b0, port=0x2aa055e16b0) at
./build/../src/backend/postmaster/postmaster.c:4523
        av = 0x2aa055e8190
        maxac = <optimized out>
        ac = 1
        i = 1
        av = <optimized out>
        maxac = <optimized out>
        ac = <optimized out>
        i = <optimized out>
        __func__ = "BackendRun"
        __errno_location = <optimized out>
        __errno_location = <optimized out>
        __errno_location = <optimized out>
#19 BackendStartup (port=0x2aa055e16b0) at ./build/../src/backend/postmaster/postmaster.c:4215
        bn = <optimized out>
        pid = <optimized out>
        bn = <optimized out>
        pid = <optimized out>
        __func__ = "BackendStartup"
        __errno_location = <optimized out>
        __errno_location = <optimized out>
        save_errno = <optimized out>
        __errno_location = <optimized out>
        __errno_location = <optimized out>
#20 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1727
        port = 0x2aa055e16b0
        i = <optimized out>
        rmask = {fds_bits = {32, 0 <repeats 15 times>}}
        selres = <optimized out>
        now = <optimized out>
        readmask = {fds_bits = {32, 0 <repeats 15 times>}}
        nSockets = 0
        last_lockfile_recheck_time = 1594829011
        last_touch_time = 1594829011
        __func__ = "ServerLoop"
#21 0x000002aa04513128 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1400
        opt = <optimized out>
        status = <optimized out>
        userDoption = <optimized out>
        listen_addr_saved = false
        i = <optimized out>
        output_config_variable = <optimized out>
        __func__ = "PostmasterMain"
#22 0x000002aa04243fb4 in main (argc=<optimized out>, argv=0x2aa055b71b0) at ./build/../src/backend/main/main.c:210
        do_check_root = <optimized out>

Christoph



Re: gs_group_1 crashing on 13beta2/s390x

From
Tom Lane
Date:
Christoph Berg <myon@debian.org> writes:
>> On the Debian s390x buildd, the 13beta2 build is crashing:

> I wired gdb into the build process and got this backtrace:

> #0  datumCopy (typByVal=false, typLen=-1, value=0) at ./build/../src/backend/utils/adt/datum.c:142
>         vl = 0x0
>         res = <optimized out>
>         res = <optimized out>
>         vl = <optimized out>
>         eoh = <optimized out>
>         resultsize = <optimized out>
>         resultptr = <optimized out>
>         realSize = <optimized out>
>         resultptr = <optimized out>
>         realSize = <optimized out>
>         resultptr = <optimized out>
> #1  datumCopy (value=0, typByVal=false, typLen=-1) at ./build/../src/backend/utils/adt/datum.c:131
>         res = <optimized out>
>         vl = <optimized out>
>         eoh = <optimized out>
>         resultsize = <optimized out>
>         resultptr = <optimized out>
>         realSize = <optimized out>
>         resultptr = <optimized out>
> #2  0x000002aa04423af8 in finalize_aggregate (aggstate=aggstate@entry=0x2aa05775920,
peragg=peragg@entry=0x2aa056e02f0,resultVal=0x2aa056e0208, resultIsNull=0x2aa056e022a, pergroupstate=<optimized out>,
pergroupstate=<optimizedout>) at ./build/../src/backend/executor/nodeAgg.c:1128 

Hmm.  If gdb isn't lying to us, that has to be coming from here:

    /*
     * If result is pass-by-ref, make sure it is in the right context.
     */
    if (!peragg->resulttypeByVal && !*resultIsNull &&
        !MemoryContextContains(CurrentMemoryContext,
                               DatumGetPointer(*resultVal)))
        *resultVal = datumCopy(*resultVal,
                               peragg->resulttypeByVal,
                               peragg->resulttypeLen);

The line numbers in HEAD are a bit different, but that's the only
call of datumCopy() in finalize_aggregate().

It's hardly surprising that datumCopy would segfault when given
a null "value" and told it is pass-by-reference.  However, to get to
the datumCopy call, we must have passed the MemoryContextContains
check on that very same pointer value, and that would surely have
segfaulted as well, one would think.

Given the apparently-can't-happen situation at the call site,
and the fact that we're not seeing similar failures reported
elsewhere (and note that every line shown above is at least
five years old), I'm kind of forced to the conclusion that this
is a compiler bug.  Does adjusting the -O level make it go away?

            regards, tom lane



Re: gs_group_1 crashing on 13beta2/s390x

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> It's hardly surprising that datumCopy would segfault when given a
 Tom> null "value" and told it is pass-by-reference. However, to get to
 Tom> the datumCopy call, we must have passed the MemoryContextContains
 Tom> check on that very same pointer value, and that would surely have
 Tom> segfaulted as well, one would think.

Nope, because MemoryContextContains just returns "false" if passed a
NULL pointer.

-- 
Andrew (irc:RhodiumToad)



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Re: Tom Lane
> Given the apparently-can't-happen situation at the call site,
> and the fact that we're not seeing similar failures reported
> elsewhere (and note that every line shown above is at least
> five years old), I'm kind of forced to the conclusion that this
> is a compiler bug.  Does adjusting the -O level make it go away?

The problem is that a manual build doesn't crash, and I'm somewhat
reluctant to do a full new package upload (which will keep buildds for
all architectures busy) just for a -O0 test unless we are sure it
helps.

I'd rather play more with the manual build artifacts (which should be
using the same compiler and everything), if anyone has ideas what I
should be trying.

Christoph



Re: gs_group_1 crashing on 13beta2/s390x

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> It's hardly surprising that datumCopy would segfault when given a
>  Tom> null "value" and told it is pass-by-reference. However, to get to
>  Tom> the datumCopy call, we must have passed the MemoryContextContains
>  Tom> check on that very same pointer value, and that would surely have
>  Tom> segfaulted as well, one would think.

> Nope, because MemoryContextContains just returns "false" if passed a
> NULL pointer.

Ah, right.  So you could imagine getting here if the finalfn had returned
PointerGetDatum(NULL) with isnull = false.  We have some aggregate
transfns that are capable of doing that for internal-type transvalues,
I think, but the finalfn never should do it.

In any case we still have the fact that this isn't being seen in our
buildfarm; and that's not for lack of s390 machines.  So I still think
the most likely explanation is a compiler bug in bleeding-edge gcc.

Probably what Christoph should be trying to figure out is why he can't
reproduce it manually.  There must be some discrepancy between his
environment and the build machines; but what?

            regards, tom lane



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Re: Tom Lane
> >  Tom> It's hardly surprising that datumCopy would segfault when given a
> >  Tom> null "value" and told it is pass-by-reference. However, to get to
> >  Tom> the datumCopy call, we must have passed the MemoryContextContains
> >  Tom> check on that very same pointer value, and that would surely have
> >  Tom> segfaulted as well, one would think.
> 
> > Nope, because MemoryContextContains just returns "false" if passed a
> > NULL pointer.
> 
> Ah, right.  So you could imagine getting here if the finalfn had returned
> PointerGetDatum(NULL) with isnull = false.  We have some aggregate
> transfns that are capable of doing that for internal-type transvalues,
> I think, but the finalfn never should do it.

So I had another stab at this. As expected, the 13.0 upload to
Debian/unstable crashed again on the buildd, while a manual
everything-should-be-the-same build succeeded. I don't know why I
didn't try this before, but this time I took this manual build and
started a PG instance from it. Pasting the gs_group_1 queries made it
segfault instantly.

So here we are:

#0  datumCopy (value=0, typLen=-1, typByVal=false) at ./build/../src/backend/utils/adt/datum.c:142
#1  0x000002aa3bf6322e in datumCopy (value=<optimized out>, typByVal=<optimized out>, typLen=<optimized out>)
    at ./build/../src/backend/utils/adt/datum.c:178
#2  0x000002aa3bda6dd6 in finalize_aggregate (aggstate=aggstate@entry=0x2aa3defbfd0,
peragg=peragg@entry=0x2aa3e0671f0,
    pergroupstate=pergroupstate@entry=0x2aa3e026b78, resultVal=resultVal@entry=0x2aa3e067108,
resultIsNull=0x2aa3e06712a)
    at ./build/../src/backend/executor/nodeAgg.c:1153

(gdb) p *resultVal
$3 = 0
(gdb) p *resultIsNull
$6 = false

(gdb) p *peragg
$7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn = {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0,
fn_strict= false,
 
    fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, numFinalArgs = 1,
aggdirectargs= 0x0,
 
  resulttypeLen = -1, resulttypeByVal = false, shareable = true}

Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
branch of the if (OidIsValid) in finalize_aggregate():

    else
    {
        /* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
        *resultVal = pergroupstate->transValue;
        *resultIsNull = pergroupstate->transValueIsNull;
    }

(gdb) p *pergroupstate
$12 = {transValue = 0, transValueIsNull = false, noTransValue = false}

That comes from finalize_aggregates:

#3  0x000002aa3bda7e10 in finalize_aggregates (aggstate=aggstate@entry=0x2aa3defbfd0,
peraggs=peraggs@entry=0x2aa3e067140,
    pergroup=0x2aa3e026b58) at ./build/../src/backend/executor/nodeAgg.c:1369

    /*
     * Run the final functions.
     */
    for (aggno = 0; aggno < aggstate->numaggs; aggno++)
    {
        AggStatePerAgg peragg = &peraggs[aggno];
        int         transno = peragg->transno;
        AggStatePerGroup pergroupstate;

        pergroupstate = &pergroup[transno];

        if (DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit))
            finalize_partialaggregate(aggstate, peragg, pergroupstate,
                                      &aggvalues[aggno], &aggnulls[aggno]);
        else
            finalize_aggregate(aggstate, peragg, pergroupstate,
                               &aggvalues[aggno], &aggnulls[aggno]);
    }

... but at that point I'm lost. Maybe "aggno" and "transno" got mixed
up here?

(I'll leave the gdb session open for further suggestions.)

Christoph



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
I poked around with the SET in the offending tests, and the crash is
only present if `set jit_above_cost = 0;` is present. Removing that
makes it pass. Removing work_mem or enable_hashagg does not make a
difference. llvm version is 10.0.1.


Test file:

--
-- Compare results between plans using sorting and plans using hash
-- aggregation. Force spilling in both cases by setting work_mem low
-- and altering the statistics.
--

create table gs_data_1 as
select g%1000 as g1000, g%100 as g100, g%10 as g10, g
   from generate_series(0,1999) g;

analyze gs_data_1;
alter table gs_data_1 set (autovacuum_enabled = 'false');
update pg_class set reltuples = 10 where relname='gs_data_1';

SET work_mem='64kB';

-- Produce results with sorting.

set enable_hashagg = false;
set jit_above_cost = 0; -- remove this to remove crash

explain (costs off)
select g100, g10, sum(g::numeric), count(*), max(g::text)
from gs_data_1 group by cube (g1000, g100,g10);

create table gs_group_1 as
select g100, g10, sum(g::numeric), count(*), max(g::text)
from gs_data_1 group by cube (g1000, g100,g10);



Christoph



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Re: To Tom Lane
> I poked around with the SET in the offending tests, and the crash is
> only present if `set jit_above_cost = 0;` is present. Removing that
> makes it pass. Removing work_mem or enable_hashagg does not make a
> difference. llvm version is 10.0.1.

I put jit_above_cost=0 into postgresql.conf and ran "make installcheck"
again. There are more crashes:

From src/test/regress/sql/interval.sql:

2020-09-25 17:00:25.130 CEST [8135] LOG:  Serverprozess (PID 8369) wurde von Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:00:25.130 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: select avg(f1) from interval_tbl;

From src/test/regress/sql/tid.sql:

2020-09-25 17:01:20.593 CEST [8135] LOG:  Serverprozess (PID 9015) wurde von Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:01:20.593 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: SELECT max(ctid) FROM tid_tab;

From src/test/regress/sql/collate*.sql:

2020-09-25 17:07:17.852 CEST [8135] LOG:  Serverprozess (PID 12232) wurde von Signal 11 beendet:
Speicherzugriffsfehler
2020-09-25 17:07:17.852 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: SELECT min(b), max(b) FROM
collate_test1;

From src/test/regress/sql/plpgsql.sql:

2020-09-25 17:11:56.495 CEST [8135] LOG:  Serverprozess (PID 15562) wurde von Signal 11 beendet:
Speicherzugriffsfehler
2020-09-25 17:11:56.495 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: select * from returnqueryf();

Contrary to the others this one is not related to aggregates:

   -- test RETURN QUERY with dropped columns
   
   create table tabwithcols(a int, b int, c int, d int);
   insert into tabwithcols values(10,20,30,40),(50,60,70,80);
   
   create or replace function returnqueryf()
   returns setof tabwithcols as $$
   begin
     return query select * from tabwithcols;
     return query execute 'select * from tabwithcols';
   end;
   $$ language plpgsql;
   
   select * from returnqueryf();
   
   alter table tabwithcols drop column b;
   
   select * from returnqueryf();
   
   alter table tabwithcols drop column d;
   
   select * from returnqueryf();
   
   alter table tabwithcols add column d int;
   
   select * from returnqueryf();
   
   drop function returnqueryf();
   drop table tabwithcols;

src/test/regress/sql/rangefuncs.sql:

2020-09-25 17:16:04.209 CEST [8135] LOG:  Serverprozess (PID 17372) wurde von Signal 11 beendet:
Speicherzugriffsfehler
2020-09-25 17:16:04.209 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: select * from usersview;

src/test/regress/sql/alter_table.sql:

2020-09-25 17:21:36.187 CEST [8135] LOG:  Serverprozess (PID 19217) wurde von Signal 11 beendet:
Speicherzugriffsfehler
2020-09-25 17:21:36.187 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: update atacc3 set test2 = 4 where
test2is null;
 

src/test/regress/sql/polymorphism.sql:

2020-09-25 17:23:55.509 CEST [8135] LOG:  Serverprozess (PID 21010) wurde von Signal 11 beendet:
Speicherzugriffsfehler
2020-09-25 17:23:55.509 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: select myleast(1.1, 0.22, 0.55);

2020-09-25 17:28:26.222 CEST [8135] LOG:  Serverprozess (PID 22325) wurde von Signal 11 beendet:
Speicherzugriffsfehler
2020-09-25 17:28:26.222 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte aus: select f.longname from fullname
f;

(stopping here)


There are also a lot of these log lines (without prefix):

ORC error: No callback manager available for s390x-ibm-linux-gnu

Is that worrying? I'm not sure but I think I've seen these on other
architectures as well.


I guess that suggests two things:
* jit is not ready for prime time on s390x and I should disable it
* jit is not exercised enough by "make installcheck"

Christoph



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-09-25 17:29:07 +0200, Christoph Berg wrote:
> I guess that suggests two things:
> * jit is not ready for prime time on s390x and I should disable it

I don't know how good LLVMs support for s390x JITing is, and given that
it's unrealistic for people to get access to s390x...


> * jit is not exercised enough by "make installcheck"

So far we've exercised more widely it by setting up machines that use it
for all queries (by setting the config option). I'm doubtful it's worth
doing differently.

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Am 25. September 2020 18:42:04 MESZ schrieb Andres Freund <andres@anarazel.de>
>> * jit is not exercised enough by "make installcheck"
>
>So far we've exercised more widely it by setting up machines that use
>it
>for all queries (by setting the config option). I'm doubtful it's worth
>doing differently.

Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of them
withthe extra JIT option, so I should probably make sure to do that here when running tests. 



Re: gs_group_1 crashing on 13beta2/s390x

From
Ranier Vilela
Date:
Em sex., 25 de set. de 2020 às 11:30, Christoph Berg <myon@debian.org> escreveu:
Re: Tom Lane
> >  Tom> It's hardly surprising that datumCopy would segfault when given a
> >  Tom> null "value" and told it is pass-by-reference. However, to get to
> >  Tom> the datumCopy call, we must have passed the MemoryContextContains
> >  Tom> check on that very same pointer value, and that would surely have
> >  Tom> segfaulted as well, one would think.
>
> > Nope, because MemoryContextContains just returns "false" if passed a
> > NULL pointer.
>
> Ah, right.  So you could imagine getting here if the finalfn had returned
> PointerGetDatum(NULL) with isnull = false.  We have some aggregate
> transfns that are capable of doing that for internal-type transvalues,
> I think, but the finalfn never should do it.

So I had another stab at this. As expected, the 13.0 upload to
Debian/unstable crashed again on the buildd, while a manual
everything-should-be-the-same build succeeded. I don't know why I
didn't try this before, but this time I took this manual build and
started a PG instance from it. Pasting the gs_group_1 queries made it
segfault instantly.

So here we are:

#0  datumCopy (value=0, typLen=-1, typByVal=false) at ./build/../src/backend/utils/adt/datum.c:142
#1  0x000002aa3bf6322e in datumCopy (value=<optimized out>, typByVal=<optimized out>, typLen=<optimized out>)
    at ./build/../src/backend/utils/adt/datum.c:178
#2  0x000002aa3bda6dd6 in finalize_aggregate (aggstate=aggstate@entry=0x2aa3defbfd0, peragg=peragg@entry=0x2aa3e0671f0,
    pergroupstate=pergroupstate@entry=0x2aa3e026b78, resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a)
    at ./build/../src/backend/executor/nodeAgg.c:1153

(gdb) p *resultVal
$3 = 0
(gdb) p *resultIsNull
$6 = false

(gdb) p *peragg
$7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn = {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false,
    fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0,
  resulttypeLen = -1, resulttypeByVal = false, shareable = true}

Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
branch of the if (OidIsValid) in finalize_aggregate():

    else
    {
        /* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
        *resultVal = pergroupstate->transValue;
        *resultIsNull = pergroupstate->transValueIsNull;
    }

(gdb) p *pergroupstate
$12 = {transValue = 0, transValueIsNull = false, noTransValue = false}
Here transValueIsNull shouldn't be "true"?
thus, DatumCopy would be protected, for this test: "!*resultIsNull"
 
regards,
Ranier Vilela

Re: gs_group_1 crashing on 13beta2/s390x

From
Ranier Vilela
Date:
Em sex., 25 de set. de 2020 às 14:36, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em sex., 25 de set. de 2020 às 11:30, Christoph Berg <myon@debian.org> escreveu:
Re: Tom Lane
> >  Tom> It's hardly surprising that datumCopy would segfault when given a
> >  Tom> null "value" and told it is pass-by-reference. However, to get to
> >  Tom> the datumCopy call, we must have passed the MemoryContextContains
> >  Tom> check on that very same pointer value, and that would surely have
> >  Tom> segfaulted as well, one would think.
>
> > Nope, because MemoryContextContains just returns "false" if passed a
> > NULL pointer.
>
> Ah, right.  So you could imagine getting here if the finalfn had returned
> PointerGetDatum(NULL) with isnull = false.  We have some aggregate
> transfns that are capable of doing that for internal-type transvalues,
> I think, but the finalfn never should do it.

So I had another stab at this. As expected, the 13.0 upload to
Debian/unstable crashed again on the buildd, while a manual
everything-should-be-the-same build succeeded. I don't know why I
didn't try this before, but this time I took this manual build and
started a PG instance from it. Pasting the gs_group_1 queries made it
segfault instantly.

So here we are:

#0  datumCopy (value=0, typLen=-1, typByVal=false) at ./build/../src/backend/utils/adt/datum.c:142
#1  0x000002aa3bf6322e in datumCopy (value=<optimized out>, typByVal=<optimized out>, typLen=<optimized out>)
    at ./build/../src/backend/utils/adt/datum.c:178
#2  0x000002aa3bda6dd6 in finalize_aggregate (aggstate=aggstate@entry=0x2aa3defbfd0, peragg=peragg@entry=0x2aa3e0671f0,
    pergroupstate=pergroupstate@entry=0x2aa3e026b78, resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a)
    at ./build/../src/backend/executor/nodeAgg.c:1153

(gdb) p *resultVal
$3 = 0
(gdb) p *resultIsNull
$6 = false

(gdb) p *peragg
$7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn = {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false,
    fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0,
  resulttypeLen = -1, resulttypeByVal = false, shareable = true}

Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
branch of the if (OidIsValid) in finalize_aggregate():

    else
    {
        /* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
        *resultVal = pergroupstate->transValue;
        *resultIsNull = pergroupstate->transValueIsNull;
    }

(gdb) p *pergroupstate
$12 = {transValue = 0, transValueIsNull = false, noTransValue = false}
Here transValueIsNull shouldn't be "true"?
thus, DatumCopy would be protected, for this test: "!*resultIsNull"
Observe this excerpt (line 1129):
/* don't call a strict function with NULL inputs */
*resultVal = (Datum) 0;
*resultIsNull = true;

Now, it does not contradict this principle.
If all the values are null, they should be filled with True (1),
and not 0 (false)?

Line (4711), function ExecReScanAgg:
MemSet(econtext->ecxt_aggvalues, 0, sizeof(Datum) * node->numaggs);
MemSet(econtext->ecxt_aggnulls, true, sizeof(bool) * node->numaggs);

zero, here, mean False, aggvalues is Null? Not.

regards,
Ranier Vilela

Re: gs_group_1 crashing on 13beta2/s390x

From
Tom Lane
Date:
Christoph Berg <cb@df7cb.de> writes:
> Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of
themwith the extra JIT option, so I should probably make sure to do that here when running tests. 

+1.  I rather doubt our farm is running this type of test on anything
but x86_64.

Of course, we can't actually *fix* any LLVM bugs, but it'd be nice
to know whether they're there.

            regards, tom lane



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-09-25 19:05:52 +0200, Christoph Berg wrote:
> Am 25. September 2020 18:42:04 MESZ schrieb Andres Freund <andres@anarazel.de>
> >> * jit is not exercised enough by "make installcheck"
> >
> >So far we've exercised more widely it by setting up machines that use
> >it
> >for all queries (by setting the config option). I'm doubtful it's worth
> >doing differently.
> 
> Ok, but given that Debian is currently targeting 22 architectures, I
> doubt the PostgreSQL buildfarm covers all of them with the extra JIT
> option, so I should probably make sure to do that here when running
> tests.

Forcing to JIT a lot of queries that are otherwise really fast
unfortunately has a significant time cost. Doing that on slow
architectures might be prohibitively slow.  Kinda wonder if we shouldn't
just restrict JIT to a few architectures that we have a bit more regular
access to (x86, arm, maybe also ppc?). It's not like anybody would run
large analytics queries on mips.

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-09-25 14:11:46 -0400, Tom Lane wrote:
> Christoph Berg <cb@df7cb.de> writes:
> > Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of
themwith the extra JIT option, so I should probably make sure to do that here when running tests.
 
> 
> +1.  I rather doubt our farm is running this type of test on anything
> but x86_64.

There's quite a few arm animals and at least one mips animal that do
some minimal coverage of JITing (i.e. the queries that are actually
somewhat expensive). I pinged two owners asking whether one of the arm
animals could be changed to force JITing.

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Re: Andres Freund
> > > Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all of
themwith the extra JIT option, so I should probably make sure to do that here when running tests.
 
> > 
> > +1.  I rather doubt our farm is running this type of test on anything
> > but x86_64.
> 
> There's quite a few arm animals and at least one mips animal that do
> some minimal coverage of JITing (i.e. the queries that are actually
> somewhat expensive). I pinged two owners asking whether one of the arm
> animals could be changed to force JITing.

I pushed a change that should enable LLVM-10-JIT-testing everywhere [*]
and (admittedly to my surprise) all other architectures passed just
fine:

https://buildd.debian.org/status/logs.php?pkg=postgresql-13&ver=13.0-2

For the record, the architectures with llvm disabled are these:

clang-10 [!alpha !hppa !hurd-i386 !ia64 !kfreebsd-amd64 !kfreebsd-i386 !m68k !powerpc !riscv64 !s390x !sh4 !sparc64
!x32],

After the tests I realized that LLVM 11 is also already packaged, but
s390x still segfaults with that version.

Christoph

[*] apparently pgbench --temp-config=/no/such/file is not an error,
which makes verifying this change a bit harder



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-09-28 14:22:01 +0200, Christoph Berg wrote:
> Re: Andres Freund
> > > > Ok, but given that Debian is currently targeting 22 architectures, I doubt the PostgreSQL buildfarm covers all
ofthem with the extra JIT option, so I should probably make sure to do that here when running tests.
 
> > > 
> > > +1.  I rather doubt our farm is running this type of test on anything
> > > but x86_64.
> > 
> > There's quite a few arm animals and at least one mips animal that do
> > some minimal coverage of JITing (i.e. the queries that are actually
> > somewhat expensive). I pinged two owners asking whether one of the arm
> > animals could be changed to force JITing.
> 
> I pushed a change that should enable LLVM-10-JIT-testing everywhere [*]
> and (admittedly to my surprise) all other architectures passed just
> fine:
> 
> https://buildd.debian.org/status/logs.php?pkg=postgresql-13&ver=13.0-2

Thanks!


> For the record, the architectures with llvm disabled are these:
> 
> clang-10 [!alpha !hppa !hurd-i386 !ia64 !kfreebsd-amd64 !kfreebsd-i386 !m68k !powerpc !riscv64 !s390x !sh4 !sparc64
!x32],

!powerpc doesn't exclude ppc64, I assume?


> After the tests I realized that LLVM 11 is also already packaged, but
> s390x still segfaults with that version.
> 
> Christoph
> 
> [*] apparently pgbench --temp-config=/no/such/file is not an error,
> which makes verifying this change a bit harder

pgbench? I assume you mean pg_regress?

FWIW, an easy way to enable JIT for just about all tests, including tap
tests, is to set
PGOPTIONS='-c jit=1 -c jit_above_cost=0 ...'
in the environment before starting the tests.


Can a non-debian dev get access to a s390x? It'd be nice to isolate this
enough to report a bug to LLVM - and that's probably a lot easier for me
than you... My guess would be that some relocation processing or such is
off.

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Re: Andres Freund
> > clang-10 [!alpha !hppa !hurd-i386 !ia64 !kfreebsd-amd64 !kfreebsd-i386 !m68k !powerpc !riscv64 !s390x !sh4 !sparc64
!x32],
> 
> !powerpc doesn't exclude ppc64, I assume?

That's direct matches only, there's no architecture-family logic in
there.

> > [*] apparently pgbench --temp-config=/no/such/file is not an error,
> > which makes verifying this change a bit harder
> 
> pgbench? I assume you mean pg_regress?

Err yes of course.

> FWIW, an easy way to enable JIT for just about all tests, including tap
> tests, is to set
> PGOPTIONS='-c jit=1 -c jit_above_cost=0 ...'
> in the environment before starting the tests.

Ok, that might simplify the setup a bit.

> Can a non-debian dev get access to a s390x? It'd be nice to isolate this
> enough to report a bug to LLVM - and that's probably a lot easier for me
> than you... My guess would be that some relocation processing or such is
> off.

You already had an account there in the past I think. I'll see if I
can get that reactivated. Thanks for the offer!

Christoph



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

Christoph helped me to get access to a s390x machine - I wasn't able to
reproduce exactly the error he hit. Initially all tests passed, but
after recompiling with build flags more similar to Christop's I was able
to hit another instance of what I assume to be the same bug.

I am fairly sure that I see the problem. Before a post LLVM 10 change
the "runtime linker" for JITed code only asserted that relocations that
need to be performed are of a known type. Since the debian build -
correctly - uses a release version of LLVM, this results in unhandled
relocations basically being resolved to 0.

I suspect that building with LDFLAGS="-Wl,-z,relro -Wl,-z,now" - which
is what I think the debian package does - creates the types of
relocations that LLVM doesn't handle for elf + s390.

10 release branch:

void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section,
                                              uint64_t Offset, uint64_t Value,
                                              uint32_t Type, int64_t Addend) {
  uint8_t *LocalAddress = Section.getAddressWithOffset(Offset);
  switch (Type) {
  default:
    llvm_unreachable("Relocation type not implemented yet!");
    break;

11/master:

void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section,
                                              uint64_t Offset, uint64_t Value,
                                              uint32_t Type, int64_t Addend) {
  uint8_t *LocalAddress = Section.getAddressWithOffset(Offset);
  switch (Type) {
  default:
    report_fatal_error("Relocation type not implemented yet!");
    break;

Verifying that that's the case by rebuilding against 11 (and then an
LLVM debug build, which will take a day or two).

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-10-14 14:58:35 -0700, Andres Freund wrote:
> I suspect that building with LDFLAGS="-Wl,-z,relro -Wl,-z,now" - which
> is what I think the debian package does - creates the types of
> relocations that LLVM doesn't handle for elf + s390.
> 
> 10 release branch:
> 
> void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section,
>                                               uint64_t Offset, uint64_t Value,
>                                               uint32_t Type, int64_t Addend) {
>   uint8_t *LocalAddress = Section.getAddressWithOffset(Offset);
>   switch (Type) {
>   default:
>     llvm_unreachable("Relocation type not implemented yet!");
>     break;
> 
> 11/master:
> 
> void RuntimeDyldELF::resolveSystemZRelocation(const SectionEntry &Section,
>                                               uint64_t Offset, uint64_t Value,
>                                               uint32_t Type, int64_t Addend) {
>   uint8_t *LocalAddress = Section.getAddressWithOffset(Offset);
>   switch (Type) {
>   default:
>     report_fatal_error("Relocation type not implemented yet!");
>     break;
> 
> Verifying that that's the case by rebuilding against 11 (and then an
> LLVM debug build, which will take a day or two).

Oh dear. It's not as simple as that. The issue indeed are relocations,
but we don't hit those errors. The issue rather is that the systemz
specific relative redirection code thought that the only relative
symbols are functions. So it creates a stub function to redirect
them. Which turns out to not work well with variables like
CurrentMemoryContext...

Example debug output:
        This is a SystemZ indirect relocation. Create a new stub function
        RelType: 20 Addend: 2 TargetName: ExecAggInitGroup
        SectionID: 0 Offset: 624
        This is a SystemZ indirect relocation. Create a new stub function
        RelType: 26 Addend: 2 TargetName: CurrentMemoryContext
        SectionID: 0 Offset: 712

Opening a bug report...

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-10-14 17:56:16 -0700, Andres Freund wrote:
> Oh dear. It's not as simple as that. The issue indeed are relocations,
> but we don't hit those errors. The issue rather is that the systemz
> specific relative redirection code thought that the only relative
> symbols are functions. So it creates a stub function to redirect
> them. Which turns out to not work well with variables like
> CurrentMemoryContext...

That might be a problem - but the main problem causing the crash at hand
is likely something else. The prototypes we create for
ExecAggTransReparent() were missing the 'zeroext' parameter for a the
'isnull' attribute, because the code for copying the attributes from
llvmjit_types.bc didn't go deep enough (i.e. I didn't quite grok the
pretty weird API). On s390x that lead to the newValue argument in
ExecAggTransReparent() having a 0 lower byte, but set higher bytes -
which then *sometimes* fooled the if (!newValueIsNull) check, which
assumed that the higher bits were unset.

I have a fix for this, but I've just stared at s390 assembly code for
~10h, never having done so before. So that'll have to wait for tomorrow.

It's quite possible that that fix would also help on other
architectures...

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-10-15 01:32:46 -0700, Andres Freund wrote:
> I have a fix for this, but I've just stared at s390 assembly code for
> ~10h, never having done so before. So that'll have to wait for tomorrow.
>
> It's quite possible that that fix would also help on other
> architectures...

Pushed now to 11-master.

Author: Andres Freund <andres@anarazel.de>
Branch: master [72559438f] 2020-10-15 14:29:53 -0700
Branch: REL_13_STABLE [ae3e75aba] 2020-10-15 14:30:40 -0700
Branch: REL_12_STABLE [c8a2bb0f1] 2020-10-15 14:31:32 -0700
Branch: REL_11_STABLE [f3dee5b9a] 2020-10-15 15:06:16 -0700

    llvmjit: Also copy parameter / return value attributes from template functions.

    Previously we only copied the function attributes. That caused problems at
    least on s390x: Because we didn't copy the 'zeroext' attribute for
    ExecAggTransReparent()'s *IsNull parameters, expressions invoking it didn't
    ensure that the upper bytes of the registers were zeroed. In the - relatively
    rare - cases where not, ExecAggTransReparent() wrongly ended up in the
    newValueIsNull branch due to the register not being zero. Subsequently causing
    a crash.

    It's quite possible that this would cause problems on other platforms, and in
    other places than just ExecAggTransReparent() on s390x.

    Thanks to Christoph (and the Debian project) for providing me with access to a
    s390x machine, allowing me to debug this.

    Reported-By: Christoph Berg
    Author: Andres Freund
    Discussion: https://postgr.es/m/20201015083246.kie5726xerdt3ael@alap3.anarazel.de
    Backpatch: 11-, where JIT was added


I had a successful check-world run with maximum jittery on s390x. But I
did hit the issue in different places than you did, so it'd be cool if
you could re-enable JIT for s390x - I think you have a package tracking
HEAD?

Thanks again Christoph!

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-10-15 15:29:24 -0700, Andres Freund wrote:
> Pushed now to 11-master.

Ugh - there's a failure with an old LLVM version (3.9):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-10-15%2022%3A24%3A04

Need to rebuild that locally to reproduce.

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-10-15 15:37:01 -0700, Andres Freund wrote:
> On 2020-10-15 15:29:24 -0700, Andres Freund wrote:
> > Pushed now to 11-master.
> 
> Ugh - there's a failure with an old LLVM version (3.9):
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-10-15%2022%3A24%3A04
> 
> Need to rebuild that locally to reproduce.

It's a bug that was fixed in LLVM 4, but too late to be backported to
3.9.

The easiest seems to be to just use a wrapper function that does the
necessary pre-checks. Something like the below (in llvmjit_wrap.cpp).

Since the wrapper still needs to call into
LLVMGetAttributeCountAtIndexPG, it seems easier to just use the seperate
function name, rather than #define'ing LLVMGetAttributeCountAtIndex() to
the PG version?

/*
 * Like LLVM's LLVMGetAttributeCountAtIndex(), works around a bug in LLVM 3.9.
 *
 * In LLVM <= 3.9, LLVMGetAttributeCountAtIndex() segfaults if there are no
 * attributes at an index (fixed in LLVM commit ce9bb1097dc2).
 */
unsigned
LLVMGetAttributeCountAtIndexPG(LLVMValueRef F, uint32 Idx)
{
    /*
     * This is more expensive, so only do when using a problematic LLVM
     * version.
     */
#if LLVM_VERSION_MAJOR < 4
    if (!llvm::unwrap<llvm::Function>(F)->getAttributes().hasAttributes(Idx))
        return 0;
#endif

    /*
     * There is no nice public API to determine the count nicely, so just
     * always fall back to LLVM's C API.
     */
    return LLVMGetAttributeCountAtIndex(F, Idx);
}


Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Andres Freund
Date:
Hi,

On 2020-10-15 17:12:54 -0700, Andres Freund wrote:
> On 2020-10-15 15:37:01 -0700, Andres Freund wrote:
> It's a bug that was fixed in LLVM 4, but too late to be backported to
> 3.9.
> 
> The easiest seems to be to just use a wrapper function that does the
> necessary pre-checks. Something like the below (in llvmjit_wrap.cpp).
> 
> Since the wrapper still needs to call into
> LLVMGetAttributeCountAtIndexPG, it seems easier to just use the seperate
> function name, rather than #define'ing LLVMGetAttributeCountAtIndex() to
> the PG version?
> 
> /*
>  * Like LLVM's LLVMGetAttributeCountAtIndex(), works around a bug in LLVM 3.9.
>  *
>  * In LLVM <= 3.9, LLVMGetAttributeCountAtIndex() segfaults if there are no
>  * attributes at an index (fixed in LLVM commit ce9bb1097dc2).
>  */
> unsigned
> LLVMGetAttributeCountAtIndexPG(LLVMValueRef F, uint32 Idx)
> {
>     /*
>      * This is more expensive, so only do when using a problematic LLVM
>      * version.
>      */
> #if LLVM_VERSION_MAJOR < 4
>     if (!llvm::unwrap<llvm::Function>(F)->getAttributes().hasAttributes(Idx))
>         return 0;
> #endif
> 
>     /*
>      * There is no nice public API to determine the count nicely, so just
>      * always fall back to LLVM's C API.
>      */
>     return LLVMGetAttributeCountAtIndex(F, Idx);
> }

Seems to have calmed the buildfarm, without negative consequences so far.

Greetings,

Andres Freund



Re: gs_group_1 crashing on 13beta2/s390x

From
Christoph Berg
Date:
Re: Andres Freund
> I had a successful check-world run with maximum jittery on s390x. But I
> did hit the issue in different places than you did, so it'd be cool if
> you could re-enable JIT for s390x - I think you have a package tracking
> HEAD?

Cool, thanks!

I'm tracking PG14 head with apt.postgresql.org, but that doesn't have
s390x.

I'll pull the patches for PG13, re-enable JIT on some more
architectures, and use the opportunity to bump the LLVM version used
to 11.

Christoph