Thread: [HACKERS] help to identify the reason that extension's C function returns arrayget segmentation fault

I have written an extension to manage openstreetmap data. There is a C function to perform spatial top k query on several  tables and return an array of int8 type as result. The code skeleton of this function is as follows:

Datum vquery(PG_FUNCTION_ARGS) {

int array_len = PG_GETARG_INT32(0);
long * node_ids;

SPI_connect();

//some code to retrieve data from various tables 
// node_ids are allocated and filled up

ArrayType * retarr;
Datum * vals ;

vals = palloc0(array_len * sizeof(long));

// fill the vals up
for (i = 0 ; i < array_len ; i++) 
      vals[i] = Int64GetDatum((node_ids[i]));

retarr = construct_array(vals, retcnt, INT8OID, sizeof(long), true, 'i');

SPI_finish();

PG_RETURN_ARRAYTYPE_P(retarr);
}

the function runs smoothly when called using relatively small parameter, such as select(unnest(vquery(1000))) ;  but when called with large parameter, such as select(unnest(vquery(50000))), sometimes it runs normally, sometimes it runs into "Segmentation Fault" error. the larger the parameter is, the more likely to run into segmentation fault.  

back trace of the process as followings:

Program received signal SIGSEGV, Segmentation fault.
pg_detoast_datum (datum=0x55d4e7e43bc0) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/utils/fmgr/fmgr.c:2241
2241 /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/utils/fmgr/fmgr.c: No such file or directory.
(gdb) backtrace full
#0  pg_detoast_datum (datum=0x55d4e7e43bc0) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/utils/fmgr/fmgr.c:2241
No locals.
#1  0x000055d4e485a29f in array_out (fcinfo=0x7ffd0fdb9f30) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/utils/adt/arrayfuncs.c:958
        v = <optimized out>
        element_type = <optimized out>
        typlen = <optimized out>
        typbyval = <optimized out>
        typalign = <optimized out>
        typdelim = <optimized out>
        p = <optimized out>
        tmp = <optimized out>
        retval = <optimized out>
        values = <optimized out>
        dims_str = "\346\263\346U\000\000\000\000f\021\352\312\342\177\000\000 %q\003\000\000\000\000\270\347\026\313\342\177\000\000\000\000\002\000\000\000\000\000\211O\343\312\342\177\000\000(?\a\000\000\000\000\000\371\336\342\312\342\177\000\000\001\000\000\000\000\000\000\000f\021\352\312\342\177\000\000\200\204\004\001\000\000\000\000\270\347\026\313\342\177\000\000\000\000\002\000\000\000\000\000\211O\343\312\342\177\000\000`1\354\352\324U\000\000\371\336\342\312\342\177\000\000\001\000\000\000\000\000\000\000\000\200\355\347\324U\000\000\300;\344\347\324U\000\000\200\373\350\346\324U\000\000\200\204\004\001\000\000\000\000`\347\026\313\342\177\000\000\200;\344\347\324U\000\000\200D\t\000\000\000\000\000\001\000\000\000\000\000\000"
        bitmap = <optimized out>
        bitmask = <optimized out>
        needquotes = <optimized out>
        needdims = <optimized out>
        nitems = <optimized out>
        overall_length = <optimized out>
        i = <optimized out>
        j = <optimized out>
        k = <optimized out>
        indx = {125, 0, 1638826752, 1007657037, 266051136, 32765}
        ndim = <optimized out>
        dims = <optimized out>
        lb = <optimized out>
        my_extra = <optimized out>
#2  0x000055d4e491bf77 in FunctionCall1Coll (flinfo=flinfo@entry=0x55d4e6281608, collation=collation@entry=0, arg1=arg1@entry=94372911922112)
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/utils/fmgr/fmgr.c:1301
        fcinfo = {flinfo = 0x55d4e6281608, context = 0x0, resultinfo = 0x0, fncollation = 0, isnull = 0 '\000', nargs = 1, arg = {94372911922112, 16, 94372911922112, 
            94372881863664, 81000, 140612075225152, 140724869504928, 94372856290490, 20, 94372911922112, 140724869504960, 94372855395365, 59362, 59362, 
            140724869505552, 140611821448277, 140724869505184, 140724869505104, 140724869505056, 25309696688, 4630392398271114606, 4638776743690240000, 
            4628742354541509959, 4638236535588651008, 140612075225152, 140724869505184, 140724869505264, 140724869505344, 8192, 1340029796386, 94372903805648, 
            94372905445328, 4412211000755930201, 4295079941117417898, 4212081119735560672, 94372856202527, 2087976960, 94372882833728, 94372882836912, 1, 
            140724869505296, 4327854021138088704, 3599182594146, 1, 140724869505248, 94372856077425, 1016, 94372882392848, 94372860652912, 140612076116712, 
            140724869505680, 94372856082219, 94372882407728, 140724869505360, 140724869505343, 3886087214688, 140612076116712, 140724869505344, 140728326873992, 
            94372882407736, 94372882817344, 94372882817312, 1125891316908032, 0, 94372855675584, 281483566645432, 2, 0, 94372881959504, 0, 1016, 0, 0, 8192, 
            18446603348840046049, 513, 128, 176, 140724869505568, 16, 459561500672, 2, 0, 511101108336, 0, 140724869505567, 0, 0, 124, 0, 0, 0, 0, 0, 0, 
            140612046612320, 8192, 1024, 1024, 1072}, 
          argnull = "\000 \000\000\000\000\000\000\300&\343\312\342\177\000\000\220\017(\346\324U\000\000x\201'\346\324U\000\000\320\242\333\017\375\177\000\000\260R\223\344\324U\000\000\060\243\333\017\375\177\000\000\002\000\000\000\000\000\000\000\340\242\333\017\375\177\000\000?:x\344\324U\000\000\000\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\020\354'\346"}
        result = <optimized out>
        __func__ = "FunctionCall1Coll"
#3  0x000055d4e491d557 in OutputFunctionCall (flinfo=0x55d4e6281608, val=94372911922112)
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/utils/fmgr/fmgr.c:1954
        result = <optimized out>
        pushed = 0 '\000'
#4  0x000055d4e4635179 in printtup (slot=0x55d4e6280410, self=0x55d4e627ec10)
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/access/common/printtup.c:347
        outputstr = <optimized out>
        thisState = <optimized out>
        attr = <optimized out>
        typeinfo = <optimized out>
        myState = 0x55d4e627ec10
        buf = {data = 0x55d4e6288150 "", len = 2, maxlen = 1024, cursor = 68}
        natts = 1
---Type <return> to continue, or q <return> to quit--- 
        i = 0
#5  0x000055d4e475c8f7 in ExecutePlan (dest=0x55d4e627ec10, direction=<optimized out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, 
    planstate=0x55d4e6280220, estate=0x55d4e6280110) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/executor/execMain.c:1513
        slot = <optimized out>
        current_tuple_count = 0
#6  standard_ExecutorRun (queryDesc=0x55d4e61989b0, direction=<optimized out>, count=0)
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/executor/execMain.c:318
        estate = 0x55d4e6280110
        operation = CMD_SELECT
        dest = 0x55d4e627ec10
        sendTuples = <optimized out>
#7  0x000055d4e484255f in PortalRunSelect (portal=portal@entry=0x55d4e61969a0, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, 
    dest=dest@entry=0x55d4e627ec10) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/tcop/pquery.c:942
        queryDesc = 0x55d4e61989b0
        direction = <optimized out>
        nprocessed = <optimized out>
        __func__ = "PortalRunSelect"
#8  0x000055d4e4843b2a in PortalRun (portal=portal@entry=0x55d4e61969a0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
    dest=dest@entry=0x55d4e627ec10, altdest=altdest@entry=0x55d4e627ec10, completionTag=completionTag@entry=0x7ffd0fdba830 "")
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/tcop/pquery.c:786
        save_exception_stack = 0x7ffd0fdba740
        save_context_stack = 0x0
        local_sigjmp_buf = {{__jmpbuf = {94372882585600, 6959097398727827151, 94372881852832, 94372882803728, 94372882585872, 2, 6959097398784450255, 
              3801170618011246287}, __mask_was_saved = 0, __saved_mask = {__val = {140724869506399, 4561020176, 94372856202256, 94372857528766, 64, 140724869506368, 
                88, 94372881852832, 94372857166446, 94372882585872, 2, 140724869506400, 94372856290490, 2, 94372881852832, 140724869506432}}}}
        result = <optimized out>
        nprocessed = <optimized out>
        saveTopTransactionResourceOwner = 0x55d4e61560c8
        saveTopTransactionContext = 0x55d4e61b0fa0
        saveActivePortal = 0x0
        saveResourceOwner = 0x55d4e61560c8
        savePortalContext = 0x0
        saveMemoryContext = 0x55d4e61b0fa0
        __func__ = "PortalRun"
#9  0x000055d4e48414e5 in exec_simple_query (
    query_string=0x55d4e6248930 "select vquery_c(121.3055419921875,28.596278622860407,129.9647216796875,36.91682310329166,81000,0.01);")
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/tcop/postgres.c:1075
        parsetree = 0x55d4e6249800
        portal = 0x55d4e61969a0
        snapshot_set = <optimized out>
        commandTag = <optimized out>
        completionTag = "\000ELECT 1\000\000\000\000\000\000\000\000\260\250\333\017\375\177\000\000\357r\177\344\324U\000\000`\352\000\000\000\000\000\000@\355\325\344\324U\000\000\320D\025\346\324U\000\000\364y\031\346\324U\000"
        querytree_list = <optimized out>
        plantree_list = 0x55d4e627ebe0
        receiver = 0x55d4e627ec10
        format = 0
        dest = DestRemote
        parsetree_list = 0x55d4e6249930
        save_log_statement_stats = 0 '\000'
        was_logged = 0 '\000'
        msec_str = "\000ELECT 1\000\000\000\000\000\000\000\000\260\250\333\017\375\177\000\000\357r\177\344\324U\000"
        parsetree_item = 0x55d4e6249910
        isTopLevel = 1 '\001'
#10 PostgresMain (argc=<optimized out>, argv=argv@entry=0x55d4e6155878, dbname=0x55d4e6155720 "openstreetmap", username=<optimized out>)
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/tcop/postgres.c:4087
        query_string = 0x55d4e6248930 "select vquery_c(121.3055419921875,28.596278622860407,129.9647216796875,36.91682310329166,81000,0.01);"
        firstchar = -434542176
        input_message = {data = 0x55d4e6248930 "select vquery_c(121.3055419921875,28.596278622860407,129.9647216796875,36.91682310329166,81000,0.01);", len = 102, 
          maxlen = 1024, cursor = 102}
        local_sigjmp_buf = {{__jmpbuf = {140724869506880, 6959097398365019855, 94372881585920, 1, 0, 94372881836048, 6959097398729924303, 3801170603473657551}, 
            __mask_was_saved = 1, __saved_mask = {__val = {0, 140724869507616, 0, 94372881836048, 140612043772104, 8192, 206158430256, 140724869507232, 
---Type <return> to continue, or q <return> to quit---
                140724869507024, 140724869507072, 94372881586032, 656, 94372881857028, 10, 94372881585920, 94372881586232}}}}
        send_ready_for_query = 0 '\000'
        __func__ = "PostgresMain"
#11 0x000055d4e46315e9 in BackendRun (port=0x55d4e6192810) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/postmaster/postmaster.c:4185
        ac = 1
        secs = 541521482
        usecs = 969410
        i = 1
        av = 0x55d4e6155878
        maxac = <optimized out>
#12 BackendStartup (port=0x55d4e6192810) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/postmaster/postmaster.c:3848
        bn = <optimized out>
        pid = <optimized out>
#13 ServerLoop () at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/postmaster/postmaster.c:1698
        rmask = {fds_bits = {128, 0 <repeats 15 times>}}
        selres = <optimized out>
        readmask = {fds_bits = {200, 0 <repeats 15 times>}}
        now = <optimized out>
        last_lockfile_recheck_time = 1488206220
        last_touch_time = 1488205152
        __func__ = "ServerLoop"
#14 0x000055d4e47fa571 in PostmasterMain (argc=5, argv=<optimized out>)
    at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/postmaster/postmaster.c:1322
        opt = <optimized out>
        status = <optimized out>
        userDoption = <optimized out>
        listen_addr_saved = 1 '\001'
        i = <optimized out>
        __func__ = "PostmasterMain"
#15 0x000055d4e463227d in main (argc=5, argv=0x55d4e6154190) at /build/postgresql-9.3-xceQkK/postgresql-9.3-9.3.16/build/../src/backend/main/main.c:234

Some background information to mention: 

I have used some 3rd C libraries in the extension, those libraries use c standard memory allocation functions such as malloc and calloc other than the postgres memory management fucntions( palloc and pfree). Those library implement the data structure of vector, hashtable and priority queue.  
钱新林 <qianxinlin@gmail.com> writes:
> I have written an extension to manage openstreetmap data. There is a C
> function to perform spatial top k query on several  tables and return an
> array of int8 type as result. The code skeleton of this function is as
> follows:

There are a remarkable lot of bugs in this code fragment.  Many of them
would not bite you as long as you are running on 64-bit Intel hardware,
but that doesn't make them not bugs.

> Datum vquery(PG_FUNCTION_ARGS) {

> int array_len = PG_GETARG_INT32(0);
> long * node_ids;

> SPI_connect();

> //some code to retrieve data from various tables
> // node_ids are allocated and filled up

> ArrayType * retarr;
> Datum * vals ;

> vals = palloc0(array_len * sizeof(long));

Datum is not necessarily the same as "long".

> // fill the vals up
> for (i = 0 ; i < array_len ; i++)
>       vals[i] = Int64GetDatum((node_ids[i]));

int64 is not necessarily the same as "long", either.

> retarr = construct_array(vals, retcnt, INT8OID, sizeof(long), true, 'i');

Again, INT8 is not the same size as "long", and it's not necessarily
pass-by-val, and it's *certainly* not integer alignment.

> SPI_finish();

> PG_RETURN_ARRAYTYPE_P(retarr);

But I think what's really biting you, probably, is that construct_array()
made the array in CurrentMemoryContext which at that point was the SPI
execution context; which would be deleted by SPI_finish.  So you're
returning a dangling pointer.  You need to do something to either copy
the array value out to the caller's context, or build it there in the
first place.

BTW, this failure would be a lot less intermittent if you were testing
in a CLOBBER_FREED_MEMORY build.  I would go so far as to say you should
*never* develop or test C code for the Postgres backend without using
the --enable-cassert configure option for your build.  You're simply
tossing away a whole lot of debug support if you don't.
        regards, tom lane



Thanks for your clues. 

The system I have used to debug the code is x86 64bit based, Ubuntu 1404 and postgres 9.3.13, I have revised the code and it looks like as following:

Datum vquery(PG_FUNCTION_ARGS) {

int array_len = PG_GETARG_INT32(0);
int64 * node_ids;
ArrayType * retarr;
Datum * vals ;

SPI_connect();

//some code to retrieve data from various tables 
// node_ids are allocated and filled up

vals = SPI_palloc(array_len * sizeof(Datum));
memset (vals, 0, array_len * sizeof(Datum));

// fill the vals up
for (i = 0 ; i < array_len ; i++) 
      vals[i] = Int64GetDatum((node_ids[i]));

retarr = construct_array(vals, retcnt, INT8OID, sizeof(int64), true, 'd');

SPI_finish();

PG_RETURN_ARRAYTYPE_P(retarr);
}

It seems to solve the problem,  I have tested the code for a while and no more segmentation faults are reported. 

I have built Postgresql with --enable-debug and --enable-cassert, but use the binary with gdb and get no symbol file loaded. I will take further researches and use it to facilitate debug. Thanks.  

On Tue, Feb 28, 2017 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
钱新林 <qianxinlin@gmail.com> writes:
> I have written an extension to manage openstreetmap data. There is a C
> function to perform spatial top k query on several  tables and return an
> array of int8 type as result. The code skeleton of this function is as
> follows:

There are a remarkable lot of bugs in this code fragment.  Many of them
would not bite you as long as you are running on 64-bit Intel hardware,
but that doesn't make them not bugs.

> Datum vquery(PG_FUNCTION_ARGS) {

> int array_len = PG_GETARG_INT32(0);
> long * node_ids;

> SPI_connect();

> //some code to retrieve data from various tables
> // node_ids are allocated and filled up

> ArrayType * retarr;
> Datum * vals ;

> vals = palloc0(array_len * sizeof(long));

Datum is not necessarily the same as "long".

> // fill the vals up
> for (i = 0 ; i < array_len ; i++)
>       vals[i] = Int64GetDatum((node_ids[i]));

int64 is not necessarily the same as "long", either.

> retarr = construct_array(vals, retcnt, INT8OID, sizeof(long), true, 'i');

Again, INT8 is not the same size as "long", and it's not necessarily
pass-by-val, and it's *certainly* not integer alignment.

> SPI_finish();

> PG_RETURN_ARRAYTYPE_P(retarr);

But I think what's really biting you, probably, is that construct_array()
made the array in CurrentMemoryContext which at that point was the SPI
execution context; which would be deleted by SPI_finish.  So you're
returning a dangling pointer.  You need to do something to either copy
the array value out to the caller's context, or build it there in the
first place.

BTW, this failure would be a lot less intermittent if you were testing
in a CLOBBER_FREED_MEMORY build.  I would go so far as to say you should
*never* develop or test C code for the Postgres backend without using
the --enable-cassert configure option for your build.  You're simply
tossing away a whole lot of debug support if you don't.

                        regards, tom lane