Thread: segmentation fault postgres 9.3.5 core dump perlu related ?

segmentation fault postgres 9.3.5 core dump perlu related ?

From
"Day, David"
Date:

We are developing on and  running Postgres 9.3.5 on  FreeBsd 10.0-p12.

We have been experiencing a intermittent postgres core dump which

Seems primarily to be associated with the the 2  functions below.

 

The area of interest is  based on the content of the postgres log file which often indicates

 

2014-12-01T14:37:41.559725-05:00 puertorico local0 info postgres[30154]: [3-1] LOG:  server process (PID 30187) was terminated by signal 11: Segmentation fault

2014-12-01T14:37:41.559787-05:00 puertorico local0 info postgres[30154]: [3-2] DETAIL:  Failed process was running: SELECT * FROM cc.get_port_and_registration_data($1, $2, $3, $4, $5)

2014-12-01T14:37:41.559794-05:00 puertorico local0 info postgres[30154]: [4-1] LOG:  terminating any other active server processes

 

And that the core file back trace may show  association to perl libraries  of which we only have two possibilities currently, and this is the most relevant logic.

 

Given the onset of this problem,  we suspect it has something to do with the addition of  DNS lookup within the our  perlu function cc.get_sip_id(…).

I would note that we have captured the details of the arguments to the cc.get_port_and_registration_data at time of a core  and can repeat

the same query after the core event without incident.  Currently we are testing for an absence of the core event by commenting out dns  perl function logic and

have rebuilt postgres with debugging symbols.  An example core of this output is below.  ( prior to function alteration ).

 

I am usually attempting to debug simpler  program errors  without such a bad impact on the postgres server.

I would appreciate any comment on potential issues or bad practices in the suspect functions and/or additional details

that could be  gathered from the core files that might assist in resolving this matter.  

 

 

Thanks

 

Dave Day

 

 

 

CREATE OR REPLACE FUNCTION cc.get_port_and_registration_data(cca character varying, tgrp character varying, dhost character varying, usr character varying[], orig_flag boolean)

  RETURNS SETOF cc.port_type_tbl AS

$BODY$

--  The inputs to this overloaded function are sip parameters.

DECLARE pid INTEGER;

DECLARE uid INTEGER;

DECLARE mode    CHARACTER VARYING;

DECLARE sql_result record;

 

BEGIN

 

  SELECT * FROM cc.get_sip_id($1,$2,$3, $4) INTO pid LIMIT 1;   -- Perl invocation

 

  FOR sql_result IN

       SELECT cc.get_db_refhndl($5)AS db_ref_hndl,* FROM  cc.port_info t1

             LEFT JOIN (SELECT translator_id, mgcp_digit_map FROM cc.translator_sys) t2 USING (translator_id)

             LEFT JOIN cc.register_port USING (port_id)

       WHERE port_id = pid AND op_mode = 'default'

       ORDER by expiration DESC

  LOOP

    RETURN NEXT sql_result;

  END LOOP;

  RETURN;

END;

$BODY$

  LANGUAGE plpgsql VOLATILE

  COST 100

  ROWS 1000;

ALTER FUNCTION cc.get_port_and_registration_data(character varying, character varying, character varying, character varying[], boolean)

  OWNER TO redcom;

 

 

CREATE OR REPLACE FUNCTION cc.get_sip_id(cca character varying, tgrp character varying, dhost character varying, usr character varying[])

  RETURNS integer AS

$BODY$

 

use Socket qw(getaddrinfo getnameinfo

              PF_UNSPEC SOCK_STREAM AI_NUMERICHOST NI_NAMEREQD NIx_NOSERV);

use URI;

 

sub is_local {

    my $host = shift(@_);

    my $result = 0;

    open my $fh, "/sbin/route get $host |";

    while (<$fh>) {

        if (m/interface/) {

            chomp;

            my @fields = split /\s+/;

            if ($fields[2] eq "lo0") {

                $result = 1;

            }

            last;

        }

    }

    close $fh;

    return $result;

}

 

my ($cca, $tgrp, $dhost, $usr) = @_;

 

$do_dns_lookup = 1;

{

    my $query = qq{

        SELECT sip_dns_lookup_on_incoming_requests FROM admin.system_options;

    };

    my $rv = spi_exec_query($query, 1);

    if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

        $do_dns_lookup = $rv->{rows}[0]->{sip_dns_lookup_on_incoming_requests};

    }

}

 

if ($tgrp ne '') {

    my $query = qq{

        SELECT port_id FROM cc.port_info WHERE destination_group_id = '$tgrp';

    };

    my $rv = spi_exec_query($query, 1);

    if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

        return $rv->{rows}[0]->{port_id};

    }

}

 

if ($cca ne '') {

    my $query = qq{

        SELECT port_id FROM cc.port_info WHERE call_control_agent = '$cca';

    };

    my $rv = spi_exec_query($query, 1);

    if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

        return $rv->{rows}[0]->{port_id};

    }

}

 

for my $uristr (@$usr) {

    if ($uristr ne '') {

        my $uri = URI->new($uristr);

        if (is_local($uri->host)) {

            $dhost = '';

            my $name = $uri->user;

            if ($name ne '') {

                my $query = qq{

                    SELECT port_id FROM cc.port_info

                    WHERE registration_user = '$name';

                };

                my $rv = spi_exec_query($query, 1);

                if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

                    return $rv->{rows}[0]->{port_id};

                }

            }

        }

    }

}

 

if ($dhost ne '') {

    $pattern = "[:@]" . quotemeta($dhost) . "(?::\\\d+)?(?:;.*)?\$";

    my $query = qq{

        SELECT port_id FROM cc.port_info

        WHERE port_address ~* '$pattern';

    };

    my $rv = spi_exec_query($query, 1);

    if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

        return $rv->{rows}[0]->{port_id};

    }

    if ($do_dns_lookup) {

        my %hints = (family => PF_UNSPEC, socktype => SOCK_STREAM, flags => AI_NUMERICHOST);

        $dhost_trimmed = ($dhost =~ m/\[(.*)\]/) ? $1 : $dhost;

        my ($err, @result) = getaddrinfo($dhost_trimmed, "", \%hints);

        if ($err == 0) {

            for my $ai (@result) {

                my ($err, $hostname) = getnameinfo($ai->{addr}, NI_NAMEREQD, NIx_NOSERV);

                if ($err == 0 && $hostname ne '') {

                    $pattern = "[:@]" . quotemeta($hostname) . "(?::\\\d+)?(?:;.*)?\$";

                    my $query = qq{

                        SELECT port_id FROM cc.port_info

                        WHERE port_address ~* '$pattern';

                    };

                    my $rv = spi_exec_query($query , 1);

                    if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

                        return $rv->{rows}[0]->{port_id};

                    }               

                }

            }

        }

    } else {

        if ($dhost eq "127.0.0.1") {

            $pattern = "[:@]localhost(?::\\\d+)?(?:;.*)?\$";

            my $query = qq{

                SELECT port_id FROM cc.port_info

                WHERE port_address ~* '$pattern';

            };

            my $rv = spi_exec_query($query , 1);

            if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

                return $rv->{rows}[0]->{port_id};

            }               

        }

    }

}

 

for my $uristr (@$usr) {

    if ($uristr ne '') {

        my $uri = URI->new($uristr);

        my $name = $uri->user;

        if ($name ne '') {

            my $query = qq{

                SELECT port_id FROM cc.port_info

                WHERE registration_user = '$name';

            };

            my $rv = spi_exec_query($query, 1);

            if ($rv->{status} =~ /^SPI_OK/ && $rv->{processed} > 0) {

                return $rv->{rows}[0]->{port_id};

            }

        }

    }

}

 

return 0;

$BODY$

  LANGUAGE plperlu VOLATILE

  COST 100;

ALTER FUNCTION cc.get_sip_id(character varying, character varying, character varying, character varying[])

  OWNER TO redcom;

 

 


txt file icon georgia_core_3.txt (11,088 bytes) 2014-12-01 11:27

(gdb) bt full
#0  0x00000000006f50b0 in spg_range_quad_picksplit (fcinfo=<optimized out>) at rangetypes_spgist.c:233
        empty = 0 '\000'
        empty = 0 '\000'
        in = 0xf93a1
        typcache = 0x0
        i = <error reading variable i (Cannot access memory at address 0x0)>
        nonEmptyCount = <optimized out>
        centroid = <optimized out>
#1  0x0000000000705681 in perform_relmap_update (updates=<optimized out>, shared=0 '\000') at relmapper.c:842
        newmap = {magic = 7296347, num_mappings = 0, mappings = {{mapoid = 4294948304, mapfilenode = 32767}, {mapoid = 2679032113, mapfilenode = 0}, {mapoid = 4294948336, mapfilenode = 32767}, {
              mapoid = 7296186, mapfilenode = 0}, {mapoid = 0, mapfilenode = 0}, {mapoid = 43286576, mapfilenode = 8}, {mapoid = 4294948400, mapfilenode = 32767}, {mapoid = 0, mapfilenode = 0}, {
              mapoid = 4294948736, mapfilenode = 32767}, {mapoid = 7295065, mapfilenode = 0}, {mapoid = 4294967294, mapfilenode = 0}, {mapoid = 43297248, mapfilenode = 8}, {mapoid = 640,
              mapfilenode = 0}, {mapoid = 43297240, mapfilenode = 8}, {mapoid = 4088945280, mapfilenode = 0}, {mapoid = 218805760, mapfilenode = 8}, {mapoid = 0, mapfilenode = 196609}, {
              mapoid = 0, mapfilenode = 0}, {mapoid = 6913104, mapfilenode = 0}, {mapoid = 184, mapfilenode = 65538}, {mapoid = 2, mapfilenode = 0}, {mapoid = 0, mapfilenode = 0}, {
              mapoid = 42547296, mapfilenode = 8}, {mapoid = 0, mapfilenode = 0}, {mapoid = 1020833, mapfilenode = 0}, {mapoid = 0, mapfilenode = 196610}, {mapoid = 0, mapfilenode = 0}, {
              mapoid = 6824176, mapfilenode = 0}, {mapoid = 63, mapfilenode = 65538}, {mapoid = 2, mapfilenode = 0}, {mapoid = 0, mapfilenode = 0}, {mapoid = 42547296, mapfilenode = 8}, {
              mapoid = 0, mapfilenode = 0}, {mapoid = 15, mapfilenode = 0}, {mapoid = 0, mapfilenode = 196611}, {mapoid = 0, mapfilenode = 0}, {mapoid = 6698992, mapfilenode = 0}, {mapoid = 60,
              mapfilenode = 65538}, {mapoid = 2, mapfilenode = 0}, {mapoid = 0, mapfilenode = 0}, {mapoid = 42547296, mapfilenode = 8}, {mapoid = 0, mapfilenode = 0} <repeats 11 times>, {
              mapoid = 4294967295, mapfilenode = 0}, {mapoid = 218817224, mapfilenode = 8}, {mapoid = 218805768, mapfilenode = 8}, {mapoid = 22, mapfilenode = 0}, {mapoid = 15, mapfilenode = 0}, {
              mapoid = 1020833, mapfilenode = 0}, {mapoid = 4294948784, mapfilenode = 32767}, {mapoid = 7362177, mapfilenode = 0}, {mapoid = 0, mapfilenode = 0}, {mapoid = 218817224,
              mapfilenode = 8}}, crc = 226, pad = 0}
#2  AtEOXact_RelationMap (isCommit=<optimized out>) at relmapper.c:423
No locals.
#3  0x00000000005c3fe0 in cost_merge_append (path=0x80d0ae318, root=0xda8e0956, pathkeys=0xe02190c4, n_streams=<optimized out>, input_startup_cost=6.9533558069622868e-310,
    input_total_cost=1.7084015800703071e-313, tuples=6.9533558069255283e-310) at costsize.c:1383
        run_cost = <optimized out>
        startup_cost = <optimized out>
        N = 1.6996756523143625e-313
        comparison_cost = <optimized out>
        logN = <optimized out>
#4  0x00000000005c3f30 in cost_sort (path=0xf, root=0xda8e0956, pathkeys=0xe02190c4, input_cost=<optimized out>, tuples=<optimized out>, width=<optimized out>, comparison_cost=<optimized out>,
    sort_mem=<optimized out>, limit_tuples=<optimized out>) at costsize.c:1303
        mergeorder = 0
        npageaccesses = <optimized out>
        nruns = <optimized out>
        startup_cost = 1.6996987484999217e-313
        run_cost = <optimized out>
        input_bytes = <optimized out>
        output_bytes = <optimized out>
        sort_mem_bytes = <optimized out>
#5  0x00000000005bf6d3 in remove_gene (root=<optimized out>, gene=<optimized out>, edge=..., edge_table=<optimized out>) at geqo_erx.c:257
        possess_edge = <optimized out>
        genes_remaining = -534671164
#6  gimme_tour (root=0x7fffffffbb00, edge_table=0x16, new_gene=0x100000001, num_gene=1020833) at geqo_erx.c:208
        edge_failures = <error reading variable edge_failures (Cannot access memory at address 0x0)>
#7  0x00000000005be515 in copyParamList (from=0x1) at params.c:69
        typLen = 0
        typLen = 0
        oprm = 0x80294a9d8
        typByVal = 127 '\177'
        typByVal = 127 '\177'
        nprm = 0x80d0903a8
        size = <optimized out>
        retval = 0x80d0903a8
        i = <error reading variable i (Cannot access memory at address 0x0)>
#8  0x00000000005d6d1e in create_lateral_join_info (root=0x80d0abd58) at initsplan.c:547
        brel = 0x7
        lateral_referencers = <optimized out>
        lc = <optimized out>
        rti = <optimized out>
#9  0x00000000005d8b83 in query_planner (root=0x0, tlist=0x802ba2120, tuple_fraction=3.0451286481687515e-317, limit_tuples=0, qp_callback=0x7fffffffbbd0, qp_extra=0x80d0ab2f0,
    cheapest_path=0x80d0ab390, sorted_path=0x80d0ab358, num_groups=<optimized out>) at planmain.c:128
        parse = 0x80d090718
        joinlist = <optimized out>
        total_pages = <optimized out>
        final_rel = <optimized out>
        cheapestpath = <optimized out>
        sortedpath = <optimized out>
#10 0x00000000005d7c54 in distribute_qual_to_rels (root=0x80d0903a8, clause=0x802ba2120, is_deduced=0 '\000', below_outer_join=8 '\b', jointype=8, qualscope=0x802ba2120, ojscope=0x0,
    outerjoin_nonnullable=<optimized out>, deduced_nullable_relids=<optimized out>, postponed_qual_list=<optimized out>) at initsplan.c:1379
        relids = 0x802a00a68
        nullable_relids = 0x60
        relids = 0x802a00a68
        relids = 0x802a00a68
        relids = 0x802a00a68
        relids = 0x802a00a68
        relids = 0x802a00a68
        nullable_relids = 0x60
        pseudoconstant = <error reading variable pseudoconstant (Cannot access memory at address 0x0)>
        maybe_equivalence = <optimized out>
        outerjoin_delayed = <optimized out>
        restrictinfo = <optimized out>
#11 0x00000000005d705b in deconstruct_recurse (root=0x802a00a68, jtnode=0x0, below_outer_join=8 '\b', qualscope=0x0, inner_join_rels=0x802a00a68, postponed_qual_list=0x60) at initsplan.c:744
        sub_qualscope = 0x802ba2120
        sub_qualscope = 0x802ba2120
        sub_joinlist = 0x802a00a68
        sub_members = <optimized out>
        child_postponed_quals = 0x802b96080
        l = <optimized out>
        joinlist = <optimized out>
#12 0x000000000063ebed in mdtruncate (reln=<optimized out>, forknum=32767, nblocks=0) at md.c:921
        lastsegblocks = <optimized out>
        ov = <optimized out>
        curnblk = <optimized out>
        v = <optimized out>
#13 0x000000000063ecab in mdpostckpt () at md.c:1287
        absorb_counter = <optimized out>
#14 0x00000000006f88b5 in SearchCatCache (cache=0x63ebed <mdtruncate+461>, v1=<optimized out>, v2=<optimized out>, v3=<optimized out>, v4=140737488338336) at catcache.c:1117
        __atp = <optimized out>
        __test = <optimized out>
        __isnull = 0 '\000'
        __isnull = 0 '\000'
        __isnull = 0 '\000'
        __cur_keys = 0x802ba2120
        __cur_nkeys = <optimized out>
        cur_skey = {{sk_flags = 45752608, sk_attno = 8, sk_strategy = 0, sk_subtype = 218892072, sk_collation = 8, sk_func = {fn_addr = 0x60, fn_oid = 96, fn_nargs = 0, fn_strict = 0 '\000',
              fn_retset = 0 '\000', fn_stats = 104 'h', fn_extra = 0x0, fn_mcxt = 0x802a00a68, fn_expr = 0x0}, sk_argument = 34403781224}, {sk_flags = 218694424, sk_attno = 8, sk_strategy = 0,
            sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0x0, fn_oid = 4294950464, fn_nargs = 32767, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 91 '[',
              fn_extra = 0x802b96830, fn_mcxt = 0x802ba2120, fn_expr = 0x802b96080}, sk_argument = 0}, {sk_flags = 45752608, sk_attno = 8, sk_strategy = 0, sk_subtype = 0, sk_collation = 0,
            sk_func = {fn_addr = 0x802a00a68, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 112 'p', fn_extra = 0x63ebed <mdtruncate+461>,
              fn_mcxt = 0x802ba2030, fn_expr = 0x0}, sk_argument = 34403781224}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 4294950560, sk_collation = 32767, sk_func = {
              fn_addr = 0x63ecab <mdpostckpt+11>, fn_oid = 45752432, fn_nargs = 8, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x802a00a68, fn_mcxt = 0x802b94830,
              fn_expr = 0x7fffffffbee0}, sk_argument = 7309493}}
        hashValue = 4294950304
        hashIndex = 218892072
        bucket = 0x802b94830
        relation = <optimized out>
        scandesc = <optimized out>
        ntp = <optimized out>
        ct = <optimized out>
#15 0x00000000006f835f in InitCatCache (id=-19952, reloid=3666741590, indexoid=3760296132, nkeys=43297240, key=0x7ff7fd6b3590, nbuckets=218838980) at catcache.c:730
        oldcxt = 0x802a00a68
#16 0x000000000059329c in _SPI_execute_plan (plan=0x802b94800, paramLI=0x802b94830, snapshot=<optimized out>, crosscheck_snapshot=0x802a00a01, read_only=0 '\000', fire_triggers=<optimized out>,
    tcount=<optimized out>) at spi.c:2194
        completionTag = "\240\277\377\377\377\177\000\000\234\062Y\000\000\000\000\000\060l\271\002\b\000\000\000\060H\271\002\b\000\000\000x\v\240\002\000\000\000\000h\n\240\002\b\000\000\000p\302
\377\377\377\177\000\000@=Y\000\000\000\000"
        canSetTag = <optimized out>
        dest = <optimized out>
        stmt_list = <optimized out>
        lc2 = <optimized out>
        spierrcontext = {previous = 0x8029f7140, callback = 0x80bc09448, arg = 0x802a00a68}
        spierrcontext = {previous = 0x8029f7140, callback = 0x80bc09448, arg = 0x802a00a68}
        my_res = <error reading variable my_res (Cannot access memory at address 0xe)>
        my_processed = <error reading variable my_processed (Cannot access memory at address 0x0)>
        my_lastoid = <error reading variable my_lastoid (Cannot access memory at address 0x0)>
        res = <error reading variable res (Cannot access memory at address 0x0)>
        pushed_active_snap = <error reading variable pushed_active_snap (Cannot access memory at address 0x0)>
        my_tuptable = <optimized out>
        cplan = 0x0
        lc1 = <optimized out>
#17 0x000000080b9078c5 in ?? ()
No symbol table info available.
#18 0x000000080b9083d0 in ?? ()
No symbol table info available.
#19 0x000000080bc09a00 in ?? ()
No symbol table info available.
#20 0x00007fffffffc288 in ?? ()
No symbol table info available.
#21 0x000000080bc09a70 in ?? ()
No symbol table info available.
#22 0x000000080bc098f8 in ?? ()
No symbol table info available.
#23 0x00007fffffffc288 in ?? ()
No symbol table info available.
#24 0x00007fffffffc150 in ?? ()
No symbol table info available.
#25 0x000000080b9083e5 in ?? ()
No symbol table info available.
#26 0x0000000000000000 in ?? ()
No symbol table info available.

 

Attachment

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
Tom Lane
Date:
"Day, David" <dday@redcom.com> writes:
> We are developing on and  running Postgres 9.3.5 on  FreeBsd 10.0-p12.
> We have been experiencing a intermittent postgres core dump which
> Seems primarily to be associated with the the 2  functions below.

> Given the onset of this problem,  we suspect it has something to do with the addition of  DNS lookup within the our
perlufunction cc.get_sip_id(...). 

So this bit is new?

>     open my $fh, "/sbin/route get $host |";

I wonder if your version of Perl thinks this is sufficient license to go
multithreaded or something like that.  That could be problematic.  You
might try looking to see if a backend process that's successfully executed
this code now contains multiple threads.

It's difficult to offer much help on the basis of the info provided.
One comment is that the stack trace you show is completely nonsensical:
functions by those names do exist in PG, but the calling order shown
is impossible.  So it seems there's some problem in how you rebuilt
with debug symbols --- maybe the symbols being used don't match the
executable?  I'm not a FreeBSD user so I have no useful speculation
to offer about how such a mixup might occur on that platform.

            regards, tom lane


Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
"Day, David"
Date:
Tom,

Thanks very much for the feedback.

It is very likely that the date of the core was 'touched' to make the rebuilt
Postgres binary with symbols play nice with gdb.
Apparently, that was not a great idea based on your comments.

In any case we are better prepared to analyze it on the next instance.
Unfortunately the issue has been in remission since the thanksgiving holiday.

The combination of FreeBSD and postgres had been remarkably stable and
dependable up to very recently. This original bit of logic that we suspect
is related to the event  was originally written in plpgsql.  The logic  needed some access
to system level info  for which plpgsql had no built in support.
I suspect the 'getaddrinfo' and 'getnameinfo' and 'open' related statements.
The "open" was the last piece added so it does bear the best correlation to the
problem onset.

In checking the thread counts for the backend processes that have executed this logic successfully I only
see one thread per backend.

Pondering and awaiting an AHA moment. I'll keep the list appraised of any progress on the matter.

Best Regards


Dave Day



-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, December 03, 2014 3:57 PM
To: Day, David
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] segmentation fault postgres 9.3.5 core dump perlu related ?

"Day, David" <dday@redcom.com> writes:
> We are developing on and  running Postgres 9.3.5 on  FreeBsd 10.0-p12.
> We have been experiencing a intermittent postgres core dump which
> Seems primarily to be associated with the the 2  functions below.

> Given the onset of this problem,  we suspect it has something to do with the addition of  DNS lookup within the our
perlufunction cc.get_sip_id(...). 

So this bit is new?

>     open my $fh, "/sbin/route get $host |";

I wonder if your version of Perl thinks this is sufficient license to go multithreaded or something like that.  That
couldbe problematic.  You might try looking to see if a backend process that's successfully executed this code now
containsmultiple threads. 

It's difficult to offer much help on the basis of the info provided.
One comment is that the stack trace you show is completely nonsensical:
functions by those names do exist in PG, but the calling order shown is impossible.  So it seems there's some problem
inhow you rebuilt with debug symbols --- maybe the symbols being used don't match the executable?  I'm not a FreeBSD
userso I have no useful speculation to offer about how such a mixup might occur on that platform. 

            regards, tom lane


Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
Tom Lane
Date:
"Day, David" <dday@redcom.com> writes:
> It is very likely that the date of the core was 'touched' to make the rebuilt
> Postgres binary with symbols play nice with gdb.
> Apparently, that was not a great idea based on your comments.

Oh, so you rebuilt with debug enabled and then retrospectively tried to
use that executable with a core file from a previous executable?  Yeah,
I'm unsurprised that that didn't work :-( ... perhaps it would in an
ideal world, but it's unreliable in the real world.  Make sure you have
the debug-enabled build installed as the running server so the next
corefile can be examined meaningfully.

            regards, tom lane


Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
"Day, David"
Date:
It has been some time since we have seen this problem.
See earlier  message on this subject/thread  for the suspect  plperl function executing
at the time of the core.

Someone on our development team  suggested it might relate to some build options of perl.
In particular MULTIPLICITY or THREADS . We can have this perl fx executing on
two different connections/sessions at the same time. I intend to write some test scripts
that will increase the possibility of this occurrence to see if it makes the problem
more reproducible.

I'll update again after completing some testing. Meanwhile other thoughts and/or
confirmation that these build options should be enabled are welcome.

Thanks

Dave Day

I believe below is an valid stack dump:

Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x000000080bfa50a3 in Perl_fbm_instr () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#1  0x000000080c00ff93 in Perl_re_intuit_start () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#2  0x000000080bfc27a2 in Perl_pp_match () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#3  0x000000080bfbe6a3 in Perl_runops_standard () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#4  0x000000080bf57bd8 in Perl_call_sv () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#5  0x000000080bcfb7c7 in plperl_call_perl_func () from /usr/local/lib/postgresql/plperl.so
#6  0x000000080bcf83c2 in plperl_call_handler () from /usr/local/lib/postgresql/plperl.so
#7  0x000000000057611f in ExecMakeTableFunctionResult ()
#8  0x000000000058b6c7 in ExecFunctionScan ()
#9  0x000000000057bab2 in ExecScan ()
#10 0x00000000005756b8 in ExecProcNode ()
#11 0x00000000005876a8 in ExecLimit ()
#12 0x0000000000575771 in ExecProcNode ()
#13 0x0000000000573630 in standard_ExecutorRun ()
#14 0x0000000000593294 in SPI_execute ()
#15 0x000000000059379c in SPI_execute_plan_with_paramlist ()
#16 0x00000008024f19bc in plpgsql_subxact_cb () from /usr/local/lib/postgresql/plpgsql.so
#17 0x00000008024ee909 in plpgsql_subxact_cb () from /usr/local/lib/postgresql/plpgsql.so
#18 0x00000008024eaf3b in plpgsql_exec_function () from /usr/local/lib/postgresql/plpgsql.so
#19 0x00000008024ea243 in plpgsql_exec_function () from /usr/local/lib/postgresql/plpgsql.so
#20 0x00000008024e6551 in plpgsql_call_handler () from /usr/local/lib/postgresql/plpgsql.so
#21 0x000000000057611f in ExecMakeTableFunctionResult ()
#22 0x000000000058b6c7 in ExecFunctionScan ()
#23 0x000000000057bab2 in ExecScan ()
#24 0x00000000005756b8 in ExecProcNode ()
#25 0x0000000000573630 in standard_ExecutorRun ()
#26 0x0000000000645b0a in PortalRun ()
#27 0x0000000000645719 in PortalRun ()
#28 0x00000000006438ea in PostgresMain ()
#29 0x00000000005ff267 in PostmasterMain ()
#30 0x00000000005a31ba in main ()

pkg info perl5
perl5-5.18.4_11
Name           : perl5
Version        : 5.18.4_11
Installed on   : Mon Jan  5 09:28:05 EST 2015
Origin         : lang/perl5.18
Architecture   : freebsd:10:x86:64
Prefix         : /usr/local
Categories     : perl5 lang devel
Licenses       : GPLv1 or ART10
Maintainer     : perl@FreeBSD.org
WWW            : http://www.perl.org/
Comment        : Practical Extraction and Report Language
Options        :
    DEBUG          : off
    GDBM           : off
    MULTIPLICITY   : off
    PERL_64BITINT  : on
    PERL_MALLOC    : off
    PTHREAD        : on
    SITECUSTOMIZE  : off
    THREADS        : off
    USE_PERL       : on
Shared Libs provided:
    libperl.so.5.18
Annotations    :
    cpe            : cpe:2.3:a:perl:perl:5.18.4:::::freebsd10:x64:11
    repo_type      : binary
    repository     : redcom
Flat size      : 49.2MiB
Description    :
Perl is a language that combines some of the features of C, sed, awk and
shell.  See the manual page for more hype.  There are also many books
published by O'Reilly & Assoc.  See pod/perlbook.pod for more
information.



Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
Alex Hunsaker
Date:

On Wed, Jan 28, 2015 at 1:23 PM, Day, David <dday@redcom.com> wrote:
It has been some time since we have seen this problem.
See earlier  message on this subject/thread  for the suspect  plperl function executing
at the time of the core.

Someone on our development team  suggested it might relate to some build options of perl.
In particular MULTIPLICITY or THREADS . We can have this perl fx executing on
two different connections/sessions at the same time.

Hrm, I can't see how >1 connections/sessions could tickle the bug. Or THREADS/MULTIPLICITY, short of some perl bug. Each backend is its own process and so each perl interpreter is isolated at from each other at that level. IOW each new connection has its very own perl interpreter that has no shared state with any of the others (short of using $_SHARED). But hey, if your testing finds it is easier to trigger with more connections, it just makes the bug more interesting :).

open as use use it should just be standard pipe(); fork(); exec(); dance. And I'm fairly certain perl does not do anything magic like making a thread behind the scene. In gdb you could also try "info threads", just to see if somehow a thread did created.
 
Multiplicity should only come into play if you use plperl and plperlu in the same session (without it, it should error out with "Cannot allocate multiple Perl interpreters on this platform").



I believe below is an valid stack dump:

Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x000000080bfa50a3 in Perl_fbm_instr () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#1  0x000000080c00ff93 in Perl_re_intuit_start () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#2  0x000000080bfc27a2 in Perl_pp_match () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

This sure makes it look like it is segfaulting on some kind of regex /not/ open.

Any chance you could come up with a reproducible test case? I suspect the inputs to the function might help narrow it down to something reproducible. Maybe log the arguments at the start of the function? Or perhaps in your middleware when calling the function crashes, log how it was called?

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
"Day, David"
Date:

Hi Alan,

 

Thanks for your  input.

 

My initial simplistic stress  test ( two connections calling same suspect  function in a loop ) has failed in causing the problem albeit I have not used  any  range of inputs for the possible parameters. Given your thoughts on the the internal mechnanics it seems unlikely it is competing sessions.    I’ll see about varying and logging  arguments in future testing.   Reproducing is 90 % of the battle and

unfortunately we are losing on that front currently.

 

When I type (gdb) info threads  on the most recent core file I see:

* 1 Thread 802c06400 (LWP 101353/postgres)  0x00000000005756b8 in ExecProcNode ()

Not sure that fits with your expectations.

 

We only have two invoked perl functions in the database both of which are plperlu.  These functions are

both invoked at least once  in a normal usage  scenario,  which makes the infrequency of the segmentation fault puzzling.

 

 

Regards

 

 

Dave

 

 

 

 

From: Alex Hunsaker [mailto:badalex@gmail.com]
Sent: Thursday, January 29, 2015 12:58 AM
To: Day, David
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] segmentation fault postgres 9.3.5 core dump perlu related ?

 

 

On Wed, Jan 28, 2015 at 1:23 PM, Day, David <dday@redcom.com> wrote:

It has been some time since we have seen this problem.
See earlier  message on this subject/thread  for the suspect  plperl function executing
at the time of the core.

Someone on our development team  suggested it might relate to some build options of perl.
In particular MULTIPLICITY or THREADS . We can have this perl fx executing on
two different connections/sessions at the same time.

 

Hrm, I can't see how >1 connections/sessions could tickle the bug. Or THREADS/MULTIPLICITY, short of some perl bug. Each backend is its own process and so each perl interpreter is isolated at from each other at that level. IOW each new connection has its very own perl interpreter that has no shared state with any of the others (short of using $_SHARED). But hey, if your testing finds it is easier to trigger with more connections, it just makes the bug more interesting :).

 

open as use use it should just be standard pipe(); fork(); exec(); dance. And I'm fairly certain perl does not do anything magic like making a thread behind the scene. In gdb you could also try "info threads", just to see if somehow a thread did created.

 

Multiplicity should only come into play if you use plperl and plperlu in the same session (without it, it should error out with "Cannot allocate multiple Perl interpreters on this platform").

 

 


I believe below is an valid stack dump:

Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x000000080bfa50a3 in Perl_fbm_instr () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#1  0x000000080c00ff93 in Perl_re_intuit_start () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#2  0x000000080bfc27a2 in Perl_pp_match () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

 

This sure makes it look like it is segfaulting on some kind of regex /not/ open.

 

Any chance you could come up with a reproducible test case? I suspect the inputs to the function might help narrow it down to something reproducible. Maybe log the arguments at the start of the function? Or perhaps in your middleware when calling the function crashes, log how it was called?

 

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
"Day, David"
Date:

I am amending the info threads info there are two threads.

 

I was using the wrong instance of the gdb debugger.

Program terminated with signal SIGSEGV, Segmentation fault.

 

(gdb) bt

#0  0x000000080bfa50a3 in Perl_fbm_instr () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

#1  0x000000080c00ff93 in Perl_re_intuit_start () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

#2  0x000000080bfc27a2 in Perl_pp_match () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

#3  0x000000080bfbe6a3 in Perl_runops_standard () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

#4  0x000000080bf57bd8 in Perl_call_sv () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

#5  0x000000080bcfb7c7 in plperl_call_perl_func () from /usr/local/lib/postgresql/plperl.so

#6  0x000000080bcf83c2 in plperl_call_handler () from /usr/local/lib/postgresql/plperl.so

#7  0x000000000057611f in ExecMakeTableFunctionResult ()

#8  0x000000000058b6c7 in ?? ()

#9  0x000000000057bab2 in ExecScan ()

#10 0x00000000005756b8 in ExecProcNode ()

#11 0x00000000005876a8 in ExecLimit ()

#12 0x0000000000575771 in ExecProcNode ()

#13 0x0000000000573630 in standard_ExecutorRun ()

#14 0x0000000000593294 in ?? ()

#15 0x000000000059379c in SPI_execute_plan_with_paramlist ()

#16 0x00000008024f19bc in ?? () from /usr/local/lib/postgresql/plpgsql.so

#17 0x00000008024ee909 in ?? () from /usr/local/lib/postgresql/plpgsql.so

#18 0x00000008024eaf3b in ?? () from /usr/local/lib/postgresql/plpgsql.so

#19 0x00000008024ea243 in plpgsql_exec_function () from /usr/local/lib/postgresql/plpgsql.so

#20 0x00000008024e6551 in plpgsql_call_handler () from /usr/local/lib/postgresql/plpgsql.so

#21 0x000000000057611f in ExecMakeTableFunctionResult ()

#22 0x000000000058b6c7 in ?? ()

#23 0x000000000057bab2 in ExecScan ()

#24 0x00000000005756b8 in ExecProcNode ()

#25 0x0000000000573630 in standard_ExecutorRun ()

#26 0x0000000000645b0a in ?? ()

#27 0x0000000000645719 in PortalRun ()

#28 0x00000000006438ea in PostgresMain ()

#29 0x00000000005ff267 in PostmasterMain ()

#30 0x00000000005a31ba in main ()

(gdb) info thread

  Id   Target Id         Frame

* 2    Thread 802c06400 (LWP 101353) 0x000000080bfa50a3 in Perl_fbm_instr ()

   from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

* 1    Thread 802c06400 (LWP 101353) 0x000000080bfa50a3 in Perl_fbm_instr ()

   from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

 

 

Hi Alan,

 

Thanks for your  input.

 

My initial simplistic stress  test ( two connections calling same suspect  function in a loop ) has failed in causing the problem albeit I have not used  any  range of inputs for the possible parameters. Given your thoughts on the the internal mechnanics it seems unlikely it is competing sessions.    I’ll see about varying and logging  arguments in future testing.   Reproducing is 90 % of the battle and

unfortunately we are losing on that front currently.

 

When I type (gdb) info threads  on the most recent core file I see:

* 1 Thread 802c06400 (LWP 101353/postgres)  0x00000000005756b8 in ExecProcNode ()

Not sure that fits with your expectations.

 

We only have two invoked perl functions in the database both of which are plperlu.  These functions are

both invoked at least once  in a normal usage  scenario,  which makes the infrequency of the segmentation fault puzzling.

 

 

Regards

 

 

Dave

 

 

 

 

From: Alex Hunsaker [mailto:badalex@gmail.com]
Sent: Thursday, January 29, 2015 12:58 AM
To: Day, David
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] segmentation fault postgres 9.3.5 core dump perlu related ?

 

 

On Wed, Jan 28, 2015 at 1:23 PM, Day, David <dday@redcom.com> wrote:

It has been some time since we have seen this problem.
See earlier  message on this subject/thread  for the suspect  plperl function executing
at the time of the core.

Someone on our development team  suggested it might relate to some build options of perl.
In particular MULTIPLICITY or THREADS . We can have this perl fx executing on
two different connections/sessions at the same time.

 

Hrm, I can't see how >1 connections/sessions could tickle the bug. Or THREADS/MULTIPLICITY, short of some perl bug. Each backend is its own process and so each perl interpreter is isolated at from each other at that level. IOW each new connection has its very own perl interpreter that has no shared state with any of the others (short of using $_SHARED). But hey, if your testing finds it is easier to trigger with more connections, it just makes the bug more interesting :).

 

open as use use it should just be standard pipe(); fork(); exec(); dance. And I'm fairly certain perl does not do anything magic like making a thread behind the scene. In gdb you could also try "info threads", just to see if somehow a thread did created.

 

Multiplicity should only come into play if you use plperl and plperlu in the same session (without it, it should error out with "Cannot allocate multiple Perl interpreters on this platform").

 

 


I believe below is an valid stack dump:

Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x000000080bfa50a3 in Perl_fbm_instr () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#1  0x000000080c00ff93 in Perl_re_intuit_start () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18
#2  0x000000080bfc27a2 in Perl_pp_match () from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

 

This sure makes it look like it is segfaulting on some kind of regex /not/ open.

 

Any chance you could come up with a reproducible test case? I suspect the inputs to the function might help narrow it down to something reproducible. Maybe log the arguments at the start of the function? Or perhaps in your middleware when calling the function crashes, log how it was called?

 

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
Tom Lane
Date:
"Day, David" <dday@redcom.com> writes:
> I am amending the info threads info there are two threads.

Well, that's your problem right there.  There should never, ever be more
than one thread in a Postgres backend process: none of the code in the
backend is meant for a multithreaded situation, and so there are no
interlocks on global variable access etc.

Presumably what is happening is that your plperlu function is somehow
managing to spawn an additional execution thread and let that return
control as well as the original thread.  You need to prevent that.

            regards, tom lane


Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
Alex Hunsaker
Date:


On Thu, Jan 29, 2015 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Day, David" <dday@redcom.com> writes:
> I am amending the info threads info there are two threads.

Well, that's your problem right there.  There should never, ever be more
than one thread in a Postgres backend process: none of the code in the
backend is meant for a multithreaded situation, and so there are no
interlocks on global variable access etc.

One thing you might try is setting a breakpoint on pthread_create (or perhaps clone?) and see if that gives any clues as to what is spawning the thread. If that doesn't help, I would try commenting out large chunks of the plperlu function until the break point is not tripped, trying to find what line causes it. It might also be interesting to see what happens if you try with a non thread enabled perl-- but AFAICT nothing in cc.get_sip_id() should cause threads to be used. A very quick grep of the perl source seems to confirm this. Maybe something in the URI module?

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
"Day, David"
Date:

Thanks for the inputs,  I’ll attempt to apply it and will update when I have some new information. 

 

Thanks

 

 

Dave

 

From: Alex Hunsaker [mailto:badalex@gmail.com]
Sent: Thursday, January 29, 2015 3:30 PM
To: Day, David
Cc: pgsql-general@postgresql.org; Tom Lane
Subject: Re: [GENERAL] segmentation fault postgres 9.3.5 core dump perlu related ?

 

 

 

On Thu, Jan 29, 2015 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Day, David" <dday@redcom.com> writes:
> I am amending the info threads info there are two threads.

Well, that's your problem right there.  There should never, ever be more
than one thread in a Postgres backend process: none of the code in the
backend is meant for a multithreaded situation, and so there are no
interlocks on global variable access etc.

 

One thing you might try is setting a breakpoint on pthread_create (or perhaps clone?) and see if that gives any clues as to what is spawning the thread. If that doesn't help, I would try commenting out large chunks of the plperlu function until the break point is not tripped, trying to find what line causes it. It might also be interesting to see what happens if you try with a non thread enabled perl-- but AFAICT nothing in cc.get_sip_id() should cause threads to be used. A very quick grep of the perl source seems to confirm this. Maybe something in the URI module?

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
Alex Hunsaker
Date:


On Thu, Jan 29, 2015 at 1:54 PM, Day, David <dday@redcom.com> wrote:

Thanks for the inputs,  I’ll attempt to apply it and will update when I have some new information. 

 


BTW a quick check would be to attach with gdb right after you connect, check info threads (there should be none), run the plperlu procedure (with the right arguments/calls to hit all the execution paths), check info threads again. If info threads now reports a thread, we are likely looking at the right plperlu code. It should just be a matter of commenting stuff out to deduce what makes the thread. If not, it could be that plperlu is not at fault and its something else like an extension or some other procedure/pl.

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
"Day, David"
Date:

Alan,

 

I tried as you suggested,  I believe the gdb debugger is giving some false indication about threads.

Whether I attach to a newly launched  backend or a backend that has been executing the suspect perlu function.

The “info threads” result is two.  Suspiciously  they are both at the same location.

 

e.g.

 

* 2    Thread 802c06400 (LWP 101353) 0x000000080bfa50a3 in Perl_fbm_instr ()

   from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

* 1    Thread 802c06400 (LWP 101353) 0x000000080bfa50a3 in Perl_fbm_instr ()

   from /usr/local/lib/perl5/5.18/mach/CORE/libperl.so.5.18

 

That seemed odd to me.  If we use ‘top’ or ‘ps axuwwH’ to get a thread count for

a given process the indication is only one thread for the same situations.

 

I am now  pursuing a different causal hypothesis.   There are instances of another

segmentation fault that do not involve this perl fx.  Rather it is a function that

is also called regularly even on a basically idle system.  Therefore it is perhaps  happenstance as

to which kind might happen.   I believe this may relate to our update process.

 

Product developers are frequently updating (daily)  environments/packages while running postgres and possibly our  application.  I am thinking this update process is not properly coordinating with a running postgres and  may result in occasional  

shared library issues.  This thought is consistent  in  that our production testers who update

at a much lower frequency almost never see this segmentation fault problem but use the same update script.

 

I’ll attempt some scripts changes and meanwhile ask the developers to make observations that would support this idea.

 

I’ll update the thread with the future observations/outcome.

Possibly changing the subject to careless developers cause segmentation fault

 

 

Thanks for your assistance on this matter.

 

 

Dave

 

 

From: Alex Hunsaker [mailto:badalex@gmail.com]
Sent: Thursday, January 29, 2015 6:10 PM
To: Day, David
Cc: pgsql-general@postgresql.org; Tom Lane
Subject: Re: [GENERAL] segmentation fault postgres 9.3.5 core dump perlu related ?

 

 

 

On Thu, Jan 29, 2015 at 1:54 PM, Day, David <dday@redcom.com> wrote:

Thanks for the inputs,  I’ll attempt to apply it and will update when I have some new information. 

 

 

BTW a quick check would be to attach with gdb right after you connect, check info threads (there should be none), run the plperlu procedure (with the right arguments/calls to hit all the execution paths), check info threads again. If info threads now reports a thread, we are likely looking at the right plperlu code. It should just be a matter of commenting stuff out to deduce what makes the thread. If not, it could be that plperlu is not at fault and its something else like an extension or some other procedure/pl.

Re: segmentation fault postgres 9.3.5 core dump perlu related ?

From
Alex Hunsaker
Date:


On Fri, Jan 30, 2015 at 9:54 AM, Day, David <dday@redcom.com> wrote:

Alan,

 

I tried as you suggested,  I believe the gdb debugger is giving some false indication about threads.

Whether I attach to a newly launched  backend or a backend that has been executing the suspect perlu function.

The “info threads” result is two.  Suspiciously  they are both at the same location.



Curious, hrm, well, assuming gdb isn't lying about threads-- I think that would point an extension or a external library (shared_preload_libraries or local_preload_libraries).

Does info threads on the postmaster also report threads?