Re: [HACKERS] logical replication launcher crash on buildfarm - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: [HACKERS] logical replication launcher crash on buildfarm
Date
Msg-id 368e64f6-ee9a-f09f-82b4-a33b61b28d36@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] logical replication launcher crash on buildfarm  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] logical replication launcher crash on buildfarm  (Robert Haas <robertmhaas@gmail.com>)
Re: logical replication launcher crash on buildfarm  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 16/03/17 09:53, Andres Freund wrote:
> On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:
>> On 16/03/17 04:42, Andres Freund wrote:
>>> On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> I just unstuck a bunch of my buildfarm animals.  That triggered some
>>>> spurious failures (on piculet, calliphoridae, mylodon), but also one
>>>> that doesn't really look like that:
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
>>>>
>>>> with the pertinent point being:
>>>>
>>>> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
>>>> [New LWP 1894]
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>>> Core was generated by `postgres: bgworker: logical replication launcher                '.
>>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>>> #0  0x000055e265bff5e3 in ?? ()
>>>> #0  0x000055e265bff5e3 in ?? ()
>>>> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
>>>> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
>>>> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
>>>>
>>>> it's possible that me killing things and upgrading caused this, but
>>>> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
>>>> it's more than that.  The machine is a bit backed up at the moment, so
>>>> it'll probably be a while till it's at that animal/branch again,
>>>> otherwise I'd not have mentioned this.
>>>
>>> For some reason it ran again pretty soon. And I'm afraid it's indeed an
>>> issue:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
>>>
>>
>> Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
>> seems to work fine on my two machines. I don't see anything else
>> different on culicidae though. Sadly the backtrace is not that
>> informative either. I'll try to investigate more but it will take time...
> 
> Worthwhile additional failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A55%3A01
> 
> Same animal, also EXEC_BACKEND, but 9.6.
> 
> A quick look at the relevant line:
>     /*
>      * If bgw_main is set, we use that value as the initial entrypoint.
>      * However, if the library containing the entrypoint wasn't loaded at
>      * postmaster startup time, passing it as a direct function pointer is not
>      * possible.  To work around that, we allow callers for whom a function
>      * pointer is not available to pass a library name (which will be loaded,
>      * if necessary) and a function name (which will be looked up in the named
>      * library).
>      */
>     if (worker->bgw_main != NULL)
>         entrypt = worker->bgw_main;
> 
> makes the issue clear - we appear to be assuming that bgw_main is
> meaningful across processes.  Which it isn't in the EXEC_BACKEND case
> when ASLR is in use...
> 
> This kinda sounds familiar, but a quick google search doesn't find
> anything relevant.

Hmm now that you mention it, I remember discussing something similar
with you last year in Dallas in regards to parallel query. IIRC Windows
should not have this problem but other systems with EXEC_BACKEND do.
Don't remember the details though.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] logical replication launcher crash on buildfarm
Next
From: Dave Page
Date:
Subject: Re: [HACKERS] pg_ls_waldir() & pg_ls_logdir()