Thread: [HACKERS] logical replication launcher crash on buildfarm

[HACKERS] logical replication launcher crash on buildfarm

From
Andres Freund
Date:
Hi,

I just unstuck a bunch of my buildfarm animals.  That triggered some
spurious failures (on piculet, calliphoridae, mylodon), but also one
that doesn't really look like that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03

with the pertinent point being:

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
[New LWP 1894]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: logical replication launcher                '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055e265bff5e3 in ?? ()
#0  0x000055e265bff5e3 in ?? ()
#1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
#2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
#3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205

it's possible that me killing things and upgrading caused this, but
given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
it's more than that.  The machine is a bit backed up at the moment, so
it'll probably be a while till it's at that animal/branch again,
otherwise I'd not have mentioned this.

Greetings,

Andres Freund



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
> Hi,
> 
> I just unstuck a bunch of my buildfarm animals.  That triggered some
> spurious failures (on piculet, calliphoridae, mylodon), but also one
> that doesn't really look like that:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
> 
> with the pertinent point being:
> 
> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
> [New LWP 1894]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `postgres: bgworker: logical replication launcher                '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x000055e265bff5e3 in ?? ()
> #0  0x000055e265bff5e3 in ?? ()
> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
> 
> it's possible that me killing things and upgrading caused this, but
> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
> it's more than that.  The machine is a bit backed up at the moment, so
> it'll probably be a while till it's at that animal/branch again,
> otherwise I'd not have mentioned this.

For some reason it ran again pretty soon. And I'm afraid it's indeed an
issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02

- Andres



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 16/03/17 04:42, Andres Freund wrote:
> On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
>> Hi,
>>
>> I just unstuck a bunch of my buildfarm animals.  That triggered some
>> spurious failures (on piculet, calliphoridae, mylodon), but also one
>> that doesn't really look like that:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
>>
>> with the pertinent point being:
>>
>> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
>> [New LWP 1894]
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>> Core was generated by `postgres: bgworker: logical replication launcher                '.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x000055e265bff5e3 in ?? ()
>> #0  0x000055e265bff5e3 in ?? ()
>> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
>> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
>> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
>>
>> it's possible that me killing things and upgrading caused this, but
>> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
>> it's more than that.  The machine is a bit backed up at the moment, so
>> it'll probably be a while till it's at that animal/branch again,
>> otherwise I'd not have mentioned this.
> 
> For some reason it ran again pretty soon. And I'm afraid it's indeed an
> issue:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
> 

Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
seems to work fine on my two machines. I don't see anything else
different on culicidae though. Sadly the backtrace is not that
informative either. I'll try to investigate more but it will take time...

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services




Re: [HACKERS] logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:
> On 16/03/17 04:42, Andres Freund wrote:
> > On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
> >> Hi,
> >>
> >> I just unstuck a bunch of my buildfarm animals.  That triggered some
> >> spurious failures (on piculet, calliphoridae, mylodon), but also one
> >> that doesn't really look like that:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
> >>
> >> with the pertinent point being:
> >>
> >> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
> >> [New LWP 1894]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> >> Core was generated by `postgres: bgworker: logical replication launcher                '.
> >> Program terminated with signal SIGSEGV, Segmentation fault.
> >> #0  0x000055e265bff5e3 in ?? ()
> >> #0  0x000055e265bff5e3 in ?? ()
> >> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
> >> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
> >> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
> >>
> >> it's possible that me killing things and upgrading caused this, but
> >> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
> >> it's more than that.  The machine is a bit backed up at the moment, so
> >> it'll probably be a while till it's at that animal/branch again,
> >> otherwise I'd not have mentioned this.
> > 
> > For some reason it ran again pretty soon. And I'm afraid it's indeed an
> > issue:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
> > 
> 
> Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
> seems to work fine on my two machines. I don't see anything else
> different on culicidae though. Sadly the backtrace is not that
> informative either. I'll try to investigate more but it will take time...

I can give you a login to that machine, it doesn't do anything but run
buildfarm animals...  Will have to be my tomorrow however.

(Also need to fix config for older branches that don't work with
the upgraded ssl. This is a really bad situation :()

Greetings,

Andres Freund



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 16/03/17 09:44, Andres Freund wrote:
> On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:
>> On 16/03/17 04:42, Andres Freund wrote:
>>> On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> I just unstuck a bunch of my buildfarm animals.  That triggered some
>>>> spurious failures (on piculet, calliphoridae, mylodon), but also one
>>>> that doesn't really look like that:
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
>>>>
>>>> with the pertinent point being:
>>>>
>>>> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
>>>> [New LWP 1894]
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>>> Core was generated by `postgres: bgworker: logical replication launcher                '.
>>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>>> #0  0x000055e265bff5e3 in ?? ()
>>>> #0  0x000055e265bff5e3 in ?? ()
>>>> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
>>>> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
>>>> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
>>>>
>>>> it's possible that me killing things and upgrading caused this, but
>>>> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
>>>> it's more than that.  The machine is a bit backed up at the moment, so
>>>> it'll probably be a while till it's at that animal/branch again,
>>>> otherwise I'd not have mentioned this.
>>>
>>> For some reason it ran again pretty soon. And I'm afraid it's indeed an
>>> issue:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
>>>
>>
>> Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
>> seems to work fine on my two machines. I don't see anything else
>> different on culicidae though. Sadly the backtrace is not that
>> informative either. I'll try to investigate more but it will take time...
> 
> I can give you a login to that machine, it doesn't do anything but run
> buildfarm animals...  Will have to be my tomorrow however.
> 

That would be helpful, thanks.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:
> On 16/03/17 04:42, Andres Freund wrote:
> > On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
> >> Hi,
> >>
> >> I just unstuck a bunch of my buildfarm animals.  That triggered some
> >> spurious failures (on piculet, calliphoridae, mylodon), but also one
> >> that doesn't really look like that:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
> >>
> >> with the pertinent point being:
> >>
> >> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
> >> [New LWP 1894]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> >> Core was generated by `postgres: bgworker: logical replication launcher                '.
> >> Program terminated with signal SIGSEGV, Segmentation fault.
> >> #0  0x000055e265bff5e3 in ?? ()
> >> #0  0x000055e265bff5e3 in ?? ()
> >> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
> >> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
> >> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
> >>
> >> it's possible that me killing things and upgrading caused this, but
> >> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
> >> it's more than that.  The machine is a bit backed up at the moment, so
> >> it'll probably be a while till it's at that animal/branch again,
> >> otherwise I'd not have mentioned this.
> > 
> > For some reason it ran again pretty soon. And I'm afraid it's indeed an
> > issue:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
> > 
> 
> Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
> seems to work fine on my two machines. I don't see anything else
> different on culicidae though. Sadly the backtrace is not that
> informative either. I'll try to investigate more but it will take time...

Worthwhile additional failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A55%3A01

Same animal, also EXEC_BACKEND, but 9.6.

A quick look at the relevant line:/* * If bgw_main is set, we use that value as the initial entrypoint. * However, if
thelibrary containing the entrypoint wasn't loaded at * postmaster startup time, passing it as a direct function
pointeris not * possible.  To work around that, we allow callers for whom a function * pointer is not available to pass
alibrary name (which will be loaded, * if necessary) and a function name (which will be looked up in the named *
library).*/if (worker->bgw_main != NULL)    entrypt = worker->bgw_main;
 

makes the issue clear - we appear to be assuming that bgw_main is
meaningful across processes.  Which it isn't in the EXEC_BACKEND case
when ASLR is in use...

This kinda sounds familiar, but a quick google search doesn't find
anything relevant.

Greetings,

Andres Freund



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 16/03/17 09:53, Andres Freund wrote:
> On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:
>> On 16/03/17 04:42, Andres Freund wrote:
>>> On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> I just unstuck a bunch of my buildfarm animals.  That triggered some
>>>> spurious failures (on piculet, calliphoridae, mylodon), but also one
>>>> that doesn't really look like that:
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
>>>>
>>>> with the pertinent point being:
>>>>
>>>> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
>>>> [New LWP 1894]
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>>> Core was generated by `postgres: bgworker: logical replication launcher                '.
>>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>>> #0  0x000055e265bff5e3 in ?? ()
>>>> #0  0x000055e265bff5e3 in ?? ()
>>>> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
>>>> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
>>>> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
>>>>
>>>> it's possible that me killing things and upgrading caused this, but
>>>> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
>>>> it's more than that.  The machine is a bit backed up at the moment, so
>>>> it'll probably be a while till it's at that animal/branch again,
>>>> otherwise I'd not have mentioned this.
>>>
>>> For some reason it ran again pretty soon. And I'm afraid it's indeed an
>>> issue:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
>>>
>>
>> Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
>> seems to work fine on my two machines. I don't see anything else
>> different on culicidae though. Sadly the backtrace is not that
>> informative either. I'll try to investigate more but it will take time...
> 
> Worthwhile additional failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A55%3A01
> 
> Same animal, also EXEC_BACKEND, but 9.6.
> 
> A quick look at the relevant line:
>     /*
>      * If bgw_main is set, we use that value as the initial entrypoint.
>      * However, if the library containing the entrypoint wasn't loaded at
>      * postmaster startup time, passing it as a direct function pointer is not
>      * possible.  To work around that, we allow callers for whom a function
>      * pointer is not available to pass a library name (which will be loaded,
>      * if necessary) and a function name (which will be looked up in the named
>      * library).
>      */
>     if (worker->bgw_main != NULL)
>         entrypt = worker->bgw_main;
> 
> makes the issue clear - we appear to be assuming that bgw_main is
> meaningful across processes.  Which it isn't in the EXEC_BACKEND case
> when ASLR is in use...
> 
> This kinda sounds familiar, but a quick google search doesn't find
> anything relevant.

Hmm now that you mention it, I remember discussing something similar
with you last year in Dallas in regards to parallel query. IIRC Windows
should not have this problem but other systems with EXEC_BACKEND do.
Don't remember the details though.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Thu, Mar 16, 2017 at 5:13 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> Hmm now that you mention it, I remember discussing something similar
> with you last year in Dallas in regards to parallel query. IIRC Windows
> should not have this problem but other systems with EXEC_BACKEND do.
> Don't remember the details though.

Generally, extension code can't use bgw_main safely, and must use
bgw_library_name and bgw_function_name.  But bgw_main is supposedly
safe for core code.  If it's not even safe there, then I guess we
should remove it entirely as a useless foot-gun.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-16 09:27:59 -0400, Robert Haas wrote:
> On Thu, Mar 16, 2017 at 5:13 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
> > Hmm now that you mention it, I remember discussing something similar
> > with you last year in Dallas in regards to parallel query. IIRC Windows
> > should not have this problem but other systems with EXEC_BACKEND do.
> > Don't remember the details though.
> 
> Generally, extension code can't use bgw_main safely, and must use
> bgw_library_name and bgw_function_name.  But bgw_main is supposedly
> safe for core code.

I indeed think it's not safe, and it's going to get less and less safe
on windows (or EXEC_BACKEND).  I don't think we can afford to disable
ASLR in the long run (I indeed supect that'll just be disallowed at some
point), and that's the only thing making it safe-ish in combination with
EXEC_BACKEND.


> If it's not even safe there, then I guess we should remove it entirely
> as a useless foot-gun.

I indeed think that's the right consequence.  One question is what to
replace it with exactly - are we guaranteed we can dynamically lookup
symbols by name in the main binary on every platform?  Alternatively we
can just hardcode a bunch of bgw_function_name values that are matched
to specific functions if bgw_library_name is NULL - I suspect that'd be
the easiest / least worrysome portability-wise.

Greetings,

Andres Freund



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Thu, Mar 16, 2017 at 2:55 PM, Andres Freund <andres@anarazel.de> wrote:
> I indeed think it's not safe, and it's going to get less and less safe
> on windows (or EXEC_BACKEND).  I don't think we can afford to disable
> ASLR in the long run (I indeed supect that'll just be disallowed at some
> point), and that's the only thing making it safe-ish in combination with
> EXEC_BACKEND.

Ugh.

>> If it's not even safe there, then I guess we should remove it entirely
>> as a useless foot-gun.
>
> I indeed think that's the right consequence.  One question is what to
> replace it with exactly - are we guaranteed we can dynamically lookup
> symbols by name in the main binary on every platform?

I don't know the answer to that question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Peter Eisentraut
Date:
On 3/16/17 14:55, Andres Freund wrote:
> I indeed think that's the right consequence.  One question is what to
> replace it with exactly - are we guaranteed we can dynamically lookup
> symbols by name in the main binary on every platform?

I think there is probably a way to do this on all platforms.  But it
seems that at least the Windows port of pg_dlopen would need to be
updated to support this.

> Alternatively we
> can just hardcode a bunch of bgw_function_name values that are matched
> to specific functions if bgw_library_name is NULL - I suspect that'd be
> the easiest / least worrysome portability-wise.

Basically a variant of fmgrtab, which addresses the same sort of problem.


-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-16 10:13:37 +0100, Petr Jelinek wrote:
> On 16/03/17 09:53, Andres Freund wrote:
> > On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:
> >> On 16/03/17 04:42, Andres Freund wrote:
> >>> On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
> >>>> Hi,
> >>>>
> >>>> I just unstuck a bunch of my buildfarm animals.  That triggered some
> >>>> spurious failures (on piculet, calliphoridae, mylodon), but also one
> >>>> that doesn't really look like that:
> >>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
> >>>>
> >>>> with the pertinent point being:
> >>>>
> >>>> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
> >>>> [New LWP 1894]
> >>>> [Thread debugging using libthread_db enabled]
> >>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> >>>> Core was generated by `postgres: bgworker: logical replication launcher                '.
> >>>> Program terminated with signal SIGSEGV, Segmentation fault.
> >>>> #0  0x000055e265bff5e3 in ?? ()
> >>>> #0  0x000055e265bff5e3 in ?? ()
> >>>> #1  0x000055d3ccabed0d in StartBackgroundWorker () at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
> >>>> #2  0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
> >>>> #3  0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at
/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
> >>>>
> >>>> it's possible that me killing things and upgrading caused this, but
> >>>> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
> >>>> it's more than that.  The machine is a bit backed up at the moment, so
> >>>> it'll probably be a while till it's at that animal/branch again,
> >>>> otherwise I'd not have mentioned this.
> >>>
> >>> For some reason it ran again pretty soon. And I'm afraid it's indeed an
> >>> issue:
> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
> >>>
> >>
> >> Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
> >> seems to work fine on my two machines. I don't see anything else
> >> different on culicidae though. Sadly the backtrace is not that
> >> informative either. I'll try to investigate more but it will take time...
> > 
> > Worthwhile additional failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A55%3A01
> > 
> > Same animal, also EXEC_BACKEND, but 9.6.
> > 
> > A quick look at the relevant line:
> >     /*
> >      * If bgw_main is set, we use that value as the initial entrypoint.
> >      * However, if the library containing the entrypoint wasn't loaded at
> >      * postmaster startup time, passing it as a direct function pointer is not
> >      * possible.  To work around that, we allow callers for whom a function
> >      * pointer is not available to pass a library name (which will be loaded,
> >      * if necessary) and a function name (which will be looked up in the named
> >      * library).
> >      */
> >     if (worker->bgw_main != NULL)
> >         entrypt = worker->bgw_main;
> > 
> > makes the issue clear - we appear to be assuming that bgw_main is
> > meaningful across processes.  Which it isn't in the EXEC_BACKEND case
> > when ASLR is in use...
> > 
> > This kinda sounds familiar, but a quick google search doesn't find
> > anything relevant.

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?


> Hmm now that you mention it, I remember discussing something similar
> with you last year in Dallas in regards to parallel query. IIRC Windows
> should not have this problem but other systems with EXEC_BACKEND do.
> Don't remember the details though.

Don't think that's reliable, only works as long as the binary is
compiled without position independent code.

Greetings,

Andres Freund



Re: logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:
> Robert, Petr, either of you planning to fix this (as outlined elsewhere
> in the thred)?

Oh, I didn't realize anybody was looking to me to fix this.  I sort of
thought that it was fallout from the logical replication patch and
that Petr or Peter would deal with it.  If that's not the case, I'm
not totally unwilling to take a whack at it, but I don't have much
personal enthusiasm for trying to figure out how to make dynamic
loading on the postgres binary itself work everywhere, so if it falls
to me to fix, it's likely to get a hard-coded check for some
hard-coded name.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-27 13:01:11 -0400, Robert Haas wrote:
> On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:
> > Robert, Petr, either of you planning to fix this (as outlined elsewhere
> > in the thred)?
> 
> Oh, I didn't realize anybody was looking to me to fix this.

Well, it's borked in 9.6.  I'm starting to get annoyed by culicidae's
failures ;)


> I sort of thought that it was fallout from the logical replication
> patch and that Petr or Peter would deal with it.  If that's not the
> case, I'm not totally unwilling to take a whack at it, but I don't
> have much personal enthusiasm for trying to figure out how to make
> dynamic loading on the postgres binary itself work everywhere, so if
> it falls to me to fix, it's likely to get a hard-coded check for some
> hard-coded name.

I'm all for that approach - there seems very little upside in the
dynamic loading approach.  Just defining a bgw_entry_points[enum
BuiltinBGWorkerType] -> bgworker_main_type array seems to be simple
enough - it's not like we're going to add new types of builtin bgworkers
at runtime.

- Andres



Re: logical replication launcher crash on buildfarm

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... I don't have much
> personal enthusiasm for trying to figure out how to make dynamic
> loading on the postgres binary itself work everywhere, so if it falls
> to me to fix, it's likely to get a hard-coded check for some
> hard-coded name.

+1.  This seems like no time to be buying into brand new portability
requirements without a very pressing need to do so; and this patch
doesn't appear to create one.
        regards, tom lane



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 27/03/17 19:01, Robert Haas wrote:
> On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:
>> Robert, Petr, either of you planning to fix this (as outlined elsewhere
>> in the thred)?
> 
> Oh, I didn't realize anybody was looking to me to fix this.  I sort of
> thought that it was fallout from the logical replication patch and
> that Petr or Peter would deal with it.  If that's not the case, I'm
> not totally unwilling to take a whack at it, but I don't have much
> personal enthusiasm for trying to figure out how to make dynamic
> loading on the postgres binary itself work everywhere, so if it falls
> to me to fix, it's likely to get a hard-coded check for some
> hard-coded name.
> 

It affects parallel workers same way, I'll write patch for HEAD soon,
9.6 probably with some delay. I'll expand the InternalBgWorkers thing
that was added with logical replication to handle this in similar
fashion how we do fmgrtab.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 28/03/17 03:31, Petr Jelinek wrote:
> On 27/03/17 19:01, Robert Haas wrote:
>> On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Robert, Petr, either of you planning to fix this (as outlined elsewhere
>>> in the thred)?
>>
>> Oh, I didn't realize anybody was looking to me to fix this.  I sort of
>> thought that it was fallout from the logical replication patch and
>> that Petr or Peter would deal with it.  If that's not the case, I'm
>> not totally unwilling to take a whack at it, but I don't have much
>> personal enthusiasm for trying to figure out how to make dynamic
>> loading on the postgres binary itself work everywhere, so if it falls
>> to me to fix, it's likely to get a hard-coded check for some
>> hard-coded name.
>>
> 
> It affects parallel workers same way, I'll write patch for HEAD soon,
> 9.6 probably with some delay. I'll expand the InternalBgWorkers thing
> that was added with logical replication to handle this in similar
> fashion how we do fmgrtab.
> 

Btw now that I look at the code, I guess we'll want to get rid of
bgw_main completely in HEAD given that we can't guarantee it will be
valid even for shared_preload_library libraries. For older branches I
would leave things as they are in this regard as there don't seem to be
any immediate issue for standard binaries.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-28 03:47:50 +0200, Petr Jelinek wrote:
> On 28/03/17 03:31, Petr Jelinek wrote:
> > On 27/03/17 19:01, Robert Haas wrote:
> >> On Mon, Mar 27, 2017 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:
> >>> Robert, Petr, either of you planning to fix this (as outlined elsewhere
> >>> in the thred)?
> >>
> >> Oh, I didn't realize anybody was looking to me to fix this.  I sort of
> >> thought that it was fallout from the logical replication patch and
> >> that Petr or Peter would deal with it.  If that's not the case, I'm
> >> not totally unwilling to take a whack at it, but I don't have much
> >> personal enthusiasm for trying to figure out how to make dynamic
> >> loading on the postgres binary itself work everywhere, so if it falls
> >> to me to fix, it's likely to get a hard-coded check for some
> >> hard-coded name.
> >>
> > 
> > It affects parallel workers same way, I'll write patch for HEAD soon,
> > 9.6 probably with some delay. I'll expand the InternalBgWorkers thing
> > that was added with logical replication to handle this in similar
> > fashion how we do fmgrtab.
> > 
> 
> Btw now that I look at the code, I guess we'll want to get rid of
> bgw_main completely in HEAD given that we can't guarantee it will be
> valid even for shared_preload_library libraries. For older branches I
> would leave things as they are in this regard as there don't seem to be
> any immediate issue for standard binaries.

As long as you fix it so culicidae is happy (in 9.6) ;).  I think it's
fine to just introduce bgw_builtin_id or such, and leave the bgw_main
code in place in < HEAD.


Greetings,

Andres Freund



Re: logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Mon, Mar 27, 2017 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
>> Btw now that I look at the code, I guess we'll want to get rid of
>> bgw_main completely in HEAD given that we can't guarantee it will be
>> valid even for shared_preload_library libraries. For older branches I
>> would leave things as they are in this regard as there don't seem to be
>> any immediate issue for standard binaries.
>
> As long as you fix it so culicidae is happy (in 9.6) ;).  I think it's
> fine to just introduce bgw_builtin_id or such, and leave the bgw_main
> code in place in < HEAD.

I wasn't thinking of introducing bgw_builtin_id.  My idea was just
along the lines of

if (bgw_library_name == NULL && bgw_function_name != NULL)
{   if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)      ParallelQueryMain(blah);   else if
(strcmp(bgw_function_name,"LogicalReplicationMain") == 0)      LogicalReplicationMain(blah);
 
}

I think something like that is certainly better for the back-branches,
because it doesn't cause an ABI break.  But I think it would also be
fine for master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical replication launcher crash on buildfarm

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I wasn't thinking of introducing bgw_builtin_id.  My idea was just
> along the lines of

> if (bgw_library_name == NULL && bgw_function_name != NULL)
> {
>     if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
>        ParallelQueryMain(blah);
>     else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
>        LogicalReplicationMain(blah);
> }

> I think something like that is certainly better for the back-branches,
> because it doesn't cause an ABI break.  But I think it would also be
> fine for master.

That seems perfectly reasonable from here: surely the cost of a couple
of strcmp's is trivial in comparison to a process launch.

We can redesign the API whenever this way starts getting unwieldy,
but that's likely to be quite some time away.
        regards, tom lane



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 28/03/17 04:46, Robert Haas wrote:
> On Mon, Mar 27, 2017 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Btw now that I look at the code, I guess we'll want to get rid of
>>> bgw_main completely in HEAD given that we can't guarantee it will be
>>> valid even for shared_preload_library libraries. For older branches I
>>> would leave things as they are in this regard as there don't seem to be
>>> any immediate issue for standard binaries.
>>
>> As long as you fix it so culicidae is happy (in 9.6) ;).  I think it's
>> fine to just introduce bgw_builtin_id or such, and leave the bgw_main
>> code in place in < HEAD.
> 
> I wasn't thinking of introducing bgw_builtin_id.  My idea was just
> along the lines of
> 
> if (bgw_library_name == NULL && bgw_function_name != NULL)
> {
>     if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
>        ParallelQueryMain(blah);
>     else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
>        LogicalReplicationMain(blah);
> }
> 
> I think something like that is certainly better for the back-branches,
> because it doesn't cause an ABI break.  But I think it would also be
> fine for master.
> 

I had something slightly more complex like the attached in mind.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Mon, Mar 27, 2017 at 11:20 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 28/03/17 04:46, Robert Haas wrote:
>> On Mon, Mar 27, 2017 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
>>>> Btw now that I look at the code, I guess we'll want to get rid of
>>>> bgw_main completely in HEAD given that we can't guarantee it will be
>>>> valid even for shared_preload_library libraries. For older branches I
>>>> would leave things as they are in this regard as there don't seem to be
>>>> any immediate issue for standard binaries.
>>>
>>> As long as you fix it so culicidae is happy (in 9.6) ;).  I think it's
>>> fine to just introduce bgw_builtin_id or such, and leave the bgw_main
>>> code in place in < HEAD.
>>
>> I wasn't thinking of introducing bgw_builtin_id.  My idea was just
>> along the lines of
>>
>> if (bgw_library_name == NULL && bgw_function_name != NULL)
>> {
>>     if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
>>        ParallelQueryMain(blah);
>>     else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
>>        LogicalReplicationMain(blah);
>> }
>>
>> I think something like that is certainly better for the back-branches,
>> because it doesn't cause an ABI break.  But I think it would also be
>> fine for master.
>>
>
> I had something slightly more complex like the attached in mind.

Seems broadly reasonable on a quick look, but I think we should leave
bgw_main intact in 9.6.  It may be working for fine for people who
don't care about Windows, and I'd rather not break it gratuitously.
Can we have two patches, one of which converts the internal stuff to
use the new mechanism and a second of which removes bgw_main?  The
second one, at least, also needs to update the documentation.  (A good
practice when removing an identifier is to run 'git grep
thing_i_am_removing' after removing it...)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 28/03/17 17:55, Robert Haas wrote:
> On Mon, Mar 27, 2017 at 11:20 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> On 28/03/17 04:46, Robert Haas wrote:
>>> On Mon, Mar 27, 2017 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
>>>>> Btw now that I look at the code, I guess we'll want to get rid of
>>>>> bgw_main completely in HEAD given that we can't guarantee it will be
>>>>> valid even for shared_preload_library libraries. For older branches I
>>>>> would leave things as they are in this regard as there don't seem to be
>>>>> any immediate issue for standard binaries.
>>>>
>>>> As long as you fix it so culicidae is happy (in 9.6) ;).  I think it's
>>>> fine to just introduce bgw_builtin_id or such, and leave the bgw_main
>>>> code in place in < HEAD.
>>>
>>> I wasn't thinking of introducing bgw_builtin_id.  My idea was just
>>> along the lines of
>>>
>>> if (bgw_library_name == NULL && bgw_function_name != NULL)
>>> {
>>>     if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
>>>        ParallelQueryMain(blah);
>>>     else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
>>>        LogicalReplicationMain(blah);
>>> }
>>>
>>> I think something like that is certainly better for the back-branches,
>>> because it doesn't cause an ABI break.  But I think it would also be
>>> fine for master.
>>>
>>
>> I had something slightly more complex like the attached in mind.
> 
> Seems broadly reasonable on a quick look, but I think we should leave
> bgw_main intact in 9.6.  It may be working for fine for people who
> don't care about Windows, and I'd rather not break it gratuitously.
> Can we have two patches, one of which converts the internal stuff to
> use the new mechanism and a second of which removes bgw_main?  The
> second one, at least, also needs to update the documentation.  (A good
> practice when removing an identifier is to run 'git grep
> thing_i_am_removing' after removing it...)
> 

Yes I agree (and I said it in the thread) I plan to submit the 9.6 too,
I just sent what I had before having it all ready mainly because you
started talking about code already and I didn't want you to waste time
with it needlessly, but it was already 5AM for me ;).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 28/03/17 18:05, Petr Jelinek wrote:
> On 28/03/17 17:55, Robert Haas wrote:
>> On Mon, Mar 27, 2017 at 11:20 PM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>> On 28/03/17 04:46, Robert Haas wrote:
>>>> On Mon, Mar 27, 2017 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
>>>>>> Btw now that I look at the code, I guess we'll want to get rid of
>>>>>> bgw_main completely in HEAD given that we can't guarantee it will be
>>>>>> valid even for shared_preload_library libraries. For older branches I
>>>>>> would leave things as they are in this regard as there don't seem to be
>>>>>> any immediate issue for standard binaries.
>>>>>
>>>>> As long as you fix it so culicidae is happy (in 9.6) ;).  I think it's
>>>>> fine to just introduce bgw_builtin_id or such, and leave the bgw_main
>>>>> code in place in < HEAD.
>>>>
>>>> I wasn't thinking of introducing bgw_builtin_id.  My idea was just
>>>> along the lines of
>>>>
>>>> if (bgw_library_name == NULL && bgw_function_name != NULL)
>>>> {
>>>>     if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
>>>>        ParallelQueryMain(blah);
>>>>     else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
>>>>        LogicalReplicationMain(blah);
>>>> }
>>>>
>>>> I think something like that is certainly better for the back-branches,
>>>> because it doesn't cause an ABI break.  But I think it would also be
>>>> fine for master.
>>>>
>>>
>>> I had something slightly more complex like the attached in mind.
>>
>> Seems broadly reasonable on a quick look, but I think we should leave
>> bgw_main intact in 9.6.  It may be working for fine for people who
>> don't care about Windows, and I'd rather not break it gratuitously.
>> Can we have two patches, one of which converts the internal stuff to
>> use the new mechanism and a second of which removes bgw_main?  The
>> second one, at least, also needs to update the documentation.  (A good
>> practice when removing an identifier is to run 'git grep
>> thing_i_am_removing' after removing it...)
>>
> 
> Yes I agree (and I said it in the thread) I plan to submit the 9.6 too,
> I just sent what I had before having it all ready mainly because you
> started talking about code already and I didn't want you to waste time
> with it needlessly, but it was already 5AM for me ;).
> 

Okay finally got to it, the 9.6 adds similar mechanism as HEAD (it could
be simpler there but it seems like same code is better) but does not
remove the bgw_main.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 29/03/17 01:29, Petr Jelinek wrote:
> On 28/03/17 18:05, Petr Jelinek wrote:
>> On 28/03/17 17:55, Robert Haas wrote:
>>> On Mon, Mar 27, 2017 at 11:20 PM, Petr Jelinek
>>> <petr.jelinek@2ndquadrant.com> wrote:
>>>> On 28/03/17 04:46, Robert Haas wrote:
>>>>> On Mon, Mar 27, 2017 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:
>>>>>>> Btw now that I look at the code, I guess we'll want to get rid of
>>>>>>> bgw_main completely in HEAD given that we can't guarantee it will be
>>>>>>> valid even for shared_preload_library libraries. For older branches I
>>>>>>> would leave things as they are in this regard as there don't seem to be
>>>>>>> any immediate issue for standard binaries.
>>>>>>
>>>>>> As long as you fix it so culicidae is happy (in 9.6) ;).  I think it's
>>>>>> fine to just introduce bgw_builtin_id or such, and leave the bgw_main
>>>>>> code in place in < HEAD.
>>>>>
>>>>> I wasn't thinking of introducing bgw_builtin_id.  My idea was just
>>>>> along the lines of
>>>>>
>>>>> if (bgw_library_name == NULL && bgw_function_name != NULL)
>>>>> {
>>>>>     if (strcmp(bgw_function_name, "ParallelQueryMain") == 0)
>>>>>        ParallelQueryMain(blah);
>>>>>     else if (strcmp(bgw_function_name, "LogicalReplicationMain") == 0)
>>>>>        LogicalReplicationMain(blah);
>>>>> }
>>>>>
>>>>> I think something like that is certainly better for the back-branches,
>>>>> because it doesn't cause an ABI break.  But I think it would also be
>>>>> fine for master.
>>>>>
>>>>
>>>> I had something slightly more complex like the attached in mind.
>>>
>>> Seems broadly reasonable on a quick look, but I think we should leave
>>> bgw_main intact in 9.6.  It may be working for fine for people who
>>> don't care about Windows, and I'd rather not break it gratuitously.
>>> Can we have two patches, one of which converts the internal stuff to
>>> use the new mechanism and a second of which removes bgw_main?  The
>>> second one, at least, also needs to update the documentation.  (A good
>>> practice when removing an identifier is to run 'git grep
>>> thing_i_am_removing' after removing it...)
>>>
>>
>> Yes I agree (and I said it in the thread) I plan to submit the 9.6 too,
>> I just sent what I had before having it all ready mainly because you
>> started talking about code already and I didn't want you to waste time
>> with it needlessly, but it was already 5AM for me ;).
>>
> 
> Okay finally got to it, the 9.6 adds similar mechanism as HEAD (it could
> be simpler there but it seems like same code is better) but does not
> remove the bgw_main.
> 

Sigh, forgot git add for the docs, so one more try...

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Tue, Mar 28, 2017 at 7:39 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> Sigh, forgot git add for the docs, so one more try...

+    if (strncmp(worker->bgw_library_name, "postgres", 8) != 0)
+        return NULL;

I think that's not right.  You don't want to match postgresshdkgjsdglkjs.

Aside from that, these look good to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 31/03/17 15:42, Robert Haas wrote:
> On Tue, Mar 28, 2017 at 7:39 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> Sigh, forgot git add for the docs, so one more try...
> 
> +    if (strncmp(worker->bgw_library_name, "postgres", 8) != 0)
> +        return NULL;
> 
> I think that's not right.  You don't want to match postgresshdkgjsdglkjs.
>

Right, changed to BGW_MAXLEN.

> Aside from that, these look good to me.
> 

Cool.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Fri, Mar 31, 2017 at 8:32 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> Right, changed to BGW_MAXLEN.

Hmm, I don't know if there's any good reason not to just use strcmp(),
but sure, OK.  Committed and back-patched.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 01/04/17 02:53, Robert Haas wrote:
> On Fri, Mar 31, 2017 at 8:32 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> Right, changed to BGW_MAXLEN.
> 
> Hmm, I don't know if there's any good reason not to just use strcmp(),
> but sure, OK.  Committed and back-patched.
> 

Hmm culicidae still fails, this time only in parallel worker code. This
didn't happen on my machine which is strange. Looking at the code, we
are passing the fps->entrypoint as function pointer again so of course
it fails. We have some code to load libraries again but even that gets
initial entrypoint passed as function pointer
(ParallelExtensionTrampoline). I wonder if we'll have to generalize the
InternalBGWorkers even more to some kind of internal function name to
pointer map and add the parallel entry points there as well.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Fri, Mar 31, 2017 at 9:26 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
>> Hmm, I don't know if there's any good reason not to just use strcmp(),
>> but sure, OK.  Committed and back-patched.
>
> Hmm culicidae still fails, this time only in parallel worker code. This
> didn't happen on my machine which is strange. Looking at the code, we
> are passing the fps->entrypoint as function pointer again so of course
> it fails. We have some code to load libraries again but even that gets
> initial entrypoint passed as function pointer
> (ParallelExtensionTrampoline). I wonder if we'll have to generalize the
> InternalBGWorkers even more to some kind of internal function name to
> pointer map and add the parallel entry points there as well.

Argh, I forgot about that.  I think we need to use something like
ParallelExtensionTrampoline all the time, not just for libraries.
Since effectively, we've determined that postgres itself has the same
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-04-01 03:26:01 +0200, Petr Jelinek wrote:
> On 01/04/17 02:53, Robert Haas wrote:
> > On Fri, Mar 31, 2017 at 8:32 PM, Petr Jelinek
> > <petr.jelinek@2ndquadrant.com> wrote:
> >> Right, changed to BGW_MAXLEN.
> > 
> > Hmm, I don't know if there's any good reason not to just use strcmp(),
> > but sure, OK.  Committed and back-patched.

Cool!


> Hmm culicidae still fails, this time only in parallel worker code.
> This  didn't happen on my machine which is strange.

Interesting.  Did you reproduce the failure before?  I wonder if the
issue is that you don't build a position independent executable?

- Andres



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 01/04/17 03:44, Andres Freund wrote:
> On 2017-04-01 03:26:01 +0200, Petr Jelinek wrote:
>> On 01/04/17 02:53, Robert Haas wrote:
>>> On Fri, Mar 31, 2017 at 8:32 PM, Petr Jelinek
>>> <petr.jelinek@2ndquadrant.com> wrote:
>>>> Right, changed to BGW_MAXLEN.
>>>
>>> Hmm, I don't know if there's any good reason not to just use strcmp(),
>>> but sure, OK.  Committed and back-patched.
> 
> Cool!
> 
> 
>> Hmm culicidae still fails, this time only in parallel worker code.
>> This  didn't happen on my machine which is strange.
> 
> Interesting.  Did you reproduce the failure before?  I wonder if the
> issue is that you don't build a position independent executable?
> 

I did reproduce the original issue yes, that's how I tested the patch. I
just copy pasted CFLAGS from the buildfarm configuration of culicidae TBH.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-04-01 03:57:05 +0200, Petr Jelinek wrote:
> On 01/04/17 03:44, Andres Freund wrote:
> > On 2017-04-01 03:26:01 +0200, Petr Jelinek wrote:
> >> On 01/04/17 02:53, Robert Haas wrote:
> >>> On Fri, Mar 31, 2017 at 8:32 PM, Petr Jelinek
> >>> <petr.jelinek@2ndquadrant.com> wrote:
> >>>> Right, changed to BGW_MAXLEN.
> >>>
> >>> Hmm, I don't know if there's any good reason not to just use strcmp(),
> >>> but sure, OK.  Committed and back-patched.
> > 
> > Cool!
> > 
> > 
> >> Hmm culicidae still fails, this time only in parallel worker code.
> >> This  didn't happen on my machine which is strange.
> > 
> > Interesting.  Did you reproduce the failure before?  I wonder if the
> > issue is that you don't build a position independent executable?
> > 
> 
> I did reproduce the original issue yes, that's how I tested the patch. I
> just copy pasted CFLAGS from the buildfarm configuration of culicidae TBH.

That's not necessarily sufficient, unless you also use a sufficiently
new debian - they changed gcc to default to PIE being enabled by
default.


Greetings,

Andres Freund



Re: logical replication launcher crash on buildfarm

From
Andres Freund
Date:
On 2017-03-31 21:30:17 -0400, Robert Haas wrote:
> On Fri, Mar 31, 2017 at 9:26 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
> >> Hmm, I don't know if there's any good reason not to just use strcmp(),
> >> but sure, OK.  Committed and back-patched.
> >
> > Hmm culicidae still fails, this time only in parallel worker code. This
> > didn't happen on my machine which is strange. Looking at the code, we
> > are passing the fps->entrypoint as function pointer again so of course
> > it fails. We have some code to load libraries again but even that gets
> > initial entrypoint passed as function pointer
> > (ParallelExtensionTrampoline). I wonder if we'll have to generalize the
> > InternalBGWorkers even more to some kind of internal function name to
> > pointer map and add the parallel entry points there as well.
> 
> Argh, I forgot about that.  I think we need to use something like
> ParallelExtensionTrampoline all the time, not just for libraries.
> Since effectively, we've determined that postgres itself has the same
> problem.

I wonder if we shouldn't just introduce a KnownFunctionPointer[] map,
that can be used by various facilities (bgworkers themselves,
parallelism, and probably more coming).  Otherwise we'll introduce
mechanisms like 7d8f6986b8 in multiple places.

- Andres



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 01/04/17 04:19, Andres Freund wrote:
> On 2017-03-31 21:30:17 -0400, Robert Haas wrote:
>> On Fri, Mar 31, 2017 at 9:26 PM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>>> Hmm, I don't know if there's any good reason not to just use strcmp(),
>>>> but sure, OK.  Committed and back-patched.
>>>
>>> Hmm culicidae still fails, this time only in parallel worker code. This
>>> didn't happen on my machine which is strange. Looking at the code, we
>>> are passing the fps->entrypoint as function pointer again so of course
>>> it fails. We have some code to load libraries again but even that gets
>>> initial entrypoint passed as function pointer
>>> (ParallelExtensionTrampoline). I wonder if we'll have to generalize the
>>> InternalBGWorkers even more to some kind of internal function name to
>>> pointer map and add the parallel entry points there as well.
>>
>> Argh, I forgot about that.  I think we need to use something like
>> ParallelExtensionTrampoline all the time, not just for libraries.
>> Since effectively, we've determined that postgres itself has the same
>> problem.
> 
> I wonder if we shouldn't just introduce a KnownFunctionPointer[] map,
> that can be used by various facilities (bgworkers themselves,
> parallelism, and probably more coming).  Otherwise we'll introduce
> mechanisms like 7d8f6986b8 in multiple places.
> 

Yes that's what I meant with what I said above.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
So this is what I came up with on plane. Generalized the loading
functionality into load_library_function which that can load either
known postgres functions or call load_external_function.

I am not quite sure if fmgr.c is best place to put it, but I didn't want
to include stuff from executor in bgworker.c. I was thinking about
putting it to dfmgr.c (as that's where load_external_function) but again
seemed weird to including executor there.

As with previous patch, 9.6 will need quite different patch as we need
to keep compatibility there, but since I am traveling I think it's
better to share what I have so far.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Tue, Apr 4, 2017 at 2:08 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> So this is what I came up with on plane. Generalized the loading
> functionality into load_library_function which that can load either
> known postgres functions or call load_external_function.
>
> I am not quite sure if fmgr.c is best place to put it, but I didn't want
> to include stuff from executor in bgworker.c. I was thinking about
> putting it to dfmgr.c (as that's where load_external_function) but again
> seemed weird to including executor there.
>
> As with previous patch, 9.6 will need quite different patch as we need
> to keep compatibility there, but since I am traveling I think it's
> better to share what I have so far.

- * Register a new background worker while processing shared_preload_libraries.
+ * Register a new static background worker. *
- * This can only be called in the _PG_init function of a module library
- * that's loaded by shared_preload_libraries; otherwise it has no effect.
+ * This can only be called directly from postmaster or in the _PG_init
+ * function of a module library that's loaded by shared_preload_libraries;
+ * otherwise it will have no effect.

This does seem to need updating, but that's a somewhat separate issue
from fixing the problem at issue.

+         * Serialize extension entrypoint information. It's unsafe to pass
+         * function pointers across parallel processes as the function pointer
+         * may be different in new process in EXEC_BACKEND builds so we
+         * always pass library and function name.

The word "extension" should be removed.

This basically looks OK otherwise, but we'll need something else for 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Tom Lane
Date:
Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
> So this is what I came up with on plane. Generalized the loading
> functionality into load_library_function which that can load either
> known postgres functions or call load_external_function.

I've had more than enough of seeing buildfarm failures from culicidae,
so I whacked this around until I was happy with it and pushed it.
Further adjustments welcome of course.

> I am not quite sure if fmgr.c is best place to put it, but I didn't want
> to include stuff from executor in bgworker.c.

I really did not care for putting those references into fmgr.c.  Aside
from the layering violations involved, there was a type cheat, in that
you had functions with entirely different signatures in the same list.
I eventually decided that the least ugly solution was to create two
different pointer arrays and frontend functions, one for bgworkers and
one for parallel workers.  That ends up requiring only one extra
export/import, to make ParallelQueryMain accessible in parallel.c.
Maybe we need to rethink the division of labor between parallel.c
and execParallel.c, but that would depend on somebody explaining
what the difference is in the first place.

> As with previous patch, 9.6 will need quite different patch as we need
> to keep compatibility there,

I don't really understand why 9.6 needs a significantly different
patch.  AFAICS, it simply does not work (with any reliability)
for a loadable module to call CreateParallelContext directly in 9.6.
The only thing that actually works is to call the undocumented
CreateParallelContextForExternalFunction.  I suggest that we could
back-patch this as-is, if we add CreateParallelContextForExternalFunction
as a trivial wrapper around CreateParallelContext.
        regards, tom lane



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Petr Jelinek
Date:
On 15/04/17 06:01, Tom Lane wrote:
> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
>> So this is what I came up with on plane. Generalized the loading
>> functionality into load_library_function which that can load either
>> known postgres functions or call load_external_function.
> 
> I've had more than enough of seeing buildfarm failures from culicidae,
> so I whacked this around until I was happy with it and pushed it.
> Further adjustments welcome of course.

Thanks. Seems like culicidae is finally happy with master.

> 
>> I am not quite sure if fmgr.c is best place to put it, but I didn't want
>> to include stuff from executor in bgworker.c.
> 
> I really did not care for putting those references into fmgr.c.  Aside
> from the layering violations involved, there was a type cheat, in that
> you had functions with entirely different signatures in the same list.
> I eventually decided that the least ugly solution was to create two
> different pointer arrays and frontend functions, one for bgworkers and
> one for parallel workers.  That ends up requiring only one extra
> export/import, to make ParallelQueryMain accessible in parallel.c.
> Maybe we need to rethink the division of labor between parallel.c
> and execParallel.c, but that would depend on somebody explaining
> what the difference is in the first place.
> 

Sure, I did not really like the placement in fmgr.c either, just could
not find good place that would work well for both. I tend to err on the
less code duplication side of "less ugly", but the different function
signatures is reasonable argument for splitting it as we get additional
compiler check.

>> As with previous patch, 9.6 will need quite different patch as we need
>> to keep compatibility there,
> 
> I don't really understand why 9.6 needs a significantly different
> patch.  AFAICS, it simply does not work (with any reliability)
> for a loadable module to call CreateParallelContext directly in 9.6.
> The only thing that actually works is to call the undocumented
> CreateParallelContextForExternalFunction.  I suggest that we could
> back-patch this as-is, if we add CreateParallelContextForExternalFunction
> as a trivial wrapper around CreateParallelContext.
> 

I think the problem is that if somebody was using CreateParallelContext
it may have worked on unix when just normally forking if the extension
was loaded via shared_preload_libraries. And lot of extensions are linux
only. So we might break working setup for somebody if we change the
function signature.

We may need to keep CreateParallelContext as is in back branches and add
some CreateParallelContextInternal which would do what
CreateParallelContext does in master (and is used by postgres) and then
make CreateParallelContextForExternalFunction simple wrapper around
that. It's somewhat ugly though.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Tom Lane
Date:
Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
> On 15/04/17 06:01, Tom Lane wrote:
>> I've had more than enough of seeing buildfarm failures from culicidae,
>> so I whacked this around until I was happy with it and pushed it.
>> Further adjustments welcome of course.

> Thanks. Seems like culicidae is finally happy with master.

Well, only sort of happy, but I see how we can fix that.

>> I don't really understand why 9.6 needs a significantly different
>> patch.  AFAICS, it simply does not work (with any reliability)
>> for a loadable module to call CreateParallelContext directly in 9.6.

> I think the problem is that if somebody was using CreateParallelContext
> it may have worked on unix when just normally forking if the extension
> was loaded via shared_preload_libraries. And lot of extensions are linux
> only. So we might break working setup for somebody if we change the
> function signature.

Ah, I didn't think of shared_preload_libraries.  OK, we can't change
the signature in released branches.

> We may need to keep CreateParallelContext as is in back branches and add
> some CreateParallelContextInternal which would do what
> CreateParallelContext does in master (and is used by postgres) and then
> make CreateParallelContextForExternalFunction simple wrapper around
> that. It's somewhat ugly though.

Yeah, but ugly compatibility hacks in back branches are pretty common.
But rather than inventing "CreateParallelContextInternal", I suggest
we just have the core code call CreateParallelContextForExternalFunction.
        regards, tom lane



Re: [HACKERS] logical replication launcher crash on buildfarm

From
Robert Haas
Date:
On Sat, Apr 15, 2017 at 12:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe we need to rethink the division of labor between parallel.c
> and execParallel.c, but that would depend on somebody explaining
> what the difference is in the first place.

parallel.c handles general concerns related to any sort of
parallelism, particularly those related to transaction management,
hence it's in src/backend/access/transam. execParallel.c handles
concerns that are specific to the executor, hence it's in
src/backend/executor.  The header comment for execParallel.c is fairly
informative on this point; now that I look at it, parallel.c has only
a stub comment, which really ought to be expanded.  Note that while
parallel *query* uses both, a parallel utility command will likely use
only the facilities in parallel.c, so the distinction is not academic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company