Thread: Re: pgsql: Refactor dlopen() support

Re: pgsql: Refactor dlopen() support

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Refactor dlopen() support

Buildfarm member locust doesn't like this much.  I've been able to
reproduce the problem on an old Mac laptop running the same macOS release,
viz 10.5.8.  (Note that we're not seeing it on earlier or later releases,
which is odd in itself.)  According to my machine, the crash is happening
here:

#0  _PG_init () at plpy_main.c:98
98              *plpython_version_bitmask_ptr |= (1 << PY_MAJOR_VERSION);

and the reason is that the rendezvous variable sometimes contains garbage.
Most sessions correctly see it as initially zero, but sometimes it
contains

(gdb) p plpython_version_bitmask_ptr
$1 = (int *) 0x1d

and I've also seen

(gdb) p plpython_version_bitmask_ptr
$1 = (int *) 0x7f7f7f7f

It's mostly repeatable but not completely so: the 0x1d case seems
to come up every time through the plpython_do test, but I don't
always see the 0x7f7f7f7f case.  (Maybe that's a timing artifact?
It takes a variable amount of time to recover from the first crash
in plpython_do, so the rest of the plpython test run isn't exactly
operating in uniform conditions.)

No idea what's going on here, and I'm about out of steam for tonight.

            regards, tom lane


Re: pgsql: Refactor dlopen() support

From
Peter Eisentraut
Date:
On 07/09/2018 08:30, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> Refactor dlopen() support
> 
> Buildfarm member locust doesn't like this much.  I've been able to
> reproduce the problem on an old Mac laptop running the same macOS release,
> viz 10.5.8.  (Note that we're not seeing it on earlier or later releases,
> which is odd in itself.)

Nothing should have changed on macOS except that the intermediate
functions pg_dl*() were replaced by direct calls to dl*().  Very strange.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Refactor dlopen() support

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 07/09/2018 08:30, Tom Lane wrote:
>> Buildfarm member locust doesn't like this much.  I've been able to
>> reproduce the problem on an old Mac laptop running the same macOS release,
>> viz 10.5.8.  (Note that we're not seeing it on earlier or later releases,
>> which is odd in itself.)

> Nothing should have changed on macOS except that the intermediate
> functions pg_dl*() were replaced by direct calls to dl*().  Very strange.

Somehow or other, the changes you made in dfmgr.c's #include lines
have made it so that find_rendezvous_variable's local "bool found"
variable is actually of type _Bool (which is word-wide on these
machines).  However, hash_search thinks its output variable is
of type pointer to "typedef char bool".  The proximate cause of
the observed failure is that find_rendezvous_variable sees "found"
as true when it should not, and thus fails to zero out the variable's
value.

No time to look further right now, but there's something rotten
about the way we're handling bool.

            regards, tom lane


Re: pgsql: Refactor dlopen() support

From
Peter Eisentraut
Date:
On 07/09/2018 16:19, Tom Lane wrote:
> Somehow or other, the changes you made in dfmgr.c's #include lines
> have made it so that find_rendezvous_variable's local "bool found"
> variable is actually of type _Bool (which is word-wide on these
> machines).  However, hash_search thinks its output variable is
> of type pointer to "typedef char bool".  The proximate cause of
> the observed failure is that find_rendezvous_variable sees "found"
> as true when it should not, and thus fails to zero out the variable's
> value.

Ah because dlfcn.h includes stdbool.h.  Hmm.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Refactor dlopen() support

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 07/09/2018 16:19, Tom Lane wrote:
>> Somehow or other, the changes you made in dfmgr.c's #include lines
>> have made it so that find_rendezvous_variable's local "bool found"
>> variable is actually of type _Bool (which is word-wide on these
>> machines).

> Ah because dlfcn.h includes stdbool.h.  Hmm.

Yeah, and that's still true as of current macOS, it seems.

I can make the problem go away with the attached patch (borrowed from
similar code in plperl.h).  It's kind of grotty but I'm not sure there's
a better way.

            regards, tom lane

diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index c2a2572..4a5cc7c 100644
*** a/src/backend/utils/fmgr/dfmgr.c
--- b/src/backend/utils/fmgr/dfmgr.c
***************
*** 18,24 ****
--- 18,34 ----

  #ifdef HAVE_DLOPEN
  #include <dlfcn.h>
+
+ /*
+  * On macOS, <dlfcn.h> insists on including <stdbool.h>.  If we're not
+  * using stdbool, undef bool to undo the damage.
+  */
+ #ifndef USE_STDBOOL
+ #ifdef bool
+ #undef bool
  #endif
+ #endif
+ #endif                            /* HAVE_DLOPEN */

  #include "fmgr.h"
  #include "lib/stringinfo.h"