Thread: Weirdness using Executor Hooks

Weirdness using Executor Hooks

From
Eric Ridge
Date:
I've written an extension that hooks ExecutorStart_hook and ExecutorEnd_hook.  The hooks are assigned in _PG_init() (and the previous ones saved to static vars) and reset to the previous values in _PG_fini().  Maybe also of interest is the extension library is set in postgresql.conf as a local_preload_libraries.  This is with Postgres 9.3.4.

What happens is that rarely (and of course never on my development machine), the saved "prev_ExecutorXXXHook" gets set to the current value of ExecutorXXX_hook, so when my hook function is called:

static void my_executor_start_hook(QueryDesc *queryDesc, int eflags)
{
   executorDepth++;

   if (prev_ExecutorStartHook)   /* this ends up equal to my_executor_start_hook, so it recurses forever */
      prev_ExecutorStartHook(queryDesc, eflags);
   else
      standard_ExecutorStart(queryDesc, eflags);
}

it endless loops on itself (ie, prev_ExecutorStartHook == my_executor_start_hook).  Based on GDB backtraces, it looks like gcc compiles this into some form of tail recursion as the backtraces just sit on the line that calls prev_ExecutorXXXHook(...).  The backend has to be SIGKILL'd.

I've followed the patterns set forth in both the 'auto_explain' and 'pg_stat_statements' contrib extensions and I've been over my code about 3 dozen times now, and I just can't figure out what's going on.  Clearly the hooks are known-to-work, and I'm stumped.

One theory I have is that I've got a bug somewhere that's overwriting memory, but it's quite the coincidence that only the two saved prev hook pointers are being changed and being changed to very specific values.

Since it only happens rarely (and never for me during development), another theory is based on the fact that this extension is under pretty constant development/deployment and when we deploy a new binary (and run ALTER EXTENSION UPDATE) we don't restart Postgres and so maybe the already-active-and-initialized-with-the-previous-version backends are getting confused (maybe the kernel re-mmaps the .so or something, I dunno?).  I always seem to hear about the problem after a backend has been endlessly spinning for a few days.  :(

Have any of y'all seen anything like this and could I be on the right track with my second theory?

*scratching head*,

eric

Re: Weirdness using Executor Hooks

From
Andres Freund
Date:
Hi,

On 2015-06-18 16:36:17 -0400, Eric Ridge wrote:
> I've written an extension that hooks ExecutorStart_hook and
> ExecutorEnd_hook.  The hooks are assigned in _PG_init() (and the previous
> ones saved to static vars) and reset to the previous values in _PG_fini().
> Maybe also of interest is the extension library is set in postgresql.conf
> as a local_preload_libraries.  This is with Postgres 9.3.4.

How does your _PG_init() actually look like?

I'd suggest adding an error check that prev_ExecutorStartHook is
unequals to your function when you assign it.

> What happens is that rarely (and of course never on my development
> machine), the saved "prev_ExecutorXXXHook" gets set to the current value of
> ExecutorXXX_hook, so when my hook function is called:

Did you check whether it's compiler/compiler flag dependant?

Greetings,

Andres Freund



Re: Weirdness using Executor Hooks

From
Eric Ridge
Date:
On Thu, Jun 18, 2015 at 4:42 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> How does your _PG_init() actually look like?

static ExecutorStart_hook_type prev_ExecutorStartHook = NULL;
static ExecutorEnd_hook_type prev_ExecutorEndHook  = NULL;

void _PG_init(void)
{  prev_ExecutorStartHook = ExecutorStart_hook;  prev_ExecutorEndHook = ExecutorEnd_hook;
  ExecutorStart_hook = my_executor_start_hook;  ExecutorEnd_hook = my_executor_end_hook;
  RegisterXactCallback(my_xact_callback, NULL);
}

> I'd suggest adding an error check that prev_ExecutorStartHook is
> unequals to your function when you assign it.

As a data point, that might be interesting to know, but I'd still be
scratching my head about how it happened.  Postgres doesn't load an
extension library more than once per backend session, does it?

> Did you check whether it's compiler/compiler flag dependant?

I didn't.  As far as compiler flags go, the extension is just
inheriting Postgres' defaults.  Any suggestions around which flag(s)
to adjust?

I appreciate your time!

eric



Re: Weirdness using Executor Hooks

From
Tom Lane
Date:
Eric Ridge <eebbrr@gmail.com> writes:
> On Thu, Jun 18, 2015 at 4:42 PM, Andres Freund <andres@anarazel.de> wrote:
>> How does your _PG_init() actually look like?

> static ExecutorStart_hook_type prev_ExecutorStartHook = NULL;
> static ExecutorEnd_hook_type prev_ExecutorEndHook  = NULL;

> void _PG_init(void)
> {
>    prev_ExecutorStartHook = ExecutorStart_hook;
>    prev_ExecutorEndHook = ExecutorEnd_hook;

>    ExecutorStart_hook = my_executor_start_hook;
>    ExecutorEnd_hook = my_executor_end_hook;

>    RegisterXactCallback(my_xact_callback, NULL);
> }

If that's broken, then so are most of our contrib modules.
Certainly none of them have an extra check as suggested by Andres.

> As a data point, that might be interesting to know, but I'd still be
> scratching my head about how it happened.  Postgres doesn't load an
> extension library more than once per backend session, does it?

It's not supposed to, and AFAICS internal_load_library() will treat
either an exact pathname match or an inode-number match as being
"already loaded".  I wonder if you might be doing something that
confuses those checks.  It does not look like we try terribly hard
to canonicalize library pathnames --- might you have some references
under different relative paths, for instance?  The inode number
check would perhaps fail if you'd installed a new library version,
but it's unclear to me why the pathname check would fail.
        regards, tom lane



Re: Weirdness using Executor Hooks

From
Tom Lane
Date:
I wrote:
> Eric Ridge <eebbrr@gmail.com> writes:
>> As a data point, that might be interesting to know, but I'd still be
>> scratching my head about how it happened.  Postgres doesn't load an
>> extension library more than once per backend session, does it?

> It's not supposed to, and AFAICS internal_load_library() will treat
> either an exact pathname match or an inode-number match as being
> "already loaded".

BTW, if it didn't occur to you already --- next time you see this,
you could try examining dfmgr.c's DynamicFileList data structure
using gdb.  That might help illuminate what Postgres thinks it
loaded and why.  Also note that the hang presumably occurs just
*before* internal_load_library links the new item into the file_list,
so you'd want to inspect that as a local variable.
        regards, tom lane



Re: Weirdness using Executor Hooks

From
Eric Ridge
Date:
On Thu, Jun 18, 2015 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> If that's broken, then so are most of our contrib modules.
> Certainly none of them have an extra check as suggested by Andres.

I'm putting it in anyways, along with a guard in the actual hook
function to ERROR if prev_hook == my_hook.  At least that'll avoid the
endless loops until I can figure this out.

>
>> As a data point, that might be interesting to know, but I'd still be
>> scratching my head about how it happened.  Postgres doesn't load an
>> extension library more than once per backend session, does it?
>
> It's not supposed to, and AFAICS internal_load_library() will treat
> either an exact pathname match or an inode-number match as being
> "already loaded".  I wonder if you might be doing something that
> confuses those checks.  It does not look like we try terribly hard
> to canonicalize library pathnames --- might you have some references
> under different relative paths, for instance?  The inode number
> check would perhaps fail if you'd installed a new library version,
> but it's unclear to me why the pathname check would fail.

According to the docs, anything listed in 'local_preload_libraries'
has to be in $libdir/plugins (whereas extensions just go to $libdir),
so rather than duplicate the .so, the "copy" in $libdir/plugins is
just a symlink to ../my_extension.so.

Could that be confusing things?

eric



Re: Weirdness using Executor Hooks

From
Eric Ridge
Date:
On Thu, Jun 18, 2015 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, if it didn't occur to you already --- next time you see this,
> you could try examining dfmgr.c's DynamicFileList data structure
> using gdb.  That might help illuminate what Postgres thinks it
> loaded and why.

Hmm, okay.  I'll not check for the condition in the hook functions so
that when this does happen again I can do what you suggest.

thanks!

eric



Re: Weirdness using Executor Hooks

From
Tom Lane
Date:
Eric Ridge <eebbrr@gmail.com> writes:
> On Thu, Jun 18, 2015 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> As a data point, that might be interesting to know, but I'd still be
>>> scratching my head about how it happened.  Postgres doesn't load an
>>> extension library more than once per backend session, does it?

>> It's not supposed to, and AFAICS internal_load_library() will treat
>> either an exact pathname match or an inode-number match as being
>> "already loaded".  I wonder if you might be doing something that
>> confuses those checks.  It does not look like we try terribly hard
>> to canonicalize library pathnames --- might you have some references
>> under different relative paths, for instance?  The inode number
>> check would perhaps fail if you'd installed a new library version,
>> but it's unclear to me why the pathname check would fail.

> According to the docs, anything listed in 'local_preload_libraries'
> has to be in $libdir/plugins (whereas extensions just go to $libdir),
> so rather than duplicate the .so, the "copy" in $libdir/plugins is
> just a symlink to ../my_extension.so.

> Could that be confusing things?

Hm, maybe.  The scenario that had occurred to me is like this: suppose
the declaration of C-language function f1() refers to the shlib by
an absolute pathname, while function f2() refers to the same shlib by
a relative pathname, or maybe one with some extra ".." maneuvering.
Then what would happen if f1() is called first in a session is that
the shlib would get loaded under f1's name for it.  Each time f2()
is called after that, internal_load_library() doesn't see a match
on pathname, but it does see a match on inode, so it doesn't perform
an extra dlopen() and all is well.  Then by and by you replace the
library file, so now it has a new inode number.  The next f2() call
will see neither a pathname nor an inode match, so the new library
file gets dlopen'd, and kaboom.

However, based on your comment above, I'm betting that the initial
reference is via the $libdir/plugins symlink, but you have functions
that name the library via the shorter direct path, and it's a call
of one of those that fails.

What we were expecting I guess is that such functions would be declared
to reference the library underneath $libdir/plugins, not that you'd use
a symlink.

We could conceivably forestall such problems if we were willing to
chase/expand symlinks every time we generate a candidate library
pathname; but that would be extremely expensive so I'm disinclined
to add it to the critical path for calling extension functions.

Probably the best thing is to call this pilot error and add a
documentation note warning against doing it that way.
        regards, tom lane



Re: Weirdness using Executor Hooks

From
Eric Ridge
Date:
On Thu, Jun 18, 2015 at 6:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> What we were expecting I guess is that such functions would be declared
> to reference the library underneath $libdir/plugins, not that you'd use
> a symlink.

According to pg_proc.probin, all 32 of my C language functions point
to $libdir/my_extension_lib, which makes sense because as part of the
extension .sql I declare them as LANGUAGE c AS 'MODULE_PATHNAME', and
MODULE_PATHNAME gets substituted by make.

So are you saying that we should instead declare them AS
'$libdir/plugins/my_extension_lib' so that it matches what's happening
in local_preload_libraries?  And is it safe to directly update
pg_proc.probin (and ya know, terminate all existing backends)?

Thanks so much for your time!

eric

ps, I think if we just changed our deploy process to terminate
existing backends this would just disappear, yeah?



Re: Weirdness using Executor Hooks

From
Eric Ridge
Date:
On Thu, Jun 18, 2015 at 8:07 PM, Eric Ridge <eebbrr@gmail.com> wrote:
>
> According to pg_proc.probin, all 32 of my C language functions point
> to $libdir/my_extension_lib,

That's a damn lie.  The deployment server has a number of databases
that have this extension installed and some of them have probin set to
just 'my_extension_lib'

<short pause>

Of course I can't seem to trigger the problem, even by changing the
extension, re-compiling, and re-installing.  #dang

eric