BUG #9464: PANIC with 'failed to re-find shared lock object' - Mailing list pgsql-bugs

From petela@gocougs.wsu.edu
Subject BUG #9464: PANIC with 'failed to re-find shared lock object'
Date
Msg-id 20140307005623.1918.22473@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #9464: PANIC with 'failed to re-find shared lock object'  (Peter LaDow <petela@gocougs.wsu.edu>)
Re: BUG #9464: PANIC with 'failed to re-find shared lock object'  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      9464
Logged by:          Peter LaDow
Email address:      petela@gocougs.wsu.edu
PostgreSQL version: 9.3.3
Operating system:   Linux (Ubuntu 12.04, PPC)
Description:

We are upgrading from Postgresql 8.3.7 to 9.3.3, and we are getting a
database crash due to a failure in a C extension.  In this extension we are
trying to spawn an external program to run in the background.  We do this
with double fork(), with the first child exiting cleanly, and the second
child execing the program.  The database log outputs the following upon
failure (myfoo_broken() is an SQL function that calls a C extension function
called myfoo(); see the code below):

2014-03-07 00:23:00.764 GMT myfoodatabase psql_user 6753: NOTICE:  Running
myfoo_broken()
2014-03-07 00:23:00.764 GMT myfoodatabase psql_user 6753: STATEMENT:  SELECT
public.set_schema(); select myfoo_broken();
2014-03-07 00:23:00.784 GMT myfoodatabase psql_user 6753: PANIC:  failed to
re-find shared lock object
2014-03-07 00:23:00.784 GMT myfoodatabase psql_user 6753: STATEMENT:  SELECT
public.set_schema(); select myfoo_broken();
2014-03-07 00:23:05.600 GMT myfoodatabase sel_web 6557: LOG:  duration:
2.651 ms  statement: SELECT session_id FROM device.web_session;
2014-03-07 00:23:09.383 GMT   1100: LOG:  server process (PID 6753) was
terminated by signal 6: Aborted
2014-03-07 00:23:09.384 GMT   1100: LOG:  terminating any other active
server processes
2014-03-07 00:23:09.386 GMT   6534: WARNING:  terminating connection because
of crash of another server process
2014-03-07 00:23:09.386 GMT   6534: DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2014-03-07 00:23:09.386 GMT   6534: HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2014-03-07 00:23:09.387 GMT myfoodatabase sel_web 6557: WARNING:
terminating connection because of crash of another server process
2014-03-07 00:23:09.387 GMT myfoodatabase sel_web 6557: DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
2014-03-07 00:23:09.387 GMT myfoodatabase sel_web 6557: HINT:  In a moment
you should be able to reconnect to the database and repeat your command.
2014-03-07 00:23:09.384 GMT myfoodatabase psql_user 6556: WARNING:
terminating connection because of crash of another server process
2014-03-07 00:23:09.384 GMT myfoodatabase psql_user 6556: DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
2014-03-07 00:23:09.388 GMT myfoodatabase psql_user 6556: HINT:  In a moment
you should be able to reconnect to the database and repeat your command.
2014-03-07 00:23:09.388 GMT myfoodatabase psql_user 6556: DEBUG:
shmem_exit(-1): 0 callbacks to make
2014-03-07 00:23:09.390 GMT myfoodatabase psql_user 6556: DEBUG:
proc_exit(-1): 0 callbacks to make
2014-03-07 00:23:09.425 GMT   1100: LOG:  all server processes terminated;
reinitializing
2014-03-07 00:23:10.146 GMT myfoodatabase psql_user 6778: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:10.156 GMT   6777: LOG:  database system was interrupted;
last known up at 2014-03-07 00:21:12 GMT
2014-03-07 00:23:10.162 GMT   6777: LOG:  database system was not properly
shut down; automatic recovery in progress
2014-03-07 00:23:10.218 GMT myfoodatabase psql_user 6781: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:10.347 GMT myfoodatabase psql_user 6790: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:10.464 GMT myfoodatabase psql_user 6798: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:10.645 GMT myfoodatabase psql_user 6806: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:10.722 GMT myfoodatabase psql_user 6809: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:10.863 GMT myfoodatabase psql_user 6818: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:10.985 GMT myfoodatabase psql_user 6826: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:11.199 GMT myfoodatabase psql_user 6834: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:11.253 GMT   6777: LOG:  record with zero length at
0/3BB9AE0
2014-03-07 00:23:11.253 GMT   6777: LOG:  redo is not required
2014-03-07 00:23:11.271 GMT myfoodatabase psql_user 6837: FATAL:  the
database system is in recovery mode
2014-03-07 00:23:11.377 GMT   6849: LOG:  autovacuum launcher started
2014-03-07 00:23:11.381 GMT   1100: LOG:  database system is ready to accept
connections

We fired up gdb and connected to the coredump, and get the following stack
trace:

# gdb -q -c /database/core.8073.sig6.1394152008s
/usr/lib/postgresql/9.3/bin/postgres

Reading symbols from /usr/lib/postgresql/9.3/bin/postgres...done.
[New LWP 8073]

warning: Can't read pathname for load map: Input/output error.
Core was generated by `postgres: psql_user myfoodatabase [local] SELECT
                                 '.
Program terminated with signal 6, Aborted.
#0  0x1fc3a1e0 in raise () from /lib/powerpc-linux-gnu/libc.so.6
(gdb) bt
#0  0x1fc3a1e0 in raise () from /lib/powerpc-linux-gnu/libc.so.6
#1  0x1fc3f004 in abort () from /lib/powerpc-linux-gnu/libc.so.6
#2  0x2081d578 in errfinish (dummy=0)
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/error/elog.c:554
#3  0x2081fe24 in elog_finish (elevel=22,
    fmt=0x2098230c "failed to re-find shared lock object")
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/error/elog.c:1318
#4  0x206725b4 in LockRefindAndRelease (lockMethodTable=0x209d0944,
    proc=0x50a86088, locktag=0x2101ccc8, lockmode=1,
    decrement_strong_lock_count=0 '\000')
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/storage/lmgr/lock.c:2904
#5  0x206703ac in LockReleaseAll (lockmethodid=1, allLocks=0 '\000')
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/storage/lmgr/lock.c:2077
#6  0x20676438 in ProcReleaseLocks (isCommit=1 '\001')
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/storage/lmgr/proc.c:752
#7  0x2085b324 in ResourceOwnerReleaseInternal (owner=0x21026230,
    phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1 '\001')
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/resowner/resowner.c:286
#8  0x2085b07c in ResourceOwnerRelease (owner=0x21026230,
    phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1 '\001')
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/resowner/resowner.c:206
#9  0x2036d23c in CommitTransaction ()
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/access/transam/xact.c:1951
#10 0x2036de24 in CommitTransactionCommand ()
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/access/transam/xact.c:2547
#11 0x20691254 in finish_xact_command ()
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/tcop/postgres.c:2411
#12 0x2068e6a8 in exec_simple_query (
    query_string=0x2108d8b0 "SELECT public.set_schema(); select
myfoo_broken();")
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/tcop/postgres.c:1079
#13 0x20693d90 in PostgresMain (argc=1, argv=0x21024588,
    dbname=0x210244b8 "myfoodatabase", username=0x20ff32b8 "psql_user")
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/tcop/postgres.c:3997
#14 0x2060f4c4 in BackendRun (port=0x2100b920)
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:3996
#15 0x2060ea20 in BackendStartup (port=0x2100b920)
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:3685
#16 0x2060a000 in ServerLoop ()
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:1586
#17 0x2060950c in PostmasterMain (argc=4, argv=0x20ff11b8)
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:1253
#18 0x20543e7c in main (argc=4, argv=0x20ff11b8)
    at
/hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/main/main.c:206
(gdb)

The offending elog() is in src/backend/storage/lmgr/lock.c, line 2904:


    /*
     * Re-find the lock object (it had better be there).
     */
    lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
                                                (void *) locktag,
                                                hashcode,
                                                HASH_FIND,
                                                NULL);
    if (!lock)
        elog(PANIC, "failed to re-find shared lock object");

So, apparently we are getting a case where (I think) we aren't finding a
lock in a lock hash table.

Below is the body of an extension (the full extension is available if
necessary) that exhibits the problem.  In our actual case

Datum run_foo(PG_FUNCTION_ARGS)
{
   pid_t pid;
   int rc;
   int status;

   /* If 'true' is passed, take the broken path. */
   bool broken = PG_GETARG_BOOL(0);

   pid = fork();
   if ( pid < 0 )
   {
      /* Failed to fork the first child (the child) process */
      ereport( ERROR, (errcode(ERRCODE_RAISE_EXCEPTION), errmsg("fork
error")) );
   }

   /* In the child */
   else if ( pid == 0 )
   {
      struct rlimit rlim;
      int i;

      /* Make us the controlling terminal */
      setsid();

      /* Get the number of file descriptors */
      if ( getrlimit(RLIMIT_NOFILE, &rlim) < 0 )
      {
         exit(EXIT_FAILURE);
      }

      /* Close all our file descriptors */
      for(i=0; i < rlim.rlim_cur; i++)
      {
        close(i);
      }

      /* Fork again to make sure that we are not a session leader. */
      pid = fork();
      if ( pid < 0 )
      {
         /* Failed to fork the second child process (the grand child) */
         exit(EXIT_FAILURE);
      }
      else if ( pid > 0 )
      {

        /* This is the broken case.  This causes the panic. */
        if ( broken )
        {
            /* The first child (the child) process is exiting. */
            exit(EXIT_SUCCESS);
        }

        /* This execlp() "fixes" the problem. We do not see the panic
         * when taking this path.
         */
        else
        {
            execlp("/bin/true", "/bin/true", NULL);

            /* Should never get here */
            exit(EXIT_FAILURE);
        }
      }

      /* Note:  This is where we normally execute the external process.
This example below
       *        is just for demonstration purposes.
       */
      execlp("/bin/false", "/bin/false", NULL);

      // We shouldn't get here
      exit(EXIT_FAILURE);
   }

   // Wait for the first child (the child) to exit
   rc = waitpid(pid, &status, 0);

   // Check the return codes (both the call and the process)
   if( (rc < 0 ) || (WIFEXITED(status) && WEXITSTATUS(status) != 0) )
   {
      ereport( ERROR, (errcode(ERRCODE_RAISE_EXCEPTION),errmsg("wait
error")) );
   }

   PG_RETURN_VOID();
}

Take particular note of the "if (broken)" portion.  If we call exit() from
the grandchild of postgres, we get the panic.  If we call exec(), things are
fine.  Note that calling this function alone does not cause a problem.  Only
when coupled with some other SQL that invokes this function do we get the
panic.

So, we have the following (contrived) example SQL to demonstrate the
problem:

  -- Create a dummy table.  The column 'schema_name' is the name of the
schema we want to add to the search path.
  DROP TABLE IF EXISTS public.dummy_schema;
  CREATE TABLE public.dummy_schema(schema_name TEXT NOT NULL);
  INSERT INTO public.dummy_schema VALUES ('schema_16');

  -- Add the schema to the search path.  Note that selecting from a table,
  -- even if the result is unused, exhibits the problem.  It seems to be
  -- a select coupled with a set_config() that causes problems (of course
  -- paired with our C extension).
  CREATE OR REPLACE FUNCTION public.set_schema( )
  RETURNS VOID AS $$
  DECLARE
    bar TEXT;
  BEGIN
    SELECT schema_name INTO bar FROM public.dummy_schema;
    PERFORM pg_catalog.set_config('search_path', 'schema_16,public',
false);
  END;
  $$ LANGUAGE plpgsql VOLATILE STRICT SECURITY DEFINER;

  -- Create the linkage to the C extension.
  CREATE OR REPLACE FUNCTION schema_16.myfoo(IN broken BOOL)
  RETURNS VOID AS '/usr/lib/postgresql/libmyfoo.so.1', 'run_foo'
  LANGUAGE C VOLATILE STRICT;

Here are some scenarios, and how they function.

  psql -c "select schema_16.myfoo(false);" # Works fine
  psql -c "select schema_16.myfoo(true);" # Works fine

  psql -c "select set_schema(); select myfoo(false);" # Works fine
  psql -c "select set_schema(); select myfoo(true);" # PANIC!

  # Note the examples below "inline" the function set_schema()
  psql -c "select pg_catalog.set_config('search_path', 'schema_16,public',
false); select myfoo(false);" # Works fine
  psql -c "select pg_catalog.set_config('search_path', 'schema_16,public',
false); select myfoo(true);" # Works fine

So for some strange reason, the exec() in the non-"broken" case does not
exhibit the problem.  Are we violating some assumptions or has there been a
change in behavior regarding fork/exec since 8.3?

Thanks,
Pete

pgsql-bugs by date:

Previous
From: martin@antibodymx.net
Date:
Subject: BUG #9455: Subtracting an IPv6 /64 network() from its broadcast() results in an out of range error
Next
From: Peter LaDow
Date:
Subject: Re: BUG #9464: PANIC with 'failed to re-find shared lock object'