Thread: shared memory corruption

shared memory corruption

From

Todd Nemanich

Date:

13 May 2003, 16:03:42

I filled out the template as asked. I'm not certain where the real bug
is with this, but if anyone has seen this, or can give some insight as
to where to look, I would appreciate it.

============================================================================
                         POSTGRESQL BUG REPORT TEMPLATE
============================================================================


Your name               :       Todd Nemanich
Your email address      :       todd@twopunks.org


System Configuration
---------------------
   Architecture (example: Intel Pentium)         :       4x Intel Xeon

   Operating System (example: Linux 2.0.26 ELF)  :       Linux 2.4.19

   PostgreSQL version (example: PostgreSQL-7.3.1):   PostgreSQL-7.3.1

   Compiler used (example:  gcc 2.95.2)          :       ? (PGDG 7.3.1
rpms on RH 7.3)


Please enter a FULL description of your problem:
------------------------------------------------
My postgresql DB dropped into recovery mode, but failed to restart.
Typically, 600-800 backends are running at any time.
Below are some excerpts from the postgres.log:

May 13 14:01:17 db3 postgres[2618]: [1] LOG:  server process (pid 14721)
was terminated by signal 6
May 13 14:01:17 db3 postgres[2618]: [2] LOG:  terminating any other
active server processes
May 13 14:01:17 db3 postgres[15044]: [1-1] WARNING:  Message from
PostgreSQL backend:
May 13 14:01:17 db3 postgres[15044]: [1-2] ^IThe Postmaster has informed
me that some other backend
May 13 14:01:17 db3 postgres[15044]: [1-3] ^Idied abnormally and
possibly corrupted shared memory.
May 13 14:01:17 db3 postgres[15044]: [1-4] ^II have rolled back the
current transaction and am
May 13 14:01:17 db3 postgres[15044]: [1-5] ^Igoing to terminate your
database system connection and exit.
May 13 14:01:17 db3 postgres[15044]: [1-6] ^IPlease reconnect to the
database system and repeat your query.
May 13 14:01:17 db3 postgres[15046]: [1-1] WARNING:  Message from
PostgreSQL backend:
May 13 14:01:17 db3 postgres[15046]: [1-2] ^IThe Postmaster has informed
me that some other backend
May 13 14:01:17 db3 postgres[15031]: [1-1] WARNING:  Message from
PostgreSQL backend:
May 13 14:01:17 db3 postgres[15046]: [1-3] ^Idied abnormally and
possibly corrupted shared memory.
May 13 14:01:17 db3 postgres[14650]: [1-1] WARNING:  Message from
PostgreSQL backend:
May 13 14:01:17 db3 postgres[15046]: [1-4] ^II have rolled back the
current transaction and am
May 13 14:01:17 db3 postgres[15042]: [1-1] WARNING:  Message from
PostgreSQL backend:
May 13 14:01:17 db3 postgres[15032]: [1-1] WARNING:  Message from
PostgreSQL backend:

<skip a couple thousand lines>

May 13 14:30:54 db3 postgres[2100]: [3] FATAL:  The database system is
in recovery mode
May 13 14:30:54 db3 postgres[2132]: [3] FATAL:  The database system is
in recovery mode
May 13 14:30:54 db3 postgres[2618]: [3] LOG:  fast shutdown request
May 13 14:30:54 db3 postgres[2618]: [4] LOG:  all server processes
terminated; reinitializing shared memory and semaphores
May 13 14:30:54 db3 postgres[2139]: [5] FATAL:  The database system is
shutting down
May 13 14:30:54 db3 postgres[2136]: [5] LOG:  database system was
interrupted at 2003-05-13 14:00:10 EDT
May 13 14:30:54 db3 postgres[2138]: [5] FATAL:  The database system is
shutting down
May 13 14:30:54 db3 postgres[2137]: [5] FATAL:  The database system is
shutting down
May 13 14:30:54 db3 postgres[2140]: [5] FATAL:  The database system is
shutting down
May 13 14:30:54 db3 postgres[2136]: [6] LOG:  checkpoint record is at
ED/A7D7CD08
May 13 14:30:54 db3 postgres[2141]: [5] FATAL:  The database system is
shutting down
May 13 14:30:54 db3 postgres[2136]: [7] LOG:  redo record is at
ED/A7BBEF88; undo record is at 0/0; shutdown FALSE
May 13 14:30:54 db3 postgres[2136]: [8] LOG:  next transaction id:
754449278; next oid: 33734849
May 13 14:30:54 db3 postgres[2136]: [9] LOG:  database system was not
properly shut down; automatic recovery in progress
May 13 14:30:54 db3 postgres[2136]: [10] LOG:  redo starts at ED/A7BBEF88
May 13 14:30:54 db3 postgres[2142]: [5] FATAL:  The database system is
shutting down

<skip the shutdown messages>

May 13 14:31:22 db3 postgres[2816]: [5] FATAL:  The database system is
shutting down
May 13 14:31:22 db3 postgres[2758]: [6] LOG:  recycled transaction log
file 000000ED000000A9
May 13 14:31:22 db3 postgres[2758]: [7] LOG:  recycled transaction log
file 000000ED000000AA
May 13 14:31:22 db3 postgres[2758]: [8] LOG:  recycled transaction log
file 000000ED000000AB
May 13 14:31:22 db3 postgres[2758]: [9] LOG:  recycled transaction log
file 000000ED000000A7
May 13 14:31:22 db3 postgres[2758]: [10] LOG:  recycled transaction log
file 000000ED000000A8
May 13 14:31:22 db3 postgres[2758]: [11] LOG:  database system is shut down
May 13 14:31:36 db3 postgres[2877]: [1] FATAL:  The database system is
starting up
May 13 14:31:36 db3 postgres[2876]: [1] LOG:  database system was shut
down at 2003-05-13 14:31:22 EDT
May 13 14:31:36 db3 postgres[2876]: [2] LOG:  checkpoint record is at
ED/ACB32480
May 13 14:31:36 db3 postgres[2876]: [3] LOG:  redo record is at
ED/ACB32480; undo record is at 0/0; shutdown TRUE
May 13 14:31:36 db3 postgres[2876]: [4] LOG:  next transaction id:
754504089; next oid: 33734849
May 13 14:31:36 db3 postgres[2878]: [1] FATAL:  The database system is
starting up
May 13 14:31:36 db3 postgres[2876]: [5] LOG:  database system is ready




Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------
No idea. Any suggestions as to where to look for the cause would be
appreciated.




If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

Re: shared memory corruption

From

Tom Lane

Date:

13 May 2003, 20:24:15

Todd Nemanich <todd@twopunks.org> writes:
>    PostgreSQL version (example: PostgreSQL-7.3.1):   PostgreSQL-7.3.1
>    Compiler used (example:  gcc 2.95.2)          :       ? (PGDG 7.3.1
> rpms on RH 7.3)

> May 13 14:01:17 db3 postgres[2618]: [1] LOG:  server process (pid 14721)
> was terminated by signal 6

Hmm.  Signal 6 is SIGABORT, which suggests that that backend aborted
itself after detecting an Assert() failure.  But I didn't think that
the RPM version was compiled with assertions enabled.  Also, if it was
an assert then there should have been a complaint about it just before
the termination message.

If this is repeatable, I'd suggest restarting the postmaster under
"ulimit -c unlimited" so that the abort will produce a core-dump file.
A debugger backtrace from the core file would provide useful info.

            regards, tom lane

Re: shared memory corruption

From

Todd Nemanich

Date:

14 May 2003, 09:55:38

This has happened once last month as well, but we were not able to nail
it down then either. I'll restart the server when I get a chance to see
if we can get a core dump next time it happens. THX.

Tom Lane wrote:
> Todd Nemanich <todd@twopunks.org> writes:
>
>>   PostgreSQL version (example: PostgreSQL-7.3.1):   PostgreSQL-7.3.1
>>   Compiler used (example:  gcc 2.95.2)          :       ? (PGDG 7.3.1
>>rpms on RH 7.3)
>
>
>>May 13 14:01:17 db3 postgres[2618]: [1] LOG:  server process (pid 14721)
>>was terminated by signal 6
>
>
> Hmm.  Signal 6 is SIGABORT, which suggests that that backend aborted
> itself after detecting an Assert() failure.  But I didn't think that
> the RPM version was compiled with assertions enabled.  Also, if it was
> an assert then there should have been a complaint about it just before
> the termination message.
>
> If this is repeatable, I'd suggest restarting the postmaster under
> "ulimit -c unlimited" so that the abort will produce a core-dump file.
> A debugger backtrace from the core file would provide useful info.
>
>             regards, tom lane