Thread: newbie admin question: postgresql server crashes

newbie admin question: postgresql server crashes

From
Murthy Kambhampaty
Date:
Our postgresql server crashed this morning unexpectedly. Attached is our
server log from before the crash. What's perplexing is the sequence
"2002-08-01 10:17:30 DEBUG:  server process (pid 13715) was terminated by
signal 11
2002-08-01 10:17:30 DEBUG:  terminating any other active server processes
2002-08-01 10:17:30 DEBUG:  all server processes terminated; reinitializing
shared memory and semaphores
2002-08-01 10:17:30 FATAL 1:  The database system is starting up
2002-08-01 10:17:30 DEBUG:  database system was interrupted at 2002-08-01
05:30:41 EDT"

But there is nor report of any problems at 05:30:41 EDT.

I should note that we deployed this server last week with
checkpoint_segments=3 and fsync off. The server crashed repeatedly on file
imports when the checkpoing interval dropped below 5 secs. This would lead
to corrupted tables, and vacuuming those tables would also cause the server
to crash with the "pg_clog/nnnn no such file" message, which I gather means
the table has corrupted data. Since then I have increased the
checkpoint_segments to 64, and as the log shows the interval is back to five
minutes per the documentation, and with fsync on, it is my understanding
that we should not be seeing any corrupted tables. However, the server would
crash if we vacuumed any of the tables previously corrupted, and then crash
again with "signal 11" just about an hour later. So we started restarting
the server after every crash while vacuuming a corrupted table (we had
several we wanted to check through, and didn't want to just drop and
recreate). But we got through those, and now don't think we have any
corrupted tables (we can vacuum all that are left with success). So the
crashes on "signal 11" since this morning are totally perplexing.

The server is a quad PIII Xeon 550 MHz 2MB cache, with 4GB mem, mylex
acceleraid 352 attached raid5 array running 2.4.18-SGI_XFS-1.1 on an
extended partition (not LVM). Help?! (I'm not on the list, though in the
process of joining, so please copy me on replies.)

Thanks,
    Murthy


S. Murthy Kambhampaty
Glassman-Oliver Economic Consultants, Inc.
Washington, DC




2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A6
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A7
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A8
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 0000004000000095
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 0000004000000096
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 0000004000000097
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 0000004000000098
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 0000004000000099
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 000000400000009A
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A9
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000AA
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000AB
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000AC
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000AD
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000AE
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000AF
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000B0
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 000000400000009B
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 000000400000009C
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 000000400000009D
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 000000400000009E
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 000000400000009F
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A0
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A1
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A2
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A3
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A4
2002-08-01 04:58:51 DEBUG:  recycled transaction log file 00000040000000A5
2002-08-01 05:00:55 DEBUG:  connection: host=[local] user=[xxxxx]
database=[xxxxx]
2002-08-01 05:00:55 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index 'level6_sales_fact_tot_dry_pkey' for table
'level6_sales_fact_tot_dry'
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B1
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B2
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B3
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B4
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B5
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B6
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B7
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B8
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000B9
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000BA
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000BB
2002-08-01 05:04:17 DEBUG:  recycled transaction log file 00000040000000BC
2002-08-01 05:05:10 DEBUG:  connection: host=[local] user=[xxxxx]
database=[xxxxx]
2002-08-01 05:05:10 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index 'level5_sales_dim_tot_dry_pkey' for table
'level5_sales_dim_tot_dry'
2002-08-01 05:15:00 DEBUG:  connection: host=[local] user=[xxxxx]
database=[xxxxx]
2002-08-01 05:15:00 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index 'level5_sales_fact_tot_dry_pkey' for table
'level5_sales_fact_tot_dry'
2002-08-01 09:06:12 DEBUG:  connection: host=[local] user=[xxxxx]
database=[xxxxx]
2002-08-01 09:06:12 DEBUG:  connection: host=[local] user=[xxxxx]
database=[xxxxx]
2002-08-01 10:17:30 DEBUG:  server process (pid 13715) was terminated by
signal 11
2002-08-01 10:17:30 DEBUG:  terminating any other active server processes
2002-08-01 10:17:30 DEBUG:  all server processes terminated; reinitializing
shared memory and semaphores
2002-08-01 10:17:30 FATAL 1:  The database system is starting up
2002-08-01 10:17:30 DEBUG:  database system was interrupted at 2002-08-01
05:30:41 EDT
2002-08-01 10:17:30 DEBUG:  checkpoint record is at 40/BDC984D0
2002-08-01 10:17:30 DEBUG:  redo record is at 40/BDC984D0; undo record is at
0/0; shutdown FALSE
2002-08-01 10:17:30 DEBUG:  next transaction id: 54583; next oid: 3618449
2002-08-01 10:17:30 DEBUG:  database system was not properly shut down;
automatic recovery in progress
2002-08-01 10:17:30 DEBUG:  ReadRecord: record with zero length at
40/BDC98510
2002-08-01 10:17:30 DEBUG:  redo is not required
2002-08-01 10:17:30 FATAL 1:  The database system is starting up
2002-08-01 10:17:30 FATAL 1:  The database system is starting up
2002-08-01 10:17:30 FATAL 1:  The database system is starting up
2002-08-01 10:17:32 DEBUG:  database system is ready
2002-08-01 11:40:01 DEBUG:  fast shutdown request
2002-08-01 11:40:01 DEBUG:  shutting down
2002-08-01 11:40:03 DEBUG:  database system is shut down
2002-08-01 11:40:12 DEBUG:  database system was shut down at 2002-08-01
11:40:03 EDT
2002-08-01 11:40:12 DEBUG:  checkpoint record is at 40/BDC98550
2002-08-01 11:40:12 DEBUG:  redo record is at 40/BDC98550; undo record is at
0/0; shutdown TRUE
2002-08-01 11:40:12 DEBUG:  next transaction id: 54583; next oid: 3618449
2002-08-01 11:40:12 DEBUG:  database system is ready
2002-08-01 11:40:20 DEBUG:  connection: host=[local] user=[xxxxx]
database=[xxxxx]
2002-08-01 11:40:21 DEBUG:  connection: host=[local] user=[xxxxx]
database=[xxxxx]
2002-08-01 12:51:44 DEBUG:  server process (pid 8695) was terminated by
signal 11
2002-08-01 12:51:44 DEBUG:  terminating any other active server processes
2002-08-01 12:51:44 DEBUG:  all server processes terminated; reinitializing
shared memory and semaphores
2002-08-01 12:51:44 FATAL 1:  The database system is starting up
2002-08-01 12:51:44 DEBUG:  database system was interrupted at 2002-08-01
11:40:12 EDT
2002-08-01 12:51:44 DEBUG:  checkpoint record is at 40/BDC98550
2002-08-01 12:51:44 DEBUG:  redo record is at 40/BDC98550; undo record is at
0/0; shutdown TRUE
2002-08-01 12:51:44 DEBUG:  next transaction id: 54583; next oid: 3618449
2002-08-01 12:51:44 DEBUG:  database system was not properly shut down;
automatic recovery in progress
2002-08-01 12:51:44 FATAL 1:  The database system is starting up
2002-08-01 12:51:44 FATAL 1:  The database system is starting up
2002-08-01 12:51:44 DEBUG:  ReadRecord: record with zero length at
40/BDC98590
2002-08-01 12:51:44 DEBUG:  redo is not required
2002-08-01 12:51:44 FATAL 1:  The database system is starting up
2002-08-01 12:51:46 DEBUG:  database system is ready

Re: newbie admin question: postgresql server crashes

From
Tom Lane
Date:
Murthy Kambhampaty <Murthy.Kambhampaty@goeci.com> writes:
> Our postgresql server crashed this morning unexpectedly. Attached is our
> server log from before the crash. What's perplexing is the sequence
> "2002-08-01 10:17:30 DEBUG:  server process (pid 13715) was terminated by
> signal 11

Core file?  Stack backtrace?

> 2002-08-01 10:17:30 DEBUG:  database system was interrupted at 2002-08-01
> 05:30:41 EDT"

> But there is nor report of any problems at 05:30:41 EDT.

I believe that's effectively the time of the last checkpoint.  If the
system was idle between 5:30 and 10:15 or so then there's nothing very
surprising there.  Tell us about the core dump.

> I should note that we deployed this server last week with
> checkpoint_segments=3 and fsync off. The server crashed repeatedly on file
> imports when the checkpoing interval dropped below 5 secs.

Define "crash".  Again, stack backtraces would be useful.

FWIW, a lot of the reports of instability we've seen recently have
traced to hardware problems.  How confident are you of this hardware?

            regards, tom lane