Thread: postgres server crashes unexpectedly
hi all,
we've had a db to stop working for some reason and have searched through google to it's ends to no avail. we have reindexed the database and system database as tried to determine other things that would keep it from working... again, to no avail.
this just started occuring about 2 weeks ago and below is a recent snippet of error log that we have. any idea what would/could be causing these crashes? any response is much appreciated as this is not a time-critial issue for us.
thanks!
chadwick
error log:
LOG: redo is not required
LOG: database system is ready
PANIC: corrupted item pointer: offset = 0, size = 0
LOG: autovacuum process (PID 3037) was terminated by signal 6
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2008-03-18 02:55:30 PDT
LOG: checkpoint record is at 25/6068AE30
LOG: redo record is at 25/6068AE30; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 0/414366728; next OID: 240102
LOG: next MultiXactId: 1132; next MultiXactOffset: 2431
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 25/6068AE80
LOG: redo is not required
LOG: database system is ready
PANIC: corrupted item pointer: offset = 0, size = 0
LOG: autovacuum process (PID 3045) was terminated by signal 6
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2008-03-18 02:56:31 PDT
LOG: checkpoint record is at 25/6068AE80
LOG: redo record is at 25/6068AE80; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 0/414366728; next OID: 240102
LOG: next MultiXactId: 1132; next MultiXactOffset: 2431
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 25/6068AED0
LOG: redo is not required
LOG: database system is ready
PANIC: corrupted item pointer: offset = 0, size = 0
LOG: autovacuum process (PID 3072) was terminated by signal 6
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2008-03-18 02:57:32 PDT
LOG: checkpoint record is at 25/6068AED0
LOG: redo record is at 25/6068AED0; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 0/414366728; next OID: 240102
LOG: next MultiXactId: 1132; next MultiXactOffset: 2431
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 25/6068AF20
LOG: redo is not required
LOG: database system is ready
LOG: database system is ready
PANIC: corrupted item pointer: offset = 0, size = 0
LOG: autovacuum process (PID 3037) was terminated by signal 6
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2008-03-18 02:55:30 PDT
LOG: checkpoint record is at 25/6068AE30
LOG: redo record is at 25/6068AE30; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 0/414366728; next OID: 240102
LOG: next MultiXactId: 1132; next MultiXactOffset: 2431
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 25/6068AE80
LOG: redo is not required
LOG: database system is ready
PANIC: corrupted item pointer: offset = 0, size = 0
LOG: autovacuum process (PID 3045) was terminated by signal 6
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2008-03-18 02:56:31 PDT
LOG: checkpoint record is at 25/6068AE80
LOG: redo record is at 25/6068AE80; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 0/414366728; next OID: 240102
LOG: next MultiXactId: 1132; next MultiXactOffset: 2431
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 25/6068AED0
LOG: redo is not required
LOG: database system is ready
PANIC: corrupted item pointer: offset = 0, size = 0
LOG: autovacuum process (PID 3072) was terminated by signal 6
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2008-03-18 02:57:32 PDT
LOG: checkpoint record is at 25/6068AED0
LOG: redo record is at 25/6068AED0; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 0/414366728; next OID: 240102
LOG: next MultiXactId: 1132; next MultiXactOffset: 2431
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 25/6068AF20
LOG: redo is not required
LOG: database system is ready
"Chadwick Horn" <chadhorn@gmail.com> writes: > PANIC: corrupted item pointer: offset = 0, size = 0 > LOG: autovacuum process (PID 3037) was terminated by signal 6 Hmm ... the only instances of that error text are in PageIndexTupleDelete and PageIndexMultiDelete, so we can fairly safely say that you have a partially zeroed-out page in some index somewhere. If that's the only damage then you're in luck: you can recover by reindexing. What I'd do is turn off autovacuum and instead do a manual VACUUM VERBOSE to see where it crashes; then you could just reindex the one problem table instead of the whole database. You ought to look into why this happened, too. Since you've provided precisely 0 context about PG version or platform, it's hard to speculate about that ... regards, tom lane
Hi there, Sorry about the lack of information on the system. We're running fedora (not for sure what version though) core (whitebox). I did as you said and this is the result: DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.01 sec. INFO: "grp_member": moved 0 row versions, truncated 4 to 4 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: vacuuming "public.story_member" INFO: "story_member": found 603570 removable, 9903 nonremovable row versions in 43011 pages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 44 to 44 bytes long. There were 6139208 unused item pointers. Total free space (including removable row versions) is 323999824 bytes. 42732 pages are or will become empty, including 0 at the end of the table. 42958 pages containing 323999400 free bytes are potential move destinations. CPU 0.52s/0.18u sec elapsed 5.91 sec. INFO: index "fkx_story__story_member" now contains 9903 row versions in 17736 pages DETAIL: 64 index row versions were removed. 15219 index pages have been deleted, 15219 are currently reusable. CPU 0.29s/0.06u sec elapsed 26.88 sec. PANIC: corrupted item pointer: offset = 0, size = 0 server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. The connection to the server was lost. Attempting reset: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. Failed. !> !> I keep getting this error: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. The connection to the server was lost. Attempting reset: Succeeded. What could be doing this? It just started out of the blue... I reindexed the index it mentioned and it seems to error out more... -Chadwick ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Chadwick Horn" <chadhorn@gmail.com> Cc: <pgsql-sql@postgresql.org> Sent: Monday, March 17, 2008 7:32 PM Subject: Re: [SQL] postgres server crashes unexpectedly > "Chadwick Horn" <chadhorn@gmail.com> writes: >> PANIC: corrupted item pointer: offset = 0, size = 0 >> LOG: autovacuum process (PID 3037) was terminated by signal 6 > > Hmm ... the only instances of that error text are in PageIndexTupleDelete > and PageIndexMultiDelete, so we can fairly safely say that you have a > partially zeroed-out page in some index somewhere. If that's the only > damage then you're in luck: you can recover by reindexing. > > What I'd do is turn off autovacuum and instead do a manual VACUUM > VERBOSE to see where it crashes; then you could just reindex the one > problem table instead of the whole database. > > You ought to look into why this happened, too. Since you've provided > precisely 0 context about PG version or platform, it's hard to speculate > about that ... > > regards, tom lane
On Tue, 18 Mar 2008, Chadwick Horn wrote: > Sorry about the lack of information on the system. We're running fedora (not > for sure what version though) core (whitebox). This may not matter in the least bit, but have you tried running the DB on a real RHEL, or CentOS box? The kernel and libs on such a box would most likely be more stable than those on Fedora-based boxen... Cheers, -Josh
In all honesty, we're fairly "trapped" on the box we have due to the depths of corporate approvals required to get something new online. I would, most def, prefer to be on anything BUT this... ----- Original Message ----- From: "Joshua Kramerý" <josh@globalherald.net> To: "Chadwick Horn" <chadhorn@gmail.com> Cc: "Tom Lane" <tgl@sss.pgh.pa.us>; <pgsql-sql@postgresql.org> Sent: Tuesday, March 18, 2008 8:37 AM Subject: Re: [SQL] postgres server crashes unexpectedly > > On Tue, 18 Mar 2008, Chadwick Horn wrote: > >> Sorry about the lack of information on the system. We're running fedora >> (not for sure what version though) core (whitebox). > > This may not matter in the least bit, but have you tried running the DB on > a real RHEL, or CentOS box? The kernel and libs on such a box would most > likely be more stable than those on Fedora-based boxen... > > Cheers, > -Josh >
"Chadwick Horn" <chadhorn@gmail.com> writes: > I keep getting this error: > Attempting reset: WARNING: terminating connection because of crash of another server process It looks to me like psql is managing to start a new connection before the postmaster notices the crash of the prior backend and tells everybody to get out of town. Which is odd, but maybe not too implausible if your kernel is set up to favor interactive processes over background --- it'd likely think psql is interactive and the postmaster isn't. > What could be doing this? It just started out of the blue... I reindexed the > index it mentioned and it seems to error out more... If you reindexed only the last-mentioned index, then you reindexed the wrong thing; it presumably died on the next index of story_member. I'd reindex the whole table rather than guess which that is. You should also consider the not-zero probability that you have more than one corrupted index. Keep reindexing tables until you can get through a database-wide VACUUM. regards, tom lane
> "Chadwick Horn" <chadhorn@gmail.com> writes: >> I keep getting this error: > >> Attempting reset: WARNING: terminating connection because of crash of >> another server process > > It looks to me like psql is managing to start a new connection before > the postmaster notices the crash of the prior backend and tells > everybody to get out of town. Which is odd, but maybe not too > implausible if your kernel is set up to favor interactive processes over > background --- it'd likely think psql is interactive and the postmaster > isn't. Is there a way to disable this or to make both interactive and/or background? > >> What could be doing this? It just started out of the blue... I reindexed >> the >> index it mentioned and it seems to error out more... > > If you reindexed only the last-mentioned index, then you reindexed the > wrong thing; it presumably died on the next index of story_member. > I'd reindex the whole table rather than guess which that is. > > You should also consider the not-zero probability that you have more > than one corrupted index. Keep reindexing tables until you can get > through a database-wide VACUUM. I have VACUUM'd it until it's fibers are coming out. It seems to crash at various places (which, most likely, would be resolved if question #1 above is possible) and holds no consistancy. The error logs provide even fewer clues than the verbose output. -chadwick
Chadwick Horn wrote: >> It looks to me like psql is managing to start a new connection >> before the postmaster notices the crash of the prior backend and >> tells everybody to get out of town. Which is odd, but maybe not >> too implausible if your kernel is set up to favor interactive >> processes over background --- it'd likely think psql is >> interactive and the postmaster isn't. > > Is there a way to disable this or to make both interactive and/or > background? I'm not sure how applications tell the kernel whether they are interactive or background (or even if they do, at all), but you can set the kernel's preference for this in the kernel configuration. If you're not comfortable recompiling a new kernel, though, then you're out of luck. At any rate, you should look more thoroughly for problems with your database before blaming the kernel for something. Colin