Thread: Notice and share memory corruption
I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2 rpm-s NOTICE: RegisterSharedInvalid: SI buffer overflow NOTICE: InvalidateSharedInvalid: cache state reset Actually I get many of them ;( I'm running a script that does a bunch of mixed INSERTS, UPDATES, DELETES and SELECTS. after getting that I'm unable to vacuum database until I reset the OS Where/how should I start looking (or is it a known problem) Are there any simple workarounds to stop it happening. ----------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2 > rpm-s > NOTICE: RegisterSharedInvalid: SI buffer overflow > NOTICE: InvalidateSharedInvalid: cache state reset > Actually I get many of them ;( AFAIK, these are just noise in 7.0. The only reason you see them is we haven't got round to removing the messages or downgrading them to elog(DEBUG). > I'm running a script that does a bunch of mixed INSERTS, UPDATES, > DELETES and SELECTS. I'll bet you also have some backends sitting idle with open transactions? The combination of idle and active backends is what usually provokes SI overruns. > after getting that I'm unable to vacuum database until I reset the OS Define your terms more carefully, please. What do you mean by "unable to vacuum" --- what happens *exactly*? In any case, surely it doesn't take an OS reboot to recover. I might believe you need to restart the postmaster... regards, tom lane
Tom Lane wrote: > > Hannu Krosing <hannu@tm.ee> writes: > > I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2 > > rpm-s > > > NOTICE: RegisterSharedInvalid: SI buffer overflow > > NOTICE: InvalidateSharedInvalid: cache state reset > > > Actually I get many of them ;( > > AFAIK, these are just noise in 7.0. The only reason you see them is > we haven't got round to removing the messages or downgrading them to > elog(DEBUG). > > > I'm running a script that does a bunch of mixed INSERTS, UPDATES, > > DELETES and SELECTS. > > I'll bet you also have some backends sitting idle with open > transactions? The combination of idle and active backends is what > usually provokes SI overruns. > > > after getting that I'm unable to vacuum database until I reset the OS > > Define your terms more carefully, please. What do you mean by > "unable to vacuum" --- what happens *exactly*? NOTICE: FlushRelationBuffers(access_right, 2009): block 1944 is referenced (private 0, global 2) FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2 pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. > In any case, > surely it doesn't take an OS reboot to recover. I might believe > you need to restart the postmaster... on one machine a simple restart worked Maybe i have to really restart it (instead of doing /etc/rc.d/init.d/postgresql restart) by running killall -9 /usr/bin/postgres I was quite sure that just restarting it did not help, but maybe it really did not restart, just claimed to . On the other I still get amphora2=# vacuum; NOTICE: FlushRelationBuffers(item, 30): block 2 is referenced (private 0, global 1) FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2 pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. after stopping postmaster (and checking it is stopped) I could do a vacuum after restarting the whole machine... OTOH it _may_ be that someone started another backend right after restart and did something, but must this be a FATAL error ? ----------- Hannu
Hannu Krosing <hannu@tm.ee> writes: >> Define your terms more carefully, please. What do you mean by >> "unable to vacuum" --- what happens *exactly*? > NOTICE: FlushRelationBuffers(access_right, 2009): block 1944 is > referenced (private 0, global 2) > FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2 Oh, that's interesting. This error indicates that some prior transaction neglected to release a reference count on a shared buffer. We have seen sporadic reports of this problem in 7.0, but so far no one has come up with a reproducible example. If you can boil down your script to something that reproducibly causes the problem then that'd be a great help in tracking it down. If you have clients that sometimes disconnect in the middle of a transaction, it might help to apply the attached patch. > Maybe i have to really restart it (instead of doing > /etc/rc.d/init.d/postgresql restart) > by running killall -9 /usr/bin/postgres Restarting the postmaster should clear the problem (by releasing and reinitializing shared memory). I dunno where you got the idea that kill -9 was a recommended way of shutting down the system, but I sure wouldn't recommend it. A plain kill on the postmaster ought to do it (see the pg_ctl script in release 7.0.*). regards, tom lane *** src/backend/tcop/postgres.c.orig Sat May 20 22:23:30 2000 --- src/backend/tcop/postgres.c Wed Aug 30 16:47:51 2000 *************** *** 1459,1465 **** * Initialize the deferred trigger manager */ if (DeferredTriggerInit() != 0) ! proc_exit(0); SetProcessingMode(NormalProcessing); --- 1459,1465 ---- * Initialize the deferred trigger manager */ if (DeferredTriggerInit() != 0) ! goto normalexit; SetProcessingMode(NormalProcessing); *************** *** 1479,1490 **** TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction"); AbortCurrentTransaction(); ! InError = false; if (ExitAfterAbort) ! { ! ProcReleaseLocks(); /* Just to be sure... */ ! proc_exit(0); ! } } Warn_restart_ready = true; /* we can now handle elog(ERROR) */ --- 1479,1489 ---- TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction"); AbortCurrentTransaction(); ! if (ExitAfterAbort) ! goto errorexit; ! ! InError = false; } Warn_restart_ready = true; /* we can now handle elog(ERROR) */ *************** *** 1553,1560 **** if (HandleFunctionRequest() == EOF) { /* lost frontendconnection during F message input */ ! pq_close(); ! proc_exit(0); } break; --- 1552,1558 ---- if (HandleFunctionRequest() == EOF) { /* lost frontendconnection during F message input */ ! goto normalexit; } break; *************** *** 1608,1618 **** */ case 'X': case EOF: ! if (!IsUnderPostmaster) ! ShutdownXLOG(); ! pq_close(); ! proc_exit(0); ! break; default: elog(ERROR, "unknown frontend message was received"); --- 1606,1612 ---- */ case 'X': case EOF: ! goto normalexit; default: elog(ERROR, "unknown frontend message was received"); *************** *** 1642,1651 **** if (IsUnderPostmaster) NullCommand(Remote); } ! } /* infinite for-loop */ ! proc_exit(0); /* shouldn't get here... */ ! return 1; } #ifndef HAVE_GETRUSAGE --- 1636,1655 ---- if (IsUnderPostmaster) NullCommand(Remote); } ! } /* end of main loop */ ! ! normalexit: ! ExitAfterAbort = true; /* ensure we will exit if elog during abort */ ! AbortOutOfAnyTransaction(); ! if (!IsUnderPostmaster) ! ShutdownXLOG(); ! ! errorexit: ! pq_close(); ! ProcReleaseLocks(); /* Just to be sure... */ ! proc_exit(0); ! return 1; /* keep compiler quiet */ } #ifndef HAVE_GETRUSAGE