Thread: Notice and share memory corruption

Notice and share memory corruption

From
Hannu Krosing
Date:
I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2
rpm-s

NOTICE:  RegisterSharedInvalid: SI buffer overflow
NOTICE:  InvalidateSharedInvalid: cache state reset

Actually I get many of them ;(

I'm running a script that does a bunch of mixed INSERTS, UPDATES,
DELETES and SELECTS.

after getting that I'm unable to vacuum database until I reset the OS

Where/how should I start looking (or is it a known problem)

Are there any simple workarounds to stop it happening.

-----------
Hannu


Re: Notice and share memory corruption

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2
> rpm-s

> NOTICE:  RegisterSharedInvalid: SI buffer overflow
> NOTICE:  InvalidateSharedInvalid: cache state reset

> Actually I get many of them ;(

AFAIK, these are just noise in 7.0.  The only reason you see them is
we haven't got round to removing the messages or downgrading them to
elog(DEBUG).

> I'm running a script that does a bunch of mixed INSERTS, UPDATES,
> DELETES and SELECTS.

I'll bet you also have some backends sitting idle with open
transactions?  The combination of idle and active backends is what
usually provokes SI overruns.

> after getting that I'm unable to vacuum database until I reset the OS

Define your terms more carefully, please.  What do you mean by
"unable to vacuum" --- what happens *exactly*?  In any case,
surely it doesn't take an OS reboot to recover.  I might believe
you need to restart the postmaster...
        regards, tom lane


Re: Notice and share memory corruption

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> Hannu Krosing <hannu@tm.ee> writes:
> > I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2
> > rpm-s
> 
> > NOTICE:  RegisterSharedInvalid: SI buffer overflow
> > NOTICE:  InvalidateSharedInvalid: cache state reset
> 
> > Actually I get many of them ;(
> 
> AFAIK, these are just noise in 7.0.  The only reason you see them is
> we haven't got round to removing the messages or downgrading them to
> elog(DEBUG).
> 
> > I'm running a script that does a bunch of mixed INSERTS, UPDATES,
> > DELETES and SELECTS.
> 
> I'll bet you also have some backends sitting idle with open
> transactions?  The combination of idle and active backends is what
> usually provokes SI overruns.
> 
> > after getting that I'm unable to vacuum database until I reset the OS
> 
> Define your terms more carefully, please.  What do you mean by
> "unable to vacuum" --- what happens *exactly*? 

NOTICE:  FlushRelationBuffers(access_right, 2009): block 1944 is
referenced (private 0, global 2)
FATAL 1:  VACUUM (vc_repair_frag): FlushRelationBuffers returned -2
pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 
The connection to the server was lost. Attempting reset: Succeeded.

> In any case,
> surely it doesn't take an OS reboot to recover.  I might believe
> you need to restart the postmaster...

on one machine a simple restart worked

Maybe i have to really restart it (instead of doing
/etc/rc.d/init.d/postgresql restart)
by running killall -9  /usr/bin/postgres

I was quite sure that just restarting it did not help, but maybe 
it really did not restart, just claimed to .



On the other I still get 

amphora2=# vacuum;
NOTICE:  FlushRelationBuffers(item, 30): block 2 is referenced (private
0, global 1)
FATAL 1:  VACUUM (vc_repair_frag): FlushRelationBuffers returned -2
pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 
The connection to the server was lost. Attempting reset: Succeeded.

after stopping postmaster (and checking it is stopped)

I could do a vacuum after restarting the whole machine...

OTOH it _may_ be that someone started another backend right after
restart and did something, 
but must this be a FATAL error ?

-----------
Hannu


Re: Notice and share memory corruption

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
>> Define your terms more carefully, please.  What do you mean by
>> "unable to vacuum" --- what happens *exactly*? 

> NOTICE:  FlushRelationBuffers(access_right, 2009): block 1944 is
> referenced (private 0, global 2)
> FATAL 1:  VACUUM (vc_repair_frag): FlushRelationBuffers returned -2

Oh, that's interesting.  This error indicates that some prior
transaction neglected to release a reference count on a shared buffer.
We have seen sporadic reports of this problem in 7.0, but so far no
one has come up with a reproducible example.  If you can boil down
your script to something that reproducibly causes the problem then
that'd be a great help in tracking it down.

If you have clients that sometimes disconnect in the middle of a
transaction, it might help to apply the attached patch.

> Maybe i have to really restart it (instead of doing
> /etc/rc.d/init.d/postgresql restart)
> by running killall -9  /usr/bin/postgres

Restarting the postmaster should clear the problem (by releasing and
reinitializing shared memory).  I dunno where you got the idea that
kill -9 was a recommended way of shutting down the system, but I sure
wouldn't recommend it.  A plain kill on the postmaster ought to do it
(see the pg_ctl script in release 7.0.*).
        regards, tom lane

*** src/backend/tcop/postgres.c.orig    Sat May 20 22:23:30 2000
--- src/backend/tcop/postgres.c    Wed Aug 30 16:47:51 2000
***************
*** 1459,1465 ****      * Initialize the deferred trigger manager      */     if (DeferredTriggerInit() != 0)
!         proc_exit(0);      SetProcessingMode(NormalProcessing); 
--- 1459,1465 ----      * Initialize the deferred trigger manager      */     if (DeferredTriggerInit() != 0)
!         goto normalexit;      SetProcessingMode(NormalProcessing); 
***************
*** 1479,1490 ****             TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction");          AbortCurrentTransaction();
!         InError = false;         if (ExitAfterAbort)
!         {
!             ProcReleaseLocks(); /* Just to be sure... */
!             proc_exit(0);
!         }     }      Warn_restart_ready = true;    /* we can now handle elog(ERROR) */
--- 1479,1489 ----             TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction");          AbortCurrentTransaction();
!          if (ExitAfterAbort)
!             goto errorexit;
! 
!         InError = false;     }      Warn_restart_ready = true;    /* we can now handle elog(ERROR) */
***************
*** 1553,1560 ****                 if (HandleFunctionRequest() == EOF)                 {                     /* lost
frontendconnection during F message input */
 
!                     pq_close();
!                     proc_exit(0);                 }                 break; 
--- 1552,1558 ----                 if (HandleFunctionRequest() == EOF)                 {                     /* lost
frontendconnection during F message input */
 
!                     goto normalexit;                 }                 break; 
***************
*** 1608,1618 ****                  */             case 'X':             case EOF:
!                 if (!IsUnderPostmaster)
!                     ShutdownXLOG();
!                 pq_close();
!                 proc_exit(0);
!                 break;              default:                 elog(ERROR, "unknown frontend message was received");
--- 1606,1612 ----                  */             case 'X':             case EOF:
!                 goto normalexit;              default:                 elog(ERROR, "unknown frontend message was
received");
***************
*** 1642,1651 ****             if (IsUnderPostmaster)                 NullCommand(Remote);         }
!     }                            /* infinite for-loop */ 
!     proc_exit(0);                /* shouldn't get here... */
!     return 1; }  #ifndef HAVE_GETRUSAGE
--- 1636,1655 ----             if (IsUnderPostmaster)                 NullCommand(Remote);         }
!     }                            /* end of main loop */
! 
! normalexit:
!     ExitAfterAbort = true;        /* ensure we will exit if elog during abort */
!     AbortOutOfAnyTransaction();
!     if (!IsUnderPostmaster)
!         ShutdownXLOG();
! 
! errorexit:
!     pq_close();
!     ProcReleaseLocks();            /* Just to be sure... */
!     proc_exit(0); 
!     return 1;                    /* keep compiler quiet */ }  #ifndef HAVE_GETRUSAGE