Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
Date
Msg-id 8799.1338472285@sss.pgh.pa.us
Whole thread Raw
In response to Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> The one thing that still seems a little odd to me is that this caused
> a pin count to get orphaned.  It seems reasonable that ignoring the
> AccessExclusiveLock could result in not-found errors trying to open a
> missing relation, and even fsync requests on a missing relation.  But
> I don't see why that would cause the backend-local pin counts to get
> messed up, which makes me wonder if there really is another bug here
> somewhere.

According to Heikki's log, the Assert was in the startup process itself,
and it happened after an error:

> 2012-05-26 10:44:28.587 CEST 10270 FATAL:  could not open file "base/21268/32994": No such file or directory
> 2012-05-26 10:44:28.588 CEST 10270 CONTEXT:  writing block 2508 of relation base/21268/32994
>          xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
> 2012-05-26 10:44:31.131 CEST 10269 LOG:  startup process (PID 10270) was terminated by signal 6: Aborted

I don't think that code is meant to recover from errors anyway, so
the fact that it fails with a pin count held isn't exactly surprising.
But it might be worth looking at exactly which on_proc_exit callbacks
are installed in the startup process and what assumptions they make.

As for where the error came from in the first place, it's easy to
imagine somebody who's not got the word about the AccessExclusiveLock
reading pages of the table into buffers that have already been scanned
by the DROP.  So you'd end up with orphaned buffers belonging to a
vanished table.  If somebody managed to dirty them by setting hint bits
(we do allow that in HS mode no?) then later you'd have various processes
trying to write the buffer before recycling it, which seems to fit the
reported error.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Robert Klemme
Date:
Subject: Re: [PERFORM] pg_dump and thousands of schemas
Next
From: Robert Haas
Date:
Subject: Re: Uh, I change my mind about commit_delay + commit_siblings (sort of)