Robert Haas <robertmhaas@gmail.com> writes:
> The one thing that still seems a little odd to me is that this caused
> a pin count to get orphaned. It seems reasonable that ignoring the
> AccessExclusiveLock could result in not-found errors trying to open a
> missing relation, and even fsync requests on a missing relation. But
> I don't see why that would cause the backend-local pin counts to get
> messed up, which makes me wonder if there really is another bug here
> somewhere.
According to Heikki's log, the Assert was in the startup process itself,
and it happened after an error:
> 2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or directory
> 2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994
> xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
> 2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6: Aborted
I don't think that code is meant to recover from errors anyway, so
the fact that it fails with a pin count held isn't exactly surprising.
But it might be worth looking at exactly which on_proc_exit callbacks
are installed in the startup process and what assumptions they make.
As for where the error came from in the first place, it's easy to
imagine somebody who's not got the word about the AccessExclusiveLock
reading pages of the table into buffers that have already been scanned
by the DROP. So you'd end up with orphaned buffers belonging to a
vanished table. If somebody managed to dirty them by setting hint bits
(we do allow that in HS mode no?) then later you'd have various processes
trying to write the buffer before recycling it, which seems to fit the
reported error.
regards, tom lane