Thread: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

From
"Erik Rijkers"
Date:
pg 9.2 git master
AMD 8120 (8-core) / 6 GB memory / Centos 6.2


I have experimented a bit with dropping a table from master, then querying that table from a
sync-rep slave.  It is a little worrying that this, the first test I tried, crashes the slave.

There are two instance on one machine, head1 (=master) and head2 (=sync-rep slave).

First, I generated a tab-separated file, a one off, to be used in the test:

echo "
    copy (
        select
              repeat('X',20)        as c1
            , repeat('X',20)        as c2
            , repeat('X',20)        as c3
            , repeat('X',20)        as c4
            , repeat('X',20)        as c5
        from generate_series(1, 200000)
    )
    to stdout
    csv delimiter E'\t';
" |  $HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql -p 6564 -d testdb > dropload_copy.txt


That txt file is zipped, and the actual test consists of a bash while loop which
  1. drops the table
  2. loads the file into the table
  3. Either:
          a.  nothing
          b.  does a select count(*) on the table


So, it repeats the following:

zcat dropload_copy.txt.gz \
 | grep -v '^#' \
 | $HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql -p 6564 -d testdb -c "
drop table if exists t;
create table t (
        c1 text,
        c2 text,
        c3 text,
        c4 text,
        c5 text
);
copy t from stdin csv delimiter E'\t';
analyze t;";

PAUSE_DURATION=0
PSQL=$HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql

if [[ 0 -eq 1 ]];  #  ON-OFF switch
then

    echo "sleep $PAUSE_DURATION"
          sleep $PAUSE_DURATION;
    (
        echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6564 -d
testdb  # master
        echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6565 -d
testdb  # wal_receiver_01
    #echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6566 -d
testdb  # wal_receiver_02
    ) | grep -v '^$'
fi



This runs fine for hours on end, as long as the ON-OFF switch is disabled.

But when that if-block is added the client crashes after a while (sometimes almost immediately; it
never survives longer then 20 minutes):

2012-05-26 10:44:22.617 CEST 10274 ERROR:  could not fsync file "base/21268/32807": No such file
or directory
2012-05-26 10:44:28.465 CEST 10274 ERROR:  could not fsync file "base/21268/32867": No such file
or directory
2012-05-26 10:44:28.587 CEST 10270 FATAL:  could not open file "base/21268/32994": No such file or
directory
2012-05-26 10:44:28.588 CEST 10270 CONTEXT:  writing block 2508 of relation base/21268/32994
        xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-26 10:44:31.131 CEST 10269 LOG:  startup process (PID 10270) was terminated by signal 6:
Aborted
2012-05-26 10:44:31.131 CEST 10269 LOG:  terminating any other active server processes


Crazy scenario , I'll admit, but surely this shouldn't be able to crash the client?

I attach the logfiles of master(=head1) and slave (=head2).  It show how the above ran for an hour
without problems (while the ON/OFF switch was disabled), and how the crash came quickly when I
switched it on (to add the select count(*) statements).


Erik Rijkers

Attachment

Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

From
Heikki Linnakangas
Date:
On 26.05.2012 12:21, Erik Rijkers wrote:
> But when that if-block is added the client crashes after a while (sometimes almost immediately; it
> never survives longer then 20 minutes):
>
> 2012-05-26 10:44:22.617 CEST 10274 ERROR:  could not fsync file "base/21268/32807": No such file
> or directory
> 2012-05-26 10:44:28.465 CEST 10274 ERROR:  could not fsync file "base/21268/32867": No such file
> or directory
> 2012-05-26 10:44:28.587 CEST 10270 FATAL:  could not open file "base/21268/32994": No such file or
> directory
> 2012-05-26 10:44:28.588 CEST 10270 CONTEXT:  writing block 2508 of relation base/21268/32994
>          xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
> 2012-05-26 10:44:31.131 CEST 10269 LOG:  startup process (PID 10270) was terminated by signal 6:
> Aborted
> 2012-05-26 10:44:31.131 CEST 10269 LOG:  terminating any other active server processes
>
>
> Crazy scenario , I'll admit, but surely this shouldn't be able to crash the client?

Thanks for the report. I was able to reproduce this with that script, 
and I think I see what's going on now.

There's something wrong with the way AccessExclusiveLocks work on a 
standby. I did "begin; truncate foo; -- leave the xact open" in the 
master, and waited until the xlog records are shipped to the standby. 
Then I did this in the standby:

testdb=# begin;
BEGIN
testdb=# select * from foo; id
----
(0 rows)

testdb=# select locktype, database, relation, virtualtransaction, pid, 
mode, granted, fastpath from pg_locks where locktype='relation' and 
relation='foo'::regclass; locktype | database | relation | virtualtransaction |  pid  | 
mode         | granted | fastpath
----------+----------+----------+--------------------+-------+---------------------+---------+---------- relation |
16384|    27332 | 2/78               | 24984 | 
 
AccessShareLock     | t       | t relation |    16384 |    27332 | 1/0                | 24344 | 
AccessExclusiveLock | t       | f
(2 rows)

The "select * from foo" query should have blocked, because the 
transaction in the master is holding an AccessExclusiveLock on the table.

The process holding the AccessExclusiveLock is the startup process. It's 
holding the lock on behalf of the transaction in the master. But 
something's wrong, and the AccessExclusiveLock doesn't stop a regular 
backend from acquiring the AccessShareLock on the table. I suspect the 
fast-path locking patch, because this works on 9.1.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Wed, May 30, 2012 at 1:07 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> There's something wrong with the way AccessExclusiveLocks work on a standby.
> I did "begin; truncate foo; -- leave the xact open" in the master, and
> waited until the xlog records are shipped to the standby. Then I did this in
> the standby:
>
> testdb=# begin;
> BEGIN
> testdb=# select * from foo;
>  id
> ----
> (0 rows)
>
> testdb=# select locktype, database, relation, virtualtransaction, pid, mode,
> granted, fastpath from pg_locks where locktype='relation' and
> relation='foo'::regclass;
>  locktype | database | relation | virtualtransaction |  pid  | mode
> | granted | fastpath
> ----------+----------+----------+--------------------+-------+---------------------+---------+----------
>  relation |    16384 |    27332 | 2/78               | 24984 |
> AccessShareLock     | t       | t
>  relation |    16384 |    27332 | 1/0                | 24344 |
> AccessExclusiveLock | t       | f
> (2 rows)
>
> The "select * from foo" query should have blocked, because the transaction
> in the master is holding an AccessExclusiveLock on the table.
>
> The process holding the AccessExclusiveLock is the startup process. It's
> holding the lock on behalf of the transaction in the master. But something's
> wrong, and the AccessExclusiveLock doesn't stop a regular backend from
> acquiring the AccessShareLock on the table. I suspect the fast-path locking
> patch, because this works on 9.1.

Yeah, apparently so.  gdb says that FastPathStrongRelationLocks on the
standby is all-zeros even after that record has been replayed.  Not
sure how that's possible yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The process holding the AccessExclusiveLock is the startup process. It's
>> holding the lock on behalf of the transaction in the master. But something's
>> wrong, and the AccessExclusiveLock doesn't stop a regular backend from
>> acquiring the AccessShareLock on the table. I suspect the fast-path locking
>> patch, because this works on 9.1.
>
> Yeah, apparently so.  gdb says that FastPathStrongRelationLocks on the
> standby is all-zeros even after that record has been replayed.  Not
> sure how that's possible yet.

Ah.  The problem is that FastPathTag() expects that locks on database
objects will only be taken by backends with a non-zero value for
MyDatabaseId.  Apparently the can-i-use-the-fastpath test and the
do-i-need-to-force-other-people-out-of-the-fastpath test need to be a
bit more asymmetrical than they are at present.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, May 30, 2012 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> The process holding the AccessExclusiveLock is the startup process. It's
>>> holding the lock on behalf of the transaction in the master. But something's
>>> wrong, and the AccessExclusiveLock doesn't stop a regular backend from
>>> acquiring the AccessShareLock on the table. I suspect the fast-path locking
>>> patch, because this works on 9.1.
>>
>> Yeah, apparently so.  gdb says that FastPathStrongRelationLocks on the
>> standby is all-zeros even after that record has been replayed.  Not
>> sure how that's possible yet.
>
> Ah.  The problem is that FastPathTag() expects that locks on database
> objects will only be taken by backends with a non-zero value for
> MyDatabaseId.  Apparently the can-i-use-the-fastpath test and the
> do-i-need-to-force-other-people-out-of-the-fastpath test need to be a
> bit more asymmetrical than they are at present.

I've fixed things so that Heikki's test case now behaves as expected.
Hopefully this fixes Erik's problem as well, but I haven't tested.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, May 30, 2012 22:25, Robert Haas wrote:
> On Wed, May 30, 2012 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> The process holding the AccessExclusiveLock is the startup process. It's
>>>> holding the lock on behalf of the transaction in the master. But something's
>>>> wrong, and the AccessExclusiveLock doesn't stop a regular backend from
>>>> acquiring the AccessShareLock on the table. I suspect the fast-path locking
>>>> patch, because this works on 9.1.
>>>
>>> Yeah, apparently so.  gdb says that FastPathStrongRelationLocks on the
>>> standby is all-zeros even after that record has been replayed.  Not
>>> sure how that's possible yet.
>>
>> Ah.  The problem is that FastPathTag() expects that locks on database
>> objects will only be taken by backends with a non-zero value for
>> MyDatabaseId.  Apparently the can-i-use-the-fastpath test and the
>> do-i-need-to-force-other-people-out-of-the-fastpath test need to be a
>> bit more asymmetrical than they are at present.
>
> I've fixed things so that Heikki's test case now behaves as expected.
> Hopefully this fixes Erik's problem as well, but I haven't tested.
>

(I double-checked that I got your latest commit in)

I'm afraid it's not yet resolved; the sync-slave still crashes almost immediately:

master logfile says:
2012-05-30 23:30:07.846 CEST 3918 LOG:  standby wal_receiver_01 is now the synchronous standby
with priority 1

sync-slave logfile:

[...]
2012-05-30 23:30:07.833 CEST 3908 LOG:  database system is ready to accept read only connections
cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/000000010000000000000004': No such file or
directory
2012-05-30 23:30:07.845 CEST 3917 LOG:  streaming replication successfully connected to primary
2012-05-30 23:40:52.635 CEST 5287 ERROR:  could not open relation with OID 26563
2012-05-30 23:40:52.635 CEST 5287 STATEMENT:  select current_setting('port') port, count(*) from
public.t
2012-05-30 23:40:57.909 CEST 3909 FATAL:  could not open file "base/21268/26569": No such file or
directory
2012-05-30 23:40:57.909 CEST 3909 CONTEXT:  writing block 5152 of relation base/21268/26569       xlog redo
multi-insert(init): rel 1663/21268/26581; blk 3852; 35 tuples 
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
2012-05-30 23:40:58.006 CEST 5331 FATAL:  could not open file "base/21268/26569": No such file or
directory
2012-05-30 23:40:58.006 CEST 5331 CONTEXT:  writing block 5153 of relation base/21268/26569
2012-05-30 23:40:59.661 CEST 3908 LOG:  startup process (PID 3909) was terminated by signal 6:
Aborted
2012-05-30 23:40:59.661 CEST 3908 LOG:  terminating any other active server processes




On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers <er@xs4all.nl> wrote:
> (I double-checked that I got your latest commit in)
>
> I'm afraid it's not yet resolved; the sync-slave still crashes almost immediately:
>
> master logfile says:
> 2012-05-30 23:30:07.846 CEST 3918 LOG:  standby wal_receiver_01 is now the synchronous standby
> with priority 1
>
> sync-slave logfile:
>
> [...]
> 2012-05-30 23:30:07.833 CEST 3908 LOG:  database system is ready to accept read only connections
> cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/000000010000000000000004': No such file or
> directory
> 2012-05-30 23:30:07.845 CEST 3917 LOG:  streaming replication successfully connected to primary
> 2012-05-30 23:40:52.635 CEST 5287 ERROR:  could not open relation with OID 26563
> 2012-05-30 23:40:52.635 CEST 5287 STATEMENT:  select current_setting('port') port, count(*) from
> public.t
> 2012-05-30 23:40:57.909 CEST 3909 FATAL:  could not open file "base/21268/26569": No such file or
> directory
> 2012-05-30 23:40:57.909 CEST 3909 CONTEXT:  writing block 5152 of relation base/21268/26569
>        xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
> 2012-05-30 23:40:58.006 CEST 5331 FATAL:  could not open file "base/21268/26569": No such file or
> directory
> 2012-05-30 23:40:58.006 CEST 5331 CONTEXT:  writing block 5153 of relation base/21268/26569
> 2012-05-30 23:40:59.661 CEST 3908 LOG:  startup process (PID 3909) was terminated by signal 6:
> Aborted
> 2012-05-30 23:40:59.661 CEST 3908 LOG:  terminating any other active server processes

Hmm.  I set up what I believe to be the same test case you were using,
and it's not crashing for me.  In fact, with the latest code, I
haven't been able to produce any error at all.  When I reverted my
last commit, I managed to get this:

ERROR:  could not open relation with OID 18229
STATEMENT:  select current_setting('port') port, count(*) from t

...but just once, and with no other error messages.    So I'm either
missing a step somewhere, or something about your system just happens
to tickle this moreso than on mine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 31, 2012 03:30, Robert Haas wrote:
> On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers <er@xs4all.nl> wrote:
>> directory
>> 2012-05-30 23:40:57.909 CEST 3909 CONTEXT:  writing block 5152 of relation base/21268/26569
>>        xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples
>> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
>> 2012-05-30 23:40:58.006 CEST 5331 FATAL:  could not open file "base/21268/26569": No such file
>> or
>> directory
>> 2012-05-30 23:40:58.006 CEST 5331 CONTEXT:  writing block 5153 of relation base/21268/26569
>> 2012-05-30 23:40:59.661 CEST 3908 LOG:  startup process (PID 3909) was terminated by signal 6:
>> Aborted
>> 2012-05-30 23:40:59.661 CEST 3908 LOG:  terminating any other active server processes
>
> Hmm.  I set up what I believe to be the same test case you were using,
> and it's not crashing for me.  In fact, with the latest code, I
> haven't been able to produce any error at all.  When I reverted my
> last commit, I managed to get this:
>
> ERROR:  could not open relation with OID 18229
> STATEMENT:  select current_setting('port') port, count(*) from t
>
> ...but just once, and with no other error messages.    So I'm either
> missing a step somewhere, or something about your system just happens
> to tickle this moreso than on mine.
>

In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
is necessary; a few iterations are often successful.

I can test it today on a few other systems to see if it is reproducible.

thanks,

Erik Rijkers



Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

From
Heikki Linnakangas
Date:
On 31.05.2012 08:06, Erik Rijkers wrote:
> On Thu, May 31, 2012 03:30, Robert Haas wrote:
>> On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers<er@xs4all.nl>  wrote:
>>> directory
>>> 2012-05-30 23:40:57.909 CEST 3909 CONTEXT:  writing block 5152 of relation base/21268/26569
>>>         xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples
>>> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
>>> 2012-05-30 23:40:58.006 CEST 5331 FATAL:  could not open file "base/21268/26569": No such file
>>> or
>>> directory
>>> 2012-05-30 23:40:58.006 CEST 5331 CONTEXT:  writing block 5153 of relation base/21268/26569
>>> 2012-05-30 23:40:59.661 CEST 3908 LOG:  startup process (PID 3909) was terminated by signal 6:
>>> Aborted
>>> 2012-05-30 23:40:59.661 CEST 3908 LOG:  terminating any other active server processes
>>
>> Hmm.  I set up what I believe to be the same test case you were using,
>> and it's not crashing for me.  In fact, with the latest code, I
>> haven't been able to produce any error at all.  When I reverted my
>> last commit, I managed to get this:
>>
>> ERROR:  could not open relation with OID 18229
>> STATEMENT:  select current_setting('port') port, count(*) from t
>>
>> ...but just once, and with no other error messages.    So I'm either
>> missing a step somewhere, or something about your system just happens
>> to tickle this moreso than on mine.
>
> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
> is necessary; a few iterations are often successful.
>
> I can test it today on a few other systems to see if it is reproducible.

I could no longer reproduce it after Robert's fix, the test case has 
been running for about an hour now. Please triple-check that you have it 
applied in the standby :-).

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:
> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
> is necessary; a few iterations are often successful.

Yes... I let it run all night on my laptop with no errors.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 31 May 2012 12:14, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:
>> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table
>> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
>> is necessary; a few iterations are often successful.
>
> Yes... I let it run all night on my laptop with no errors.

It looked to me like the correct fix to me as well, FWIW.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Thu, May 31, 2012 13:14, Robert Haas wrote:
> On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:
>> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the
>> table
>> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
>> is necessary; a few iterations are often successful.
>
> Yes... I let it run all night on my laptop with no errors.

My apologies to both of you.  It seems the problem was indeed solved with that commit from Robert. I managed to forget
thatI, uh... temporary, commented out the git-pull from my build script... 

Thanks,

Erik Rijkers




On Thu, May 31, 2012 at 7:49 AM, Erik Rijkers <er@xs4all.nl> wrote:
> On Thu, May 31, 2012 13:14, Robert Haas wrote:
>> On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote:
>>> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the
>>> table
>>> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear.  That loop
>>> is necessary; a few iterations are often successful.
>>
>> Yes... I let it run all night on my laptop with no errors.
>
> My apologies to both of you.  It seems the problem was indeed solved with that commit from Robert.
>  I managed to forget that I, uh... temporary, commented out the git-pull from my build script...

No problem.

The one thing that still seems a little odd to me is that this caused
a pin count to get orphaned.  It seems reasonable that ignoring the
AccessExclusiveLock could result in not-found errors trying to open a
missing relation, and even fsync requests on a missing relation.  But
I don't see why that would cause the backend-local pin counts to get
messed up, which makes me wonder if there really is another bug here
somewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> The one thing that still seems a little odd to me is that this caused
> a pin count to get orphaned.  It seems reasonable that ignoring the
> AccessExclusiveLock could result in not-found errors trying to open a
> missing relation, and even fsync requests on a missing relation.  But
> I don't see why that would cause the backend-local pin counts to get
> messed up, which makes me wonder if there really is another bug here
> somewhere.

According to Heikki's log, the Assert was in the startup process itself,
and it happened after an error:

> 2012-05-26 10:44:28.587 CEST 10270 FATAL:  could not open file "base/21268/32994": No such file or directory
> 2012-05-26 10:44:28.588 CEST 10270 CONTEXT:  writing block 2508 of relation base/21268/32994
>          xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
> 2012-05-26 10:44:31.131 CEST 10269 LOG:  startup process (PID 10270) was terminated by signal 6: Aborted

I don't think that code is meant to recover from errors anyway, so
the fact that it fails with a pin count held isn't exactly surprising.
But it might be worth looking at exactly which on_proc_exit callbacks
are installed in the startup process and what assumptions they make.

As for where the error came from in the first place, it's easy to
imagine somebody who's not got the word about the AccessExclusiveLock
reading pages of the table into buffers that have already been scanned
by the DROP.  So you'd end up with orphaned buffers belonging to a
vanished table.  If somebody managed to dirty them by setting hint bits
(we do allow that in HS mode no?) then later you'd have various processes
trying to write the buffer before recycling it, which seems to fit the
reported error.
        regards, tom lane


On Thu, May 31, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> The one thing that still seems a little odd to me is that this caused
>> a pin count to get orphaned.  It seems reasonable that ignoring the
>> AccessExclusiveLock could result in not-found errors trying to open a
>> missing relation, and even fsync requests on a missing relation.  But
>> I don't see why that would cause the backend-local pin counts to get
>> messed up, which makes me wonder if there really is another bug here
>> somewhere.
>
> According to Heikki's log, the Assert was in the startup process itself,
> and it happened after an error:
>
>> 2012-05-26 10:44:28.587 CEST 10270 FATAL:  could not open file "base/21268/32994": No such file or directory
>> 2012-05-26 10:44:28.588 CEST 10270 CONTEXT:  writing block 2508 of relation base/21268/32994
>>          xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
>> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
>> 2012-05-26 10:44:31.131 CEST 10269 LOG:  startup process (PID 10270) was terminated by signal 6: Aborted
>
> I don't think that code is meant to recover from errors anyway, so
> the fact that it fails with a pin count held isn't exactly surprising.
> But it might be worth looking at exactly which on_proc_exit callbacks
> are installed in the startup process and what assumptions they make.

Which code isn't meant to recover from errors?

> As for where the error came from in the first place, it's easy to
> imagine somebody who's not got the word about the AccessExclusiveLock
> reading pages of the table into buffers that have already been scanned
> by the DROP.  So you'd end up with orphaned buffers belonging to a
> vanished table.  If somebody managed to dirty them by setting hint bits
> (we do allow that in HS mode no?) then later you'd have various processes
> trying to write the buffer before recycling it, which seems to fit the
> reported error.

Right, I understand the other errors.  It's just the pin count that I
am a bit confused about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company