Thread: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
"Erik Rijkers"
Date:
pg 9.2 git master AMD 8120 (8-core) / 6 GB memory / Centos 6.2 I have experimented a bit with dropping a table from master, then querying that table from a sync-rep slave. It is a little worrying that this, the first test I tried, crashes the slave. There are two instance on one machine, head1 (=master) and head2 (=sync-rep slave). First, I generated a tab-separated file, a one off, to be used in the test: echo " copy ( select repeat('X',20) as c1 , repeat('X',20) as c2 , repeat('X',20) as c3 , repeat('X',20) as c4 , repeat('X',20) as c5 from generate_series(1, 200000) ) to stdout csv delimiter E'\t'; " | $HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql -p 6564 -d testdb > dropload_copy.txt That txt file is zipped, and the actual test consists of a bash while loop which 1. drops the table 2. loads the file into the table 3. Either: a. nothing b. does a select count(*) on the table So, it repeats the following: zcat dropload_copy.txt.gz \ | grep -v '^#' \ | $HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql -p 6564 -d testdb -c " drop table if exists t; create table t ( c1 text, c2 text, c3 text, c4 text, c5 text ); copy t from stdin csv delimiter E'\t'; analyze t;"; PAUSE_DURATION=0 PSQL=$HOME/pg_stuff/pg_installations/pgsql.head1/bin/psql if [[ 0 -eq 1 ]]; # ON-OFF switch then echo "sleep $PAUSE_DURATION" sleep $PAUSE_DURATION; ( echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6564 -d testdb # master echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6565 -d testdb # wal_receiver_01 #echo "select current_setting('port') port, count(*) from $schema.$table" | $PSQL -qtXp 6566 -d testdb # wal_receiver_02 ) | grep -v '^$' fi This runs fine for hours on end, as long as the ON-OFF switch is disabled. But when that if-block is added the client crashes after a while (sometimes almost immediately; it never survives longer then 20 minutes): 2012-05-26 10:44:22.617 CEST 10274 ERROR: could not fsync file "base/21268/32807": No such file or directory 2012-05-26 10:44:28.465 CEST 10274 ERROR: could not fsync file "base/21268/32867": No such file or directory 2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or directory 2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994 xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) 2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6: Aborted 2012-05-26 10:44:31.131 CEST 10269 LOG: terminating any other active server processes Crazy scenario , I'll admit, but surely this shouldn't be able to crash the client? I attach the logfiles of master(=head1) and slave (=head2). It show how the above ran for an hour without problems (while the ON/OFF switch was disabled), and how the crash came quickly when I switched it on (to add the select count(*) statements). Erik Rijkers
Attachment
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Heikki Linnakangas
Date:
On 26.05.2012 12:21, Erik Rijkers wrote: > But when that if-block is added the client crashes after a while (sometimes almost immediately; it > never survives longer then 20 minutes): > > 2012-05-26 10:44:22.617 CEST 10274 ERROR: could not fsync file "base/21268/32807": No such file > or directory > 2012-05-26 10:44:28.465 CEST 10274 ERROR: could not fsync file "base/21268/32867": No such file > or directory > 2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or > directory > 2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994 > xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) > 2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6: > Aborted > 2012-05-26 10:44:31.131 CEST 10269 LOG: terminating any other active server processes > > > Crazy scenario , I'll admit, but surely this shouldn't be able to crash the client? Thanks for the report. I was able to reproduce this with that script, and I think I see what's going on now. There's something wrong with the way AccessExclusiveLocks work on a standby. I did "begin; truncate foo; -- leave the xact open" in the master, and waited until the xlog records are shipped to the standby. Then I did this in the standby: testdb=# begin; BEGIN testdb=# select * from foo; id ---- (0 rows) testdb=# select locktype, database, relation, virtualtransaction, pid, mode, granted, fastpath from pg_locks where locktype='relation' and relation='foo'::regclass; locktype | database | relation | virtualtransaction | pid | mode | granted | fastpath ----------+----------+----------+--------------------+-------+---------------------+---------+---------- relation | 16384| 27332 | 2/78 | 24984 | AccessShareLock | t | t relation | 16384 | 27332 | 1/0 | 24344 | AccessExclusiveLock | t | f (2 rows) The "select * from foo" query should have blocked, because the transaction in the master is holding an AccessExclusiveLock on the table. The process holding the AccessExclusiveLock is the startup process. It's holding the lock on behalf of the transaction in the master. But something's wrong, and the AccessExclusiveLock doesn't stop a regular backend from acquiring the AccessShareLock on the table. I suspect the fast-path locking patch, because this works on 9.1. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Robert Haas
Date:
On Wed, May 30, 2012 at 1:07 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > There's something wrong with the way AccessExclusiveLocks work on a standby. > I did "begin; truncate foo; -- leave the xact open" in the master, and > waited until the xlog records are shipped to the standby. Then I did this in > the standby: > > testdb=# begin; > BEGIN > testdb=# select * from foo; > id > ---- > (0 rows) > > testdb=# select locktype, database, relation, virtualtransaction, pid, mode, > granted, fastpath from pg_locks where locktype='relation' and > relation='foo'::regclass; > locktype | database | relation | virtualtransaction | pid | mode > | granted | fastpath > ----------+----------+----------+--------------------+-------+---------------------+---------+---------- > relation | 16384 | 27332 | 2/78 | 24984 | > AccessShareLock | t | t > relation | 16384 | 27332 | 1/0 | 24344 | > AccessExclusiveLock | t | f > (2 rows) > > The "select * from foo" query should have blocked, because the transaction > in the master is holding an AccessExclusiveLock on the table. > > The process holding the AccessExclusiveLock is the startup process. It's > holding the lock on behalf of the transaction in the master. But something's > wrong, and the AccessExclusiveLock doesn't stop a regular backend from > acquiring the AccessShareLock on the table. I suspect the fast-path locking > patch, because this works on 9.1. Yeah, apparently so. gdb says that FastPathStrongRelationLocks on the standby is all-zeros even after that record has been replayed. Not sure how that's possible yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Robert Haas
Date:
On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> The process holding the AccessExclusiveLock is the startup process. It's >> holding the lock on behalf of the transaction in the master. But something's >> wrong, and the AccessExclusiveLock doesn't stop a regular backend from >> acquiring the AccessShareLock on the table. I suspect the fast-path locking >> patch, because this works on 9.1. > > Yeah, apparently so. gdb says that FastPathStrongRelationLocks on the > standby is all-zeros even after that record has been replayed. Not > sure how that's possible yet. Ah. The problem is that FastPathTag() expects that locks on database objects will only be taken by backends with a non-zero value for MyDatabaseId. Apparently the can-i-use-the-fastpath test and the do-i-need-to-force-other-people-out-of-the-fastpath test need to be a bit more asymmetrical than they are at present. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Robert Haas
Date:
On Wed, May 30, 2012 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> The process holding the AccessExclusiveLock is the startup process. It's >>> holding the lock on behalf of the transaction in the master. But something's >>> wrong, and the AccessExclusiveLock doesn't stop a regular backend from >>> acquiring the AccessShareLock on the table. I suspect the fast-path locking >>> patch, because this works on 9.1. >> >> Yeah, apparently so. gdb says that FastPathStrongRelationLocks on the >> standby is all-zeros even after that record has been replayed. Not >> sure how that's possible yet. > > Ah. The problem is that FastPathTag() expects that locks on database > objects will only be taken by backends with a non-zero value for > MyDatabaseId. Apparently the can-i-use-the-fastpath test and the > do-i-need-to-force-other-people-out-of-the-fastpath test need to be a > bit more asymmetrical than they are at present. I've fixed things so that Heikki's test case now behaves as expected. Hopefully this fixes Erik's problem as well, but I haven't tested. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
"Erik Rijkers"
Date:
On Wed, May 30, 2012 22:25, Robert Haas wrote: > On Wed, May 30, 2012 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, May 30, 2012 at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> The process holding the AccessExclusiveLock is the startup process. It's >>>> holding the lock on behalf of the transaction in the master. But something's >>>> wrong, and the AccessExclusiveLock doesn't stop a regular backend from >>>> acquiring the AccessShareLock on the table. I suspect the fast-path locking >>>> patch, because this works on 9.1. >>> >>> Yeah, apparently so. gdb says that FastPathStrongRelationLocks on the >>> standby is all-zeros even after that record has been replayed. Not >>> sure how that's possible yet. >> >> Ah. The problem is that FastPathTag() expects that locks on database >> objects will only be taken by backends with a non-zero value for >> MyDatabaseId. Apparently the can-i-use-the-fastpath test and the >> do-i-need-to-force-other-people-out-of-the-fastpath test need to be a >> bit more asymmetrical than they are at present. > > I've fixed things so that Heikki's test case now behaves as expected. > Hopefully this fixes Erik's problem as well, but I haven't tested. > (I double-checked that I got your latest commit in) I'm afraid it's not yet resolved; the sync-slave still crashes almost immediately: master logfile says: 2012-05-30 23:30:07.846 CEST 3918 LOG: standby wal_receiver_01 is now the synchronous standby with priority 1 sync-slave logfile: [...] 2012-05-30 23:30:07.833 CEST 3908 LOG: database system is ready to accept read only connections cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/000000010000000000000004': No such file or directory 2012-05-30 23:30:07.845 CEST 3917 LOG: streaming replication successfully connected to primary 2012-05-30 23:40:52.635 CEST 5287 ERROR: could not open relation with OID 26563 2012-05-30 23:40:52.635 CEST 5287 STATEMENT: select current_setting('port') port, count(*) from public.t 2012-05-30 23:40:57.909 CEST 3909 FATAL: could not open file "base/21268/26569": No such file or directory 2012-05-30 23:40:57.909 CEST 3909 CONTEXT: writing block 5152 of relation base/21268/26569 xlog redo multi-insert(init): rel 1663/21268/26581; blk 3852; 35 tuples TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) 2012-05-30 23:40:58.006 CEST 5331 FATAL: could not open file "base/21268/26569": No such file or directory 2012-05-30 23:40:58.006 CEST 5331 CONTEXT: writing block 5153 of relation base/21268/26569 2012-05-30 23:40:59.661 CEST 3908 LOG: startup process (PID 3909) was terminated by signal 6: Aborted 2012-05-30 23:40:59.661 CEST 3908 LOG: terminating any other active server processes
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Robert Haas
Date:
On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers <er@xs4all.nl> wrote: > (I double-checked that I got your latest commit in) > > I'm afraid it's not yet resolved; the sync-slave still crashes almost immediately: > > master logfile says: > 2012-05-30 23:30:07.846 CEST 3918 LOG: standby wal_receiver_01 is now the synchronous standby > with priority 1 > > sync-slave logfile: > > [...] > 2012-05-30 23:30:07.833 CEST 3908 LOG: database system is ready to accept read only connections > cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/000000010000000000000004': No such file or > directory > 2012-05-30 23:30:07.845 CEST 3917 LOG: streaming replication successfully connected to primary > 2012-05-30 23:40:52.635 CEST 5287 ERROR: could not open relation with OID 26563 > 2012-05-30 23:40:52.635 CEST 5287 STATEMENT: select current_setting('port') port, count(*) from > public.t > 2012-05-30 23:40:57.909 CEST 3909 FATAL: could not open file "base/21268/26569": No such file or > directory > 2012-05-30 23:40:57.909 CEST 3909 CONTEXT: writing block 5152 of relation base/21268/26569 > xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) > 2012-05-30 23:40:58.006 CEST 5331 FATAL: could not open file "base/21268/26569": No such file or > directory > 2012-05-30 23:40:58.006 CEST 5331 CONTEXT: writing block 5153 of relation base/21268/26569 > 2012-05-30 23:40:59.661 CEST 3908 LOG: startup process (PID 3909) was terminated by signal 6: > Aborted > 2012-05-30 23:40:59.661 CEST 3908 LOG: terminating any other active server processes Hmm. I set up what I believe to be the same test case you were using, and it's not crashing for me. In fact, with the latest code, I haven't been able to produce any error at all. When I reverted my last commit, I managed to get this: ERROR: could not open relation with OID 18229 STATEMENT: select current_setting('port') port, count(*) from t ...but just once, and with no other error messages. So I'm either missing a step somewhere, or something about your system just happens to tickle this moreso than on mine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
"Erik Rijkers"
Date:
On Thu, May 31, 2012 03:30, Robert Haas wrote: > On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers <er@xs4all.nl> wrote: >> directory >> 2012-05-30 23:40:57.909 CEST 3909 CONTEXT: writing block 5152 of relation base/21268/26569 >> xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples >> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) >> 2012-05-30 23:40:58.006 CEST 5331 FATAL: could not open file "base/21268/26569": No such file >> or >> directory >> 2012-05-30 23:40:58.006 CEST 5331 CONTEXT: writing block 5153 of relation base/21268/26569 >> 2012-05-30 23:40:59.661 CEST 3908 LOG: startup process (PID 3909) was terminated by signal 6: >> Aborted >> 2012-05-30 23:40:59.661 CEST 3908 LOG: terminating any other active server processes > > Hmm. I set up what I believe to be the same test case you were using, > and it's not crashing for me. In fact, with the latest code, I > haven't been able to produce any error at all. When I reverted my > last commit, I managed to get this: > > ERROR: could not open relation with OID 18229 > STATEMENT: select current_setting('port') port, count(*) from t > > ...but just once, and with no other error messages. So I'm either > missing a step somewhere, or something about your system just happens > to tickle this moreso than on mine. > In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop is necessary; a few iterations are often successful. I can test it today on a few other systems to see if it is reproducible. thanks, Erik Rijkers
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Heikki Linnakangas
Date:
On 31.05.2012 08:06, Erik Rijkers wrote: > On Thu, May 31, 2012 03:30, Robert Haas wrote: >> On Wed, May 30, 2012 at 6:00 PM, Erik Rijkers<er@xs4all.nl> wrote: >>> directory >>> 2012-05-30 23:40:57.909 CEST 3909 CONTEXT: writing block 5152 of relation base/21268/26569 >>> xlog redo multi-insert (init): rel 1663/21268/26581; blk 3852; 35 tuples >>> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) >>> 2012-05-30 23:40:58.006 CEST 5331 FATAL: could not open file "base/21268/26569": No such file >>> or >>> directory >>> 2012-05-30 23:40:58.006 CEST 5331 CONTEXT: writing block 5153 of relation base/21268/26569 >>> 2012-05-30 23:40:59.661 CEST 3908 LOG: startup process (PID 3909) was terminated by signal 6: >>> Aborted >>> 2012-05-30 23:40:59.661 CEST 3908 LOG: terminating any other active server processes >> >> Hmm. I set up what I believe to be the same test case you were using, >> and it's not crashing for me. In fact, with the latest code, I >> haven't been able to produce any error at all. When I reverted my >> last commit, I managed to get this: >> >> ERROR: could not open relation with OID 18229 >> STATEMENT: select current_setting('port') port, count(*) from t >> >> ...but just once, and with no other error messages. So I'm either >> missing a step somewhere, or something about your system just happens >> to tickle this moreso than on mine. > > In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table > is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop > is necessary; a few iterations are often successful. > > I can test it today on a few other systems to see if it is reproducible. I could no longer reproduce it after Robert's fix, the test case has been running for about an hour now. Please triple-check that you have it applied in the standby :-). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Robert Haas
Date:
On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote: > In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table > is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop > is necessary; a few iterations are often successful. Yes... I let it run all night on my laptop with no errors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Simon Riggs
Date:
On 31 May 2012 12:14, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote: >> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the table >> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop >> is necessary; a few iterations are often successful. > > Yes... I let it run all night on my laptop with no errors. It looked to me like the correct fix to me as well, FWIW. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
"Erik Rijkers"
Date:
On Thu, May 31, 2012 13:14, Robert Haas wrote: > On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote: >> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the >> table >> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop >> is necessary; a few iterations are often successful. > > Yes... I let it run all night on my laptop with no errors. My apologies to both of you. It seems the problem was indeed solved with that commit from Robert. I managed to forget thatI, uh... temporary, commented out the git-pull from my build script... Thanks, Erik Rijkers
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Robert Haas
Date:
On Thu, May 31, 2012 at 7:49 AM, Erik Rijkers <er@xs4all.nl> wrote: > On Thu, May 31, 2012 13:14, Robert Haas wrote: >> On Thu, May 31, 2012 at 1:06 AM, Erik Rijkers <er@xs4all.nl> wrote: >>> In my test, I run the bash code (the bits that I posted earlier) in a while loop so that the >>> table >>> is CREATEd, COPYied into, and DROPped every few seconds -- perhaps that wasn't clear. That loop >>> is necessary; a few iterations are often successful. >> >> Yes... I let it run all night on my laptop with no errors. > > My apologies to both of you. It seems the problem was indeed solved with that commit from Robert. > I managed to forget that I, uh... temporary, commented out the git-pull from my build script... No problem. The one thing that still seems a little odd to me is that this caused a pin count to get orphaned. It seems reasonable that ignoring the AccessExclusiveLock could result in not-found errors trying to open a missing relation, and even fsync requests on a missing relation. But I don't see why that would cause the backend-local pin counts to get messed up, which makes me wonder if there really is another bug here somewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > The one thing that still seems a little odd to me is that this caused > a pin count to get orphaned. It seems reasonable that ignoring the > AccessExclusiveLock could result in not-found errors trying to open a > missing relation, and even fsync requests on a missing relation. But > I don't see why that would cause the backend-local pin counts to get > messed up, which makes me wonder if there really is another bug here > somewhere. According to Heikki's log, the Assert was in the startup process itself, and it happened after an error: > 2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or directory > 2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994 > xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) > 2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6: Aborted I don't think that code is meant to recover from errors anyway, so the fact that it fails with a pin count held isn't exactly surprising. But it might be worth looking at exactly which on_proc_exit callbacks are installed in the startup process and what assumptions they make. As for where the error came from in the first place, it's easy to imagine somebody who's not got the word about the AccessExclusiveLock reading pages of the table into buffers that have already been scanned by the DROP. So you'd end up with orphaned buffers belonging to a vanished table. If somebody managed to dirty them by setting hint bits (we do allow that in HS mode no?) then later you'd have various processes trying to write the buffer before recycling it, which seems to fit the reported error. regards, tom lane
Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
From
Robert Haas
Date:
On Thu, May 31, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The one thing that still seems a little odd to me is that this caused >> a pin count to get orphaned. It seems reasonable that ignoring the >> AccessExclusiveLock could result in not-found errors trying to open a >> missing relation, and even fsync requests on a missing relation. But >> I don't see why that would cause the backend-local pin counts to get >> messed up, which makes me wonder if there really is another bug here >> somewhere. > > According to Heikki's log, the Assert was in the startup process itself, > and it happened after an error: > >> 2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or directory >> 2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994 >> xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples >> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741) >> 2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6: Aborted > > I don't think that code is meant to recover from errors anyway, so > the fact that it fails with a pin count held isn't exactly surprising. > But it might be worth looking at exactly which on_proc_exit callbacks > are installed in the startup process and what assumptions they make. Which code isn't meant to recover from errors? > As for where the error came from in the first place, it's easy to > imagine somebody who's not got the word about the AccessExclusiveLock > reading pages of the table into buffers that have already been scanned > by the DROP. So you'd end up with orphaned buffers belonging to a > vanished table. If somebody managed to dirty them by setting hint bits > (we do allow that in HS mode no?) then later you'd have various processes > trying to write the buffer before recycling it, which seems to fit the > reported error. Right, I understand the other errors. It's just the pin count that I am a bit confused about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company