Thread: BUG #5412: Crash in production SIGSEGV, equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c
BUG #5412: Crash in production SIGSEGV, equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c
From
"Rusty Conover"
Date:
The following bug has been logged online: Bug reference: 5412 Logged by: Rusty Conover Email address: rconover@infogears.com PostgreSQL version: 8.4.3 Operating system: Fedora Core 10, 2.6.27.30-170.2.82.fc10.i686.PAE Description: Crash in production SIGSEGV, equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c Details: Hi, 8.4.3 is crashing for me under load, below is the stack trace: #0 equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c:311 #1 0x0832451b in RelationClearRelation (relation=0x96654178, rebuild=<value optimized out>) at relcache.c:1901 #2 0x08324b9f in RelationFlushRelation () at relcache.c:1991 #3 RelationCacheInvalidateEntry (relationId=322615) at relcache.c:2042 #4 0x0831dd89 in LocalExecuteInvalidationMessage (msg=0xa218b54) at inval.c:510 #5 0x0831d341 in ProcessInvalidationMessages (hdr=0xa2010c4, func=0x831dc50 <LocalExecuteInvalidationMessage>) at inval.c:397 #6 0x0831d3dc in CommandEndInvalidationMessages () at inval.c:1006 #7 0x080c9035 in AtCommit_LocalCache () at xact.c:1009 #8 CommandCounterIncrement () at xact.c:634 #9 0x0826dcc9 in finish_xact_command () at postgres.c:2363 #10 0x0826ed4d in exec_simple_query ( query_string=0xa1e88d4 "insert into product_refer_urls (item_id, destination_url) select product_info.oid, product_info_static.affiliate_url from real_products, product_parts, product_info left join product_info_static on ("...) at postgres.c:1022 #11 0x0827042c in PostgresMain (argc=4, argv=0xa17c5d0, username=0xa17c590 "gb") at postgres.c:3614 #12 0x0823a4d3 in BackendRun () at postmaster.c:3449 #13 BackendStartup () at postmaster.c:3063 #14 ServerLoop () at postmaster.c:1387 #15 0x0823b503 in PostmasterMain (argc=4, argv=0xa179678) at postmaster.c:1040 #16 0x081dc0a6 in main (argc=4, argv=0xa179678) at main.c:188 The bug is reproducible, I'm not sure how to proceed. Here is the output from the log: LOG: server process (PID 8886) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process I have been seeing these messages: LOG: autovacuum: found orphan temp table "pg_temp_7"."found_sizes" in database "gb_render_footwear_db" LOG: autovacuum: found orphan temp table "pg_temp_7"."classification_tree" in database "gb_render_footwear_db" Could they be the result of previous crashes? Any help is greatly appreciated. All of this code ran perfectly under 8.3. Thanks, Rusty
Re: BUG #5412: Crash in production SIGSEGV, equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c
From
Heikki Linnakangas
Date:
Rusty Conover wrote: > The bug is reproducible, I'm not sure how to proceed. How do you reproduce it? Can you make a self-contained test case? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #5412: Crash in production SIGSEGV, equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c
From
Rusty Conover
Date:
On Apr 9, 2010, at 2:23 PM, Heikki Linnakangas wrote: > Rusty Conover wrote: >> The bug is reproducible, I'm not sure how to proceed.=20=20 >=20 > How do you reproduce it? Can you make a self-contained test case? >=20 I can reproduce it by running my code, I haven't found a good test case yet= because the bug seems to depend on an interaction between postgres process= es. I have about 12 different db sessions creating temporary tables with the sa= me name them dropping them very quickly in the same database. While other = processes are just running queries in other databases. The process that is= crashing isn't one of the processes that is creating and dropping temporar= y tables, its completely separate in a different database. Here a new crash from today with the same trace: #0 equalTupleDescs (tupdesc1=3D0x7f7f7f7f, tupdesc2=3D0x9661e430) at tupde= sc.c:311 311 if (tupdesc1->natts !=3D tupdesc2->natts) (gdb) bt #0 equalTupleDescs (tupdesc1=3D0x7f7f7f7f, tupdesc2=3D0x9661e430) at tupde= sc.c:311 #1 0x0832451b in RelationClearRelation (relation=3D0x96617200, rebuild=3D<= value optimized out>) at relcache.c:1901 #2 0x08324b9f in RelationFlushRelation () at relcache.c:1991 #3 RelationCacheInvalidateEntry (relationId=3D276951) at relcache.c:2042 #4 0x0831dd89 in LocalExecuteInvalidationMessage (msg=3D0xa21c328) at inva= l.c:510 #5 0x0831d341 in ProcessInvalidationMessages (hdr=3D0xa2075fc, func=3D0x83= 1dc50 <LocalExecuteInvalidationMessage>) at inval.c:397 #6 0x0831d3dc in CommandEndInvalidationMessages () at inval.c:1006 #7 0x080c9035 in AtCommit_LocalCache () at xact.c:1009 #8 CommandCounterIncrement () at xact.c:634 #9 0x0826dcc9 in finish_xact_command () at postgres.c:2363 #10 0x0826ed4d in exec_simple_query ( query_string=3D0xa1e89ec "insert into pch_oid_archive (url, product_nam= e, oid) select product_combined_history_names.url, product_combined_history= _names.product_name, product_info.oid from product_info, product_combined_h= istor"...) at postgres.c:1022 #11 0x0827042c in PostgresMain (argc=3D4, argv=3D0xa17c740, username=3D0xa1= 7c700 "gb") at postgres.c:3614 #12 0x0823a4d3 in BackendRun () at postmaster.c:3449 #13 BackendStartup () at postmaster.c:3063 #14 ServerLoop () at postmaster.c:1387 #15 0x0823b503 in PostmasterMain (argc=3D4, argv=3D0xa179678) at postmaster= .c:1040 #16 0x081dc0a6 in main (argc=3D4, argv=3D0xa179678) at main.c:188 (gdb)=20 Thanks, Rusty - Rusty Conover rconover@infogears.com InfoGears Inc / GearBuyer.com / FootwearBuyer.com http://www.infogears.com http://www.gearbuyer.com http://www.footwearbuyer.com
Hi Heikki and everybody else, It seems like this is a race condition cause by the system catalog cache no= t being locked properly. I've included a perl script below that causes the = crash on my box consistently. The script forks two different types of processes: #1 - begin transaction, create a few temp tables and analyze them in a tran= saction, commit (running in database foobar_1) #2 - begin transaction, truncate table, insert records into table from sele= ct in a transaction, commit (running in database foobar_2) I setup the process to have 10 instances of task #1 and 1 instance of task = #2. Running this script causes the crash of postgres within seconds on my box. If you change the parameters to say <6 of task #1, no crash happens, but if= you have >7 the crash does happen. The box that I'm running the script on has 8 cores, so CPU contention and s= ome improper locking might cause some of the problem.=20 The specs of the box are: Fedora release 10 (Cambridge) Intel(R) Xeon(R) CPU E5420 @ 2.50GHz glibc-2.9-3.i686 Linux 2.6.27.30-170.2.82.fc10.i686.PAE #1 SMP Mon Aug 17 08:24:23 EDT 2009 = i686 i686 i386 GNU/Linux PostgreSQL: 8.4.3 I tried to reproduce it on one of my 16-core x64 boxes and the same crash d= oesn't occur, also I tried on a dual core box and couldn't get a crash but= I haven't exhaustively tested the right number of parameters for task #1. If the script doesn't cause a crash for you please try changing the variabl= e $total_job_1_children to be a greater number then the number of CPU cores= of the machine that you're running it on. Any help would be appreciated and if I can be of further assistance please = let me know, Rusty -- Rusty Conover rconover@infogears.com InfoGears Inc / GearBuyer.com / FootwearBuyer.com http://www.infogears.com http://www.gearbuyer.com http://www.footwearbuyer.com #!/usr/bin/perl use strict; use warnings; use DBI; use POSIX ":sys_wait_h"; # Number of children for job1, create temp tables and analyze them # The number of jobs here matters: (on my 8 core box you need to have some = contention to get a failure) # >11=3Dfail # 10=3Dfail # 9=3Dfail # 8=3Dfail # 7=3Dfail # <6 works, my $total_job_1_children =3D 11; # Number of children for job 2 run a truncate and insert query loop. # we only need one of these jobs to be running really, because the truncate= locks. my $total_job_2_children =3D 1; # Just need two databases on your machine, foobar_1 and foobar_2 are the de= faults. my $database_1_dsn =3D ['dbi:Pg:dbname=3Dfoobar_1', 'postgres']; my $database_2_dsn =3D ['dbi:Pg:dbname=3Dfoobar_2', 'postgres']; # Do some setup transactions. if(1) { my $dbh =3D DBI->connect(@$database_2_dsn); $dbh->do("drop table foo_dest"); $dbh->do("drop table foobar_source"); $dbh->begin_work(); eval { $dbh->do("create table foobar_source (id integer, name text, size intege= r)") || die("Failed to create foobar_source: " . $dbh->errstr()); for(my $k =3D 0; $k < 3500; $k++) { $dbh->do("insert into foobar_source (id, name, size) values (?, 'test = me', ?)", undef, $k, int(rand(400000))) || die("Failed to insert into fooba= r_source: " . $dbh->errstr()); } $dbh->do("analyze foobar_source"); $dbh->do("create table foo_dest (id integer, name text, size integer)"); }; if($@) { print "Error doing init of tables: " . $@ . "\n"; $dbh->rollback(); $dbh->disconnect(); exit(0); } $dbh->commit(); $dbh->disconnect(); } my @child_pids; for(my $i =3D0; $i < $total_job_1_children; $i++) { print "Forking\n"; my $pid =3D fork(); if($pid =3D=3D 0) { run_child('job1'); exit(0); } else { push @child_pids, $pid; } } for(my $i =3D0; $i < $total_job_2_children; $i++) { print "Forking\n"; my $pid =3D fork(); if($pid =3D=3D 0) { run_child('job2'); exit(0); } else { push @child_pids, $pid; } } foreach my $pid (@child_pids) { print "Waiting for $pid\n"; waitpid($pid, 0); print "Got it\n"; } exit(0); sub run_child { my $job_type =3D shift; my $dsn; if($job_type eq 'job1') { $dsn =3D $database_1_dsn; } else { $dsn =3D $database_2_dsn; } my $dbh =3D DBI->connect(@$dsn); defined($dbh) || die("Failed to get connection to database"); for(my $i =3D0; $i < 400; $i++) { $dbh->begin_work(); eval { if($job_type eq 'job1') { $dbh->{Warn} =3D 0; $dbh->do("create temp table c_products (id INTEGER NOT NULL, product_name_= stemmed text, average_price numeric(12,2), cset_bitmap bit(437), gender tex= t) WITHOUT OIDS ON COMMIT DROP"); $dbh->do("create temp table c_products_oids (c_products_id INTEGER NOT NUL= L, oid INTEGER NOT NULL UNIQUE, price numeric(12,2) not null, product_name_= stemmed text not null) WITHOUT OIDS ON COMMIT DROP"); $dbh->{Warn} =3D 1; =09 $dbh->do("analyze c_products"); $dbh->do("analyze c_products_oids"); } else { $dbh->do("truncate table foo_dest"); $dbh->do("insert into foo_dest (id, name, size) select id, name, size from= foobar_source"); } }; if($@) { print "Got error in job $job_type: $@\n"; $dbh->rollback(); $dbh->disconnect(); exit(0); } $dbh->commit(); } $dbh->disconnect(); print "Child finished\n"; return; }
Rusty Conover wrote: > It seems like this is a race condition cause by the system catalog cache not being locked properly. I've included a perlscript below that causes the crash on my box consistently. > > The script forks two different types of processes: > > #1 - begin transaction, create a few temp tables and analyze them in a transaction, commit (running in database foobar_1) > #2 - begin transaction, truncate table, insert records into table from select in a transaction, commit (running in databasefoobar_2) > > I setup the process to have 10 instances of task #1 and 1 instance of task #2. > > Running this script causes the crash of postgres within seconds on my box. Thanks, that script crashes on my laptop too, with assertions enabled. According to the comments above RelationClearRelation(), if it's called with 'rebuild=true', the caller should hold a lock on the relation, i.e refcnt > 0. That's not the case in RelationFlushRelation() when it rebuilds a new relcache entry. Attached patch should fix that, by incrementing the reference count while the entry is rebuilt. It also adds an Assertion in RelationClearRelation() to check that the refcnt is indeed > 0. Comments? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/utils/cache/relcache.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/cache/relcache.c,v retrieving revision 1.287.2.3 diff -c -r1.287.2.3 relcache.c *** src/backend/utils/cache/relcache.c 13 Jan 2010 23:07:15 -0000 1.287.2.3 --- src/backend/utils/cache/relcache.c 14 Apr 2010 11:09:23 -0000 *************** *** 1773,1778 **** --- 1773,1781 ---- { Oid old_reltype = relation->rd_rel->reltype; + Assert((rebuild && relation->rd_refcnt > 0) || + (!rebuild && relation->rd_refcnt == 0)); + /* * Make sure smgr and lower levels close the relation's files, if they * weren't closed already. If the relation is not getting deleted, the *************** *** 1968,1975 **** static void RelationFlushRelation(Relation relation) { - bool rebuild; - if (relation->rd_createSubid != InvalidSubTransactionId || relation->rd_newRelfilenodeSubid != InvalidSubTransactionId) { --- 1971,1976 ---- *************** *** 1978,1994 **** * forget the "new" status of the relation, which is a useful * optimization to have. Ditto for the new-relfilenode status. */ ! rebuild = true; } else { /* * Pre-existing rels can be dropped from the relcache if not open. */ ! rebuild = !RelationHasReferenceCountZero(relation); } - - RelationClearRelation(relation, rebuild); } /* --- 1979,1996 ---- * forget the "new" status of the relation, which is a useful * optimization to have. Ditto for the new-relfilenode status. */ ! RelationIncrementReferenceCount(relation); ! RelationClearRelation(relation, true); ! RelationDecrementReferenceCount(relation); } else { /* * Pre-existing rels can be dropped from the relcache if not open. */ ! bool rebuild = !RelationHasReferenceCountZero(relation); ! RelationClearRelation(relation, rebuild); } } /*
On Apr 14, 2010, at 7:10 AM, Heikki Linnakangas wrote: > Rusty Conover wrote: >> It seems like this is a race condition cause by the system catalog cache= not being locked properly. I've included a perl script below that causes t= he crash on my box consistently. >>=20 >> The script forks two different types of processes: >>=20 >> #1 - begin transaction, create a few temp tables and analyze them in a t= ransaction, commit (running in database foobar_1) >> #2 - begin transaction, truncate table, insert records into table from s= elect in a transaction, commit (running in database foobar_2) >>=20 >> I setup the process to have 10 instances of task #1 and 1 instance of ta= sk #2. >>=20 >> Running this script causes the crash of postgres within seconds on my bo= x. >=20 > Thanks, that script crashes on my laptop too, with assertions enabled. >=20 > According to the comments above RelationClearRelation(), if it's called > with 'rebuild=3Dtrue', the caller should hold a lock on the relation, i.e > refcnt > 0. That's not the case in RelationFlushRelation() when it > rebuilds a new relcache entry. >=20 > Attached patch should fix that, by incrementing the reference count > while the entry is rebuilt. It also adds an Assertion in > RelationClearRelation() to check that the refcnt is indeed > 0. > Comments? Hi Heikki, This patch allows my machine to run the script without crashing, so its wor= king for me as well. Many thanks, Rusty -- Rusty Conover rconover@infogears.com InfoGears Inc / GearBuyer.com / FootwearBuyer.com http://www.infogears.com http://www.gearbuyer.com http://www.footwearbuyer.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > According to the comments above RelationClearRelation(), if it's called > with 'rebuild=true', the caller should hold a lock on the relation, i.e > refcnt > 0. That's not the case in RelationFlushRelation() when it > rebuilds a new relcache entry. > Attached patch should fix that, by incrementing the reference count > while the entry is rebuilt. It also adds an Assertion in > RelationClearRelation() to check that the refcnt is indeed > 0. > Comments? Why would this patch fix anything? It doesn't change the lock status. regards, tom lane
I wrote: > Why would this patch fix anything? It doesn't change the lock status. I have not been able to reproduce the crash using Rusty's script on my own machine, but after contemplating his stack trace for awhile I have a theory about what is happening. I think that while we are building a new relation entry (the RelationBuildDesc call in RelationClearRelation) for a locally-created relation, we receive an sinval reset event caused by sinval queue overflow. (That could only happen with a lot of concurrent catalog update activity, which is why there's a significant number of concurrent "job1" clients needed to provoke the problem.) The sinval reset will be serviced by RelationCacheInvalidate, which will blow away any relcache entries with refcount zero, including the one that the outer instance of RelationClearRelation is trying to rebuild. So when control returns the next thing that happens is we try to do the equalTupleDescs() comparison against a trashed pointer, as seen in the stack trace. This behavior is new in 8.4.3; before that RelationClearRelation temporarily unhooked the target rel from the relcache hash table, so it wouldn't be found by RelationCacheInvalidate. So that explains why Rusty's app worked before. In short, then, Heikki's fix is good, although it desperately needs some comment updates: there's effectively an API change happening here, because RelationClearRelation's contract with its caller is not the same as before. I'll clean it up a bit and apply. It will need to go into all the branches this patch went into: http://archives.postgresql.org/pgsql-committers/2010-01/msg00186.php regards, tom lane