Thread: BUG #5412: Crash in production SIGSEGV, equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c

The following bug has been logged online:

Bug reference:      5412
Logged by:          Rusty Conover
Email address:      rconover@infogears.com
PostgreSQL version: 8.4.3
Operating system:   Fedora Core 10, 2.6.27.30-170.2.82.fc10.i686.PAE
Description:        Crash in production SIGSEGV, equalTupleDescs
(tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at tupdesc.c
Details:

Hi,

8.4.3 is crashing for me under load, below is the stack trace:

#0  equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at
tupdesc.c:311
#1  0x0832451b in RelationClearRelation (relation=0x96654178, rebuild=<value
optimized out>) at relcache.c:1901
#2  0x08324b9f in RelationFlushRelation () at relcache.c:1991
#3  RelationCacheInvalidateEntry (relationId=322615) at relcache.c:2042
#4  0x0831dd89 in LocalExecuteInvalidationMessage (msg=0xa218b54) at
inval.c:510
#5  0x0831d341 in ProcessInvalidationMessages (hdr=0xa2010c4, func=0x831dc50
<LocalExecuteInvalidationMessage>) at inval.c:397
#6  0x0831d3dc in CommandEndInvalidationMessages () at inval.c:1006
#7  0x080c9035 in AtCommit_LocalCache () at xact.c:1009
#8  CommandCounterIncrement () at xact.c:634
#9  0x0826dcc9 in finish_xact_command () at postgres.c:2363
#10 0x0826ed4d in exec_simple_query (
    query_string=0xa1e88d4 "insert into product_refer_urls (item_id,
destination_url) select product_info.oid,  product_info_static.affiliate_url
from real_products, product_parts, product_info left join
product_info_static on ("...) at postgres.c:1022
#11 0x0827042c in PostgresMain (argc=4, argv=0xa17c5d0, username=0xa17c590
"gb") at postgres.c:3614
#12 0x0823a4d3 in BackendRun () at postmaster.c:3449
#13 BackendStartup () at postmaster.c:3063
#14 ServerLoop () at postmaster.c:1387
#15 0x0823b503 in PostmasterMain (argc=4, argv=0xa179678) at
postmaster.c:1040
#16 0x081dc0a6 in main (argc=4, argv=0xa179678) at main.c:188

The bug is reproducible, I'm not sure how to proceed.  Here is the output
from the log:

LOG:  server process (PID 8886) was terminated by signal 11: Segmentation
fault
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process

I have been seeing these messages:

LOG:  autovacuum: found orphan temp table "pg_temp_7"."found_sizes" in
database "gb_render_footwear_db"
LOG:  autovacuum: found orphan temp table "pg_temp_7"."classification_tree"
in database "gb_render_footwear_db"

Could they be the result of previous crashes?  Any help is greatly
appreciated.

All of this code ran perfectly under 8.3.

Thanks,

Rusty
Rusty Conover wrote:
> The bug is reproducible, I'm not sure how to proceed.

How do you reproduce it? Can you make a self-contained test case?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
On Apr 9, 2010, at 2:23 PM, Heikki Linnakangas wrote:

> Rusty Conover wrote:
>> The bug is reproducible, I'm not sure how to proceed.=20=20
>=20
> How do you reproduce it? Can you make a self-contained test case?
>=20


I can reproduce it by running my code, I haven't found a good test case yet=
 because the bug seems to depend on an interaction between postgres process=
es.

I have about 12 different db sessions creating temporary tables with the sa=
me name them dropping them very quickly in the same database.  While other =
processes are just running queries in other databases.  The process that is=
 crashing isn't one of the processes that is creating and dropping temporar=
y tables, its completely separate in a different database.

Here a new crash from today with the same trace:

#0  equalTupleDescs (tupdesc1=3D0x7f7f7f7f, tupdesc2=3D0x9661e430) at tupde=
sc.c:311
311        if (tupdesc1->natts !=3D tupdesc2->natts)
(gdb) bt
#0  equalTupleDescs (tupdesc1=3D0x7f7f7f7f, tupdesc2=3D0x9661e430) at tupde=
sc.c:311
#1  0x0832451b in RelationClearRelation (relation=3D0x96617200, rebuild=3D<=
value optimized out>) at relcache.c:1901
#2  0x08324b9f in RelationFlushRelation () at relcache.c:1991
#3  RelationCacheInvalidateEntry (relationId=3D276951) at relcache.c:2042
#4  0x0831dd89 in LocalExecuteInvalidationMessage (msg=3D0xa21c328) at inva=
l.c:510
#5  0x0831d341 in ProcessInvalidationMessages (hdr=3D0xa2075fc, func=3D0x83=
1dc50 <LocalExecuteInvalidationMessage>)
    at inval.c:397
#6  0x0831d3dc in CommandEndInvalidationMessages () at inval.c:1006
#7  0x080c9035 in AtCommit_LocalCache () at xact.c:1009
#8  CommandCounterIncrement () at xact.c:634
#9  0x0826dcc9 in finish_xact_command () at postgres.c:2363
#10 0x0826ed4d in exec_simple_query (
    query_string=3D0xa1e89ec "insert into pch_oid_archive (url, product_nam=
e, oid) select product_combined_history_names.url, product_combined_history=
_names.product_name, product_info.oid from product_info, product_combined_h=
istor"...)
    at postgres.c:1022
#11 0x0827042c in PostgresMain (argc=3D4, argv=3D0xa17c740, username=3D0xa1=
7c700 "gb") at postgres.c:3614
#12 0x0823a4d3 in BackendRun () at postmaster.c:3449
#13 BackendStartup () at postmaster.c:3063
#14 ServerLoop () at postmaster.c:1387
#15 0x0823b503 in PostmasterMain (argc=3D4, argv=3D0xa179678) at postmaster=
.c:1040
#16 0x081dc0a6 in main (argc=3D4, argv=3D0xa179678) at main.c:188
(gdb)=20

Thanks,

Rusty
-
Rusty Conover
rconover@infogears.com
InfoGears Inc / GearBuyer.com / FootwearBuyer.com
http://www.infogears.com
http://www.gearbuyer.com
http://www.footwearbuyer.com

Re: BUG #5412: test case produced, possible race condition.

From
Rusty Conover
Date:
Hi Heikki and everybody else,

It seems like this is a race condition cause by the system catalog cache no=
t being locked properly. I've included a perl script below that causes the =
crash on my box consistently.

The script forks two different types of processes:

#1 - begin transaction, create a few temp tables and analyze them in a tran=
saction, commit (running in database foobar_1)
#2 - begin transaction, truncate table, insert records into table from sele=
ct in a transaction, commit (running in database foobar_2)

I setup the process to have 10 instances of task #1 and 1 instance of task =
#2.

Running this script causes the crash of postgres within seconds on my box.

If you change the parameters to say <6 of task #1, no crash happens, but if=
 you have >7 the crash does happen.

The box that I'm running the script on has 8 cores, so CPU contention and s=
ome improper locking might cause some of the problem.=20

The specs of the box are:

Fedora release 10 (Cambridge)
Intel(R) Xeon(R) CPU           E5420  @ 2.50GHz
glibc-2.9-3.i686
Linux 2.6.27.30-170.2.82.fc10.i686.PAE #1 SMP Mon Aug 17 08:24:23 EDT 2009 =
i686 i686 i386 GNU/Linux
PostgreSQL: 8.4.3

I tried to reproduce it on one of my 16-core x64 boxes and the same crash d=
oesn't occur,  also I tried on a dual core box and couldn't get a crash but=
 I haven't exhaustively tested the right number of parameters for task #1.

If the script doesn't cause a crash for you please try changing the variabl=
e $total_job_1_children to be a greater number then the number of CPU cores=
 of the machine that you're running it on.

Any help would be appreciated and if I can be of further assistance please =
let me know,

Rusty
--
Rusty Conover
rconover@infogears.com
InfoGears Inc / GearBuyer.com / FootwearBuyer.com
http://www.infogears.com
http://www.gearbuyer.com
http://www.footwearbuyer.com


#!/usr/bin/perl
use strict;
use warnings;
use DBI;
use POSIX ":sys_wait_h";


# Number of children for job1, create temp tables and analyze them

# The number of jobs here matters: (on my 8 core box you need to have some =
contention to get a failure)
# >11=3Dfail
# 10=3Dfail
# 9=3Dfail
# 8=3Dfail
# 7=3Dfail
# <6 works,
my $total_job_1_children =3D 11;

# Number of children for job 2 run a truncate and insert query loop.
# we only need one of these jobs to be running really, because the truncate=
 locks.
my $total_job_2_children =3D 1;


# Just need two databases on your machine, foobar_1 and foobar_2 are the de=
faults.
my $database_1_dsn =3D ['dbi:Pg:dbname=3Dfoobar_1', 'postgres'];
my $database_2_dsn =3D ['dbi:Pg:dbname=3Dfoobar_2', 'postgres'];

# Do some setup transactions.
if(1) {
 my $dbh =3D DBI->connect(@$database_2_dsn);

 $dbh->do("drop table foo_dest");
 $dbh->do("drop table foobar_source");
 $dbh->begin_work();
 eval {
   $dbh->do("create table foobar_source (id integer, name text, size intege=
r)") || die("Failed to create foobar_source: " . $dbh->errstr());
   for(my $k =3D 0; $k < 3500; $k++) {
     $dbh->do("insert into foobar_source (id, name, size) values (?, 'test =
me', ?)", undef, $k, int(rand(400000))) || die("Failed to insert into fooba=
r_source: " . $dbh->errstr());
   }

   $dbh->do("analyze foobar_source");

   $dbh->do("create table foo_dest (id integer, name text, size integer)");
 };
 if($@) {
   print "Error doing init of tables: " . $@ . "\n";
   $dbh->rollback();
   $dbh->disconnect();
   exit(0);
 }

 $dbh->commit();
 $dbh->disconnect();
}



my @child_pids;

for(my $i =3D0; $i < $total_job_1_children; $i++) {
 print "Forking\n";
 my $pid =3D fork();
 if($pid =3D=3D 0) {
   run_child('job1');
   exit(0);
 } else {
   push @child_pids, $pid;
 }
}

for(my $i =3D0; $i < $total_job_2_children; $i++) {
 print "Forking\n";
 my $pid =3D fork();
 if($pid =3D=3D 0) {
   run_child('job2');
   exit(0);
 } else {
   push @child_pids, $pid;
 }
}


foreach my $pid (@child_pids) {
 print "Waiting for $pid\n";
 waitpid($pid, 0);
 print "Got it\n";
}
exit(0);


sub run_child {
 my $job_type =3D shift;
 my $dsn;
 if($job_type eq 'job1') {
   $dsn =3D $database_1_dsn;
 } else {
   $dsn =3D $database_2_dsn;
 }
 my $dbh =3D DBI->connect(@$dsn);
 defined($dbh) || die("Failed to get connection to database");

 for(my $i =3D0; $i < 400; $i++) {

   $dbh->begin_work();
   eval {

     if($job_type eq 'job1') {
    $dbh->{Warn} =3D 0;
    $dbh->do("create temp table c_products (id INTEGER NOT NULL, product_name_=
stemmed text, average_price numeric(12,2), cset_bitmap bit(437), gender tex=
t) WITHOUT OIDS ON COMMIT DROP");
    $dbh->do("create temp table c_products_oids (c_products_id INTEGER NOT NUL=
L, oid INTEGER NOT NULL UNIQUE, price numeric(12,2) not null, product_name_=
stemmed text not null) WITHOUT OIDS ON COMMIT DROP");
    $dbh->{Warn} =3D 1;
=09
    $dbh->do("analyze c_products");
    $dbh->do("analyze c_products_oids");
     } else {
    $dbh->do("truncate table foo_dest");
    $dbh->do("insert into foo_dest (id, name, size) select id, name, size from=
 foobar_source");
     }
   };
   if($@) {
     print "Got error in job $job_type: $@\n";
     $dbh->rollback();
     $dbh->disconnect();
     exit(0);
   }

   $dbh->commit();
 }
 $dbh->disconnect();
 print "Child finished\n";
 return;
}

Re: BUG #5412: test case produced, possible race condition.

From
Heikki Linnakangas
Date:
Rusty Conover wrote:
> It seems like this is a race condition cause by the system catalog cache not being locked properly. I've included a
perlscript below that causes the crash on my box consistently. 
>
> The script forks two different types of processes:
>
> #1 - begin transaction, create a few temp tables and analyze them in a transaction, commit (running in database
foobar_1)
> #2 - begin transaction, truncate table, insert records into table from select in a transaction, commit (running in
databasefoobar_2) 
>
> I setup the process to have 10 instances of task #1 and 1 instance of task #2.
>
> Running this script causes the crash of postgres within seconds on my box.

Thanks, that script crashes on my laptop too, with assertions enabled.

According to the comments above RelationClearRelation(), if it's called
with 'rebuild=true', the caller should hold a lock on the relation, i.e
refcnt > 0. That's not the case in RelationFlushRelation() when it
rebuilds a new relcache entry.

Attached patch should fix that, by incrementing the reference count
while the entry is rebuilt. It also adds an Assertion in
RelationClearRelation() to check that the refcnt is indeed > 0.
Comments?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/utils/cache/relcache.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/cache/relcache.c,v
retrieving revision 1.287.2.3
diff -c -r1.287.2.3 relcache.c
*** src/backend/utils/cache/relcache.c    13 Jan 2010 23:07:15 -0000    1.287.2.3
--- src/backend/utils/cache/relcache.c    14 Apr 2010 11:09:23 -0000
***************
*** 1773,1778 ****
--- 1773,1781 ----
  {
      Oid            old_reltype = relation->rd_rel->reltype;

+     Assert((rebuild && relation->rd_refcnt > 0) ||
+            (!rebuild && relation->rd_refcnt == 0));
+
      /*
       * Make sure smgr and lower levels close the relation's files, if they
       * weren't closed already.  If the relation is not getting deleted, the
***************
*** 1968,1975 ****
  static void
  RelationFlushRelation(Relation relation)
  {
-     bool        rebuild;
-
      if (relation->rd_createSubid != InvalidSubTransactionId ||
          relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
      {
--- 1971,1976 ----
***************
*** 1978,1994 ****
           * forget the "new" status of the relation, which is a useful
           * optimization to have.  Ditto for the new-relfilenode status.
           */
!         rebuild = true;
      }
      else
      {
          /*
           * Pre-existing rels can be dropped from the relcache if not open.
           */
!         rebuild = !RelationHasReferenceCountZero(relation);
      }
-
-     RelationClearRelation(relation, rebuild);
  }

  /*
--- 1979,1996 ----
           * forget the "new" status of the relation, which is a useful
           * optimization to have.  Ditto for the new-relfilenode status.
           */
!         RelationIncrementReferenceCount(relation);
!         RelationClearRelation(relation, true);
!         RelationDecrementReferenceCount(relation);
      }
      else
      {
          /*
           * Pre-existing rels can be dropped from the relcache if not open.
           */
!         bool rebuild = !RelationHasReferenceCountZero(relation);
!         RelationClearRelation(relation, rebuild);
      }
  }

  /*

Re: BUG #5412: test case produced, possible race condition.

From
Rusty Conover
Date:
On Apr 14, 2010, at 7:10 AM, Heikki Linnakangas wrote:

> Rusty Conover wrote:
>> It seems like this is a race condition cause by the system catalog cache=
 not being locked properly. I've included a perl script below that causes t=
he crash on my box consistently.
>>=20
>> The script forks two different types of processes:
>>=20
>> #1 - begin transaction, create a few temp tables and analyze them in a t=
ransaction, commit (running in database foobar_1)
>> #2 - begin transaction, truncate table, insert records into table from s=
elect in a transaction, commit (running in database foobar_2)
>>=20
>> I setup the process to have 10 instances of task #1 and 1 instance of ta=
sk #2.
>>=20
>> Running this script causes the crash of postgres within seconds on my bo=
x.
>=20
> Thanks, that script crashes on my laptop too, with assertions enabled.
>=20
> According to the comments above RelationClearRelation(), if it's called
> with 'rebuild=3Dtrue', the caller should hold a lock on the relation, i.e
> refcnt > 0. That's not the case in RelationFlushRelation() when it
> rebuilds a new relcache entry.
>=20
> Attached patch should fix that, by incrementing the reference count
> while the entry is rebuilt. It also adds an Assertion in
> RelationClearRelation() to check that the refcnt is indeed > 0.
> Comments?



Hi Heikki,

This patch allows my machine to run the script without crashing, so its wor=
king for me as well.

Many thanks,

Rusty
--
Rusty Conover
rconover@infogears.com
InfoGears Inc / GearBuyer.com / FootwearBuyer.com
http://www.infogears.com
http://www.gearbuyer.com
http://www.footwearbuyer.com

Re: BUG #5412: test case produced, possible race condition.

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> According to the comments above RelationClearRelation(), if it's called
> with 'rebuild=true', the caller should hold a lock on the relation, i.e
> refcnt > 0. That's not the case in RelationFlushRelation() when it
> rebuilds a new relcache entry.

> Attached patch should fix that, by incrementing the reference count
> while the entry is rebuilt. It also adds an Assertion in
> RelationClearRelation() to check that the refcnt is indeed > 0.
> Comments?

Why would this patch fix anything?  It doesn't change the lock status.

            regards, tom lane

Re: BUG #5412: test case produced, possible race condition.

From
Tom Lane
Date:
I wrote:
> Why would this patch fix anything?  It doesn't change the lock status.

I have not been able to reproduce the crash using Rusty's script on my
own machine, but after contemplating his stack trace for awhile I have a
theory about what is happening.  I think that while we are building a
new relation entry (the RelationBuildDesc call in RelationClearRelation)
for a locally-created relation, we receive an sinval reset event caused
by sinval queue overflow.  (That could only happen with a lot of
concurrent catalog update activity, which is why there's a significant
number of concurrent "job1" clients needed to provoke the problem.)
The sinval reset will be serviced by RelationCacheInvalidate, which will
blow away any relcache entries with refcount zero, including the one
that the outer instance of RelationClearRelation is trying to rebuild.
So when control returns the next thing that happens is we try to do the
equalTupleDescs() comparison against a trashed pointer, as seen in the
stack trace.

This behavior is new in 8.4.3; before that RelationClearRelation
temporarily unhooked the target rel from the relcache hash table,
so it wouldn't be found by RelationCacheInvalidate.  So that explains
why Rusty's app worked before.

In short, then, Heikki's fix is good, although it desperately needs
some comment updates: there's effectively an API change happening here,
because RelationClearRelation's contract with its caller is not the
same as before.  I'll clean it up a bit and apply.  It will need to
go into all the branches this patch went into:
http://archives.postgresql.org/pgsql-committers/2010-01/msg00186.php

            regards, tom lane