catalog corruption bug - Mailing list pgsql-hackers

From Jeremy Drake
Subject catalog corruption bug
Date
Msg-id Pine.LNX.4.63.0512211651440.15097@garibaldi.apptechsys.com
Whole thread Raw
Responses Re: catalog corruption bug
List pgsql-hackers
We have encountered a very nasty but apparently rare bug which appears to
result in catalog corruption.  I have not been able to pin down an exact
sequence of events which cause this problem, it appears to be a race
condition of some sort.  This is what I have been able to figure out so
far.

* It appears to be related to temp table creation/deletion.
* It requires at least 2 clients be connected simultaneously.
* It seems to be related to the autovacuum (not sure, just a theory).

I will attempt to explain the circumstances leading up to the problem, and
then show the symptoms.

We are working on a project which uses postgresql to store data which has
been decomposed into a large number of rows in stored procedures
(plpgsql/plperl).  The implementation we have been working with recently
has used temp tables to store intermediate stages of this decomposition so
that we can run multiple queries over it in the course of adding it to our
final tables without having to regenerate the set each time.  We were
running a timing test for a load of data which would result in tens of
millions of rows.  This load creates temp tables with "on commit drop" and
also explitly drops them.  It appears to do so at a rate of approximately
10 per second (also transactions are being created/committed at that same
rate).  This works fine.  While this load was running we were
working on some testing code to determine whether it might be better to
create the temp table with "on commit delete rows" instead and use a
plpgsql function to create the temp table with an EXCEPTION
duplicate_table block to handle when the table has already been created
for this connection.  We wrote the function at first on a postgres 8.0 box
which was not running the load, but when we were attempting to determine
what the error code thrown was we noticed that the SQLSTATE variable was
not available in 8.0 and copied the function onto the 8.1 box (which was
running the load) to try it out.  We ran this function a couple times to
get the error, and then had it catch the duplicate_table exception.  We
got the function working, and when we looked at the status of our load we
found that it had died with a message saying

"ERROR:  pg_class entry for relid 7502381 vanished during vacuuming"

We found this interesting, figuring it was a bug in postgres.  Googling
the non-variable pieces of that message turned up nothing relevant, so we
set about trying to reproduce it.

During the course of doing so, we restarted our load several times and
called the function.  We later put the calling of the function into a loop
in bash calling psql (so we could disconnect/reconnect) to speed up the
finding of the problem.

These are some of the interesting errors which we got while doing this
(all from the server log):

ERROR:  duplicate key violates unique constraint
"pg_class_relname_nsp_index"
CONTEXT:  SQL statement "CREATE TEMP TABLE foo (a integer, b integer) ON
COMMIT
DELETE ROWS"       PL/pgSQL function "temp_table_test" line 2 at SQL statement
ERROR:  relation "foo" does not exist
ERROR:  duplicate key violates unique constraint
"pg_class_relname_nsp_index"
CONTEXT:  SQL statement "CREATE TEMP TABLE foo (a integer, b integer) ON
COMMIT
DELETE ROWS"       PL/pgSQL function "temp_table_test" line 2 at SQL statement



ERROR:  relation "foo" does not exist
FATAL:  cache lookup failed for relation 7600066
LOG:  server process (PID 20942) exited with exit code 1
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.


We also managed to get an error which was more bothersome than the
mysterious disappearing/reappearing temp tables.

ERROR:  relation "windowpos" does not exist
ERROR:  type "windowpos" already exists
ERROR:  cache lookup failed for relation 794218

Later:
ERROR:  relation "windowpos" already exists
ERROR:  catalog is missing 14 attribute(s) for relid 7577269
ERROR:  catalog is missing 14 attribute(s) for relid 7577269
ERROR:  catalog is missing 14 attribute(s) for relid 7577269
ERROR:  catalog is missing 14 attribute(s) for relid 7577269
ERROR:  catalog is missing 14 attribute(s) for relid 7577269
ERROR:  catalog is missing 14 attribute(s) for relid 7577269
ERROR:  catalog is missing 14 attribute(s) for relid 7577269
ERROR:  catalog is missing 14 attribute(s) for relid 7577269




Here is the temp table function we were testing:

CREATE OR REPLACE FUNCTION temp_table_test() RETURNS boolean AS $$
BEGINCREATE TEMP TABLE foo (a integer, b integer) ON COMMIT DELETE ROWS;RETURN true;
EXCEPTION WHEN duplicate_table THENRETURN false;
END;
$$ LANGUAGE plpgsql;


And our bash command line for stressing:
for i in `seq 1 10000`; do
echo -e 'select temp_table_test();\n insert into foo select a, b from generate_series(1,200) a, generate_series(1,200)
b;\n'| psql -o /dev/null cruft;
 
if (($? != 0)); then
break;
fi;
done

We may have reproduced the issue using another bash script instead of our
timing test, but we are not sure whether this was due to the problem, or
due to the catalog already being corrupt in the database.

In this test, we did this initally in psql:

CREATE TABLE junk (a integer, b integer);
CREATE INDEX junk_a_idx ON junk(a);
CREATE INDEX junk_b_idx ON junk(b);
CREATE INDEX junk_a_b_idx ON junk(a, b);
INSERT INTO junk SELECT a, b FROM generate_series(1,200) a, generate_series(1,200) b;
TRUNCATE junk;
TRUNCATE junk;

Then immediately in bash we ran

for i in `seq 1 10000`; do
echo -e 'insert into junk SELECT a, b FROM generate_series(1,200) a, generate_series(1,200) b;\nCREATE TEMP TABLE goo
(ainteger, b integer) ON COMMIT DROP;\n' | psql -o /dev/null cruft;
 
if (($? != 0)); then
break;
fi;
done


Here is my postgres version:

PostgreSQL 8.1.0 on x86_64-pc-linux-gnu, compiled by GCC
x86_64-pc-linux-gnu-gcc (GCC) 3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0,
pie-8.7.8)

And here is the CONFIGURE line from pg_config:
CONFIGURE = '--prefix=/usr' '--mandir=/usr/share/man'
'--host=x86_64-pc-linux-gnu'
'--with-docdir=/usr/share/doc/postgresql-8.1.0' '--libdir=/usr/lib64'
'--enable-depend' '--with-tcl' '--with-python' '--with-perl'
'--with-openssl' '--enable-nls' '--with-CXX' '--with-pam' 'CFLAGS=-O2
-pipe -march=nocona -mtune=nocona' 'host_alias=x86_64-pc-linux-gnu'


This was built from the gentoo ebuild version 8.1.0

My kernel version is
Linux samir 2.6.13-gentoo-r5 #1 SMP Thu Nov 17 13:26:09 PST 2005 x86_64 Intel(R) Xeon(TM) CPU 3.00GHz GenuineIntel
GNU/Linux

We also saw a similar error about a type not being found in the cache on
another box which has the same postgres version and platform, but was an
Opteron vs the EM64T Xeon, and was running a different project which also
used temp tables which were on commit drop.

I am sorry I could not provide an exact reproduction case, but this is
just one of those things that is too timing dependant to be able to do
that.

Let me know if you need any additional information and I will attempt to
provide it.


--
Half the world is composed of people who have something to say and can't,
and the other half who have nothing to say and keep on saying it.


pgsql-hackers by date:

Previous
From: Christopher Kings-Lynne
Date:
Subject: Re: [pgadmin-hackers] Client-side password encryption
Next
From: Bruce Momjian
Date:
Subject: Re: [pgadmin-hackers] Client-side password encryption