Thread: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index

ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index

From
Kragen Sitaker
Date:
ERROR:  Cannot insert a duplicate key into unique index pg_class_relname_nsp_index

We've been getting this error in our application every once in a while
--- typically once an hour to once a day, although it varies over time.
The daemon that gets the error exits and restarts a few seconds later.
Usually it's fine then, but sometimes the error happens three or more
times in a row.

Occasionally, instead, we get "ERROR:  Cannot insert a duplicate key
into unique index pg_type_typname_nsp_index".

We started seeing this error on November 22, three days after we migrated
from Postgres 7.2.3 and 7.3.2 to 7.3.4.  We still see the error with
7.3.5, but not with 7.4.0.  We're not sure we're quite ready to migrate
all of our customers to 7.4.0 yet, though.

The daemon that gets this error does the following every 15 seconds:
- start a transaction
- execute a complex and sometimes slow SELECT INTO query, creating a
  temporary table
- lock another table in exclusive mode
- for no good reason, locking the temporary table too; this is
  embarrassing but could conceivably relate to the bug
- delete the contents of the other table
- copy the contents of the temporary table into the other table
- drop the temporary table (again, embarrassing, sorry)
- commit

Reducing the interval from 15 seconds to 0.2 seconds makes the error
happen several times an hour.  We hoped it might make the error happen
several times a minute.

We're doing this temp-table dance to minimize the time the other table
is locked, and to keep anyone who's reading the other table from seeing
an incomplete list of records.

It does all of this in four queries.  The second query, the one that
merely creates and populates the temporary table, is the one that gets
the error --- understandably, since I wouldn't expect the other queries
to insert into pg_class or pg_type.

Creating and destroying a temporary table with a lot of attributes every
second is causing other problems, too; the indices on the pg_attribute
table grow to hundreds of megabytes in size, and for some reason,
reindexing our system tables corrupts our database.  These huge indices
slow down access to attributes of tables, and therefore our whole
application.

We originally ran the slow SELECT INTO query in an Apache handler, because
we thought it would be fast.  Usually it took tens of milliseconds.
But every once in a while, on a loaded system, while other processes were
updating the tables it was selecting from, the query would take 2, 5,
10, 20, 30, even 60 seconds.  That's a long time for a web page to load.
We never did figure out why it was slow.  Now, the slow query happens
in a daemon, and the Apache handler just queries the table populated by
the daemon, which reliably runs fast.  Do you have any suggestions for
how to diagnose this unpredictable performance?

Many thanks.

-Kragen

Kragen Sitaker <kragen+pgsql@airwave.com> writes:
> ERROR:  Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
> We've been getting this error in our application every once in a while
> --- typically once an hour to once a day, although it varies over time.

This seems to me that it must indicate a collision on name+schema of the
temp table.  Now that ought to be impossible :-( --- you should get
errors earlier than this if you were actually creating a duplicately
named temp table, and the design for selecting nonconflicting temp
schemas seems pretty bulletproof to me too.

> We started seeing this error on November 22, three days after we migrated
> from Postgres 7.2.3 and 7.3.2 to 7.3.4.  We still see the error with
> 7.3.5, but not with 7.4.0.

Hmm.  I'm not aware of any 7.4 bug fix that would affect such a thing,
so I wouldn't want to bet that 7.4 has really solved the issue.

Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially
relevant change:

2003-02-06 20:33  tgl

    * src/: backend/catalog/dependency.c, backend/catalog/namespace.c,
    include/catalog/dependency.h (REL7_3_STABLE): Revise mechanism for
    getting rid of temp tables at backend shutdown.  Instead of
    grovelling through pg_class to find them, make use of the handy
    dandy dependency mechanism: just delete everything that depends on
    our temp schema.  Unlike the pg_class scan, the dependency
    mechanism is smart enough to delete things in an order that doesn't
    fall foul of any dependency restrictions.  Fixes problem reported
    by David Heggie: a temp table with a serial column may cause a
    backend FATAL exit at shutdown time, if it chances to try to delete
    the temp sequence first.

Now this change also exists in 7.4, but perhaps it is malfunctioning in
7.3.*.  Or maybe you just haven't stressed the 7.4 installation enough
to reproduce the problem there --- what do you think are the odds of
that?

Given that you're explicitly dropping the temp table before exit, it's
not clear how a problem in this code could cause you grief anyway.
But it's the only starting point I can see.  You might try adding some
monitoring code to see if you can detect temp tables being left around
by exited backends.

> Creating and destroying a temporary table with a lot of attributes every
> second is causing other problems, too; the indices on the pg_attribute
> table grow to hundreds of megabytes in size,

Right.  7.4 should fix that though.

> and for some reason,
> reindexing our system tables corrupts our database.

That seems suspicious as well.  What happens exactly?  How did you get
out of it??

> Do you have any suggestions for
> how to diagnose this unpredictable performance?

I think you have more urgent things to worry about.  Like finding why it
doesn't work reliably.

            regards, tom lane

Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index

From
Martijn van Oosterhout
Date:
On Fri, Jan 09, 2004 at 12:07:25PM -0800, Kragen Sitaker wrote:
<snip>

Not really related to your problem, but given you're in a transaction, why
do you need to lock anything? What's wrong with:

> The daemon that gets this error does the following every 15 seconds:
> - start a transaction
> - delete the contents of the other table
> - execute a complex and sometimes slow SELECT INTO query, creating a
>   temporary table
> - copy the contents of the temporary table into the other table
> - drop the temporary table (again, embarrassing, sorry)
> - commit

Maybe I'm missing something?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> (... have gone from d-i being barely usable even by its developers
> anywhere, to being about 20% done. Sweet. And the last 80% usually takes
> 20% of the time, too, right?) -- Anthony Towns, debian-devel-announce

Attachment
On Fri, Jan 09, 2004 at 06:19:00PM -0500, Tom Lane wrote:
> Kragen Sitaker <kragen+pgsql@airwave.com> writes:
> > ERROR:  Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
> > We've been getting this error in our application every once in a while
> > --- typically once an hour to once a day, although it varies over time.
>
> This seems to me that it must indicate a collision on name+schema of the
> temp table.  Now that ought to be impossible :-(

Those were my first thoughts too :)

> --- you should get errors earlier than this if you were actually
> creating a duplicately named temp table, and the design for selecting
> nonconflicting temp schemas seems pretty bulletproof to me too.

Sure.  We thought maybe we had two instances of the daemons running at
once, but we tried that, and couldn't make the error happen every time.

It's worth mentioning that the daemon will often run for hours before
dying with this error.  Then, when it comes back up a few seconds later,
it's likely to fail again immediately, but it's even more likely to run
without a problem for hours more.

> > We started seeing this error on November 22, three days after we migrated
> > from Postgres 7.2.3 and 7.3.2 to 7.3.4.  We still see the error with
> > 7.3.5, but not with 7.4.0.
>
> Hmm.  I'm not aware of any 7.4 bug fix that would affect such a thing,
> so I wouldn't want to bet that 7.4 has really solved the issue.

I'm glad to know that.

> Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially
> relevant change:
>
> 2003-02-06 20:33  tgl
>
>     * src/: backend/catalog/dependency.c, backend/catalog/namespace.c,
>     include/catalog/dependency.h (REL7_3_STABLE): Revise mechanism for
>     getting rid of temp tables at backend shutdown.  Instead of
>     grovelling through pg_class to find them, make use of the handy
>     dandy dependency mechanism: just delete everything that depends on
>     our temp schema.  Unlike the pg_class scan, the dependency
>     mechanism is smart enough to delete things in an order that doesn't
>     fall foul of any dependency restrictions.  Fixes problem reported
>     by David Heggie: a temp table with a serial column may cause a
>     backend FATAL exit at shutdown time, if it chances to try to delete
>     the temp sequence first.
>
> Now this change also exists in 7.4, but perhaps it is malfunctioning in
> 7.3.*.  Or maybe you just haven't stressed the 7.4 installation enough
> to reproduce the problem there --- what do you think are the odds of
> that?

It's possible.  We've re-downgraded that development machine to 7.3.4
to experiment with other ways of solving the problem, and it looks like
our nightly backup script didn't work last night, so I can't inspect
the state of the database that didn't manifest the problems with 7.4.
It's possible it might have had less stuff in it :(

We'll run the experiment again.  Should we try 7.3.3 too?

> Given that you're explicitly dropping the temp table before exit, it's
> not clear how a problem in this code could cause you grief anyway.

Well, it's possible the daemon could have gotten killed while it was
inside the transaction, followed shortly by a shutdown of postgres ---
a dozen times or more --- and during development, we frequently kill
the daemon so that it will restart with new code.  For our application,
we shut down and restart Postgres every night because it seems to make
VACUUM FULL work better.

> But it's the only starting point I can see.  You might try adding some
> monitoring code to see if you can detect temp tables being left around
> by exited backends.

Something like this?
foobar=> select count(*), pg_class.relnamespace group by relnamespace;
 count | relnamespace
-------+--------------
   106 |           11
    70 |           99
   147 |         2200
(3 rows)
foobar=> select oid, * from pg_namespace;
   oid   |  nspname   | nspowner | nspacl
---------+------------+----------+--------
      11 | pg_catalog |        1 | {=U}
      99 | pg_toast   |        1 | {=}
    2200 | public     |        1 | {=UC}
   16765 | pg_temp_1  |        1 |
   17593 | pg_temp_2  |        1 |
   17647 | pg_temp_15 |        1 |
   20278 | pg_temp_16 |        1 |
 1570284 | pg_temp_32 |        1 |
(8 rows)

I wonder why those old namespaces are left around?  A new one shows up
whenever I kill and restart the daemon that creates the temporary tables.

We could run this code periodically to see when new namespaces pop up.

> > Creating and destroying a temporary table with a lot of attributes every
> > second is causing other problems, too; the indices on the pg_attribute
> > table grow to hundreds of megabytes in size,
>
> Right.  7.4 should fix that though.

Great!

> > and for some reason,
> > reindexing our system tables corrupts our database.
>
> That seems suspicious as well.  What happens exactly?  How did you get
> out of it??

I don't remember what happens exactly.  One of us will try this again
this weekend or early next week to get more details.  All my coworkers
can remember is that PostgreSQL complained about "something about
the heap".

We never deployed that code to any customer sites, so we recovered from
it by wiping the data directory on the development machines that we had
that problem on.

> > Do you have any suggestions for how to diagnose this unpredictable
> > performance?
>
> I think you have more urgent things to worry about.  Like finding why it
> doesn't work reliably.

Well, if we didn't have the unpredictable performance, we wouldn't have
been creating the temporary table in the first place, which is the only
thing that exposed this problem for us.  PostgreSQL otherwise works great,
rock-solidly reliably; we've lost data to disk failure, flaky hardware,
filesystem corruption due to power failure on ext2fs, and human error,
in hundreds of thousands of hours of production operation, but never
yet to Postgres.

BTW, we're using the 7.3.4 PGDG RPMs with an extra patch to add
pg_autovacuum.

Thank you very much.

-Kragen
--
Very clever, young man.  But if you use turtles for RPC, you'll have a
very, very slow RPC system.  -- Jacqueline Arnold

On Fri, Jan 09, 2004 at 08:02:16PM -0500, Tom Lane wrote:
> Kragen Sitaker <kragen+pgsql@airwave.com> writes:
> > We'll run the experiment again.  Should we try 7.3.3 too?
>
> No, I don't think 7.3.3 is likely to behave differently from 7.3.4
> as far as this goes.  What would actually be interesting is whether
> you can make 7.4 fail.

We'll do our best.

> > Well, it's possible the daemon could have gotten killed while it was
> > inside the transaction, followed shortly by a shutdown of postgres ---
> > a dozen times or more --- and during development, we frequently kill
> > the daemon so that it will restart with new code.
>
> But you're seeing these errors in production, on a machine where you're
> not doing that, no?  In any case there is code in place to clean out
> a temp schema of any pre-existing junk when a new backend starts to use
> it ... perhaps there's a bug in that, but that code was not changed
> since 7.3.2 ...

I'm not sure what kind of pre-existing junk could cause this problem,
anyway.  Leftover ordinary rows in pg_class would cause the daemon to
fail as soon as it started, not after running for hours, and as you
pointed out, would normally give us a higher-level error message that
didn't mention relations from pg_catalog.

> Another question: are you fairly confident that if the same bug had been
> in 7.3.2, you would have found it?  Were there any changes in your usage
> patterns around the time you adopted 7.3.4?

Actually, yes and yes; we revamped the transaction handling in the rest
of the system, using autocommit most of the time and wrapping various
chunks of the system in transactions.  The particular query that's at
fault didn't change noticeably (we added another column to it, from
a table it was already selecting half a dozen or so columns from),
but it's running inside of a transaction all the time, same as before.
The only relevant difference in this daemon is that its transaction used
to start as soon as it had committed its previous changes, followed by
a 15-second sleep; now it starts after the end of the 15-second sleep,
just before we start doing database actions.

The other tables it's selecting from didn't change much, but all the
processes updating them changed their transactional style.

Of course, my ideas about what could affect this problem are pretty fuzzy.

I'll retest again against 7.3.2 to make sure it's not a change we
introduced into our code around the same time.

Oh, by the way, we aren't aware of any production machines that don't
have this problem.

> > For our application, we shut down and restart Postgres every night
> > because it seems to make VACUUM FULL work better.
>
> [ itch... ]  Let's not discuss the wisdom of that just now, but ...

We started doing it on Postgres 7.2, where we determined empirically
that not doing it made the database a little slower every day.  We may
not need it anymore with pg_autovacuum and 7.3.

> If you're not planning to go to 7.4 soon, you might want to think about
> an update to 7.3.5, just on general principles.

We did test on 7.3.5, with the same database contents as on 7.3.4.
Didn't help.  But of course updating to 7.3.5 is probably a good idea
anyway.

Thank you very much for your help.

-Kragen

On Sat, Jan 10, 2004 at 11:20:11AM +1100, Martijn van Oosterhout wrote:
> Not really related to your problem, but given you're in a transaction, why
> do you need to lock anything? What's wrong with:
>
> > The daemon that gets this error does the following every 15 seconds:
> > - start a transaction
> > - delete the contents of the other table
> > - execute a complex and sometimes slow SELECT INTO query, creating a
> >   temporary table
> > - copy the contents of the temporary table into the other table
> > - drop the temporary table (again, embarrassing, sorry)
> > - commit
>
> Maybe I'm missing something?

We don't need to lock anything.  We just thought we did.  We'd observed
that accessing a table inside a transaction (at the default READ COMMITTED
isolation level) could show us records created by other transactions since
this transaction started (i.e. it doesn't guarantee repeatable reads),
even if we'd already accessed the table.

So, lacking a thorough understanding of section 12.2 (or transaction
isolation levels in general), we thought we might have to lock the table
to keep someone else from accessing it while it was partly empty.
We were wrong, but I didn't know that until this afternoon.

Thank you very much for your help!

-Kragen

Re: ERROR: Cannot insert a duplicate key into unique index

From
Bruce Momjian
Date:
Tom Lane wrote:
> Hmm.  I'm not aware of any 7.4 bug fix that would affect such a thing,
> so I wouldn't want to bet that 7.4 has really solved the issue.
>
> Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially
> relevant change:

The only thing I can think of is the fix for splitting the first btree
page.  We fixed that in 7.4.  I remember it happened mostly on SMP
machines.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially
>> relevant change:

> The only thing I can think of is the fix for splitting the first btree
> page.

I paused on that too, but I don't see how it could apply, unless they
were dropping and rebuilding their database every few hours.  Besides,
that bug is fixed in 7.3.5, which is still showing the problem.

            regards, tom lane

Re: ERROR: Cannot insert a duplicate key into unique index

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially
> >> relevant change:
>
> > The only thing I can think of is the fix for splitting the first btree
> > page.
>
> I paused on that too, but I don't see how it could apply, unless they
> were dropping and rebuilding their database every few hours.  Besides,
> that bug is fixed in 7.3.5, which is still showing the problem.

I didn't know we got that into 7.3.5, but now I remember it wasn't
serious enough to force a new 7.3.X release but it was in 7.3.X CVS.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Kragen Sitaker <kragen+pgsql@airwave.com> writes:
> We'll run the experiment again.  Should we try 7.3.3 too?

No, I don't think 7.3.3 is likely to behave differently from 7.3.4
as far as this goes.  What would actually be interesting is whether
you can make 7.4 fail.

> Well, it's possible the daemon could have gotten killed while it was
> inside the transaction, followed shortly by a shutdown of postgres ---
> a dozen times or more --- and during development, we frequently kill
> the daemon so that it will restart with new code.

But you're seeing these errors in production, on a machine where you're
not doing that, no?  In any case there is code in place to clean out
a temp schema of any pre-existing junk when a new backend starts to use
it ... perhaps there's a bug in that, but that code was not changed
since 7.3.2 ...

Another question: are you fairly confident that if the same bug had been
in 7.3.2, you would have found it?  Were there any changes in your usage
patterns around the time you adopted 7.3.4?

> For our application, we shut down and restart Postgres every night
> because it seems to make VACUUM FULL work better.

[ itch... ]  Let's not discuss the wisdom of that just now, but ...

> I wonder why those old namespaces are left around?

They're supposed to be; there's no point in deleting the pg_namespace
entry only to recreate it the next time someone needs it.  The real
question is whether you see any tables belonging to those namespaces.
The count(*) query on pg_class looked like a fine way to watch that.

> BTW, we're using the 7.3.4 PGDG RPMs with an extra patch to add
> pg_autovacuum.

If you're not planning to go to 7.4 soon, you might want to think about
an update to 7.3.5, just on general principles.

            regards, tom lane