Thread: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Kragen Sitaker
Date:
ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index We've been getting this error in our application every once in a while --- typically once an hour to once a day, although it varies over time. The daemon that gets the error exits and restarts a few seconds later. Usually it's fine then, but sometimes the error happens three or more times in a row. Occasionally, instead, we get "ERROR: Cannot insert a duplicate key into unique index pg_type_typname_nsp_index". We started seeing this error on November 22, three days after we migrated from Postgres 7.2.3 and 7.3.2 to 7.3.4. We still see the error with 7.3.5, but not with 7.4.0. We're not sure we're quite ready to migrate all of our customers to 7.4.0 yet, though. The daemon that gets this error does the following every 15 seconds: - start a transaction - execute a complex and sometimes slow SELECT INTO query, creating a temporary table - lock another table in exclusive mode - for no good reason, locking the temporary table too; this is embarrassing but could conceivably relate to the bug - delete the contents of the other table - copy the contents of the temporary table into the other table - drop the temporary table (again, embarrassing, sorry) - commit Reducing the interval from 15 seconds to 0.2 seconds makes the error happen several times an hour. We hoped it might make the error happen several times a minute. We're doing this temp-table dance to minimize the time the other table is locked, and to keep anyone who's reading the other table from seeing an incomplete list of records. It does all of this in four queries. The second query, the one that merely creates and populates the temporary table, is the one that gets the error --- understandably, since I wouldn't expect the other queries to insert into pg_class or pg_type. Creating and destroying a temporary table with a lot of attributes every second is causing other problems, too; the indices on the pg_attribute table grow to hundreds of megabytes in size, and for some reason, reindexing our system tables corrupts our database. These huge indices slow down access to attributes of tables, and therefore our whole application. We originally ran the slow SELECT INTO query in an Apache handler, because we thought it would be fast. Usually it took tens of milliseconds. But every once in a while, on a loaded system, while other processes were updating the tables it was selecting from, the query would take 2, 5, 10, 20, 30, even 60 seconds. That's a long time for a web page to load. We never did figure out why it was slow. Now, the slow query happens in a daemon, and the Apache handler just queries the table populated by the daemon, which reliably runs fast. Do you have any suggestions for how to diagnose this unpredictable performance? Many thanks. -Kragen
Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Tom Lane
Date:
Kragen Sitaker <kragen+pgsql@airwave.com> writes: > ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index > We've been getting this error in our application every once in a while > --- typically once an hour to once a day, although it varies over time. This seems to me that it must indicate a collision on name+schema of the temp table. Now that ought to be impossible :-( --- you should get errors earlier than this if you were actually creating a duplicately named temp table, and the design for selecting nonconflicting temp schemas seems pretty bulletproof to me too. > We started seeing this error on November 22, three days after we migrated > from Postgres 7.2.3 and 7.3.2 to 7.3.4. We still see the error with > 7.3.5, but not with 7.4.0. Hmm. I'm not aware of any 7.4 bug fix that would affect such a thing, so I wouldn't want to bet that 7.4 has really solved the issue. Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially relevant change: 2003-02-06 20:33 tgl * src/: backend/catalog/dependency.c, backend/catalog/namespace.c, include/catalog/dependency.h (REL7_3_STABLE): Revise mechanism for getting rid of temp tables at backend shutdown. Instead of grovelling through pg_class to find them, make use of the handy dandy dependency mechanism: just delete everything that depends on our temp schema. Unlike the pg_class scan, the dependency mechanism is smart enough to delete things in an order that doesn't fall foul of any dependency restrictions. Fixes problem reported by David Heggie: a temp table with a serial column may cause a backend FATAL exit at shutdown time, if it chances to try to delete the temp sequence first. Now this change also exists in 7.4, but perhaps it is malfunctioning in 7.3.*. Or maybe you just haven't stressed the 7.4 installation enough to reproduce the problem there --- what do you think are the odds of that? Given that you're explicitly dropping the temp table before exit, it's not clear how a problem in this code could cause you grief anyway. But it's the only starting point I can see. You might try adding some monitoring code to see if you can detect temp tables being left around by exited backends. > Creating and destroying a temporary table with a lot of attributes every > second is causing other problems, too; the indices on the pg_attribute > table grow to hundreds of megabytes in size, Right. 7.4 should fix that though. > and for some reason, > reindexing our system tables corrupts our database. That seems suspicious as well. What happens exactly? How did you get out of it?? > Do you have any suggestions for > how to diagnose this unpredictable performance? I think you have more urgent things to worry about. Like finding why it doesn't work reliably. regards, tom lane
Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Martijn van Oosterhout
Date:
On Fri, Jan 09, 2004 at 12:07:25PM -0800, Kragen Sitaker wrote: <snip> Not really related to your problem, but given you're in a transaction, why do you need to lock anything? What's wrong with: > The daemon that gets this error does the following every 15 seconds: > - start a transaction > - delete the contents of the other table > - execute a complex and sometimes slow SELECT INTO query, creating a > temporary table > - copy the contents of the temporary table into the other table > - drop the temporary table (again, embarrassing, sorry) > - commit Maybe I'm missing something? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > (... have gone from d-i being barely usable even by its developers > anywhere, to being about 20% done. Sweet. And the last 80% usually takes > 20% of the time, too, right?) -- Anthony Towns, debian-devel-announce
Attachment
Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Kragen Sitaker
Date:
On Fri, Jan 09, 2004 at 06:19:00PM -0500, Tom Lane wrote: > Kragen Sitaker <kragen+pgsql@airwave.com> writes: > > ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index > > We've been getting this error in our application every once in a while > > --- typically once an hour to once a day, although it varies over time. > > This seems to me that it must indicate a collision on name+schema of the > temp table. Now that ought to be impossible :-( Those were my first thoughts too :) > --- you should get errors earlier than this if you were actually > creating a duplicately named temp table, and the design for selecting > nonconflicting temp schemas seems pretty bulletproof to me too. Sure. We thought maybe we had two instances of the daemons running at once, but we tried that, and couldn't make the error happen every time. It's worth mentioning that the daemon will often run for hours before dying with this error. Then, when it comes back up a few seconds later, it's likely to fail again immediately, but it's even more likely to run without a problem for hours more. > > We started seeing this error on November 22, three days after we migrated > > from Postgres 7.2.3 and 7.3.2 to 7.3.4. We still see the error with > > 7.3.5, but not with 7.4.0. > > Hmm. I'm not aware of any 7.4 bug fix that would affect such a thing, > so I wouldn't want to bet that 7.4 has really solved the issue. I'm glad to know that. > Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially > relevant change: > > 2003-02-06 20:33 tgl > > * src/: backend/catalog/dependency.c, backend/catalog/namespace.c, > include/catalog/dependency.h (REL7_3_STABLE): Revise mechanism for > getting rid of temp tables at backend shutdown. Instead of > grovelling through pg_class to find them, make use of the handy > dandy dependency mechanism: just delete everything that depends on > our temp schema. Unlike the pg_class scan, the dependency > mechanism is smart enough to delete things in an order that doesn't > fall foul of any dependency restrictions. Fixes problem reported > by David Heggie: a temp table with a serial column may cause a > backend FATAL exit at shutdown time, if it chances to try to delete > the temp sequence first. > > Now this change also exists in 7.4, but perhaps it is malfunctioning in > 7.3.*. Or maybe you just haven't stressed the 7.4 installation enough > to reproduce the problem there --- what do you think are the odds of > that? It's possible. We've re-downgraded that development machine to 7.3.4 to experiment with other ways of solving the problem, and it looks like our nightly backup script didn't work last night, so I can't inspect the state of the database that didn't manifest the problems with 7.4. It's possible it might have had less stuff in it :( We'll run the experiment again. Should we try 7.3.3 too? > Given that you're explicitly dropping the temp table before exit, it's > not clear how a problem in this code could cause you grief anyway. Well, it's possible the daemon could have gotten killed while it was inside the transaction, followed shortly by a shutdown of postgres --- a dozen times or more --- and during development, we frequently kill the daemon so that it will restart with new code. For our application, we shut down and restart Postgres every night because it seems to make VACUUM FULL work better. > But it's the only starting point I can see. You might try adding some > monitoring code to see if you can detect temp tables being left around > by exited backends. Something like this? foobar=> select count(*), pg_class.relnamespace group by relnamespace; count | relnamespace -------+-------------- 106 | 11 70 | 99 147 | 2200 (3 rows) foobar=> select oid, * from pg_namespace; oid | nspname | nspowner | nspacl ---------+------------+----------+-------- 11 | pg_catalog | 1 | {=U} 99 | pg_toast | 1 | {=} 2200 | public | 1 | {=UC} 16765 | pg_temp_1 | 1 | 17593 | pg_temp_2 | 1 | 17647 | pg_temp_15 | 1 | 20278 | pg_temp_16 | 1 | 1570284 | pg_temp_32 | 1 | (8 rows) I wonder why those old namespaces are left around? A new one shows up whenever I kill and restart the daemon that creates the temporary tables. We could run this code periodically to see when new namespaces pop up. > > Creating and destroying a temporary table with a lot of attributes every > > second is causing other problems, too; the indices on the pg_attribute > > table grow to hundreds of megabytes in size, > > Right. 7.4 should fix that though. Great! > > and for some reason, > > reindexing our system tables corrupts our database. > > That seems suspicious as well. What happens exactly? How did you get > out of it?? I don't remember what happens exactly. One of us will try this again this weekend or early next week to get more details. All my coworkers can remember is that PostgreSQL complained about "something about the heap". We never deployed that code to any customer sites, so we recovered from it by wiping the data directory on the development machines that we had that problem on. > > Do you have any suggestions for how to diagnose this unpredictable > > performance? > > I think you have more urgent things to worry about. Like finding why it > doesn't work reliably. Well, if we didn't have the unpredictable performance, we wouldn't have been creating the temporary table in the first place, which is the only thing that exposed this problem for us. PostgreSQL otherwise works great, rock-solidly reliably; we've lost data to disk failure, flaky hardware, filesystem corruption due to power failure on ext2fs, and human error, in hundreds of thousands of hours of production operation, but never yet to Postgres. BTW, we're using the 7.3.4 PGDG RPMs with an extra patch to add pg_autovacuum. Thank you very much. -Kragen -- Very clever, young man. But if you use turtles for RPC, you'll have a very, very slow RPC system. -- Jacqueline Arnold
Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Kragen Sitaker
Date:
On Fri, Jan 09, 2004 at 08:02:16PM -0500, Tom Lane wrote: > Kragen Sitaker <kragen+pgsql@airwave.com> writes: > > We'll run the experiment again. Should we try 7.3.3 too? > > No, I don't think 7.3.3 is likely to behave differently from 7.3.4 > as far as this goes. What would actually be interesting is whether > you can make 7.4 fail. We'll do our best. > > Well, it's possible the daemon could have gotten killed while it was > > inside the transaction, followed shortly by a shutdown of postgres --- > > a dozen times or more --- and during development, we frequently kill > > the daemon so that it will restart with new code. > > But you're seeing these errors in production, on a machine where you're > not doing that, no? In any case there is code in place to clean out > a temp schema of any pre-existing junk when a new backend starts to use > it ... perhaps there's a bug in that, but that code was not changed > since 7.3.2 ... I'm not sure what kind of pre-existing junk could cause this problem, anyway. Leftover ordinary rows in pg_class would cause the daemon to fail as soon as it started, not after running for hours, and as you pointed out, would normally give us a higher-level error message that didn't mention relations from pg_catalog. > Another question: are you fairly confident that if the same bug had been > in 7.3.2, you would have found it? Were there any changes in your usage > patterns around the time you adopted 7.3.4? Actually, yes and yes; we revamped the transaction handling in the rest of the system, using autocommit most of the time and wrapping various chunks of the system in transactions. The particular query that's at fault didn't change noticeably (we added another column to it, from a table it was already selecting half a dozen or so columns from), but it's running inside of a transaction all the time, same as before. The only relevant difference in this daemon is that its transaction used to start as soon as it had committed its previous changes, followed by a 15-second sleep; now it starts after the end of the 15-second sleep, just before we start doing database actions. The other tables it's selecting from didn't change much, but all the processes updating them changed their transactional style. Of course, my ideas about what could affect this problem are pretty fuzzy. I'll retest again against 7.3.2 to make sure it's not a change we introduced into our code around the same time. Oh, by the way, we aren't aware of any production machines that don't have this problem. > > For our application, we shut down and restart Postgres every night > > because it seems to make VACUUM FULL work better. > > [ itch... ] Let's not discuss the wisdom of that just now, but ... We started doing it on Postgres 7.2, where we determined empirically that not doing it made the database a little slower every day. We may not need it anymore with pg_autovacuum and 7.3. > If you're not planning to go to 7.4 soon, you might want to think about > an update to 7.3.5, just on general principles. We did test on 7.3.5, with the same database contents as on 7.3.4. Didn't help. But of course updating to 7.3.5 is probably a good idea anyway. Thank you very much for your help. -Kragen
Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Kragen Sitaker
Date:
On Sat, Jan 10, 2004 at 11:20:11AM +1100, Martijn van Oosterhout wrote: > Not really related to your problem, but given you're in a transaction, why > do you need to lock anything? What's wrong with: > > > The daemon that gets this error does the following every 15 seconds: > > - start a transaction > > - delete the contents of the other table > > - execute a complex and sometimes slow SELECT INTO query, creating a > > temporary table > > - copy the contents of the temporary table into the other table > > - drop the temporary table (again, embarrassing, sorry) > > - commit > > Maybe I'm missing something? We don't need to lock anything. We just thought we did. We'd observed that accessing a table inside a transaction (at the default READ COMMITTED isolation level) could show us records created by other transactions since this transaction started (i.e. it doesn't guarantee repeatable reads), even if we'd already accessed the table. So, lacking a thorough understanding of section 12.2 (or transaction isolation levels in general), we thought we might have to lock the table to keep someone else from accessing it while it was partly empty. We were wrong, but I didn't know that until this afternoon. Thank you very much for your help! -Kragen
Tom Lane wrote: > Hmm. I'm not aware of any 7.4 bug fix that would affect such a thing, > so I wouldn't want to bet that 7.4 has really solved the issue. > > Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially > relevant change: The only thing I can think of is the fix for splitting the first btree page. We fixed that in 7.4. I remember it happened mostly on SMP machines. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially >> relevant change: > The only thing I can think of is the fix for splitting the first btree > page. I paused on that too, but I don't see how it could apply, unless they were dropping and rebuilding their database every few hours. Besides, that bug is fixed in 7.3.5, which is still showing the problem. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Digging in the 7.3.2-to-7.3.4 change logs, I see one potentially > >> relevant change: > > > The only thing I can think of is the fix for splitting the first btree > > page. > > I paused on that too, but I don't see how it could apply, unless they > were dropping and rebuilding their database every few hours. Besides, > that bug is fixed in 7.3.5, which is still showing the problem. I didn't know we got that into 7.3.5, but now I remember it wasn't serious enough to force a new 7.3.X release but it was in 7.3.X CVS. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Re: ERROR: Cannot insert a duplicate key into unique index pg_class_relname_nsp_index
From
Tom Lane
Date:
Kragen Sitaker <kragen+pgsql@airwave.com> writes: > We'll run the experiment again. Should we try 7.3.3 too? No, I don't think 7.3.3 is likely to behave differently from 7.3.4 as far as this goes. What would actually be interesting is whether you can make 7.4 fail. > Well, it's possible the daemon could have gotten killed while it was > inside the transaction, followed shortly by a shutdown of postgres --- > a dozen times or more --- and during development, we frequently kill > the daemon so that it will restart with new code. But you're seeing these errors in production, on a machine where you're not doing that, no? In any case there is code in place to clean out a temp schema of any pre-existing junk when a new backend starts to use it ... perhaps there's a bug in that, but that code was not changed since 7.3.2 ... Another question: are you fairly confident that if the same bug had been in 7.3.2, you would have found it? Were there any changes in your usage patterns around the time you adopted 7.3.4? > For our application, we shut down and restart Postgres every night > because it seems to make VACUUM FULL work better. [ itch... ] Let's not discuss the wisdom of that just now, but ... > I wonder why those old namespaces are left around? They're supposed to be; there's no point in deleting the pg_namespace entry only to recreate it the next time someone needs it. The real question is whether you see any tables belonging to those namespaces. The count(*) query on pg_class looked like a fine way to watch that. > BTW, we're using the 7.3.4 PGDG RPMs with an extra patch to add > pg_autovacuum. If you're not planning to go to 7.4 soon, you might want to think about an update to 7.3.5, just on general principles. regards, tom lane