Thread: does this look more like a possible bug or more like a possible hardware problem...? (long)
does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Frank van Vugt
Date:
Hi, I'm using a 9500 line script to create an initial database. After applying a number of changes, I was repeatedly executing this script while hunting for typos. The script itself takes care of dropping the various objects and every now and then (when delta t << autovacuum t) a manual vacuum full was performed to clean the system tables. At one point, I arrived at the following situation: db=# select version(); version ------------------------------------------------------------------------ PostgreSQL 7.4.1 on i586-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.3 Message in psql terminal: psql:/home/data/megadump.sql:5169: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "remainder_type_pkey" for table "remainder_type" psql:/home/data/megadump.sql:5169: NOTICE: CREATE TABLE / UNIQUE will create implicit index "remainder_type_abbreviation_key" for table "remainder_type" psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too large psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page for "pg_attribute_relid_attnum_index" server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql:/home/data/megadump.sql:5169: connection to server was lost vugtf@katrien:~> (so psql did indeed quit) The specific part in the script contains nothing strange: CREATE TABLE "remainder_type" ( "id" integer primary key check ("id" > 0), "abbreviation" varchar(20) not null unique, "descr" varchar(30) not null ); GRANT SELECT on remainder_type to PUBLIC; GRANT ALL on remainder_type to "postgres"; The server logs are mentioning this: 2004-03-09 23:11:34 [6516] LOG: statement: CREATE TABLE "remainder_type" ( "id" integer primary key check ("id" > 0), "abbreviation" varchar(20) not null unique, "descr" varchar(30) not null ); 2004-03-09 23:11:34 [6516] NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "remainder_type_pkey" for table "remainder_type" 2004-03-09 23:11:34 [6516] NOTICE: CREATE TABLE / UNIQUE will create implicit index "remainder_type_abbreviation_key" for table "remainder_type" 2004-03-09 23:11:34 [6516] WARNING: specified item offset is too large 2004-03-09 23:11:34 [6516] PANIC: failed to add item to the page for "pg_attribute_relid_attnum_index" STATEMENT: CREATE TABLE "remainder_type" ( "id" integer primary key check ("id" > 0), "abbreviation" varchar(20) not null unique, "descr" varchar(30) not null ); 2004-03-09 23:11:34 [833] LOG: server process (PID 6516) was terminated by signal 6 2004-03-09 23:11:34 [833] LOG: terminating any other active server processes 2004-03-09 23:11:34 [8074] WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. 2004-03-09 23:11:34 [833] LOG: all server processes terminated; reinitializing 2004-03-09 23:11:34 [8075] LOG: database system was interrupted at 2004-03-09 23:11:31 CET 2004-03-09 23:11:35 [8075] LOG: checkpoint record is at 1/9FAAB5C 2004-03-09 23:11:35 [8075] LOG: redo record is at 1/9E803DC; undo record is at 0/0; shutdown FALSE 2004-03-09 23:11:35 [8075] LOG: next transaction ID: 354187; next OID: 8355508 2004-03-09 23:11:35 [8075] LOG: database system was not properly shut down; automatic recovery in progress 2004-03-09 23:11:35 [8075] LOG: redo starts at 1/9E803DC 2004-03-09 23:11:46 [8075] LOG: record with zero length at 1/B27335C 2004-03-09 23:11:46 [8075] LOG: redo done at 1/B273338 2004-03-09 23:11:55 [8075] LOG: database system is ready After this, a vacuum full verbose analyse seems to end prematurely with: INFO: vacuuming "public.stock_delivery" INFO: "stock_delivery": found 0 removable, 0 nonremovable row versions in 0 pages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 0 to 0 bytes long. There were 0 unused item pointers. Total free space (including removable row versions) is 0 bytes. 0 pages are or will become empty, including 0 at the end of the table. 0 pages containing 0 free bytes are potential move destinations. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "stock_delivery_pkey" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.02 sec. INFO: index "stock_delivery_delivery_id_key" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "stock_delivery_idx1" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: analyzing "public.stock_delivery" INFO: "stock_delivery": 0 pages, 0 rows sampled, 0 estimated total rows ERROR: catalog is missing 6 attribute(s) for relid 8349771 Looking at the pg_autovacuum log (see timestamps) it seems this could have clashed with the script : [2004-03-09 11:11:08 PM] Performing: VACUUM ANALYZE "pg_catalog"."pg_depend" [2004-03-09 11:11:29 PM] table name: megafox_trial."pg_catalog"."pg_depend" [2004-03-09 11:11:29 PM] relid: 16599; relisshared: 0 [2004-03-09 11:11:29 PM] reltuples: 6124; relpages: 153 [2004-03-09 11:11:29 PM] curr_analyze_count: 668446; cur_delete_count: 332899 [2004-03-09 11:11:29 PM] ins_at_last_analyze: 668446; del_at_last_vacuum: 332899 [2004-03-09 11:11:29 PM] insert_threshold: 6624; delete_threshold 13248 [2004-03-09 11:11:29 PM] Performing: VACUUM ANALYZE "pg_catalog"."pg_attribute" [2004-03-09 11:11:35 PM] Can not refresh statistics information from the database megafox_trial. [2004-03-09 11:11:35 PM] The error is [server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ] [2004-03-09 11:11:35 PM] Fatal error occured while sending query (select oid,reltuples,relpages from pg_class where oid=1249) to database megafox_trial [2004-03-09 11:11:35 PM] The error is [] [2004-03-09 11:11:35 PM] table name: megafox_trial."pg_catalog"."pg_attribute" [2004-03-09 11:11:35 PM] relid: 1249; relisshared: 0 [2004-03-09 11:11:35 PM] reltuples: 3595; relpages: 214 [2004-03-09 11:11:35 PM] curr_analyze_count: 721978; cur_delete_count: 361920 [2004-03-09 11:11:35 PM] ins_at_last_analyze: 707819; del_at_last_vacuum: 346312 [2004-03-09 11:11:35 PM] insert_threshold: 4095; delete_threshold 11400 [2004-03-09 11:11:35 PM] Performing: VACUUM ANALYZE "pg_catalog"."pg_class" [2004-03-09 11:11:35 PM] Fatal error occured while sending query (VACUUM ANALYZE "pg_catalog"."pg_class") to database megafox_trial [2004-03-09 11:11:35 PM] The error is [] [2004-03-09 11:11:35 PM] Fatal error occured while sending query (select oid,reltuples,relpages from pg_class where oid=1259) to database megafox_trial [2004-03-09 11:11:35 PM] The error is [] [2004-03-09 11:11:35 PM] table name: megafox_trial."pg_catalog"."pg_class" [2004-03-09 11:11:35 PM] relid: 1259; relisshared: 0 [2004-03-09 11:11:35 PM] reltuples: 885; relpages: 117 [2004-03-09 11:11:35 PM] curr_analyze_count: 439206; cur_delete_count: 373969 [2004-03-09 11:11:35 PM] ins_at_last_analyze: 430423; del_at_last_vacuum: 366316 [2004-03-09 11:11:35 PM] insert_threshold: 1385; delete_threshold 2770 [2004-03-09 11:11:35 PM] Performing: VACUUM ANALYZE "pg_catalog"."pg_attrdef" [2004-03-09 11:11:35 PM] Fatal error occured while sending query (VACUUM ANALYZE "pg_catalog"."pg_attrdef") to database megafox_trial [2004-03-09 11:11:35 PM] The error is [] [2004-03-09 11:11:35 PM] Fatal error occured while sending query (select oid,reltuples,relpages from pg_class where oid=16384) to database megafox_trial [2004-03-09 11:11:35 PM] The error is [] There's nothing critical here and I expect a drop/create database to solve it, but I'll wait for a reply just in case. -- Best, Frank.
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > At one point, I arrived at the following situation: > psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too > large > psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page for > "pg_attribute_relid_attnum_index" Looks like a bug to me. Can you create a reproducible test case? regards, tom lane
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Frank van Vugt
Date:
Hi Tom, > > At one point, I arrived at the following situation: > > psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too > > large > > psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page > > for "pg_attribute_relid_attnum_index" > > Looks like a bug to me. Can you create a reproducible test case? Nope..... Trying the same script on a newly created database doesn't show the problem. However, I do still have the database that came up with this message, which is now failing a vacuum full verbose analyse with: (psql) INFO: vacuuming "public.user_pref" INFO: "user_pref": found 0 removable, 0 nonremovable row versions in 0 pages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 0 to 0 bytes long. There were 0 unused item pointers. Total free space (including removable row versions) is 0 bytes. 0 pages are or will become empty, including 0 at the end of the table. 0 pages containing 0 free bytes are potential move destinations. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "user_pref_pkey" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.03 sec. INFO: index "user_pref_idx1" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.01 sec. INFO: analyzing "public.user_pref" INFO: "user_pref": 0 pages, 0 rows sampled, 0 estimated total rows ERROR: catalog is missing 6 attribute(s) for relid 8349771 (log) 2004-03-10 10:10:00 [8207] LOG: statement: vacuum full verbose analyse; 2004-03-10 10:11:43 [8207] ERROR: catalog is missing 6 attribute(s) for relid 8349771 The server itself has no particular messages in its log that hint on what went on, except for the earlier mentioned 'possible clash' of a running pg_autovacuum. By the way, I have stopped pg_autovacuum by now to avoid any more changes on this particular database. -- Best, Frank.
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Frank van Vugt
Date:
Hi, > > ERROR: catalog is missing 6 attribute(s) for relid 8349771 > This is consistent with the idea that pg_attribute_relid_attnum_index is > corrupted. I would suggest saving a copy of that file for postmortem > analysis and then trying to REINDEX pg_attribute. Uhm...... looking for a file named 8349771 located in the directory base/<nr> where nr corresponds to the proper db, using oid2name..... We seem to have a small problem : there is no such file..... I'll hold reindexing in case you first want to look elsewhere. > If REINDEX makes the problem go away, would you send me the corrupted > index file off-list? I happily would, but you already know its content ;-) Here's a log of what I did: base/7847604# ls 82* 8240054 8240056 8240058 8240059 8240600 8240602 8261384 8261386 8261388 8261398 base/7847604# ls 83* 8351687 8352194 8352623 8352754 8353080 8353478 8353581 8353741 <cut> The last few existing entries from the 82* range seem to be regular user-tables and corresponding indexes, likewise for the first few in the 83* range But when I try the particular oid: # /usr/local/pgsql/bin/oid2name -d megafox_trial -U ... -P ... -o 8349771 Tablename of oid 8349771 from database "megafox_trial": --------------------------------- 8349771 = article_property_tree and its revers: # /usr/local/pgsql/bin/oid2name -d megafox_trial -U ... -P ... -t article_property_tree Oid of table article_property_tree from database "megafox_trial": --------------------------------- VERY scary: more than one table with that name found!! -- Best, Frank.
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Frank van Vugt
Date:
> I'm sorry, I meant to say save a copy of > pg_attribute_relid_attnum_index. The correct filename for it can be > found via > select relfilenode from pg_class where relname = > 'pg_attribute_relid_attnum_index'; Got it, made a backup of the entire database as well. Since the db wasn't filled yet, both files are fairly small when bzipped (<0.5MB and < 5MB) and both can be made available if you want. Issuing 'reindex table pg_attribute' did not solve the problem, though.... > > Oid of table article_property_tree from database "megafox_trial": > > --------------------------------- > > VERY scary: more than one table with that name found!! > > Do you have more than one such table (presumably in different schemas)? > If so this isn't scary at all. I don't believe oid2name has been fixed > to be schema-aware :-( Nope, all user-tables are in public at the moment. Also: megafox_trial=# select * from article_property_tree; ERROR: catalog is missing 6 attribute(s) for relid 8349771 -- Best, Frank.
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > At one point, I arrived at the following situation: > psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too > large > psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page > for "pg_attribute_relid_attnum_index" > Trying the same script on a newly created database doesn't show the problem. > However, I do still have the database that came up with this message, which > is now failing a vacuum full verbose analyse with: > ERROR: catalog is missing 6 attribute(s) for relid 8349771 This is consistent with the idea that pg_attribute_relid_attnum_index is corrupted. I would suggest saving a copy of that file for postmortem analysis and then trying to REINDEX pg_attribute. (Depending on which PG version you are running, that may require running a standalone backend. See the REINDEX man page.) If REINDEX makes the problem go away, would you send me the corrupted index file off-list? regards, tom lane
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: >> This is consistent with the idea that pg_attribute_relid_attnum_index is >> corrupted. I would suggest saving a copy of that file for postmortem >> analysis and then trying to REINDEX pg_attribute. > Uhm...... looking for a file named 8349771 located in the directory base/<nr> > where nr corresponds to the proper db, using oid2name..... > We seem to have a small problem : there is no such file..... I'm sorry, I meant to say save a copy of pg_attribute_relid_attnum_index. The correct filename for it can be found via select relfilenode from pg_class where relname = 'pg_attribute_relid_attnum_index'; > Oid of table article_property_tree from database "megafox_trial": > --------------------------------- > VERY scary: more than one table with that name found!! Do you have more than one such table (presumably in different schemas)? If so this isn't scary at all. I don't believe oid2name has been fixed to be schema-aware :-( regards, tom lane
Re: does this look more like a possible bug or more like a possible hardware problem...? (long)
From
Frank van Vugt
Date:
Some additional info: # select * from pg_class where relname = 'article_property_tree'; -[ RECORD 1 ]-- +------------------------------------------------------------------- relname | article_property_tree relnamespace | 2200 reltype | 8349772 relowner | 1000 relam | 0 relfilenode | 8349771 relpages | 4 reltuples | 299 reltoastrelid | 8349775 reltoastidxid | 0 relhasindex | t relisshared | f relkind | r relnatts | 6 relchecks | 1 reltriggers | 3 relukeys | 0 relfkeys | 0 relrefs | 0 relhasoids | t relhaspkey | t relhasrules | f relhassubclass | f relacl | {vugtf=a*r*w*d*R*x*t*/vugtf,=arwdRxt/vugtf,postgres=arwdRxt/ vugtf} -[ RECORD 2 ]-- +------------------------------------------------------------------- relname | article_property_tree relnamespace | 2200 reltype | 8354495 relowner | 1000 relam | 0 relfilenode | 8354494 relpages | 4 reltuples | 299 reltoastrelid | 8354498 reltoastidxid | 0 relhasindex | t relisshared | f relkind | r relnatts | 6 relchecks | 1 reltriggers | 5 relukeys | 0 relfkeys | 0 relrefs | 0 relhasoids | t relhaspkey | t relhasrules | f relhassubclass | f relacl | {vugtf=a*r*w*d*R*x*t*/vugtf,=arwdRxt/vugtf,postgres=arwdRxt/ vugtf} -- Best, Frank.
Tom Lane wrote: > Frank van Vugt <ftm.van.vugt@foxi.nl> writes: >> At one point, I arrived at the following situation: > >> psql:/home/data/megadump.sql:5169: WARNING: specified item offset is too >> large >> psql:/home/data/megadump.sql:5169: PANIC: failed to add item to the page for >> "pg_attribute_relid_attnum_index" > > Looks like a bug to me. Can you create a reproducible test case? I have seen one occurence of that during a Slony test run on the log table. I think it can be reproduced (not reliable though) with a high frequency of insert, select and delete of growing keys with a very high frequency of vacuums at the same time. Concurrency seems to be an issue here. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #