Thread: Corrupt index
I am experiencing frequent index corruptions (almost daily). The tables have about a million records. I solve the problem by using the REINDEX TABLE command and rerunning my programs. Does anyone know why I might be experiencing these problems? I am using batch mode to insert a few thousand records at a time in order to improve performance - could the index corruption be caused by a bug in the batch mode? Also, is there a way to determine whether an index is corrupt (other than seeing inconsistencies in the data)? I am running PostgreSQL 7.3.1 on Red Hat 8.0. I access and modify the database through Java (using the JDBC driver). __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
Amir Becher <abecher@yahoo.com> writes: > I am experiencing frequent index corruptions (almost > daily). It's very likely to be a hardware problem, but try upgrading to the current 7.3 release (7.3.2 I think) to make sure it's not an already-fixed bug. -Doug
Amir Becher <abecher@yahoo.com> writes: > I am experiencing frequent index corruptions (almost > daily). Please define your problem: what misbehavior are you actually seeing? Give us facts, not interpretations. regards, tom lane
Here is an example of the misbehavior that I am seeing: A simple table with 3 colums (and approximately a million records): column_a INTEGER PRIMARY KEY column_b DATE NOT NULL column_c DOUBLE NOT NULL When the index (primary key) gets corrupted, I get the following behavior when I do a simple SELECT (using the psql program): SELECT * FROM example_table WHERE (column_a = 12345); The result looks like: column_a | column_b | column_c -------------------------------- 67890 | 2001-01-01 | 100 The main point is that "column_a" is the wrong number. When I use the EXPLAIN command, it tells me that it's doing an index scan using the primary key index. Issuing a "REINDEX TABLE example_table" command fixes the problem. I usually detect the problem when an INSERT fails (batch insert fails due to duplicate key). I also delete a few thousand records before the inserts. All of the inserts and deletes are done throught the JDBC3 driver. Please let me know what additional information you would like me to provide. --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amir Becher <abecher@yahoo.com> writes: > > I am experiencing frequent index corruptions > (almost > > daily). > > Please define your problem: what misbehavior are you > actually seeing? > Give us facts, not interpretations. > > regards, tom lane __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
exactly what are your indexes on this, as well as any triggers? Amir Becher wrote: > Here is an example of the misbehavior that I am > seeing: > > A simple table with 3 colums (and approximately a > million records): > column_a INTEGER PRIMARY KEY > column_b DATE NOT NULL > column_c DOUBLE NOT NULL > > When the index (primary key) gets corrupted, I get the > following behavior when I do a simple SELECT (using > the psql program): > > SELECT * FROM example_table WHERE (column_a = 12345); > > The result looks like: > > column_a | column_b | column_c > -------------------------------- > 67890 | 2001-01-01 | 100 > > The main point is that "column_a" is the wrong number. > > When I use the EXPLAIN command, it tells me that it's > doing an index scan using the primary key index. > Issuing a "REINDEX TABLE example_table" command fixes > the problem. > > I usually detect the problem when an INSERT fails > (batch insert fails due to duplicate key). I also > delete a few thousand records before the inserts. All > of the inserts and deletes are done throught the JDBC3 > driver. > > Please let me know what additional information you > would like me to provide. > > --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>Amir Becher <abecher@yahoo.com> writes: >> >>>I am experiencing frequent index corruptions >> >>(almost >> >>>daily). >> >>Please define your problem: what misbehavior are you >>actually seeing? >>Give us facts, not interpretations. >> >> regards, tom lane > > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - File online, calculators, forms, and more > http://tax.yahoo.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
Amir Becher <abecher@yahoo.com> writes: > Here is an example of the misbehavior that I am > seeing: > [snip] > The main point is that "column_a" is the wrong number. Hmm. This looks like you must have an index entry pointing at the wrong heap tuple. Which is not something we hear of regularly, and certainly not an error that I'd expect to see appearing in isolation. You didn't say in so many words, but should I take it that (a) this pattern of misbehavior (wrong row returned) occurs repeatedly for you, and (b) there are no other signs of problems (backend unexpectedly quits, strange error messages, that sort of thing)? I can't really think of any plausible mechanism for this to happen, except for the index and table files getting out of sync somehow. Are you doing anything that operates directly on the database files (for example, doing backup/restore with cp or tar or some such)? regards, tom lane
This pattern of misbehavior does occur repeatedly for me. I don't know if this may have something to do with it, but we do backup the data every night using VERITAS Backup Exec. We are not restoring anything, though (the data is backed up to tape). The VERITAS software runs on Windows, but there is an agent that runs on our Linux box where the PostgreSQL data is stored. I should also mention that the backup is running while the database is being modified (we modify the database 24/7). There is another unexpected behavior that I noticed for the first time this morning (so I am not sure if it's recurring, related or relevant). The database "blinked" in the sense that all database connections were lost - but new connections could be obtained immediately after the "blink". The error message that I got said something about possible "corrupted shared memory" and I guess the shutting down of the connections was a precautionary measure. I hope this additional information helps. --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amir Becher <abecher@yahoo.com> writes: > > Here is an example of the misbehavior that I am > > seeing: > > [snip] > > The main point is that "column_a" is the wrong > number. > > Hmm. This looks like you must have an index entry > pointing at the wrong > heap tuple. Which is not something we hear of > regularly, and certainly > not an error that I'd expect to see appearing in > isolation. > > You didn't say in so many words, but should I take > it that (a) this > pattern of misbehavior (wrong row returned) occurs > repeatedly for you, > and (b) there are no other signs of problems > (backend unexpectedly > quits, strange error messages, that sort of thing)? > > I can't really think of any plausible mechanism for > this to happen, > except for the index and table files getting out of > sync somehow. > Are you doing anything that operates directly on the > database files > (for example, doing backup/restore with cp or tar or > some such)? > > regards, tom lane __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
Amir Becher <abecher@yahoo.com> writes: > I don't know if this may have something to do with it, > but we do backup the data every night using VERITAS > Backup Exec. We are not restoring anything, though > (the data is backed up to tape). The VERITAS software > runs on Windows, but there is an agent that runs on > our Linux box where the PostgreSQL data is stored. I > should also mention that the backup is running while > the database is being modified (we modify the database > 24/7). You're wasting your time making such a backup --- if you ever have to use it, it'll be corrupt, because the individual files in the database won't be in sync. But that's not the immediate problem. > There is another unexpected behavior that I noticed > for the first time this morning (so I am not sure if > it's recurring, related or relevant). The database > "blinked" in the sense that all database connections > were lost - but new connections could be obtained > immediately after the "blink". The error message that > I got said something about possible "corrupted shared > memory" and I guess the shutting down of the > connections was a precautionary measure. That sounds like a backend crash, all right. Given that, I'm thinking that you have more extensive problems than just this one symptom. The odds are good that it's a hardware issue, because we haven't heard any reports of comparable misbehavior from anyone else. I'd recommend running some hardware diagnostics --- memtest86 and badblocks seem to be the most widely used, although they aren't always able to find problems. It would also be a good idea to start taking some *real* backups, using pg_dump or pg_dumpall. You will be lucky if you don't find any more serious corruption in the database, if I'm right that there's hardware flakiness involved. You may find yourself forced to initdb and restore from a backup, so you'd better have one. regards, tom lane
I have noticed this misbehavior on three tables that are somewhat different than the example I gave. So here are the actual tables with trigger and index information: TABLE #1 -------- CREATE TABLE example1_table ( id INTEGER NOT NULL, price FLOAT NOT NULL, date DATE NOT NULL, source TEXT); CREATE UNIQUE INDEX example1_index ON example1_table (id, date); TABLE #2 -------- CREATE TABLE example2_table ( entry_number INTEGER PRIMARY KEY, s_id INTEGER NOT NULL REFERENCES example4_table ON UPDATE CASCADE ON DELETE NO ACTION, a_number INTEGER NOT NULL REFERENCES example5_table ON UPDATE CASCADE ON DELETE NO ACTION, type TEXT NOT NULL CHECK (type IN ('D','C')), a_id INTEGER NOT NULL, price FLOAT NOT NULL, quantity FLOAT NOT NULL, currency CHAR(3) NOT NULL, date DATE NOT NULL, date_created DATE NOT NULL, is_derived BOOLEAN NOT NULL, description TEXT NOT NULL); CREATE INDEX example2_index1 ON example2_table (is_derived, date_created); CREATE INDEX example2_index2 ON example2_table (s_id, date); CREATE INDEX example2_index3 ON example2_table (s_id, a_id); CREATE INDEX example2_index4 ON example2_table (description); TABLE #3 -------- CREATE TABLE example3_table ( a_id INTEGER NOT NULL, s_id INTEGER NOT NULL REFERENCES example4_table ON UPDATE CASCADE ON DELETE NO ACTION, a_number INTEGER NOT NULL REFERENCES example5_table ON UPDATE CASCADE ON DELETE NO ACTION, date DATE NOT NULL, is_alone BOOLEAN NOT NULL, price FLOAT NOT NULL, rate FLOAT NOT NULL, currency CHAR(3) NOT NULL, d_quantity FLOAT NOT NULL, d_price FLOAT NOT NULL, d_rate FLOAT NOT NULL, c_quantity FLOAT NOT NULL, c_price FLOAT NOT NULL, c_rate FLOAT NOT NULL, PRIMARY KEY (a_id, s_id, a_number, date, is_alone)); CREATE INDEX example3_table_index ON example3_table (s_id, date, is_alone); --- Dennis Gearon <gearond@cvc.net> wrote: > exactly what are your indexes on this, as well as > any triggers? > > Amir Becher wrote: > > Here is an example of the misbehavior that I am > > seeing: > > > > A simple table with 3 colums (and approximately a > > million records): > > column_a INTEGER PRIMARY KEY > > column_b DATE NOT NULL > > column_c DOUBLE NOT NULL __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
On Thu, 10 Apr 2003, Tom Lane wrote: > Amir Becher <abecher@yahoo.com> writes: > > I don't know if this may have something to do with it, > > but we do backup the data every night using VERITAS > > Backup Exec. We are not restoring anything, though > > (the data is backed up to tape). The VERITAS software > > runs on Windows, but there is an agent that runs on > > our Linux box where the PostgreSQL data is stored. I > > should also mention that the backup is running while > > the database is being modified (we modify the database > > 24/7). > > You're wasting your time making such a backup --- if you ever have to > use it, it'll be corrupt, because the individual files in the database > won't be in sync. But that's not the immediate problem. Is there any chance that the veritas backup agent on linux is somehow locking the files as root against read or some other such insanity? If so, I'd imagine there being problems of some kind. But I have no idea if the veritas backup agent can or does do something like that. Either way, I'd turn off file system backups for that directory, as they aren't gonna do any good anyway. I'm thinking with my rusty old windows NT sysadmin hat here, by the way.
The source of the problem was the VERITAS Backup Exec. I was able to replicate the index corruption in under a minute by running a backup and updating the database at the same time. I have never been able to replicate the problem before because I was testing during the day, when the backup was not running. As far as backups are concerned, we will no longer backup the data directory itself - that was clearly a dumb thing to do in the first place. We actually have been backing up the data using pg_dumpall as well (so there is still hope for us). Thanks for all the help - I greatly appreciate it. --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amir Becher <abecher@yahoo.com> writes: > > I don't know if this may have something to do with > it, > > but we do backup the data every night using > VERITAS > > Backup Exec. We are not restoring anything, though > > (the data is backed up to tape). The VERITAS > software > > runs on Windows, but there is an agent that runs > on > > our Linux box where the PostgreSQL data is stored. > I > > should also mention that the backup is running > while > > the database is being modified (we modify the > database > > 24/7). > > You're wasting your time making such a backup --- if > you ever have to > use it, it'll be corrupt, because the individual > files in the database > won't be in sync. But that's not the immediate > problem. > > > There is another unexpected behavior that I > noticed > > for the first time this morning (so I am not sure > if > > it's recurring, related or relevant). The database > > "blinked" in the sense that all database > connections > > were lost - but new connections could be obtained > > immediately after the "blink". The error message > that > > I got said something about possible "corrupted > shared > > memory" and I guess the shutting down of the > > connections was a precautionary measure. > > That sounds like a backend crash, all right. Given > that, I'm thinking > that you have more extensive problems than just this > one symptom. The > odds are good that it's a hardware issue, because we > haven't heard any > reports of comparable misbehavior from anyone else. > > I'd recommend running some hardware diagnostics --- > memtest86 and > badblocks seem to be the most widely used, although > they aren't always > able to find problems. > > It would also be a good idea to start taking some > *real* backups, using > pg_dump or pg_dumpall. You will be lucky if you > don't find any more > serious corruption in the database, if I'm right > that there's hardware > flakiness involved. You may find yourself forced to > initdb and restore > from a backup, so you'd better have one. > > regards, tom lane __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
Amir Becher <abecher@yahoo.com> writes: > The source of the problem was the VERITAS Backup Exec. > I was able to replicate the index corruption in under > a minute by running a backup and updating the database > at the same time. Fascinating. It would be interesting to know exactly what Veritas is doing that causes the problem. Do any error messages show up in the postmaster log while you are doing the above? I'm still quite concerned about the health of your database, btw. I can't see any reason to think that whatever interference is going on would corrupt only indexes. There may be problems accumulating in the table files as well :-( regards, tom lane
Unfortuntely, I haven't enabled the postmaster logs, so I can't shed any more light on this problem. Thanks for the advice on the health of my database. I plan to do a pg_dumpall and rebuild it from scratch this weekend (since it is in production). Hopefully nothing else is damaged, and at least going forward the database will be healthy. --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amir Becher <abecher@yahoo.com> writes: > > The source of the problem was the VERITAS Backup > Exec. > > I was able to replicate the index corruption in > under > > a minute by running a backup and updating the > database > > at the same time. > > Fascinating. It would be interesting to know > exactly what Veritas is > doing that causes the problem. Do any error > messages show up in the > postmaster log while you are doing the above? > > I'm still quite concerned about the health of your > database, btw. > I can't see any reason to think that whatever > interference is going > on would corrupt only indexes. There may be > problems accumulating > in the table files as well :-( > > regards, tom lane __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com