Thread: 7.0.3 database corruption
I know you guys want to focus on 7.1 or 7.2, and yes I am trying to move to 7.1, but it won't happen overnight. We have a serious problem with 7.0.3 and data corruption. We have a program that compares records in two tables. It creates a set of SQL scripts that either update, insert, or delelte records based on the comparison of the two tables. These scripts are then run against multiple database servers, which are slaves. Some of these scripts get pretty big, and take a while to run (10s of thousands of records are affected). After we run the scripts the database seems fine. Then we run two SQL scripts which create some summary tables. After we run the scripts, it looks like the database is corrupt. Oddly enough, if we run vacuum prior to running these scripts, the database does not seem to get corrupted. All I really need to know is if anyone has seen anything in the code which would explain this, and if so, do you know if is fixed in 7.1.x? The scripts look like: <<<<<<<<<<<<< stattbl.sql >>> drop table musicstat ; -- This crap is to find oldest new release create temp table lastnr ( name varchar, value date ); insert into lastnr select 'LASTNRUPDATE' as "name", min( to_date(released,'MM/DD/YYYY')) as "value" from new_ztitles where released != 'n/a' ; select name, released as "value" into musicstat from new_ztitles, lastnr where to_date(released, 'MM/DD/YYYY') = lastnr.value limit 1; -- Count number of songs insert into musicstat select 'tracks' as "name", count(*) as "value" from zsong; -- Count number of albums insert into musicstat select 'albums' as "name", count(*) as "value" from ztitles; -- Count number of artists insert into musicstat select 'artists' as "name", count(*) as "value" from dartists; insert into musicstat select 'dotclick_albums' as "name", count(*) as "value" from dotclicktitle ; insert into musicstat select 'dotclick_songs' as "name", count(*) as "value" from dotclicksong ; insert into musicstat select 'deleted songs' as "name", count(*) as "value" from old_zsong ; insert into musicstat select 'deleted titles' as "name", count(*) as "value" from old_ztitles ; insert into musicstat select 'new titles' as "name", count(*) as "value" from ztitles where acd = 'A'; insert into musicstat select 'new songs' as "name", count(*) as "value" from zsong, ztitles where ztitles.acd = 'A' and zsong.muzenbr = ztitles.muzenbr; insert into musicstat select 'updated albums' as "name", count( distinct( muzenbr)) as "value" from zsong where acd = 'U' ; insert into musicstat select 'new artists' as "name", count(*) as "value" from tartists, dartists where tartists.artist = dartists.artist ; insert into musicstat (name,value) values('CONCERTGO' , '1' ); <<<<<<<<< end >>>>>>>>>>>>>>>>> <<<<<<<< newrel >>>>>>>>>>>>>> drop table new_ztitles ; drop table new_releases ; create table new_ztitles as select * from ztitles T where T.origrel > (date_part('year','now'::abstime)-1) and to_date(T.released, 'MM/DD/YYYY') > ('now'::abstime - '30 days'::reltime) and T.datasrc = 1; create index newztitles_muzenbr on new_ztitles (muzenbr); create table new_releases as select T.origrel, T.released, T.title, T.muzenbr, T.artistid, T.performer2 as artist, S.song, S.trackid from new_ztitles T, zsong S where S.muzenbr = T.muzenbr and S.datasrc = 1 order by performer2; -- new releases indexes create index new_rel_artist_lower on new_releases (varchar_lower(artist) ); create index new_rel_artist_meta on new_releases (metatext(artist) ); create index new_rel_artist_strip on new_releases (strip(artist) ); create index new_rel_title_lower on new_releases (varchar_lower(title) ); create index new_rel_title_meta on new_releases (metatext(title) ); create index new_rel_title_strip on new_releases (strip(title) ); create index new_rel_song_lower on new_releases (varchar_lower(song) ); create index new_rel_song_meta on new_releases (metatext(song) ); create index new_rel_song_strip on new_releases (strip(song) ); -- new_ztitles indexes create index newztitles_artistid on new_ztitles (artistid); create index newztitles_cat3 on new_ztitles(cat3); <<<<<<<<end>>>>>>>>>>>>>
mlw wrote: > > I know you guys want to focus on 7.1 or 7.2, and yes I am trying to move to > 7.1, but it won't happen overnight. > > We have a serious problem with 7.0.3 and data corruption. We have a program > that compares records in two tables. It creates a set of SQL scripts that > either update, insert, or delelte records based on the comparison of the two > tables. These scripts are then run against multiple database servers, which are > slaves. > > Some of these scripts get pretty big, and take a while to run (10s of thousands > of records are affected). After we run the scripts the database seems fine. > Then we run two SQL scripts which create some summary tables. After we run the > scripts, it looks like the database is corrupt. > > Oddly enough, if we run vacuum prior to running these scripts, the database > does not seem to get corrupted. > > All I really need to know is if anyone has seen anything in the code which > would explain this, and if so, do you know if is fixed in 7.1.x? There certainly are bugs in 7.0.3 - I can describe at least two: 1. an index on varchar(8) (an user name) gets corrupted so that some names are no longer found when searching by index - they are still there when doing an unqualified select and come back after reindex for the qualified one. 1a. "FATAL: bits falling of the end of world" or something like it in logs and then broken db connection after that 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting" or "DELETE waiting" state. This happens sporadically and requires a db system restart to go away ------------- Hannu
> mlw wrote: >> After we run the >> scripts, it looks like the database is corrupt. It's impossible to say anything useful with such an undescriptive description of the problem. Hannu Krosing <hannu@tm.ee> writes: > There certainly are bugs in 7.0.3 - I can describe at least two: I would really like to see a reproducible example of index corruption in 7.0.*. We've heard such reports often enough to know the problem is real, but without a test case in hand it's difficult to do much about it. > 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting" 7.0.*'s deadlock detection algorithm is known to have some holes, but deadlock couldn't be the explanation for just a single stuck backend. Again, any chance of looking at an example? regards, tom lane
Tom Lane wrote: > > > mlw wrote: > >> After we run the > >> scripts, it looks like the database is corrupt. > > It's impossible to say anything useful with such an undescriptive > description of the problem. > > Hannu Krosing <hannu@tm.ee> writes: > > There certainly are bugs in 7.0.3 - I can describe at least two: > > I would really like to see a reproducible example of index corruption > in 7.0.*. We've heard such reports often enough to know the problem > is real, but without a test case in hand it's difficult to do much about > it. I know ;( Unfortunately this has happened only a few times on some quite busy servers receiving a workload of quite varied queries. > > 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting" > > 7.0.*'s deadlock detection algorithm is known to have some holes, but > deadlock couldn't be the explanation for just a single stuck backend. that's what "ps ax| grep post" output looks like in my logs Sun Jun 10 06:31:00 EET 2001 828 ? S 0:02 /usr/bin/postmaster -i -o -F 26652 ? S 5:20 /usr/bin/postgres localhost gamer casino idle 30082 ? S 0:20 /usr/bin/postgres 127.0.0.1 nobody casino idle 30084 ? S 1:26 /usr/bin/postgres 127.0.0.1 nobody casino idle 31565 ? S 0:43 /usr/bin/postgres 127.0.0.1 nobody casino idle 31595 ? S 0:19 /usr/bin/postgres 127.0.0.1 nobody casino idle 31596 ? S 0:21 /usr/bin/postgres 127.0.0.1 nobody casino idle 31597 ? S 0:31 /usr/bin/postgres 127.0.0.1 nobody casino idle 31598 ? S 1:39 /usr/bin/postgres 127.0.0.1 nobody casino idle 31600 ? S 0:17 /usr/bin/postgres 127.0.0.1 nobody casino idle 31608 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 31612 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 32080 ? S 0:43 /usr/bin/postgres localhost gamer casino UPDATE waiti 32706 ? S 0:10 /usr/bin/postgres localhost gamer casino idle 302 ? S 0:00 /usr/bin/postgres 127.0.0.1nobody casino idle 361 ? S 0:00 sh -c date;ps ax|grep post 364 ? S 0:00 grep post CHECKING WAITING PIDS: ['32080'] Sun Jun 10 06:31:10 EET 2001 828 ? S 0:02 /usr/bin/postmaster -i -o -F 26652 ? S 5:20 /usr/bin/postgres localhost gamer casino idle 30082 ? S 0:20 /usr/bin/postgres 127.0.0.1 nobody casino idle 30084 ? S 1:26 /usr/bin/postgres 127.0.0.1 nobody casino idle 31565 ? S 0:43 /usr/bin/postgres 127.0.0.1 nobody casino idle 31595 ? S 0:19 /usr/bin/postgres 127.0.0.1 nobody casino idle 31596 ? S 0:21 /usr/bin/postgres 127.0.0.1 nobody casino idle 31597 ? S 0:31 /usr/bin/postgres 127.0.0.1 nobody casino idle 31598 ? S 1:39 /usr/bin/postgres 127.0.0.1 nobody casino idle 31600 ? S 0:17 /usr/bin/postgres 127.0.0.1 nobody casino idle 31608 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 31612 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 32080 ? S 0:43 /usr/bin/postgres localhost gamer casino UPDATE waiti 32706 ? S 0:10 /usr/bin/postgres localhost gamer casino idle 302 ? S 0:00 /usr/bin/postgres 127.0.0.1nobody casino idle 365 ? S 0:00 sh -c date;ps ax|grep post 368 ? S 0:00 grep post PROCESS 32080 STILL WAITING, RESTART TIME > Again, any chance of looking at an example? I could send you tails of postgres logfiles that are rotated on detecting the INSERT/UPDATE wait condition that does not go away in 10 sec. How long logfiles (time) would be enough ? There seems to be no general pattern that leads to it though ;( --------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > I could send you tails of postgres logfiles that are rotated on > detecting > the INSERT/UPDATE wait condition that does not go away in 10 sec. > How long logfiles (time) would be enough ? Do the logs show the queries being executed? The queries forming the current transaction of the stuck backend, and all the transactions that have occurred since that transaction started, would be useful to look at. regards, tom lane