Thread: 7.0.3 database corruption

7.0.3 database corruption

From
mlw
Date:
I know you guys want to focus on 7.1 or 7.2, and yes I am trying to move to
7.1, but it won't happen overnight.

We have a serious problem with 7.0.3 and data corruption. We have a program
that compares records in two tables. It creates a set of SQL scripts that
either update, insert, or delelte records based on the comparison of the two
tables. These scripts are then run against multiple database servers, which are
slaves.

Some of these scripts get pretty big, and take a while to run (10s of thousands
of records are affected). After we run the scripts the database seems fine.
Then we run two SQL scripts which create some summary tables. After we run the
scripts, it looks like the database is corrupt.

Oddly enough, if we run vacuum prior to running these scripts, the database
does not seem to get corrupted. 

All I really need to know is if anyone has seen anything in the code which
would explain this, and if so, do you know if is fixed in 7.1.x?

The scripts look like:
<<<<<<<<<<<<< stattbl.sql >>>

drop table musicstat ;
-- This crap is to find oldest new release
create temp table lastnr ( name varchar, value date );
insert into lastnr select 'LASTNRUPDATE' as "name", min(
to_date(released,'MM/DD/YYYY')) as "value" from new_ztitles where released !=
'n/a' ;
select name, released  as "value" into musicstat from new_ztitles, lastnr where
to_date(released, 'MM/DD/YYYY') = lastnr.value limit 1;
-- Count number of songs
insert into musicstat select 'tracks' as "name", count(*) as "value" from
zsong;
-- Count number of albums
insert into musicstat select 'albums' as "name", count(*) as "value" from
ztitles;
-- Count number of artists
insert into musicstat select 'artists' as "name", count(*) as "value" from
dartists;
insert into musicstat select 'dotclick_albums' as "name", count(*) as "value"
from dotclicktitle ;
insert into musicstat select 'dotclick_songs' as "name", count(*) as "value"
from dotclicksong ;
insert into musicstat select 'deleted songs' as "name", count(*) as "value"
from old_zsong ;
insert into musicstat select 'deleted titles' as "name", count(*) as "value"
from old_ztitles ;
insert into musicstat select 'new titles' as "name", count(*) as "value" from
ztitles where acd = 'A';
insert into musicstat select 'new songs' as "name", count(*) as "value" from
zsong, ztitles       where ztitles.acd = 'A' and zsong.muzenbr = ztitles.muzenbr;
insert into musicstat select 'updated albums' as "name", count( distinct(
muzenbr)) as "value" from zsong       where acd = 'U' ;
insert into musicstat select 'new artists' as "name", count(*) as "value" from
tartists, dartists       where tartists.artist = dartists.artist ;
insert into musicstat (name,value) values('CONCERTGO' , '1' );

<<<<<<<<< end >>>>>>>>>>>>>>>>>
<<<<<<<< newrel >>>>>>>>>>>>>>
drop table new_ztitles ;
drop table new_releases ;
create table new_ztitles as select * from ztitles T
where T.origrel > (date_part('year','now'::abstime)-1)
and to_date(T.released, 'MM/DD/YYYY') > ('now'::abstime - '30 days'::reltime)
and T.datasrc = 1;
create index newztitles_muzenbr       on new_ztitles (muzenbr);
create table new_releases as select T.origrel, T.released, T.title, T.muzenbr,
T.artistid,
T.performer2 as artist, S.song, S.trackid
from new_ztitles T, zsong S where S.muzenbr = T.muzenbr
and S.datasrc = 1
order by performer2;
-- new releases indexes
create index new_rel_artist_lower       on new_releases (varchar_lower(artist) );
create index new_rel_artist_meta       on new_releases (metatext(artist) );
create index new_rel_artist_strip       on new_releases (strip(artist) );
create index new_rel_title_lower       on new_releases (varchar_lower(title) );
create index new_rel_title_meta       on new_releases (metatext(title) );
create index new_rel_title_strip       on new_releases (strip(title) );
create index new_rel_song_lower       on new_releases (varchar_lower(song) );
create index new_rel_song_meta       on new_releases (metatext(song) );
create index new_rel_song_strip       on new_releases (strip(song) );
-- new_ztitles indexes
create index newztitles_artistid       on new_ztitles (artistid);
create index newztitles_cat3       on new_ztitles(cat3);
<<<<<<<<end>>>>>>>>>>>>>


Re: 7.0.3 database corruption

From
Hannu Krosing
Date:
mlw wrote:
> 
> I know you guys want to focus on 7.1 or 7.2, and yes I am trying to move to
> 7.1, but it won't happen overnight.
> 
> We have a serious problem with 7.0.3 and data corruption. We have a program
> that compares records in two tables. It creates a set of SQL scripts that
> either update, insert, or delelte records based on the comparison of the two
> tables. These scripts are then run against multiple database servers, which are
> slaves.
> 
> Some of these scripts get pretty big, and take a while to run (10s of thousands
> of records are affected). After we run the scripts the database seems fine.
> Then we run two SQL scripts which create some summary tables. After we run the
> scripts, it looks like the database is corrupt.
> 
> Oddly enough, if we run vacuum prior to running these scripts, the database
> does not seem to get corrupted.
> 
> All I really need to know is if anyone has seen anything in the code which
> would explain this, and if so, do you know if is fixed in 7.1.x?


There certainly are bugs in 7.0.3 - I can describe at least two:

1. an index on varchar(8) (an user name) gets corrupted so that some
names are   no longer found when searching by index - they are still there when
doing an  unqualified select and come back after reindex for the qualified one.

1a. "FATAL: bits falling of the end of world" or something like it in
logs and   then broken db connection after that

2. Some kind of stuck locks - a single backend stuck in "INSERT waiting"
or   "DELETE waiting" state. This happens sporadically and requires a db
system   restart to go away


-------------
Hannu


Re: 7.0.3 database corruption

From
Tom Lane
Date:
> mlw wrote:
>> After we run the
>> scripts, it looks like the database is corrupt.

It's impossible to say anything useful with such an undescriptive
description of the problem.

Hannu Krosing <hannu@tm.ee> writes:
> There certainly are bugs in 7.0.3 - I can describe at least two:

I would really like to see a reproducible example of index corruption
in 7.0.*.  We've heard such reports often enough to know the problem
is real, but without a test case in hand it's difficult to do much about
it.

> 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting"

7.0.*'s deadlock detection algorithm is known to have some holes, but
deadlock couldn't be the explanation for just a single stuck backend.
Again, any chance of looking at an example?
        regards, tom lane


Re: 7.0.3 database corruption

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> > mlw wrote:
> >> After we run the
> >> scripts, it looks like the database is corrupt.
> 
> It's impossible to say anything useful with such an undescriptive
> description of the problem.
> 
> Hannu Krosing <hannu@tm.ee> writes:
> > There certainly are bugs in 7.0.3 - I can describe at least two:
> 
> I would really like to see a reproducible example of index corruption
> in 7.0.*.  We've heard such reports often enough to know the problem
> is real, but without a test case in hand it's difficult to do much about
> it.

I know ;( Unfortunately this has happened only a few times on some quite 
busy servers receiving a workload of quite varied queries.

> > 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting"
> 
> 7.0.*'s deadlock detection algorithm is known to have some holes, but
> deadlock couldn't be the explanation for just a single stuck backend.

that's what "ps ax| grep post" output looks like in my logs

Sun Jun 10 06:31:00 EET 2001 828 ?        S      0:02 /usr/bin/postmaster -i -o -F
26652 ?        S      5:20 /usr/bin/postgres localhost gamer casino idle
30082 ?        S      0:20 /usr/bin/postgres 127.0.0.1 nobody casino
idle
30084 ?        S      1:26 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31565 ?        S      0:43 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31595 ?        S      0:19 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31596 ?        S      0:21 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31597 ?        S      0:31 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31598 ?        S      1:39 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31600 ?        S      0:17 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31608 ?        S      0:24 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31612 ?        S      0:24 /usr/bin/postgres 127.0.0.1 nobody casino
idle
32080 ?        S      0:43 /usr/bin/postgres localhost gamer casino
UPDATE waiti
32706 ?        S      0:10 /usr/bin/postgres localhost gamer casino idle 302 ?        S      0:00 /usr/bin/postgres
127.0.0.1nobody casino
 
idle 361 ?        S      0:00 sh -c date;ps ax|grep post 364 ?        S      0:00 grep post

CHECKING WAITING PIDS: ['32080']
Sun Jun 10 06:31:10 EET 2001 828 ?        S      0:02 /usr/bin/postmaster -i -o -F
26652 ?        S      5:20 /usr/bin/postgres localhost gamer casino idle
30082 ?        S      0:20 /usr/bin/postgres 127.0.0.1 nobody casino
idle
30084 ?        S      1:26 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31565 ?        S      0:43 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31595 ?        S      0:19 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31596 ?        S      0:21 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31597 ?        S      0:31 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31598 ?        S      1:39 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31600 ?        S      0:17 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31608 ?        S      0:24 /usr/bin/postgres 127.0.0.1 nobody casino
idle
31612 ?        S      0:24 /usr/bin/postgres 127.0.0.1 nobody casino
idle
32080 ?        S      0:43 /usr/bin/postgres localhost gamer casino
UPDATE waiti
32706 ?        S      0:10 /usr/bin/postgres localhost gamer casino idle 302 ?        S      0:00 /usr/bin/postgres
127.0.0.1nobody casino
 
idle 365 ?        S      0:00 sh -c date;ps ax|grep post 368 ?        S      0:00 grep post

PROCESS 32080 STILL WAITING, RESTART TIME


> Again, any chance of looking at an example?

I could send you tails of postgres logfiles that are rotated on
detecting 
the INSERT/UPDATE wait condition that does not go away in 10 sec. 
How long logfiles (time) would be enough ?

There seems to be no general pattern that leads to it though ;(

---------------
Hannu


Re: 7.0.3 database corruption

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> I could send you tails of postgres logfiles that are rotated on
> detecting 
> the INSERT/UPDATE wait condition that does not go away in 10 sec. 
> How long logfiles (time) would be enough ?

Do the logs show the queries being executed?  The queries forming
the current transaction of the stuck backend, and all the transactions
that have occurred since that transaction started, would be useful
to look at.
        regards, tom lane