Re: 8.4b2 tsearch2 strange error - Mailing list pgsql-hackers

From Tom Lane
Subject Re: 8.4b2 tsearch2 strange error
Date
Msg-id 13882.1244242148@sss.pgh.pa.us
Whole thread Raw
In response to Re: 8.4b2 tsearch2 strange error  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: 8.4b2 tsearch2 strange error  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
>> I poked around in the dump for awhile.  I still can't reproduce the
>> failure from a standing start.

I've had a bit of success with reproducing the failure from scratch.
After loading the SQL dump that Tatsuo previously supplied into a UTF8,
C-locale database, generate a data-only dump of it, along the lines of

pg_dump -a ... >ishii-data.sql

Then run this psql script in the database (the function is designed to
exercise every entry in the GIN index):

create table msgstats as select * from ts_stat('select body_index from msginfo');

create language plpgsql;
create function testit () returns int as $$
declare r record;       ifail int := 0;
begin for r in select * from msgstats loop   begin   perform count(*) from msginfo where plainto_tsquery(r.word) @@
body_index;  exception when others then     raise notice 'failed for "%": %', r.word, sqlerrm;     ifail := ifail + 1;
end; end loop; return ifail;
 
end $$ language plpgsql;

delete from msginfo;
truncate msg_folderinfo;
\i ishii-data.sql
reset client_min_messages;

select testit();

The results are not very stable, but I fairly often get "tuple offset out
of range" errors, and my last try resulted in an Assert failure in
autovacuum.  What I think is happening is that there's some inadequately
locked page-changing logic in there, and if you can get autovacuum to
happen on msginfo concurrently with the data reload step, it'll mess up.
Given that we've not seen reports of this on 8.3, I'll bet lunch that
the bug is somewhere in the "fast insert" code.  I haven't found it
yet, though.

BTW, I also did more analysis of Tatsuo's binary dump using this
function.  Of some 38000 distinct lexemes in the body_index column,
about 1500 trigger an error in testit(); the errors are either
"tuple offset out of range" or something like "could not open segment 1
of relation base/16384/16497 (target block 955254869): No such file or
directory".  There are about 440 distinct messages that show up,
suggesting that approximately that many pages of the index contain
corrupted TIDs.  Manual examination of the index is consistent with
this, and also indicates that all of the corrupt pages are in the second
half of the table.  That's not too hard to square with the
concurrent-update theory --- autovac might have already done about half
the index before a large data load started.  However it means that
corruption is *very* probable if data insertion is happening
concurrently with a vacuum, else not such a large number of pages would
have gotten messed up.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: blocking referencing system catalogs in 8.4 breaks my code
Next
From: Emmanuel Cecchet
Date:
Subject: Unique namespace per session for temp tables?