Corrupt data pages... - Mailing list pgsql-hackers
From | Kevin Brown |
---|---|
Subject | Corrupt data pages... |
Date | |
Msg-id | 20060417022942.GD11892@filer Whole thread Raw |
Responses |
Re: Corrupt data pages...
|
List | pgsql-hackers |
Apologies if my previous attempts to post this to the mailing list have actually succeeded, but I've seen no evidence of that... While doing some bugzilla testing, I ran into a data page corruption issue. The symptom was the usual "could not access status of transaction <bignum>". I tracked it down via the usual pg_filedump method and found what appears to my untrained eye to be data tuples from a different table residing in the (now corrupt) table in question. First, the relevant environmental details: System: - ASUS A8V motherboard - Athlon 64 3200 - 2G ECC memory (which is enabled. The kernel will panic in the face of an uncorrectable ECC error). - 2 250G Seagate ST3250823AS rev 3.03 in a mirrored configuration via Linuxsoftware RAID. OS: Debian Linux x86_64, 'unstable' branch. 'uname -a' yields: Linux kbrown-desktop 2.6.15-1-amd64-generic #2 Wed Jan4 06:22:42 CET 2006 x86_64 GNU/Linux Filesystem: XFS with the default mount options. PostgreSQL: - version: 8.1.3. - The package is postgresql-8.1 version 8.1.3-2. - Autovacuum is enabled. I'm attachingmy postgresql.conf. The context: I'm performing testing in preparation for migrating from a different bug tracking system to bugzilla. Most of this testing revolves around the migration tool, which reads directly from the other bug tracking system's database and inserts into the bugzilla database. The bugzilla database is hosted on PostgreSQL and has a lot of additional foreign keys that bugzilla does not have by default. All of the constraints are DEFERRABLE and INITIALLY DEFERRED. A test run consists of dropping all the constraints and all the tables, recreating all the tables and constraints, and importing a base set of data. The commands for doing all that were created via pg_dump. They are all executed in a single transaction and then committed. The migration run itself also executes in a single transaction and involves a savepoint prior to every insertion group (the name of the savepoint is based on the name of the operation, so the total number of outstanding savepoints at any one time is perhaps 4 or 5). Each migration run ends up executing perhaps 300,000 inserts. The meat of the problem: After the last run, a later select from one of the tables yielded the error. I narrowed it down to a single table (series_data) and tracked down the offending row via a binary search. Here's the end results: stagezilla=# select * from series_data; ERROR: could not access status of transaction 1685201011 DETAIL: could not open file "pg_clog/0647": No such file or directory stagezilla=# select * from series_data offset 9927 limit 1;series_id | series_date | series_value -----------+---------------------+-------------- 46 | 2006-01-06 00:00:00 | 0 (1 row) stagezilla=# select * from series_data offset 9928 limit 1; ERROR: could not access status of transaction 1685201011 DETAIL: could not open file "pg_clog/0647": No such file or directory stagezilla=# select ctid from series_data offset 9927 limit 1; ctid ----------(72,136) (1 row) stagezilla=# select oid from pg_database where datname = 'stagezilla'; oid -------20518 (1 row) stagezilla=# select oid from pg_class where relname = 'series_data'; oid -------53294 (1 row) stagezilla=# \d series_data Table "public.series_data" Column | Type | Modifiers --------------+--------------------------------+-----------series_id | integer | not nullseries_date | timestamp(0) without time zone | not nullseries_value | integer | not null Indexes: "series_data_series_id_idx" UNIQUE, btree (series_id, series_date) Foreign-key constraints: "series_data_series_id_series_series_id_fk" FOREIGN KEY (series_id) REFERENCES series(series_id)ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED The only file in pg_clog is 0000. After examining the output of pg_filedump, it became obvious this was a corrupt page issue. What bothers me is the way in which it's corrupt. The corrupt data looks supiciously like the data from different table, or perhaps from an index. In this case, from the short_desc field of the bugs table, which has a datatype of 'text' (I presume this means it's TOASTed?). The text data in the corrupt page exactly matches entries in that field of the bugs table. So: because the hardware is relatively new and is using components that are as reliable as I can make them given my somewhat limited budget (ECC memory, mirrored SATA disks), I have to consider the possibility that there's a bug here. Note that on the amd64 architecture, gcc (or some versions thereof) seems to have a code generation bug that can be eliminated by using the -fno-strict-aliasing switch. The version of gcc that seems to be in current use is 4.0.3. It's possible that this is causing problems, but to be honest this is the first time I've ever seen this happen. If you guys have any suggestions on what I might do to help track this one down, please let me know. I tarred up the entire 'stagezilla' database after shutting down PostgreSQL so that I can do additional examination of the datafiles if that proves useful. Additionally, if you'd like to see the output of pg_filedump, I'll be happy to post it here (or email it separately). I hesitate to attach it here because it's not clear the mailing list's spam filtering would let it pass. -- Kevin Brown kevin@sysexperts.com
pgsql-hackers by date: