Re: Row data corruption under 7.3.5 - Mailing list pgsql-admin
From | Marc Mitchell |
---|---|
Subject | Re: Row data corruption under 7.3.5 |
Date | |
Msg-id | 001f01c40c3a$b1a95e80$6701050a@MarcM8500 Whole thread Raw |
In response to | Re: Row data corruption under 7.3.5 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Row data corruption under 7.3.5
Re: Row data corruption under 7.3.5 Re: Row data corruption under 7.3.5 |
List | pgsql-admin |
This is follow-up to a problem first reported on 3/1/04. The problem has continued to occur intermittently and recently we experienced the first occurrence where the first column of a table was the column where the corrupted and thus we could not recover it. Google groups searching have found numerous hits for people reporting the same symptoms. While we've seen some instructions to get things back, we've seen nothing about correcting the root cause. This is becoming a major production problem and starting to cast doubt on the "Postgres in Production" decision. We've observed nothing that would lead us to believe there are any hardware problems. Initially we were using write-caching using battery-backed up cache but we turned that off and are using direct I/O and still experiencing the same problem. Furthermore, the fact that the problems seem isolated to 3 specific tables in a 50+ table database makes us weary of hardware-level issues. As far as matching up correct and corrupted rows, here's more detail on a recent occurrence: [root@cin1 backups]# /usr/local/pgsql/7.3.5/bin/pg_dump -Ft -p 5432 -U postgres solo > solo.dmp pg_dump: ERROR: MemoryContextAlloc: invalid request size 4294967293 pg_dump: lost synchronization with server, resetting connection pg_dump: SQL command to dump the contents of table "freight_track_detail" failed: PQendcopy() failed. pg_dump: Error message from server: pg_dump: The command was: COPY public.freight_track_detail (ftd_uid, ftm_uid, txl_uid, ref_nbr_1, ref_nbr_2, fg_tab_uid, fg_tab_alias, ftd_status_code, scan_timestamp, add_userid, add_timestamp, mod_userid, mod_timestamp) TO stdout; ..<snip>.. 118171 512 2159 00004300854908405208 46366 FGI 2004-01-21 12:25:00 postgres 2004-01-21 15:39:29 OSD 2004-01-22 01:04:40 118153 512 2159 00004304730000071106 46990 FGI 2004-01-21 12:20:00 post 2000-01-01 00:00:00 ..End of output. The second row shows the vchar userid getting lopped off after the first 4 characters. Note that we've experienced this problem with several different vchar-typed columns though, as mentioned before, we have recently seen corruption of integer typed columns. If we issued an update setting that column plus the subsequent 3 columns to "null", everything then was back to normal. This row was right in the middle of the table. Furthermore, we recently found problems reported in the same table from nightly vacuums. See the following cron-generated emails that contain error messages as well as datetimes to show the temporal relationship between these problems:
pgsql-admin by date: