Re: Partially corrupted table - Mailing list pgsql-bugs
From | Filip Hrbek |
---|---|
Subject | Re: Partially corrupted table |
Date | |
Msg-id | 002701c6cc0c$ea3ba890$1e03a8c0@fhrbek Whole thread Raw |
In response to | Partially corrupted table ("Filip Hrbek" <filip.hrbek@plz.comstar.cz>) |
Responses |
Re: Partially corrupted table
|
List | pgsql-bugs |
Tom, thank you very much for your excellent and fast analysis (I mean it seriously, I am comparing your help to IBM Informix commercial support :-) ). It is possible that the corruption was caused by a HW problem at customer's server, and then this problem appeared also at our development environment because of the data already beeing corrupted. I will recommend the customer to make some memory tests. We are using PostgreSQL at 14 customer servers for almost 5 years and this is the first time it crashed - and perhaps due to a HW problem. Great work! Regards Filip Hrbek ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz> Cc: <pgsql-bugs@postgreSQL.org> Sent: Wednesday, August 30, 2006 1:33 AM Subject: Re: [BUGS] Partially corrupted table > Well, it's a corrupt-data problem all right. The tuple that's > causing the problem is on page 1208, item 27: > > Item 27 -- Length: 240 Offset: 1400 (0x0578) Flags: USED > XMIN: 5213 CMIN: 140502 XMAX: 0 CMAX|XVAC: 0 > Block Id: 1208 linp Index: 27 Attributes: 29 Size: 28 > infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID) > > 0578: 5d140000 d6240200 00000000 00000000 ]....$.......... > 0588: 0000b804 1b001d00 02091c00 0e000000 ................ > 0598: 02000000 42020000 23040000 6b000000 ....B...#...k... > 05a8: 02000000 6a010000 0d000000 42020000 ....j.......B... > 05b8: 02000000 10000000 08000000 00000400 ................ > 05c8: 08000000 00000400 0a000000 ffff0400 ................ > 05d8: 78050000 0a000000 00000200 03000000 x............... > 05e8: 08000000 00000300 08000000 00000400 ................ > 05f8: 08000000 00000400 08000000 00000400 ................ > 0608: 08000000 00000200 08000000 00000300 ................ > 0618: 08800000 00000400 08000000 00000400 ................ > ^^^^^^^^ > 0628: 08000000 00000400 08000000 00000200 ................ > 0638: 08000000 00000300 08000000 00000400 ................ > 0648: 08000000 00000400 18000000 494e565f ............INV_ > 0658: 41534153 5f323030 36303130 31202020 ASAS_20060101 > > The underlined word is a field length word that evidently should contain > 8, but contains hex 8008. This causes the tuple-data decoder to step > way past the end of the tuple and off into never-never land. Since the > results will depend on which shared buffer the page happens to be in and > what happens to be at the address the step lands at, the inconsistent > results from try to try are not so surprising. > > The next question is how did it get that way. In my experience a > single-bit flip like that is most likely to be due to flaky memory, > though bad motherboards or cables are not out of the question either. > I'd recommend some thorough hardware testing on the original machine. > > It seems there's only the one bad bit; I did > > dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)'; > DELETE 1 > > and then was able to copy the table repeatedly without crash. I'd > suggest doing that and then reconstructing the deleted tuple from > the above dump. > > regards, tom lane
pgsql-bugs by date: