Re: Partially corrupted table - Mailing list pgsql-bugs

From Filip Hrbek
Subject Re: Partially corrupted table
Date
Msg-id 002701c6cc0c$ea3ba890$1e03a8c0@fhrbek
Whole thread Raw
In response to Partially corrupted table  ("Filip Hrbek" <filip.hrbek@plz.comstar.cz>)
Responses Re: Partially corrupted table
List pgsql-bugs
Tom, thank you very much for your excellent and fast analysis (I mean it
seriously, I am comparing your help to IBM Informix commercial support
:-) ).

It is possible that the corruption was caused by a HW problem at customer's
server, and then this problem appeared also at our development environment
because of the data already beeing corrupted. I will recommend the customer
to make some memory tests.

We are using PostgreSQL at 14 customer servers for almost 5 years and this
is the first time it crashed - and perhaps due to a HW problem. Great work!

Regards
  Filip Hrbek


----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>
Cc: <pgsql-bugs@postgreSQL.org>
Sent: Wednesday, August 30, 2006 1:33 AM
Subject: Re: [BUGS] Partially corrupted table


> Well, it's a corrupt-data problem all right.  The tuple that's
> causing the problem is on page 1208, item 27:
>
> Item  27 -- Length:  240  Offset: 1400 (0x0578)  Flags: USED
>  XMIN: 5213  CMIN: 140502  XMAX: 0  CMAX|XVAC: 0
>  Block Id: 1208  linp Index: 27   Attributes: 29   Size: 28
>  infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)
>
>  0578: 5d140000 d6240200 00000000 00000000  ]....$..........
>  0588: 0000b804 1b001d00 02091c00 0e000000  ................
>  0598: 02000000 42020000 23040000 6b000000  ....B...#...k...
>  05a8: 02000000 6a010000 0d000000 42020000  ....j.......B...
>  05b8: 02000000 10000000 08000000 00000400  ................
>  05c8: 08000000 00000400 0a000000 ffff0400  ................
>  05d8: 78050000 0a000000 00000200 03000000  x...............
>  05e8: 08000000 00000300 08000000 00000400  ................
>  05f8: 08000000 00000400 08000000 00000400  ................
>  0608: 08000000 00000200 08000000 00000300  ................
>  0618: 08800000 00000400 08000000 00000400  ................
>        ^^^^^^^^
>  0628: 08000000 00000400 08000000 00000200  ................
>  0638: 08000000 00000300 08000000 00000400  ................
>  0648: 08000000 00000400 18000000 494e565f  ............INV_
>  0658: 41534153 5f323030 36303130 31202020  ASAS_20060101
>
> The underlined word is a field length word that evidently should contain
> 8, but contains hex 8008.  This causes the tuple-data decoder to step
> way past the end of the tuple and off into never-never land.  Since the
> results will depend on which shared buffer the page happens to be in and
> what happens to be at the address the step lands at, the inconsistent
> results from try to try are not so surprising.
>
> The next question is how did it get that way.  In my experience a
> single-bit flip like that is most likely to be due to flaky memory,
> though bad motherboards or cables are not out of the question either.
> I'd recommend some thorough hardware testing on the original machine.
>
> It seems there's only the one bad bit; I did
>
> dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
> DELETE 1
>
> and then was able to copy the table repeatedly without crash.  I'd
> suggest doing that and then reconstructing the deleted tuple from
> the above dump.
>
> regards, tom lane

pgsql-bugs by date:

Previous
From: Kris Jurka
Date:
Subject: Re: BUG #2593: Improper implimentation of SQLException
Next
From: Bernhard Weisshuhn
Date:
Subject: Re: BUG #2594: Gin Indexes cause server to crash on Windows