Re: Partially corrupted table - Mailing list pgsql-bugs

From Tom Lane
Subject Re: Partially corrupted table
Date
Msg-id 19402.1156894425@sss.pgh.pa.us
Whole thread Raw
In response to Partially corrupted table  ("Filip Hrbek" <filip.hrbek@plz.comstar.cz>)
Responses Re: Partially corrupted table  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-bugs
Well, it's a corrupt-data problem all right.  The tuple that's
causing the problem is on page 1208, item 27:

 Item  27 -- Length:  240  Offset: 1400 (0x0578)  Flags: USED
  XMIN: 5213  CMIN: 140502  XMAX: 0  CMAX|XVAC: 0
  Block Id: 1208  linp Index: 27   Attributes: 29   Size: 28
  infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)

  0578: 5d140000 d6240200 00000000 00000000  ]....$..........
  0588: 0000b804 1b001d00 02091c00 0e000000  ................
  0598: 02000000 42020000 23040000 6b000000  ....B...#...k...
  05a8: 02000000 6a010000 0d000000 42020000  ....j.......B...
  05b8: 02000000 10000000 08000000 00000400  ................
  05c8: 08000000 00000400 0a000000 ffff0400  ................
  05d8: 78050000 0a000000 00000200 03000000  x...............
  05e8: 08000000 00000300 08000000 00000400  ................
  05f8: 08000000 00000400 08000000 00000400  ................
  0608: 08000000 00000200 08000000 00000300  ................
  0618: 08800000 00000400 08000000 00000400  ................
        ^^^^^^^^
  0628: 08000000 00000400 08000000 00000200  ................
  0638: 08000000 00000300 08000000 00000400  ................
  0648: 08000000 00000400 18000000 494e565f  ............INV_
  0658: 41534153 5f323030 36303130 31202020  ASAS_20060101

The underlined word is a field length word that evidently should contain
8, but contains hex 8008.  This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land.  Since the
results will depend on which shared buffer the page happens to be in and
what happens to be at the address the step lands at, the inconsistent
results from try to try are not so surprising.

The next question is how did it get that way.  In my experience a
single-bit flip like that is most likely to be due to flaky memory,
though bad motherboards or cables are not out of the question either.
I'd recommend some thorough hardware testing on the original machine.

It seems there's only the one bad bit; I did

dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
DELETE 1

and then was able to copy the table repeatedly without crash.  I'd
suggest doing that and then reconstructing the deleted tuple from
the above dump.

            regards, tom lane

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Partially corrupted table
Next
From: Charlie Savage
Date:
Subject: Re: BUG #2594: Gin Indexes cause server to crash on Windows