Thread: Partially corrupted table

Partially corrupted table

From
"Filip Hrbek"
Date:
Platform:
CentOS release 4.3 (Final) (Linux 2.6.9-34.EL)

Database version:
PostgreSQL 8.1.3 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.5 =
20051201 (Red Hat 3.4.5-2)

Description:
One of cca 100 tables is partially corrupted. An attempt to read or dump th=
e data from the table is sometimes successful, sometimes crashes.
After upgrade to 8.1.4 the behaviour remained unchanged (using the cluster =
created with 8.1.3). Unfortunately, I am not able to reproduce the error fr=
om the beginning using ONLY 8.1.4.

Note:
After successful dump/reload everything is OK. The problem is not in the da=
ta content itself, but in the binary database cluster. This is why I would =
like to send you the whole cluster instead of the database dump as an attac=
hment. The problem is in the file size (30M). Please tell me where or how t=
o send it. For simplicity, I removed all other objects from the database. T=
here is only one table with several indexes, the table contains 56621 rows.

Here are some examples of the bevaviour:

[root@devlin2 tmp]# pg_dumpall -p5447 -U postgres > pgdump.sql
pg_dump: ERROR:  invalid memory alloc request size 4294967290
pg_dump: SQL command to dump the contents of table "fct" failed: PQendcopy(=
) failed.
pg_dump: Error message from server: ERROR:  invalid memory alloc request si=
ze 4294967290
pg_dump: The command was: COPY dwhdata_salemc.fct (time_id, company_id, cus=
tomer_id, product_id, flagsprod_id, flagssale_id, account_id, accttime_id, =
invcustomer_id, salesperson_id, vendor_id, inv_cost_amt, inv_base_amt, inv_=
amt, inv_qty, inv_wght, ret_cost_amt, ret_base_amt, ret_amt, ret_qty, ret_w=
ght, unret_cost_amt, unret_base_amt, unret_amt, unret_qty, unret_wght, bonu=
s_forecast, bonus_final, stamp_code) TO stdout;
pg_dumpall: pg_dump failed on database "dwhdb", exiting

dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct;
SELECT

dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR:  row is too big: size 119264, maximum size 8136

dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR:  row is too big: size 38788, maximum size 8136

AFTER UPGRADE TO 8.1.4:
dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR:  row is too big: size 52892, maximum size 8136


I noticed one more problem when executing vacuum:
dwhdb=3D# vacuum full;
WARNING:  relation "pg_attribute" page 113 is uninitialized --- fixing
VACUUM

The "vacuum" problem has happend only once.


Regards
  Fililp Hrbek

Re: Partially corrupted table

From
Tom Lane
Date:
"Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes:
> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
> SELECT

> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>

> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
> ERROR:  row is too big: size 119264, maximum size 8136

> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
> ERROR:  row is too big: size 38788, maximum size 8136

I think you've got hardware problems.  Getting different answers from
successive scans of the same table is really hard to explain any other
way.  memtest86 and badblocks are commonly suggested for testing memory
and disk respectively on Linux machines.

            regards, tom lane

Re: Partially corrupted table

From
"Filip Hrbek"
Date:
No, I'm sure is it not a HW problem. I tested the same DB cluster on two
different machines. The error is exactly the same.
I can send you the cluster if you tell me how and where to send 30M file.

Thanks for reply
  Filip Hrbek

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>
Cc: <pgsql-bugs@postgresql.org>
Sent: Tuesday, August 29, 2006 4:01 PM
Subject: Re: [BUGS] Partially corrupted table


> "Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes:
>> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
>> SELECT
>
>> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
>> server closed the connection unexpectedly
>>         This probably means the server terminated abnormally
>>         before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>> !>
>
>> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
>> ERROR:  row is too big: size 119264, maximum size 8136
>
>> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
>> ERROR:  row is too big: size 38788, maximum size 8136
>
> I think you've got hardware problems.  Getting different answers from
> successive scans of the same table is really hard to explain any other
> way.  memtest86 and badblocks are commonly suggested for testing memory
> and disk respectively on Linux machines.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org

Re: Partially corrupted table

From
Tom Lane
Date:
"Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes:
> No, I'm sure is it not a HW problem. I tested the same DB cluster on two
> different machines. The error is exactly the same.

Really?

> I can send you the cluster if you tell me how and where to send 30M file.

You can email it to me personally if you want.  What's the exact
hardware/OS platform this is on, again?

            regards, tom lane

Re: Partially corrupted table

From
Tom Lane
Date:
Well, it's a corrupt-data problem all right.  The tuple that's
causing the problem is on page 1208, item 27:

 Item  27 -- Length:  240  Offset: 1400 (0x0578)  Flags: USED
  XMIN: 5213  CMIN: 140502  XMAX: 0  CMAX|XVAC: 0
  Block Id: 1208  linp Index: 27   Attributes: 29   Size: 28
  infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)

  0578: 5d140000 d6240200 00000000 00000000  ]....$..........
  0588: 0000b804 1b001d00 02091c00 0e000000  ................
  0598: 02000000 42020000 23040000 6b000000  ....B...#...k...
  05a8: 02000000 6a010000 0d000000 42020000  ....j.......B...
  05b8: 02000000 10000000 08000000 00000400  ................
  05c8: 08000000 00000400 0a000000 ffff0400  ................
  05d8: 78050000 0a000000 00000200 03000000  x...............
  05e8: 08000000 00000300 08000000 00000400  ................
  05f8: 08000000 00000400 08000000 00000400  ................
  0608: 08000000 00000200 08000000 00000300  ................
  0618: 08800000 00000400 08000000 00000400  ................
        ^^^^^^^^
  0628: 08000000 00000400 08000000 00000200  ................
  0638: 08000000 00000300 08000000 00000400  ................
  0648: 08000000 00000400 18000000 494e565f  ............INV_
  0658: 41534153 5f323030 36303130 31202020  ASAS_20060101

The underlined word is a field length word that evidently should contain
8, but contains hex 8008.  This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land.  Since the
results will depend on which shared buffer the page happens to be in and
what happens to be at the address the step lands at, the inconsistent
results from try to try are not so surprising.

The next question is how did it get that way.  In my experience a
single-bit flip like that is most likely to be due to flaky memory,
though bad motherboards or cables are not out of the question either.
I'd recommend some thorough hardware testing on the original machine.

It seems there's only the one bad bit; I did

dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
DELETE 1

and then was able to copy the table repeatedly without crash.  I'd
suggest doing that and then reconstructing the deleted tuple from
the above dump.

            regards, tom lane

Re: Partially corrupted table

From
Alvaro Herrera
Date:
Tom Lane wrote:

> The underlined word is a field length word that evidently should contain
> 8, but contains hex 8008.  This causes the tuple-data decoder to step
> way past the end of the tuple and off into never-never land.  Since the
> results will depend on which shared buffer the page happens to be in and
> what happens to be at the address the step lands at, the inconsistent
> results from try to try are not so surprising.

Hmm, perhaps we could protect against "impossible" length words?  (i.e.
those that would cause us try to read past the end of the current
buffer).

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Partially corrupted table

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> The underlined word is a field length word that evidently should contain
>> 8, but contains hex 8008.  This causes the tuple-data decoder to step
>> way past the end of the tuple and off into never-never land.

> Hmm, perhaps we could protect against "impossible" length words?

Perhaps.  I'm hesitant to add additional tests into the inner loop of
heap_deform_tuple and friends though.   It's not like it's going to do
anything to recover your data post-corruption :-(

            regards, tom lane

Re: Partially corrupted table

From
"Filip Hrbek"
Date:
Tom, thank you very much for your excellent and fast analysis (I mean it
seriously, I am comparing your help to IBM Informix commercial support
:-) ).

It is possible that the corruption was caused by a HW problem at customer's
server, and then this problem appeared also at our development environment
because of the data already beeing corrupted. I will recommend the customer
to make some memory tests.

We are using PostgreSQL at 14 customer servers for almost 5 years and this
is the first time it crashed - and perhaps due to a HW problem. Great work!

Regards
  Filip Hrbek


----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>
Cc: <pgsql-bugs@postgreSQL.org>
Sent: Wednesday, August 30, 2006 1:33 AM
Subject: Re: [BUGS] Partially corrupted table


> Well, it's a corrupt-data problem all right.  The tuple that's
> causing the problem is on page 1208, item 27:
>
> Item  27 -- Length:  240  Offset: 1400 (0x0578)  Flags: USED
>  XMIN: 5213  CMIN: 140502  XMAX: 0  CMAX|XVAC: 0
>  Block Id: 1208  linp Index: 27   Attributes: 29   Size: 28
>  infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)
>
>  0578: 5d140000 d6240200 00000000 00000000  ]....$..........
>  0588: 0000b804 1b001d00 02091c00 0e000000  ................
>  0598: 02000000 42020000 23040000 6b000000  ....B...#...k...
>  05a8: 02000000 6a010000 0d000000 42020000  ....j.......B...
>  05b8: 02000000 10000000 08000000 00000400  ................
>  05c8: 08000000 00000400 0a000000 ffff0400  ................
>  05d8: 78050000 0a000000 00000200 03000000  x...............
>  05e8: 08000000 00000300 08000000 00000400  ................
>  05f8: 08000000 00000400 08000000 00000400  ................
>  0608: 08000000 00000200 08000000 00000300  ................
>  0618: 08800000 00000400 08000000 00000400  ................
>        ^^^^^^^^
>  0628: 08000000 00000400 08000000 00000200  ................
>  0638: 08000000 00000300 08000000 00000400  ................
>  0648: 08000000 00000400 18000000 494e565f  ............INV_
>  0658: 41534153 5f323030 36303130 31202020  ASAS_20060101
>
> The underlined word is a field length word that evidently should contain
> 8, but contains hex 8008.  This causes the tuple-data decoder to step
> way past the end of the tuple and off into never-never land.  Since the
> results will depend on which shared buffer the page happens to be in and
> what happens to be at the address the step lands at, the inconsistent
> results from try to try are not so surprising.
>
> The next question is how did it get that way.  In my experience a
> single-bit flip like that is most likely to be due to flaky memory,
> though bad motherboards or cables are not out of the question either.
> I'd recommend some thorough hardware testing on the original machine.
>
> It seems there's only the one bad bit; I did
>
> dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
> DELETE 1
>
> and then was able to copy the table repeatedly without crash.  I'd
> suggest doing that and then reconstructing the deleted tuple from
> the above dump.
>
> regards, tom lane

Re: Partially corrupted table

From
"Filip Hrbek"
Date:
Well it would be a good idea to check the row validity. If the row is
partially corrupted, I can get an error (which is quite good) but I can get
bad results without an error as well, which is really bad.

After Tom decoded which row yields the problem, I tried to select the data
from the corrupted row:

dwhdb=# select * from dwhdata_salemc.fct where time_id = 14 and company_id =
2 and customer_id = 578 and product_id = 1059 and flagsprod_id = 107 and
flagssale_id = 2 and account_id = 362 and accttime_id = 13 and
invcustomer_id = 578 and salesperson_id = 2 and vendor_id = 16;
 time_id | company_id | customer_id | product_id | flagsprod_id |
flagssale_id | account_id | accttime_id | invcustomer_id | salesperson_id |
vendor_id | inv_cost_amt | inv_base_amt | inv_amt | inv_qty | inv_wght |
ret_cost_amt | ret_base_amt | ret_amt | ret_qty | ret_wght | unret_cost_amt
|          unret_base_amt           |             unret_amt             |
unret_qty             | unret_wght | bonus_forecast | bonus_final |
stamp_code

---------+------------+-------------+------------+--------------+--------------+------------+-------------+----------------+----------------+-----------+--------------+--------------+---------+---------+----------+--------------+--------------+---------+---------+----------+----------------+-----------------------------------+-----------------------------------+-----------------------------------+------------+----------------+-------------+----------------------
      14 |          2 |         578 |       1059 |          107 |
2 |        362 |          13 |            578 |              2 |        16 |
0.0000 |       0.0000 |  0.1400 |    3.00 |    0.000 |       0.0000 |
0.0000 |  0.0000 |    0.00 |    0.000 |         8.0000 |
000040008000000000004000800000000
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| 000040008000000000004000800000000 | 000040008000000000004000800000000 |
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0.000 |         0.0000 |      0.0000 | INV_ASAS_20060601
                                                    ^^^^^^^^^^^^^^^^^^^
(1 row)

As you can see, the underlined data is bad, but I get no error report. There
should be all zeros in the number fields, and 'INV_ASAS_20060101' instead of
'INV_ASAS_20060601' in the stamp field. This all is caused by a single bit
being inverted. I can see it very dangerous.

Regards
  Filip Hrbek


----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Alvaro Herrera" <alvherre@commandprompt.com>
Cc: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>; <pgsql-bugs@postgreSQL.org>
Sent: Wednesday, August 30, 2006 4:10 AM
Subject: Re: [BUGS] Partially corrupted table


> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Tom Lane wrote:
>>> The underlined word is a field length word that evidently should contain
>>> 8, but contains hex 8008.  This causes the tuple-data decoder to step
>>> way past the end of the tuple and off into never-never land.
>
>> Hmm, perhaps we could protect against "impossible" length words?
>
> Perhaps.  I'm hesitant to add additional tests into the inner loop of
> heap_deform_tuple and friends though.   It's not like it's going to do
> anything to recover your data post-corruption :-(
>
> regards, tom lane

Re: Partially corrupted table

From
Bruno Wolff III
Date:
On Wed, Aug 30, 2006 at 10:18:46 +0200,
  Filip Hrbek <filip.hrbek@plz.comstar.cz> wrote:
>
> It is possible that the corruption was caused by a HW problem at customer's
> server, and then this problem appeared also at our development environment
> because of the data already beeing corrupted. I will recommend the customer
> to make some memory tests.

You might also ask them if they are using ECC memory. A single bit flip
could have easily happened with good memory that didn't have error correction.