Thread: Partially corrupted table
Platform: CentOS release 4.3 (Final) (Linux 2.6.9-34.EL) Database version: PostgreSQL 8.1.3 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.5 = 20051201 (Red Hat 3.4.5-2) Description: One of cca 100 tables is partially corrupted. An attempt to read or dump th= e data from the table is sometimes successful, sometimes crashes. After upgrade to 8.1.4 the behaviour remained unchanged (using the cluster = created with 8.1.3). Unfortunately, I am not able to reproduce the error fr= om the beginning using ONLY 8.1.4. Note: After successful dump/reload everything is OK. The problem is not in the da= ta content itself, but in the binary database cluster. This is why I would = like to send you the whole cluster instead of the database dump as an attac= hment. The problem is in the file size (30M). Please tell me where or how t= o send it. For simplicity, I removed all other objects from the database. T= here is only one table with several indexes, the table contains 56621 rows. Here are some examples of the bevaviour: [root@devlin2 tmp]# pg_dumpall -p5447 -U postgres > pgdump.sql pg_dump: ERROR: invalid memory alloc request size 4294967290 pg_dump: SQL command to dump the contents of table "fct" failed: PQendcopy(= ) failed. pg_dump: Error message from server: ERROR: invalid memory alloc request si= ze 4294967290 pg_dump: The command was: COPY dwhdata_salemc.fct (time_id, company_id, cus= tomer_id, product_id, flagsprod_id, flagssale_id, account_id, accttime_id, = invcustomer_id, salesperson_id, vendor_id, inv_cost_amt, inv_base_amt, inv_= amt, inv_qty, inv_wght, ret_cost_amt, ret_base_amt, ret_amt, ret_qty, ret_w= ght, unret_cost_amt, unret_base_amt, unret_amt, unret_qty, unret_wght, bonu= s_forecast, bonus_final, stamp_code) TO stdout; pg_dumpall: pg_dump failed on database "dwhdb", exiting dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct; SELECT dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct; ERROR: row is too big: size 119264, maximum size 8136 dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct; ERROR: row is too big: size 38788, maximum size 8136 AFTER UPGRADE TO 8.1.4: dwhdb=3D# create temp table t_fct as select * from dwhdata_salemc.fct; ERROR: row is too big: size 52892, maximum size 8136 I noticed one more problem when executing vacuum: dwhdb=3D# vacuum full; WARNING: relation "pg_attribute" page 113 is uninitialized --- fixing VACUUM The "vacuum" problem has happend only once. Regards Fililp Hrbek
"Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes: > dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; > SELECT > dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !> > dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; > ERROR: row is too big: size 119264, maximum size 8136 > dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; > ERROR: row is too big: size 38788, maximum size 8136 I think you've got hardware problems. Getting different answers from successive scans of the same table is really hard to explain any other way. memtest86 and badblocks are commonly suggested for testing memory and disk respectively on Linux machines. regards, tom lane
No, I'm sure is it not a HW problem. I tested the same DB cluster on two different machines. The error is exactly the same. I can send you the cluster if you tell me how and where to send 30M file. Thanks for reply Filip Hrbek ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz> Cc: <pgsql-bugs@postgresql.org> Sent: Tuesday, August 29, 2006 4:01 PM Subject: Re: [BUGS] Partially corrupted table > "Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes: >> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; >> SELECT > >> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. >> !> > >> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; >> ERROR: row is too big: size 119264, maximum size 8136 > >> dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct; >> ERROR: row is too big: size 38788, maximum size 8136 > > I think you've got hardware problems. Getting different answers from > successive scans of the same table is really hard to explain any other > way. memtest86 and badblocks are commonly suggested for testing memory > and disk respectively on Linux machines. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
"Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes: > No, I'm sure is it not a HW problem. I tested the same DB cluster on two > different machines. The error is exactly the same. Really? > I can send you the cluster if you tell me how and where to send 30M file. You can email it to me personally if you want. What's the exact hardware/OS platform this is on, again? regards, tom lane
Well, it's a corrupt-data problem all right. The tuple that's causing the problem is on page 1208, item 27: Item 27 -- Length: 240 Offset: 1400 (0x0578) Flags: USED XMIN: 5213 CMIN: 140502 XMAX: 0 CMAX|XVAC: 0 Block Id: 1208 linp Index: 27 Attributes: 29 Size: 28 infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID) 0578: 5d140000 d6240200 00000000 00000000 ]....$.......... 0588: 0000b804 1b001d00 02091c00 0e000000 ................ 0598: 02000000 42020000 23040000 6b000000 ....B...#...k... 05a8: 02000000 6a010000 0d000000 42020000 ....j.......B... 05b8: 02000000 10000000 08000000 00000400 ................ 05c8: 08000000 00000400 0a000000 ffff0400 ................ 05d8: 78050000 0a000000 00000200 03000000 x............... 05e8: 08000000 00000300 08000000 00000400 ................ 05f8: 08000000 00000400 08000000 00000400 ................ 0608: 08000000 00000200 08000000 00000300 ................ 0618: 08800000 00000400 08000000 00000400 ................ ^^^^^^^^ 0628: 08000000 00000400 08000000 00000200 ................ 0638: 08000000 00000300 08000000 00000400 ................ 0648: 08000000 00000400 18000000 494e565f ............INV_ 0658: 41534153 5f323030 36303130 31202020 ASAS_20060101 The underlined word is a field length word that evidently should contain 8, but contains hex 8008. This causes the tuple-data decoder to step way past the end of the tuple and off into never-never land. Since the results will depend on which shared buffer the page happens to be in and what happens to be at the address the step lands at, the inconsistent results from try to try are not so surprising. The next question is how did it get that way. In my experience a single-bit flip like that is most likely to be due to flaky memory, though bad motherboards or cables are not out of the question either. I'd recommend some thorough hardware testing on the original machine. It seems there's only the one bad bit; I did dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)'; DELETE 1 and then was able to copy the table repeatedly without crash. I'd suggest doing that and then reconstructing the deleted tuple from the above dump. regards, tom lane
Tom Lane wrote: > The underlined word is a field length word that evidently should contain > 8, but contains hex 8008. This causes the tuple-data decoder to step > way past the end of the tuple and off into never-never land. Since the > results will depend on which shared buffer the page happens to be in and > what happens to be at the address the step lands at, the inconsistent > results from try to try are not so surprising. Hmm, perhaps we could protect against "impossible" length words? (i.e. those that would cause us try to read past the end of the current buffer). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> The underlined word is a field length word that evidently should contain >> 8, but contains hex 8008. This causes the tuple-data decoder to step >> way past the end of the tuple and off into never-never land. > Hmm, perhaps we could protect against "impossible" length words? Perhaps. I'm hesitant to add additional tests into the inner loop of heap_deform_tuple and friends though. It's not like it's going to do anything to recover your data post-corruption :-( regards, tom lane
Tom, thank you very much for your excellent and fast analysis (I mean it seriously, I am comparing your help to IBM Informix commercial support :-) ). It is possible that the corruption was caused by a HW problem at customer's server, and then this problem appeared also at our development environment because of the data already beeing corrupted. I will recommend the customer to make some memory tests. We are using PostgreSQL at 14 customer servers for almost 5 years and this is the first time it crashed - and perhaps due to a HW problem. Great work! Regards Filip Hrbek ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz> Cc: <pgsql-bugs@postgreSQL.org> Sent: Wednesday, August 30, 2006 1:33 AM Subject: Re: [BUGS] Partially corrupted table > Well, it's a corrupt-data problem all right. The tuple that's > causing the problem is on page 1208, item 27: > > Item 27 -- Length: 240 Offset: 1400 (0x0578) Flags: USED > XMIN: 5213 CMIN: 140502 XMAX: 0 CMAX|XVAC: 0 > Block Id: 1208 linp Index: 27 Attributes: 29 Size: 28 > infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID) > > 0578: 5d140000 d6240200 00000000 00000000 ]....$.......... > 0588: 0000b804 1b001d00 02091c00 0e000000 ................ > 0598: 02000000 42020000 23040000 6b000000 ....B...#...k... > 05a8: 02000000 6a010000 0d000000 42020000 ....j.......B... > 05b8: 02000000 10000000 08000000 00000400 ................ > 05c8: 08000000 00000400 0a000000 ffff0400 ................ > 05d8: 78050000 0a000000 00000200 03000000 x............... > 05e8: 08000000 00000300 08000000 00000400 ................ > 05f8: 08000000 00000400 08000000 00000400 ................ > 0608: 08000000 00000200 08000000 00000300 ................ > 0618: 08800000 00000400 08000000 00000400 ................ > ^^^^^^^^ > 0628: 08000000 00000400 08000000 00000200 ................ > 0638: 08000000 00000300 08000000 00000400 ................ > 0648: 08000000 00000400 18000000 494e565f ............INV_ > 0658: 41534153 5f323030 36303130 31202020 ASAS_20060101 > > The underlined word is a field length word that evidently should contain > 8, but contains hex 8008. This causes the tuple-data decoder to step > way past the end of the tuple and off into never-never land. Since the > results will depend on which shared buffer the page happens to be in and > what happens to be at the address the step lands at, the inconsistent > results from try to try are not so surprising. > > The next question is how did it get that way. In my experience a > single-bit flip like that is most likely to be due to flaky memory, > though bad motherboards or cables are not out of the question either. > I'd recommend some thorough hardware testing on the original machine. > > It seems there's only the one bad bit; I did > > dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)'; > DELETE 1 > > and then was able to copy the table repeatedly without crash. I'd > suggest doing that and then reconstructing the deleted tuple from > the above dump. > > regards, tom lane
Well it would be a good idea to check the row validity. If the row is partially corrupted, I can get an error (which is quite good) but I can get bad results without an error as well, which is really bad. After Tom decoded which row yields the problem, I tried to select the data from the corrupted row: dwhdb=# select * from dwhdata_salemc.fct where time_id = 14 and company_id = 2 and customer_id = 578 and product_id = 1059 and flagsprod_id = 107 and flagssale_id = 2 and account_id = 362 and accttime_id = 13 and invcustomer_id = 578 and salesperson_id = 2 and vendor_id = 16; time_id | company_id | customer_id | product_id | flagsprod_id | flagssale_id | account_id | accttime_id | invcustomer_id | salesperson_id | vendor_id | inv_cost_amt | inv_base_amt | inv_amt | inv_qty | inv_wght | ret_cost_amt | ret_base_amt | ret_amt | ret_qty | ret_wght | unret_cost_amt | unret_base_amt | unret_amt | unret_qty | unret_wght | bonus_forecast | bonus_final | stamp_code ---------+------------+-------------+------------+--------------+--------------+------------+-------------+----------------+----------------+-----------+--------------+--------------+---------+---------+----------+--------------+--------------+---------+---------+----------+----------------+-----------------------------------+-----------------------------------+-----------------------------------+------------+----------------+-------------+---------------------- 14 | 2 | 578 | 1059 | 107 | 2 | 362 | 13 | 578 | 2 | 16 | 0.0000 | 0.0000 | 0.1400 | 3.00 | 0.000 | 0.0000 | 0.0000 | 0.0000 | 0.00 | 0.000 | 8.0000 | 000040008000000000004000800000000 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 000040008000000000004000800000000 | 000040008000000000004000800000000 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0.000 | 0.0000 | 0.0000 | INV_ASAS_20060601 ^^^^^^^^^^^^^^^^^^^ (1 row) As you can see, the underlined data is bad, but I get no error report. There should be all zeros in the number fields, and 'INV_ASAS_20060101' instead of 'INV_ASAS_20060601' in the stamp field. This all is caused by a single bit being inverted. I can see it very dangerous. Regards Filip Hrbek ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Alvaro Herrera" <alvherre@commandprompt.com> Cc: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>; <pgsql-bugs@postgreSQL.org> Sent: Wednesday, August 30, 2006 4:10 AM Subject: Re: [BUGS] Partially corrupted table > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Tom Lane wrote: >>> The underlined word is a field length word that evidently should contain >>> 8, but contains hex 8008. This causes the tuple-data decoder to step >>> way past the end of the tuple and off into never-never land. > >> Hmm, perhaps we could protect against "impossible" length words? > > Perhaps. I'm hesitant to add additional tests into the inner loop of > heap_deform_tuple and friends though. It's not like it's going to do > anything to recover your data post-corruption :-( > > regards, tom lane
On Wed, Aug 30, 2006 at 10:18:46 +0200, Filip Hrbek <filip.hrbek@plz.comstar.cz> wrote: > > It is possible that the corruption was caused by a HW problem at customer's > server, and then this problem appeared also at our development environment > because of the data already beeing corrupted. I will recommend the customer > to make some memory tests. You might also ask them if they are using ECC memory. A single bit flip could have easily happened with good memory that didn't have error correction.