Very bizarre bug with corrupted index - Mailing list pgsql-bugs

From Josh Berkus
Subject Very bizarre bug with corrupted index
Date
Msg-id 200305021108.22359.josh@agliodbs.com
Whole thread Raw
Responses Re: Very bizarre bug with corrupted index  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Folks,

Here's the setup:

The machine: an x86 machine running generic Linux and about 15 server=20
applications other than PostgreSQL; it is short on both RAM and disk I/O,=
=20
resulting in frequent use of swap, which is also slow.

The database:  a medium-large (> 100mb)  multi-function database including=
=20
several large tables (> 200,000 records)

The index: the index in question is on an aggregates-caching table; that is=
,=20
when updates are made to one of the largest tables, a trigger (written in C=
=20
by Joe Conway) automatically updates category totals in the caching table.=
=20=20
the index is on the category column of the table, which serves as the prima=
ry=20
key.

The story:=20=20=20
-- At 2am, the machine had an unexpected power-out and restarted apparently=
=20
without problems (the logs show no significant errors at startup, anyway).
-- At 8pm, a user makes 3 updates to the table launching the triggers.  The=
y=20
get an error message: "ERROR:  Index pk_table_aggs is not a btree"  What's=
=20
really puzzling about the client's report is that the first two updates=20
appear to have succeeded, and only the 3rd update triggered the error and=
=20
failed (according to the client).
-- After trying a number of things and documenting the problem, the client=
=20
dropped and rebuilt the corrupted index and things worked normally.

Here are the relevant portions of the database log:

2003-04-21 02:10:17 [2857]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:10:38 [2861]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:10:38 [2860]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:14:38 [1521]   DEBUG:  database system was interrupted at=20
2003-04-21 02:07:20 UTC
2003-04-21 02:14:38 [1521]   DEBUG:  checkpoint record is at 6/76D71AB4
2003-04-21 02:14:38 [1521]   DEBUG:  redo record is at 6/76D71AB4; undo rec=
ord=20
is at 0/0; shutdown FALSE
2003-04-21 02:14:38 [1521]   DEBUG:  next transaction id: 63994551; next oi=
d:=20
2377601
2003-04-21 02:14:38 [1521]   DEBUG:  database system was not properly shut=
=20
down; automatic recovery in progress
2003-04-21 02:14:38 [1521]   DEBUG:  redo starts at 6/76D71AF4
2003-04-21 02:14:40 [1521]   DEBUG:  ReadRecord: record with zero length at=
=20
6/7725D26C
2003-04-21 02:14:40 [1521]   DEBUG:  redo done at 6/7725D248
2003-04-21 02:14:44 [1521]   DEBUG:  database system is ready
2003-04-21 02:15:51 [2914]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2913]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2912]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2911]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2908]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:19:47 [3086]   DEBUG:  recycled transaction log file=20
0000000600000076

***

2003-04-21 20:03:08 [2555]   ERROR:  Index pk_table_aggs is not a btree
2003-04-21 20:03:19 [2828]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:19 [2825]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:19 [2822]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:19 [2818]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection=20
2003-04-21 20:03:19 [2555]   DEBUG:  pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:20 [18330]  ERROR:  Index pk_table_aggs is not a btree

Copies of the actual corrupted index are available.   An examination of the=
m=20
shows something really bizarre; the entire index is zero-filled!   It all=
=20
looks like this:

*****************************************************************
* PostgreSQL File/Block Formatted Dump Utility
*
* File: 2369452
* Options used: -idx -S 8192=20
*
* Dump created on: Thu May  1 19:57:15 2003
*****************************************************************

Block    0 ******************************************************
  0000: 00000000 00000000 00000000 00000000  ................
  0010: 00000000 00000000 00000000 00000000  ................
  0020: 00000000 00000000 00000000 00000000  ................
  0030: 00000000 00000000 00000000 00000000  ................
  0040: 00000000 00000000 00000000 00000000  ................
  0050: 00000000 00000000 00000000 00000000  ................
  0060: 00000000 00000000 00000000 00000000  ................
  0070: 00000000 00000000 00000000 00000000  ................
  0080: 00000000 00000000 00000000 00000000  ................
  0090: 00000000 00000000 00000000 00000000  ................
  00a0: 00000000 00000000 00000000 00000000  ................
  00b0: 00000000 00000000 00000000 00000000  ................
  00c0: 00000000 00000000 00000000 00000000  ................
  00d0: 00000000 00000000 00000000 00000000  ................
  00e0: 00000000 00000000 00000000 00000000  ................
  00f0: 00000000 00000000 00000000 00000000  ................
  0100: 00000000 00000000 00000000 00000000  ................
  0110: 00000000 00000000 00000000 00000000  ................
  0120: 00000000 00000000 00000000 00000000  ................
  0130: 00000000 00000000 00000000 00000000  ................
  0140: 00000000 00000000 00000000 00000000  ................
  0150: 00000000 00000000 00000000 00000000  ................
  0160: 00000000 00000000 00000000 00000000  ................
  0170: 00000000 00000000 00000000 00000000  ...............
etc.

Questions:

1) Is this problem likely to be related to the unexpected power-out, or=20
unrelated?
2) What could cause even a corrupted index to be zero-filled?
3) Could this problem be due to a memory error since the client is using=20
non-ECC RAM?   If so, how likely is it and can the chances be reduced?
4) Most importantly:  Is this a Postgres bug that needs patching?


--=20
Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-bugs by date:

Previous
From: pgsql-bugs@postgresql.org
Date:
Subject: Bug #963: Weird plpython trigger(?) behaviour in 7.3.2
Next
From: Tom Lane
Date:
Subject: Re: Very bizarre bug with corrupted index