Folks,
Here's the setup:
The machine: an x86 machine running generic Linux and about 15 server=20
applications other than PostgreSQL; it is short on both RAM and disk I/O,=
=20
resulting in frequent use of swap, which is also slow.
The database: a medium-large (> 100mb) multi-function database including=
=20
several large tables (> 200,000 records)
The index: the index in question is on an aggregates-caching table; that is=
,=20
when updates are made to one of the largest tables, a trigger (written in C=
=20
by Joe Conway) automatically updates category totals in the caching table.=
=20=20
the index is on the category column of the table, which serves as the prima=
ry=20
key.
The story:=20=20=20
-- At 2am, the machine had an unexpected power-out and restarted apparently=
=20
without problems (the logs show no significant errors at startup, anyway).
-- At 8pm, a user makes 3 updates to the table launching the triggers. The=
y=20
get an error message: "ERROR: Index pk_table_aggs is not a btree" What's=
=20
really puzzling about the client's report is that the first two updates=20
appear to have succeeded, and only the 3rd update triggered the error and=
=20
failed (according to the client).
-- After trying a number of things and documenting the problem, the client=
=20
dropped and rebuilt the corrupted index and things worked normally.
Here are the relevant portions of the database log:
2003-04-21 02:10:17 [2857] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:10:38 [2861] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:10:38 [2860] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:14:38 [1521] DEBUG: database system was interrupted at=20
2003-04-21 02:07:20 UTC
2003-04-21 02:14:38 [1521] DEBUG: checkpoint record is at 6/76D71AB4
2003-04-21 02:14:38 [1521] DEBUG: redo record is at 6/76D71AB4; undo rec=
ord=20
is at 0/0; shutdown FALSE
2003-04-21 02:14:38 [1521] DEBUG: next transaction id: 63994551; next oi=
d:=20
2377601
2003-04-21 02:14:38 [1521] DEBUG: database system was not properly shut=
=20
down; automatic recovery in progress
2003-04-21 02:14:38 [1521] DEBUG: redo starts at 6/76D71AF4
2003-04-21 02:14:40 [1521] DEBUG: ReadRecord: record with zero length at=
=20
6/7725D26C
2003-04-21 02:14:40 [1521] DEBUG: redo done at 6/7725D248
2003-04-21 02:14:44 [1521] DEBUG: database system is ready
2003-04-21 02:15:51 [2914] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2913] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2912] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2911] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:15:51 [2908] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 02:19:47 [3086] DEBUG: recycled transaction log file=20
0000000600000076
***
2003-04-21 20:03:08 [2555] ERROR: Index pk_table_aggs is not a btree
2003-04-21 20:03:19 [2828] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:19 [2825] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:19 [2822] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:19 [2818] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection=20
2003-04-21 20:03:19 [2555] DEBUG: pq_recvbuf: unexpected EOF on client=
=20
connection
2003-04-21 20:03:20 [18330] ERROR: Index pk_table_aggs is not a btree
Copies of the actual corrupted index are available. An examination of the=
m=20
shows something really bizarre; the entire index is zero-filled! It all=
=20
looks like this:
*****************************************************************
* PostgreSQL File/Block Formatted Dump Utility
*
* File: 2369452
* Options used: -idx -S 8192=20
*
* Dump created on: Thu May 1 19:57:15 2003
*****************************************************************
Block 0 ******************************************************
0000: 00000000 00000000 00000000 00000000 ................
0010: 00000000 00000000 00000000 00000000 ................
0020: 00000000 00000000 00000000 00000000 ................
0030: 00000000 00000000 00000000 00000000 ................
0040: 00000000 00000000 00000000 00000000 ................
0050: 00000000 00000000 00000000 00000000 ................
0060: 00000000 00000000 00000000 00000000 ................
0070: 00000000 00000000 00000000 00000000 ................
0080: 00000000 00000000 00000000 00000000 ................
0090: 00000000 00000000 00000000 00000000 ................
00a0: 00000000 00000000 00000000 00000000 ................
00b0: 00000000 00000000 00000000 00000000 ................
00c0: 00000000 00000000 00000000 00000000 ................
00d0: 00000000 00000000 00000000 00000000 ................
00e0: 00000000 00000000 00000000 00000000 ................
00f0: 00000000 00000000 00000000 00000000 ................
0100: 00000000 00000000 00000000 00000000 ................
0110: 00000000 00000000 00000000 00000000 ................
0120: 00000000 00000000 00000000 00000000 ................
0130: 00000000 00000000 00000000 00000000 ................
0140: 00000000 00000000 00000000 00000000 ................
0150: 00000000 00000000 00000000 00000000 ................
0160: 00000000 00000000 00000000 00000000 ................
0170: 00000000 00000000 00000000 00000000 ...............
etc.
Questions:
1) Is this problem likely to be related to the unexpected power-out, or=20
unrelated?
2) What could cause even a corrupted index to be zero-filled?
3) Could this problem be due to a memory error since the client is using=20
non-ECC RAM? If so, how likely is it and can the chances be reduced?
4) Most importantly: Is this a Postgres bug that needs patching?
--=20
Josh Berkus
Aglio Database Solutions
San Francisco