Re: Tracking down log segment corruption - Mailing list pgsql-general

From Gordon Shannon
Subject Re: Tracking down log segment corruption
Date
Msg-id s2qb2dd93301005021055w93b80fb6q9ec27b826a550f7d@mail.gmail.com
Whole thread Raw
In response to Re: Tracking down log segment corruption  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Tracking down log segment corruption  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general


On Sun, May 2, 2010 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:


Hmm ... AFAICS the only way to get that message when the incoming TID's
offsetNumber is only 2 is for the index page to be completely empty
(not zeroes, else PageAddItem's sanity check would have triggered,
but valid and empty).  What that smells like is a software bug, like
failing to emit a WAL record in a case where it was necessary.  Can you
identify which index this was?  (Look for relfilenode 48778276 in the
database with OID 22362.)  If so, can you give us any hints about
unusual things that might have been done with that index?

Interesting. There is no pg_class entry for 22362.  There is, however, an entry for that filenode. It's an index I created Sat AM, about 6AM.

select oid, * from pg_class where relfilenode=48778276;
-[ RECORD 1 ]--+---------------------------
oid            | 48777488
relname        | cts_20100501_topic_date_nk
relnamespace   | 2200
reltype        | 0
relowner       | 16412
relam          | 403
relfilenode    | 48778276
reltablespace  | 48777166
relpages       | 2476
reltuples      | 58879
reltoastrelid  | 0
reltoastidxid  | 0
relhasindex    | f
relisshared    | f
relistemp      | f
relkind        | i
relnatts       | 2
relchecks      | 0
relhasoids     | f
relhaspkey     | f
relhasrules    | f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 0
relacl         | null
reloptions     | null

Possibly relevant facts:

- The WSB server went active on Friday around 3:30PM
- On Friday evening, I added about 11 tablespaces.  I noted the new files on the WSB, no problems.
- On Sat morning, I created a partitioned table cts_20100501 (inherits from another table) and 4 indexes.
- This morning, I was doing some table maintenance on the master and discovered I had created this table and its indexes in the wrong tablespace.
I wanted the table in ts29, but had it in ts30.  Vice versa for the indexes.  So I moved them. This is from my command history:
   
alter index cts_20100501_natural_uk set tablespace ts30;
alter index cts_20100501_pkey set tablespace ts30;
alter index cts_20100501_topic_date_nk set tablespace ts30;
alter index cts_20100501_updated_nk set tablespace ts30;
alter table cts_20100501 set tablespace ts29;

These commands worked fine on the master, yet this seems suspiciously relevant.


> Any suggestions?

As far as recovering goes, there's probably not much you can do except
resync the standby from scratch.  But it would be nice to get to the
bottom of the problem, so that we can fix the bug.  Have you got an
archive of this xlog segment and the ones before it, and would you be
willing to let a developer look at them?


Before I received your reply, I had already started the re-sync, and unfortunately already deleted the wal logs in question.
If it happens again, I will certainly keep them, and would be happy to share them.

Regards,
Gordon


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Tracking down log segment corruption
Next
From: Andy
Date:
Subject: Re: PostgreSQL vs. Microsoft SQL server