Re: corrupted item pointer in streaming based replication - Mailing list pgsql-general
From | Jigar Shah |
---|---|
Subject | Re: corrupted item pointer in streaming based replication |
Date | |
Msg-id | 1E737D138B89104D8A7853F7DD23177DB6358A@SF1-EXMBX-2.ad.savagebeast.com Whole thread Raw |
In response to | Re: corrupted item pointer in streaming based replication (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: corrupted item pointer in streaming based replication
|
List | pgsql-general |
We had some disk issues on the primary, but raid verification corrected those blocks. That may have caused the primary to be corrupt. I have identified the objects, they both are indexes relname | relfilenode | relkind ------------------------+-------------+--------- feedback_packed_pkey | 114846 | i feedback_packed_id_idx | 115085 | i We did start a pg_dump but it then started throwing the error below. It happened while querying a different table and has nothing to do with the above indexes [d: u:customerservice p:28567 3] ERROR: could not access status of transaction 4074203375 [d: u:customerservice p:28567 4] DETAIL: Could not open file "pg_clog/0F2D": No such file or directory. [d: u:customerservice p:28567 5] STATEMENT: select listener_id, station_id from station where date_created > '2013-03-26 00:23:17.249' and name != 'QuickMix'; We do have a backup from last night but its several hours old. The secondary is the most recent copy. If we could just tell the secondary to go passed beyond that corrupt block and get the database started, we can then divert traffic to the secondary so our system can run read-only until we can isolate and fix our primary. But the secondary is stuck at this point and wont start. Is there a way to make the secondary do that? Is there a way to remove that block from the wal file its applying so it can go passed that point? 2013-03-27 11:00:47.281 PDT LOG: recovery restart point at 161A/17108AA8 2013-03-27 11:00:47.281 PDT DETAIL: last completed transaction was at log time 2013-03-27 11:00:47.241236-07 2013-03-27 11:00:47.520 PDT LOG: restartpoint starting: xlog 2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size = 0 2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, firstright 192 2013-03-27 11:07:51.716 PDT LOG: startup process (PID 5959) exited with exit code 1 2013-03-27 11:07:51.716 PDT LOG: terminating any other active server processes Also, incase we are not able to fix our corrupt primary, we could promote our secondary as its the most recent copy and will save us a lot of time restoring an old backup. We could then rebuild corrupt indexes on the secondary. All this is possible only if we can get the secondary started, but it wont budge. Any suggestions? Thanks Jigar On 4/3/13 1:18 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >Jigar Shah <jshah@pandora.com> writes: >> Postgres version = 9.1.2 > >Um, you do realize this is over a year out of date right? >(Fortunately, you will have an excellent opportunity to update tomorrow.) > >> Few days ago we had a situation where our Primary started to through >>the error messages below indicating corruption in the database. It >>crashed sometimes and showed a panic message in the logs > >> [d: u:radio p:31917 242] ERROR: could not open file >>"base/16384/114846.39" (target block 360448000): No such file or >>directory [d: u:radio p:31917 243] > >> 2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, >>size = 0 >> 2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel >>1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, >>firstright 192 > >Look up relfilenodes 114846 and 115085 in pg_class of whichever database >has OID 16384. I'm guessing the latter is an index of the former. If >that's true, then both of these messages suggest corruption in the index >--- the latter pretty obviously, and the former because it looks like >it's an attempt to fetch from a silly block number, which could have >come out of a corrupted index entry. So if you're really lucky and >nothing but that index is corrupted, a REINDEX will fix it. Personally >I'd be wondering about what's the underlying cause and whether there is >corruption elsewhere, though. Try looking for evidence of flaky RAM or >flaky disk drives on your primary. See if you can pg_dump (not just >for forensic reasons, but so you've got some kind of backup if things >go downhill from here). > > regards, tom lane
pgsql-general by date: