Thread: corrupted item pointer in streaming based replication
Hi,
Postgres version = 9.1.2
OS = debian(6.0.7)
fsync = on
full_page_writes = on
Setup = Primary and streaming replication based secondary
Few days ago we had a situation where our Primary started to through the error messages below indicating corruption in the database. It crashed sometimes and showed a panic message in the logs
2013-03-25 07:30:39.545 PDT PANIC: corrupted item pointer: offset = 0, size = 0
2013-03-25 07:30:39.704 PDT LOG: server process (PID 8715) was terminated by signal 6: Aborted
2013-03-25 07:30:39.704 PDT LOG: terminating any other active server processes
Days before it started to crash it showed the below error messages in the logs.
[d: u:postgres p:2498 7] ERROR: could not access status of transaction 837550133
DETAIL: Could not open file "pg_clog/031E": No such file or directory. [u:postgres p:2498 9]
[d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39" (target block 360448000): No such file or directory [d: u:radio p:31917 243]
On top of that, our secondaries are now crashed and would not startup and showed the error messages below in pg logs.
2013-03-27 11:00:47.281 PDT LOG: recovery restart point at 161A/17108AA8
2013-03-27 11:00:47.281 PDT DETAIL: last completed transaction was at log time 2013-03-27 11:00:47.241236-07
2013-03-27 11:00:47.520 PDT LOG: restartpoint starting: xlog
2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size = 0
2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, firstright 192
2013-03-27 11:07:51.716 PDT LOG: startup process (PID 5959) exited with exit code 1
2013-03-27 11:07:51.716 PDT LOG: terminating any other active server processes
At this point we have a running but corrupt primary and crashed secondary that wont startup.
I am wondering what are our options at this point. Can we do something to fix this? How can we recover from corruption.
Thanks for help in advance.
Regards
Jigar
You should figure out what base/16384/114846.39 corresponds to inside the database. If you're super lucky its something unimportant and/or something that can be recreated easily (like an index). If its something important, then you're only option is to try to drop the object and restore it from the last known good backup. On Wed, Apr 3, 2013 at 1:02 PM, Jigar Shah <jshah@pandora.com> wrote: > Hi, > > Postgres version = 9.1.2 > OS = debian(6.0.7) > fsync = on > full_page_writes = on > Setup = Primary and streaming replication based secondary > > Few days ago we had a situation where our Primary started to through the > error messages below indicating corruption in the database. It crashed > sometimes and showed a panic message in the logs > > 2013-03-25 07:30:39.545 PDT PANIC: corrupted item pointer: offset = 0, size > = 0 > 2013-03-25 07:30:39.704 PDT LOG: server process (PID 8715) was terminated > by signal 6: Aborted > 2013-03-25 07:30:39.704 PDT LOG: terminating any other active server > processes > > Days before it started to crash it showed the below error messages in the > logs. > > [d: u:postgres p:2498 7] ERROR: could not access status of transaction > 837550133 > DETAIL: Could not open file "pg_clog/031E": No such file or directory. > [u:postgres p:2498 9] > > [d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39" > (target block 360448000): No such file or directory [d: u:radio p:31917 243] > > On top of that, our secondaries are now crashed and would not startup and > showed the error messages below in pg logs. > > 2013-03-27 11:00:47.281 PDT LOG: recovery restart point at 161A/17108AA8 > 2013-03-27 11:00:47.281 PDT DETAIL: last completed transaction was at log > time 2013-03-27 11:00:47.241236-07 > 2013-03-27 11:00:47.520 PDT LOG: restartpoint starting: xlog > > 2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size > = 0 > 2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel > 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, > firstright 192 > 2013-03-27 11:07:51.716 PDT LOG: startup process (PID 5959) exited with > exit code 1 > 2013-03-27 11:07:51.716 PDT LOG: terminating any other active server > processes > > At this point we have a running but corrupt primary and crashed secondary > that wont startup. > > I am wondering what are our options at this point. Can we do something to > fix this? How can we recover from corruption. > > Thanks for help in advance. > > Regards > Jigar > > -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand https://netllama.linux-sxs.org
Jigar Shah <jshah@pandora.com> writes: > Postgres version = 9.1.2 Um, you do realize this is over a year out of date right? (Fortunately, you will have an excellent opportunity to update tomorrow.) > Few days ago we had a situation where our Primary started to through the error messages below indicating corruption inthe database. It crashed sometimes and showed a panic message in the logs > [d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39" (target block 360448000): No such file or directory[d: u:radio p:31917 243] > 2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size = 0 > 2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459,level 0, firstright 192 Look up relfilenodes 114846 and 115085 in pg_class of whichever database has OID 16384. I'm guessing the latter is an index of the former. If that's true, then both of these messages suggest corruption in the index --- the latter pretty obviously, and the former because it looks like it's an attempt to fetch from a silly block number, which could have come out of a corrupted index entry. So if you're really lucky and nothing but that index is corrupted, a REINDEX will fix it. Personally I'd be wondering about what's the underlying cause and whether there is corruption elsewhere, though. Try looking for evidence of flaky RAM or flaky disk drives on your primary. See if you can pg_dump (not just for forensic reasons, but so you've got some kind of backup if things go downhill from here). regards, tom lane
We had some disk issues on the primary, but raid verification corrected those blocks. That may have caused the primary to be corrupt. I have identified the objects, they both are indexes relname | relfilenode | relkind ------------------------+-------------+--------- feedback_packed_pkey | 114846 | i feedback_packed_id_idx | 115085 | i We did start a pg_dump but it then started throwing the error below. It happened while querying a different table and has nothing to do with the above indexes [d: u:customerservice p:28567 3] ERROR: could not access status of transaction 4074203375 [d: u:customerservice p:28567 4] DETAIL: Could not open file "pg_clog/0F2D": No such file or directory. [d: u:customerservice p:28567 5] STATEMENT: select listener_id, station_id from station where date_created > '2013-03-26 00:23:17.249' and name != 'QuickMix'; We do have a backup from last night but its several hours old. The secondary is the most recent copy. If we could just tell the secondary to go passed beyond that corrupt block and get the database started, we can then divert traffic to the secondary so our system can run read-only until we can isolate and fix our primary. But the secondary is stuck at this point and wont start. Is there a way to make the secondary do that? Is there a way to remove that block from the wal file its applying so it can go passed that point? 2013-03-27 11:00:47.281 PDT LOG: recovery restart point at 161A/17108AA8 2013-03-27 11:00:47.281 PDT DETAIL: last completed transaction was at log time 2013-03-27 11:00:47.241236-07 2013-03-27 11:00:47.520 PDT LOG: restartpoint starting: xlog 2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size = 0 2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, firstright 192 2013-03-27 11:07:51.716 PDT LOG: startup process (PID 5959) exited with exit code 1 2013-03-27 11:07:51.716 PDT LOG: terminating any other active server processes Also, incase we are not able to fix our corrupt primary, we could promote our secondary as its the most recent copy and will save us a lot of time restoring an old backup. We could then rebuild corrupt indexes on the secondary. All this is possible only if we can get the secondary started, but it wont budge. Any suggestions? Thanks Jigar On 4/3/13 1:18 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >Jigar Shah <jshah@pandora.com> writes: >> Postgres version = 9.1.2 > >Um, you do realize this is over a year out of date right? >(Fortunately, you will have an excellent opportunity to update tomorrow.) > >> Few days ago we had a situation where our Primary started to through >>the error messages below indicating corruption in the database. It >>crashed sometimes and showed a panic message in the logs > >> [d: u:radio p:31917 242] ERROR: could not open file >>"base/16384/114846.39" (target block 360448000): No such file or >>directory [d: u:radio p:31917 243] > >> 2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, >>size = 0 >> 2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel >>1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, >>firstright 192 > >Look up relfilenodes 114846 and 115085 in pg_class of whichever database >has OID 16384. I'm guessing the latter is an index of the former. If >that's true, then both of these messages suggest corruption in the index >--- the latter pretty obviously, and the former because it looks like >it's an attempt to fetch from a silly block number, which could have >come out of a corrupted index entry. So if you're really lucky and >nothing but that index is corrupted, a REINDEX will fix it. Personally >I'd be wondering about what's the underlying cause and whether there is >corruption elsewhere, though. Try looking for evidence of flaky RAM or >flaky disk drives on your primary. See if you can pg_dump (not just >for forensic reasons, but so you've got some kind of backup if things >go downhill from here). > > regards, tom lane
Jigar Shah <jshah@pandora.com> writes: > We had some disk issues on the primary, but raid verification corrected > those blocks. That may have caused the primary to be corrupt. "corrected" for small values of "corrected", I'm guessing :-( > I have identified the objects, they both are indexes > relname | relfilenode | relkind > ------------------------+-------------+--------- > feedback_packed_pkey | 114846 | i > feedback_packed_id_idx | 115085 | i Hm, well, the good news is you could reindex both of those, the bad is that there are certainly more problems than this. > The secondary is the most recent copy. If we could just tell the secondary > to go passed beyond that corrupt block and get the database started, we > can then divert traffic to the secondary so our system can run read-only > until we can isolate and fix our primary. But the secondary is stuck at > this point and wont start. Is there a way to make the secondary do that? > Is there a way to remove that block from the wal file its applying so it > can go passed that point? No. You could probably make use of the PITR functionality to let the secondary replay up to just short of the WAL record where corruption becomes apparent, then stop and come up normally. The problem here is that it seems unlikely that the detected-inconsistent WAL record is the first bit of corruption that's been passed to the secondary. I don't have a lot of faith in the idea that your troubles would be over if you could only fire up the secondary. It's particularly worrisome that you seem to be looking for ways to avoid a dump/restore. That should be your zeroth-order priority at this point. What I would do if I were you is to take a filesystem backup of the secondary's entire current state (WAL and data directory) so that you can get back to this point if you have to. Then try the PITR stop-at-this-point trick. Be prepared to restore from the filesystem backup and recover to some other stopping point, possibly a few times, to get to the latest point that doesn't have clear corruption. Meanwhile you could be trying to get the master into a better state. It's not immediately obvious which path is going to lead to a better outcome faster, but I wouldn't assume the secondary is in better shape than the primary. On the master, again it seems like a filesystem dump ought to be the first priority, mainly so that you still have the data if the disks continue the downward arc that it sounds like they're on. In short: you're in for a long day, but your first priority ought to be to make sure things can't get even worse. regards, tom lane