Thread: Invalid page header
I get the following error message when doing a select on a table: ERROR: invalid page header in block 295 of relation "reported_titles" I found some messages that said this means a block of this table is corrupt. I found some suspicious lines in the server log just before: ERROR: could not access status of transaction 3651584 DETAIL: could not open file "/usr/local/pgsql/data/pg_clog/0003": No such file or directory How do I fix this corruption? I have dumped as much of the databases as I can including about half of this table. Only this table is corrupt. What could cause the corruption? We are using custom C code. Could bugs in this be causing it? Or is hardware problems more likely? - Ian
Ian Burrell <imb@rentrak.com> writes: > I get the following error message when doing a select on a table: > ERROR: invalid page header in block 295 of relation "reported_titles" > How do I fix this corruption? You can zap just the failed block by turning on "zero_damaged_pages"; that will at least allow you to recover the rest of the table. If you want to try harder, you could look at the damaged page with pg_filedump (http://sources.redhat.com/rhdb/) or a similar tool and try to intuit how to fix it manually. > What could cause the corruption? We are using custom C code. Could > bugs in this be causing it? Or is hardware problems more likely? Hmm. A scribble-on-memory kind of bug could cause this, but in my experience it's unusual for coding errors to trash the disk buffers --- that's a relatively small part of your address space, and usually a memory clobber will crash the backend elsewhere before it hits a disk buffer. (BTW, one reason we force a database restart after a backend crash is in hopes of not letting any such clobber make it to disk. The contents of shared disk buffers are simply thrown away in a restart.) It would probably be worth your while to look at the damaged page with pg_filedump before you zap it. The symptoms of hardware misfeasance and software errors are enough different that you can often tell which theory to believe by examining the bits. regards, tom lane
Tom Lane wrote: > > You can zap just the failed block by turning on "zero_damaged_pages"; > that will at least allow you to recover the rest of the table. If you > want to try harder, you could look at the damaged page with pg_filedump > (http://sources.redhat.com/rhdb/) or a similar tool and try to intuit > how to fix it manually. > I zapped the damaged block. It didn't seem to effect the rows in the table. My suspicion is that the page only contained deleted rows since the table had many updates done recently. > Hmm. A scribble-on-memory kind of bug could cause this, but in my > experience it's unusual for coding errors to trash the disk buffers --- > that's a relatively small part of your address space, and usually a > memory clobber will crash the backend elsewhere before it hits a disk > buffer. (BTW, one reason we force a database restart after a backend > crash is in hopes of not letting any such clobber make it to disk. The > contents of shared disk buffers are simply thrown away in a restart.) > > It would probably be worth your while to look at the damaged page with > pg_filedump before you zap it. The symptoms of hardware misfeasance and > software errors are enough different that you can often tell which > theory to believe by examining the bits. > I used pg_filedump on a backup of the database files. The block looks like it is mostly zero bytes with a few x02 bytes thrown to just be confusing. - Ian
Ian Burrell <imb@rentrak.com> writes: > Tom Lane wrote: >> It would probably be worth your while to look at the damaged page with >> pg_filedump before you zap it. The symptoms of hardware misfeasance and >> software errors are enough different that you can often tell which >> theory to believe by examining the bits. > I used pg_filedump on a backup of the database files. The block looks > like it is mostly zero bytes with a few x02 bytes thrown to just be > confusing. My interpretation of that would be a hardware glitch. A software problem would be more likely to look like copying the wrong data into the block, or possibly zeroing out the block when it shouldn't --- but the sprinkling of x02's rules out a misaimed memset(). Time to break out the RAM and disk test programs ... regards, tom lane