Thread: PANIC: corrupted item pointer

PANIC: corrupted item pointer

From

Janning Vygen

Date:

27 March 2012, 06:47:30

Hi,

I am running postgresql-9.1 from debian backport package
fsync=on
full_page_writes=off
I didn't had any power failures on this server.

Now I got this:

1. Logfile PANIC

postgres[27352]: [4-1] PANIC:  corrupted item pointer: offset = 21248,
size = 16
postgres[27352]: [4-2] STATEMENT:  insert into RankingEntry
(rankingentry_mitglied_name, rankingentry_spieltagspunkte,
rankingentry_gesamtpunkte, rankingentry_spieltagssiege,
rankingentry_spieltagssieger, tippspieltag_id, mitglied_id) values ($1,
$2, $3, $4, $5, $6, $7)
postgres[26286]: [2-1] LOG:  server process (PID 27352) was terminated
by signal 6: Aborted
postgres[26286]: [3-1] LOG:  terminating any other active server processes

2. All my database connections are closed after this log entry

3. My Application is throwing lots of java.io.EOFException because of this.

Sometimes i get exactly the same behaviour but without no.1. So there is
no PANIC logged but all connections are closed suddenly with an EOFException

I searched the archive and found
<http://archives.postgresql.org/pgsql-general/2007-06/msg01268.php>

So I first reindexed all indexes on table "rankingentry" concurrently
and replaced the old ones. No errors.

Then I run "VACUUM rankingentry" and i got:
kicktipp=# VACUUM rankingentry ;
WARNING: relation "rankingentry" page 424147 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424154 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424155 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424166 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424167 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424180 is uninitialized --- fixing
VACUUM
Time: 138736.347 ms

Now I restarted my process which issued the insert statement which
caused the server panic. Everything runs fine now.

I am worried because i never had any error like this with postgresql. I
just switched to 9.1 and started to have a hot standby server (WAL
shipping). Does this error has any relation to this?

Should I check or exchange my hardware? Is it a hardware problem?

Should I still worry about it?

regards
Janning



--
Kicktipp GmbH

Venloer Straße 8, 40477 Düsseldorf
Sitz der Gesellschaft: Düsseldorf
Geschäftsführung: Janning Vygen
Handelsregister Düsseldorf: HRB 55639

http://www.kicktipp.de/

Re: PANIC: corrupted item pointer

From

Jeff Davis

Date:

29 March 2012, 16:39:57

Hi,

First of all, shut down both servers (you indicated that you have a
replica) and make a full copy of both data directories. At the first
sign of corruption, that's always a good step as long as it's a
practical amount of data (obviously this is more of a challenge if you
have terabytes of data).

On Tue, 2012-03-27 at 11:47 +0200, Janning Vygen wrote:
> Hi,
>
> I am running postgresql-9.1 from debian backport package
> fsync=on
> full_page_writes=off

That may be unsafe (and usually is) depending on your I/O system and
filesystem. However, because you didn't have any power failures, I don't
think this is the cause of the problem.

> I didn't had any power failures on this server.

These WARNINGs below could also be caused by a power failure. Can you
verify that no power failure occurred? E.g. check uptime, and maybe look
at a few logfiles?

> Now I got this:
>
> 1. Logfile PANIC
>
> postgres[27352]: [4-1] PANIC:  corrupted item pointer: offset = 21248,
> size = 16

...

> Then I run "VACUUM rankingentry" and i got:
> kicktipp=# VACUUM rankingentry ;
> WARNING: relation "rankingentry" page 424147 is uninitialized --- fixing
> WARNING: relation "rankingentry" page 424154 is uninitialized --- fixing
> WARNING: relation "rankingentry" page 424155 is uninitialized --- fixing
> WARNING: relation "rankingentry" page 424166 is uninitialized --- fixing
> WARNING: relation "rankingentry" page 424167 is uninitialized --- fixing
> WARNING: relation "rankingentry" page 424180 is uninitialized --- fixing
> VACUUM
> Time: 138736.347 ms
>

...

> I am worried because i never had any error like this with postgresql. I
> just switched to 9.1 and started to have a hot standby server (WAL
> shipping). Does this error has any relation to this?

Did you get the PANIC and WARNINGs on the primary or the replica? It
might be worth doing some comparisons between the two systems.

Again, make those copies first, so you have some room to explore to find
out what happened.

It seems very unlikely that problems on the master would be caused by
the presence of a replication slave.

> Should I check or exchange my hardware? Is it a hardware problem?

It could be.

> Should I still worry about it?

Yes. The WARNINGs might be harmless if it were a power failure, but you
say you didn't have a power failure. The PANIC is pretty clearly
indicating corruption.

Regards,
    Jeff Davis

Re: PANIC: corrupted item pointer

From

Janning Vygen

Date:

30 March 2012, 11:02:55

Hi,

thanks so much for answering. I found a "segmentation fault" in my logs
so please check below:

> On Tue, 2012-03-27 at 11:47 +0200, Janning Vygen wrote:
>>
>> I am running postgresql-9.1 from debian backport package fsync=on
>> full_page_writes=off
>
> That may be unsafe (and usually is) depending on your I/O system and
> filesystem. However, because you didn't have any power failures, I
> don't think this is the cause of the problem.

I think i should switch to full_page_writes=on. But as my harddisk are
rather cheap, so I used to tune it to get maximum performance.

> These WARNINGs below could also be caused by a power failure. Can
> you verify that no power failure occurred? E.g. check uptime, and
> maybe look at a few logfiles?

The PANIC occurred first on March, 19. My servers uptime ist 56 days, so
about 4th of February. There was no power failure since i started to use
this machine. This machine is in use since March, 7. I checked it twice:
Now power failure.

But i found more strange things, so let me show you a summary (some
things were shortened for readability)

1. Segmentation fault
Mar 13 19:01 LOG:  server process (PID 32464) was terminated by signal
11: Segmentation fault
Mar 13 19:01 FATAL:  the database system is in recovery mode
Mar 13 19:01 LOG:  unexpected pageaddr 22/8D402000 in log file 35,
segment 208, offset 4202496
Mar 13 19:01 LOG:  redo done at 23/D0401F78
Mar 13 19:01 LOG:  last completed transaction was at log time 2012-03-13
19:01:58.667779+01
Mar 13 19:01 LOG:  checkpoint starting: end-of-recovery immediate

2. PANICS
Mar 19 22:14 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 20 23:38 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 21 23:30 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 23 02:10 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 24 06:12 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 25 01:28 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 26 22:16 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 27 09:17 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 27 09:21 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 27 09:36 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 27 09:48 PANIC: corrupted item pointer: offset = 21248, size = 16
Mar 27 10:01 PANIC: corrupted item pointer: offset = 21248, size = 16

What I additionally see, that my table rankingentry was not autovacuumed
anymore after the first PANIC on March,19. But it was still autovacuumed
after segmentation fault without error.

3.
Then I rebuilt all index on this table, dropped old indexes, and did run
vacuum on this table:

WARNING: relation "rankingentry" page 424147 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424154 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424155 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424166 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424167 is uninitialized --- fixing
WARNING: relation "rankingentry" page 424180 is uninitialized --- fixing

After this everything is running just fine. No more problems, just headache.

> Did you get the PANIC and WARNINGs on the primary or the replica? It
> might be worth doing some comparisons between the two systems.

It only happend on my primary server. My backup server has no suspicious
log entries.

It is pretty obvious to me the segmentation fault is the main reason for
getting the PANIC afterwards. What can cause a segmentation fault? Is
there anything to analyse further?

kind regards
Janning

--
Kicktipp GmbH

Venloer Straße 8, 40477 Düsseldorf
Sitz der Gesellschaft: Düsseldorf
Geschäftsführung: Janning Vygen
Handelsregister Düsseldorf: HRB 55639

http://www.kicktipp.de/

Re: PANIC: corrupted item pointer

From

Jeff Davis

Date:

30 March 2012, 15:25:05

On Fri, 2012-03-30 at 16:02 +0200, Janning Vygen wrote:
> The PANIC occurred first on March, 19. My servers uptime ist 56 days, so
> about 4th of February. There was no power failure since i started to use
> this machine. This machine is in use since March, 7. I checked it twice:
> Now power failure.

Just to be sure: the postgres instance didn't exist before you started
to use it, right?

> > Did you get the PANIC and WARNINGs on the primary or the replica? It
> > might be worth doing some comparisons between the two systems.
>
> It only happend on my primary server. My backup server has no suspicious
> log entries.

Do you have a full copy of the two data directories? It might be worth
exploring the differences there, but that could be a tedious process.

> It is pretty obvious to me the segmentation fault is the main reason for
> getting the PANIC afterwards. What can cause a segmentation fault? Is
> there anything to analyse further?

It's clear that they are connected, but it's not clear that it was the
cause. To speculate: it might be that disk corruption caused the
segfault as well as the PANICs.

Do you have any core files? Can you get backtraces?

Regards,
    Jeff Davis

Re: PANIC: corrupted item pointer

From

Janning Vygen

Date:

31 March 2012, 08:21:38

Thank you so much for still helping me...

Am 30.03.2012 20:24, schrieb Jeff Davis:
> On Fri, 2012-03-30 at 16:02 +0200, Janning Vygen wrote:
>> The PANIC occurred first on March, 19. My servers uptime ist 56 days, so
>> about 4th of February. There was no power failure since i started to use
>> this machine. This machine is in use since March, 7. I checked it twice:
>> Now power failure.
>
> Just to be sure: the postgres instance didn't exist before you started
> to use it, right?

I don't really understand your question, but it was like this:

The OS was installed a few days before, the i installed the postgresql
instance. I configured my setup with a backup server by WAL archiving.
Then i tested some things and i played around with pg_reorg (but i
didn't use ist till then) then i dropped the database, shut down my app,
installed a fresh dump and restarted the app.

>>> Did you get the PANIC and WARNINGs on the primary or the replica? It
>>> might be worth doing some comparisons between the two systems.
>>
>> It only happend on my primary server. My backup server has no suspicious
>> log entries.
>
> Do you have a full copy of the two data directories? It might be worth
> exploring the differences there, but that could be a tedious process.

Is it still worth to make the copy now? At the moment everything is
running fine.

>> It is pretty obvious to me the segmentation fault is the main reason for
>> getting the PANIC afterwards. What can cause a segmentation fault? Is
>> there anything to analyse further?
>
> It's clear that they are connected, but it's not clear that it was the
> cause. To speculate: it might be that disk corruption caused the
> segfault as well as the PANICs.
>
> Do you have any core files?

No, i didn't found any in my postgresql dirs. Should i have a core file
around when i see a segmentation fault? What should i look for?

> Can you get backtraces?

I have never done it before. But as everything runs fine at the moment
it's quite useless, isn't it?

regards
Janning

> Regards,
>     Jeff Davis
>

Re: PANIC: corrupted item pointer

From

Jeff Davis

Date:

06 April 2012, 18:49:46

On Sat, 2012-03-31 at 13:21 +0200, Janning Vygen wrote:
> The OS was installed a few days before, the i installed the postgresql
> instance. I configured my setup with a backup server by WAL archiving.
> Then i tested some things and i played around with pg_reorg (but i
> didn't use ist till then) then i dropped the database, shut down my app,
> installed a fresh dump and restarted the app.

Hmm... I wonder if pg_reorg could be responsible for your problem? I
know it does a few tricky internal things.

> Is it still worth to make the copy now? At the moment everything is
> running fine.

Probably not very useful now.

> No, i didn't found any in my postgresql dirs. Should i have a core file
> around when i see a segmentation fault? What should i look for?

It's an OS setup thing, but generally a crash will generate a core file
if it is allowed to. Use "ulimit -c unlimited" on linux in the shell
that starts postgresql and I think that will work. You can test it by
manually doing a "kill -11" on the pid of a backend process.

> I have never done it before. But as everything runs fine at the moment
> it's quite useless, isn't it?

I meant a backtrace from the core file. If you don't have a core file,
then you won't have this information.

Regards,
    Jeff Davis

Re: PANIC: corrupted item pointer

From

Janning Vygen

Date:

11 April 2012, 11:16:53


Am 06.04.2012 23:49, schrieb Jeff Davis:
>> No, i didn't found any in my postgresql dirs. Should i have a core file
>> around when i see a segmentation fault? What should i look for?
>
> It's an OS setup thing, but generally a crash will generate a core file
> if it is allowed to. Use "ulimit -c unlimited" on linux in the shell
> that starts postgresql and I think that will work. You can test it by
> manually doing a "kill -11" on the pid of a backend process.

My system was setup with
$ cat /proc/32741/limits
Limit                     Soft Limit           Hard Limit
Units
...
Max core file size        0                    unlimited
bytes
...

to bad, no core dump.

I will follow instructions on peters blog here
<http://petereisentraut.blogspot.de/2011/06/enabling-core-files-for-postgresql-on.html>

So next time i'll be ready to handle this issue.

Thanks a lot for your help, jeff.

regards
Janning