Thread: PANIC: corrupted item pointer
Hi, I am running postgresql-9.1 from debian backport package fsync=on full_page_writes=off I didn't had any power failures on this server. Now I got this: 1. Logfile PANIC postgres[27352]: [4-1] PANIC: corrupted item pointer: offset = 21248, size = 16 postgres[27352]: [4-2] STATEMENT: insert into RankingEntry (rankingentry_mitglied_name, rankingentry_spieltagspunkte, rankingentry_gesamtpunkte, rankingentry_spieltagssiege, rankingentry_spieltagssieger, tippspieltag_id, mitglied_id) values ($1, $2, $3, $4, $5, $6, $7) postgres[26286]: [2-1] LOG: server process (PID 27352) was terminated by signal 6: Aborted postgres[26286]: [3-1] LOG: terminating any other active server processes 2. All my database connections are closed after this log entry 3. My Application is throwing lots of java.io.EOFException because of this. Sometimes i get exactly the same behaviour but without no.1. So there is no PANIC logged but all connections are closed suddenly with an EOFException I searched the archive and found <http://archives.postgresql.org/pgsql-general/2007-06/msg01268.php> So I first reindexed all indexes on table "rankingentry" concurrently and replaced the old ones. No errors. Then I run "VACUUM rankingentry" and i got: kicktipp=# VACUUM rankingentry ; WARNING: relation "rankingentry" page 424147 is uninitialized --- fixing WARNING: relation "rankingentry" page 424154 is uninitialized --- fixing WARNING: relation "rankingentry" page 424155 is uninitialized --- fixing WARNING: relation "rankingentry" page 424166 is uninitialized --- fixing WARNING: relation "rankingentry" page 424167 is uninitialized --- fixing WARNING: relation "rankingentry" page 424180 is uninitialized --- fixing VACUUM Time: 138736.347 ms Now I restarted my process which issued the insert statement which caused the server panic. Everything runs fine now. I am worried because i never had any error like this with postgresql. I just switched to 9.1 and started to have a hot standby server (WAL shipping). Does this error has any relation to this? Should I check or exchange my hardware? Is it a hardware problem? Should I still worry about it? regards Janning -- Kicktipp GmbH Venloer Straße 8, 40477 Düsseldorf Sitz der Gesellschaft: Düsseldorf Geschäftsführung: Janning Vygen Handelsregister Düsseldorf: HRB 55639 http://www.kicktipp.de/
Hi, First of all, shut down both servers (you indicated that you have a replica) and make a full copy of both data directories. At the first sign of corruption, that's always a good step as long as it's a practical amount of data (obviously this is more of a challenge if you have terabytes of data). On Tue, 2012-03-27 at 11:47 +0200, Janning Vygen wrote: > Hi, > > I am running postgresql-9.1 from debian backport package > fsync=on > full_page_writes=off That may be unsafe (and usually is) depending on your I/O system and filesystem. However, because you didn't have any power failures, I don't think this is the cause of the problem. > I didn't had any power failures on this server. These WARNINGs below could also be caused by a power failure. Can you verify that no power failure occurred? E.g. check uptime, and maybe look at a few logfiles? > Now I got this: > > 1. Logfile PANIC > > postgres[27352]: [4-1] PANIC: corrupted item pointer: offset = 21248, > size = 16 ... > Then I run "VACUUM rankingentry" and i got: > kicktipp=# VACUUM rankingentry ; > WARNING: relation "rankingentry" page 424147 is uninitialized --- fixing > WARNING: relation "rankingentry" page 424154 is uninitialized --- fixing > WARNING: relation "rankingentry" page 424155 is uninitialized --- fixing > WARNING: relation "rankingentry" page 424166 is uninitialized --- fixing > WARNING: relation "rankingentry" page 424167 is uninitialized --- fixing > WARNING: relation "rankingentry" page 424180 is uninitialized --- fixing > VACUUM > Time: 138736.347 ms > ... > I am worried because i never had any error like this with postgresql. I > just switched to 9.1 and started to have a hot standby server (WAL > shipping). Does this error has any relation to this? Did you get the PANIC and WARNINGs on the primary or the replica? It might be worth doing some comparisons between the two systems. Again, make those copies first, so you have some room to explore to find out what happened. It seems very unlikely that problems on the master would be caused by the presence of a replication slave. > Should I check or exchange my hardware? Is it a hardware problem? It could be. > Should I still worry about it? Yes. The WARNINGs might be harmless if it were a power failure, but you say you didn't have a power failure. The PANIC is pretty clearly indicating corruption. Regards, Jeff Davis
Hi, thanks so much for answering. I found a "segmentation fault" in my logs so please check below: > On Tue, 2012-03-27 at 11:47 +0200, Janning Vygen wrote: >> >> I am running postgresql-9.1 from debian backport package fsync=on >> full_page_writes=off > > That may be unsafe (and usually is) depending on your I/O system and > filesystem. However, because you didn't have any power failures, I > don't think this is the cause of the problem. I think i should switch to full_page_writes=on. But as my harddisk are rather cheap, so I used to tune it to get maximum performance. > These WARNINGs below could also be caused by a power failure. Can > you verify that no power failure occurred? E.g. check uptime, and > maybe look at a few logfiles? The PANIC occurred first on March, 19. My servers uptime ist 56 days, so about 4th of February. There was no power failure since i started to use this machine. This machine is in use since March, 7. I checked it twice: Now power failure. But i found more strange things, so let me show you a summary (some things were shortened for readability) 1. Segmentation fault Mar 13 19:01 LOG: server process (PID 32464) was terminated by signal 11: Segmentation fault Mar 13 19:01 FATAL: the database system is in recovery mode Mar 13 19:01 LOG: unexpected pageaddr 22/8D402000 in log file 35, segment 208, offset 4202496 Mar 13 19:01 LOG: redo done at 23/D0401F78 Mar 13 19:01 LOG: last completed transaction was at log time 2012-03-13 19:01:58.667779+01 Mar 13 19:01 LOG: checkpoint starting: end-of-recovery immediate 2. PANICS Mar 19 22:14 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 20 23:38 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 21 23:30 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 23 02:10 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 24 06:12 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 25 01:28 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 26 22:16 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 27 09:17 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 27 09:21 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 27 09:36 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 27 09:48 PANIC: corrupted item pointer: offset = 21248, size = 16 Mar 27 10:01 PANIC: corrupted item pointer: offset = 21248, size = 16 What I additionally see, that my table rankingentry was not autovacuumed anymore after the first PANIC on March,19. But it was still autovacuumed after segmentation fault without error. 3. Then I rebuilt all index on this table, dropped old indexes, and did run vacuum on this table: WARNING: relation "rankingentry" page 424147 is uninitialized --- fixing WARNING: relation "rankingentry" page 424154 is uninitialized --- fixing WARNING: relation "rankingentry" page 424155 is uninitialized --- fixing WARNING: relation "rankingentry" page 424166 is uninitialized --- fixing WARNING: relation "rankingentry" page 424167 is uninitialized --- fixing WARNING: relation "rankingentry" page 424180 is uninitialized --- fixing After this everything is running just fine. No more problems, just headache. > Did you get the PANIC and WARNINGs on the primary or the replica? It > might be worth doing some comparisons between the two systems. It only happend on my primary server. My backup server has no suspicious log entries. It is pretty obvious to me the segmentation fault is the main reason for getting the PANIC afterwards. What can cause a segmentation fault? Is there anything to analyse further? kind regards Janning -- Kicktipp GmbH Venloer Straße 8, 40477 Düsseldorf Sitz der Gesellschaft: Düsseldorf Geschäftsführung: Janning Vygen Handelsregister Düsseldorf: HRB 55639 http://www.kicktipp.de/
On Fri, 2012-03-30 at 16:02 +0200, Janning Vygen wrote: > The PANIC occurred first on March, 19. My servers uptime ist 56 days, so > about 4th of February. There was no power failure since i started to use > this machine. This machine is in use since March, 7. I checked it twice: > Now power failure. Just to be sure: the postgres instance didn't exist before you started to use it, right? > > Did you get the PANIC and WARNINGs on the primary or the replica? It > > might be worth doing some comparisons between the two systems. > > It only happend on my primary server. My backup server has no suspicious > log entries. Do you have a full copy of the two data directories? It might be worth exploring the differences there, but that could be a tedious process. > It is pretty obvious to me the segmentation fault is the main reason for > getting the PANIC afterwards. What can cause a segmentation fault? Is > there anything to analyse further? It's clear that they are connected, but it's not clear that it was the cause. To speculate: it might be that disk corruption caused the segfault as well as the PANICs. Do you have any core files? Can you get backtraces? Regards, Jeff Davis
Thank you so much for still helping me... Am 30.03.2012 20:24, schrieb Jeff Davis: > On Fri, 2012-03-30 at 16:02 +0200, Janning Vygen wrote: >> The PANIC occurred first on March, 19. My servers uptime ist 56 days, so >> about 4th of February. There was no power failure since i started to use >> this machine. This machine is in use since March, 7. I checked it twice: >> Now power failure. > > Just to be sure: the postgres instance didn't exist before you started > to use it, right? I don't really understand your question, but it was like this: The OS was installed a few days before, the i installed the postgresql instance. I configured my setup with a backup server by WAL archiving. Then i tested some things and i played around with pg_reorg (but i didn't use ist till then) then i dropped the database, shut down my app, installed a fresh dump and restarted the app. >>> Did you get the PANIC and WARNINGs on the primary or the replica? It >>> might be worth doing some comparisons between the two systems. >> >> It only happend on my primary server. My backup server has no suspicious >> log entries. > > Do you have a full copy of the two data directories? It might be worth > exploring the differences there, but that could be a tedious process. Is it still worth to make the copy now? At the moment everything is running fine. >> It is pretty obvious to me the segmentation fault is the main reason for >> getting the PANIC afterwards. What can cause a segmentation fault? Is >> there anything to analyse further? > > It's clear that they are connected, but it's not clear that it was the > cause. To speculate: it might be that disk corruption caused the > segfault as well as the PANICs. > > Do you have any core files? No, i didn't found any in my postgresql dirs. Should i have a core file around when i see a segmentation fault? What should i look for? > Can you get backtraces? I have never done it before. But as everything runs fine at the moment it's quite useless, isn't it? regards Janning > Regards, > Jeff Davis >
On Sat, 2012-03-31 at 13:21 +0200, Janning Vygen wrote: > The OS was installed a few days before, the i installed the postgresql > instance. I configured my setup with a backup server by WAL archiving. > Then i tested some things and i played around with pg_reorg (but i > didn't use ist till then) then i dropped the database, shut down my app, > installed a fresh dump and restarted the app. Hmm... I wonder if pg_reorg could be responsible for your problem? I know it does a few tricky internal things. > Is it still worth to make the copy now? At the moment everything is > running fine. Probably not very useful now. > No, i didn't found any in my postgresql dirs. Should i have a core file > around when i see a segmentation fault? What should i look for? It's an OS setup thing, but generally a crash will generate a core file if it is allowed to. Use "ulimit -c unlimited" on linux in the shell that starts postgresql and I think that will work. You can test it by manually doing a "kill -11" on the pid of a backend process. > I have never done it before. But as everything runs fine at the moment > it's quite useless, isn't it? I meant a backtrace from the core file. If you don't have a core file, then you won't have this information. Regards, Jeff Davis
Am 06.04.2012 23:49, schrieb Jeff Davis: >> No, i didn't found any in my postgresql dirs. Should i have a core file >> around when i see a segmentation fault? What should i look for? > > It's an OS setup thing, but generally a crash will generate a core file > if it is allowed to. Use "ulimit -c unlimited" on linux in the shell > that starts postgresql and I think that will work. You can test it by > manually doing a "kill -11" on the pid of a backend process. My system was setup with $ cat /proc/32741/limits Limit Soft Limit Hard Limit Units ... Max core file size 0 unlimited bytes ... to bad, no core dump. I will follow instructions on peters blog here <http://petereisentraut.blogspot.de/2011/06/enabling-core-files-for-postgresql-on.html> So next time i'll be ready to handle this issue. Thanks a lot for your help, jeff. regards Janning