Thread: clarifying a few error messages
i'm hoping someone might be able to help me understand some of what might been going on in the environment external to postgres based on some error messages i just got in the logs of one of my pg installations. in a period of duress (i.e., the box itself was rebooting and postgres was dying) on one of my servers, i saw the following a few times: ERROR: deadlock detected shortly afterward, there was this: DEBUG: database system was interrupted at <timestamp> what can interrupt the database? i've seen it get terminated by signal 9 when the box is failing, but is there any way to know what might have interrupted it? it seems to have died altogether, as it then goes through the checkpoint record, etc. after it starts up, this: DEBUG: database system is ready DEBUG: server process (pid 882) was terminated by signal 11 does this mean postgres itself segfaulted or that it received an external SIGSEGV from a critical system process (e.g., the kernel)? also, is there any significance to the fact that those two statements occurred one right after the other? a little later, after another recovery, i see these: DEBUG: all server processes terminated; reinitializing shared memory and semaphores DEBUG: database system was interrupted at <timestamp> finally, it seems to stabilize for a bit. then, a little later, a whole spew of garbage characters in the log immediately preceding another: DEBUG: database system was interrupted at <timestamp> how would garbage data end up in the log? does that indicate anything about the manner in which postgres was interrupted? after yet another recovery, i see the following: FATAL 1: The database system is starting up FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory DEBUG: server process (pid 945) exited with exit code 2 here, what is exit code 2? does that just mean that postgres found a significant problem with clog files? then, there was one more of the shared memory/interruption pairs, a final recovery, and smooth sailing again. i guess i'd like to be able to determine if this is a system resources issue, and if so, which system resources. is this sequence something that can be prevented in the future via postgresql.conf? more memory? is there any way of knowing what actually brought down postgres from these messages? this installation is on a Linux box running kernel 2.4.18 with 1GB RAM. i can provide postgresql.conf settings upon request. thanks! -tfo
I think I forgot to mention that this is in pg7.2.3... -tfo In article <tfo-C7DADB.15104309012003@news.hub.org>, Thomas O'Connell <tfo@monsterlabs.com> wrote: > i'm hoping someone might be able to help me understand some of what > might been going on in the environment external to postgres based on > some error messages i just got in the logs of one of my pg installations.
So I've managed to determine that the interrupt messages most likely coincided with the server reboots. Could the same thing have caused the signal 11? An unexpected external event? And is exit code 2 just related to the bad clog? -tfo In article <tfo-C7DADB.15104309012003@news.hub.org>, Thomas O'Connell <tfo@monsterlabs.com> wrote: > i'm hoping someone might be able to help me understand some of what > might been going on in the environment external to postgres based on > some error messages i just got in the logs of one of my pg installations. > > in a period of duress (i.e., the box itself was rebooting and postgres > was dying) on one of my servers, i saw the following a few times: > > ERROR: deadlock detected > > shortly afterward, there was this: > > DEBUG: database system was interrupted at <timestamp> > > what can interrupt the database? i've seen it get terminated by signal 9 > when the box is failing, but is there any way to know what might have > interrupted it? it seems to have died altogether, as it then goes > through the checkpoint record, etc. > > after it starts up, this: > > DEBUG: database system is ready > DEBUG: server process (pid 882) was terminated by signal 11 > > does this mean postgres itself segfaulted or that it received an > external SIGSEGV from a critical system process (e.g., the kernel)? > > also, is there any significance to the fact that those two statements > occurred one right after the other? > > a little later, after another recovery, i see these: > > DEBUG: all server processes terminated; reinitializing shared memory > and semaphores > DEBUG: database system was interrupted at <timestamp> > > finally, it seems to stabilize for a bit. then, a little later, a whole > spew of garbage characters in the log immediately preceding another: > > DEBUG: database system was interrupted at <timestamp> > > how would garbage data end up in the log? does that indicate anything > about the manner in which postgres was interrupted? > > after yet another recovery, i see the following: > > FATAL 1: The database system is starting up > FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory > FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory > FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory > FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory > FATAL 2: open of $PGDATA/pg_clog/0419 failed: No such file or directory > DEBUG: server process (pid 945) exited with exit code 2 > > here, what is exit code 2? does that just mean that postgres found a > significant problem with clog files? > > then, there was one more of the shared memory/interruption pairs, a > final recovery, and smooth sailing again. > > i guess i'd like to be able to determine if this is a system resources > issue, and if so, which system resources. is this sequence something > that can be prevented in the future via postgresql.conf? more memory? > > is there any way of knowing what actually brought down postgres from > these messages? > > this installation is on a Linux box running kernel 2.4.18 with 1GB RAM. > > i can provide postgresql.conf settings upon request. > > thanks! > > -tfo
"Thomas O'Connell" <tfo@monsterlabs.com> writes: > So I've managed to determine that the interrupt messages most likely > coincided with the server reboots. > Could the same thing have caused the signal 11? An unexpected external > event? My guess is that you've got hardware problems, most likely bad RAM. The SIGSEGV is probably a side-effect of RAM dropping bits unexpectedly --- for example, the value of a pointer stored in memory might have changed so that it appears to point outside Postgres' valid address space, leading to SIGSEGV next time the pointer is used. The fact that you're seeing unexpected reboots is what points the finger at the hardware; evidently the kernel is suffering the same kinds of problems. (Or you could believe that your hardware is okay and both the kernel and Postgres have suddenly developed severe bugs; but the hardware theory seems much more plausible.) > And is exit code 2 just related to the bad clog? Yes. This part looks like corrupted data on disk :-( ... likely also a side effect of busted RAM. Probably the RAM corrupted a page image that was sitting in an in-memory buffer, and then it got written out before any other problem was noticed. I hope you have a recent good backup that you can restore from after you fix your hardware. I would not trust what's presently on your disk if I were you. regards, tom lane
Well, here's my concern: the first postgres outage was caused by the server rebooting itself. The rest of the server reboots, a few of which took postgres with them, were caused pre-emptively by our sysadmin. In looking at snapshots of the activity on the machine surrounding recent outages (of either postgres or the whole box), it seems that postgres is one of the culprits. Right before it went out, memory was almost exhausted. I've seen, before, the signal 9 error, which results from a server under duress, right? Could it not be a vicious cycle? I.e., postgres begins consuming tremendous resources on a machine, the kernel gets frightened and starts killing procs, including postgres, and reboots? The reboots don't occur during periods of light load. Only when there are high numbers of both httpd and postgres connections running. I'm a little suspicious of blaming the hardware. I think it's more likely an extremely stressful server environment. I'm just trying to figure out where to turn next for the diagnostics. Most recently, the memory usage issue came to light. As for the bad data on disk, I've got a backup, but how severe are we talking? By not trusting it, do you mean that it could be flagrantly wrong (i.e., truly corrupted; bad data), or just out of sync with whatever writes were last occurring? -tfo On Monday, January 13, 2003, at 02:03 , Tom Lane wrote: > My guess is that you've got hardware problems, most likely bad RAM. The > SIGSEGV is probably a side-effect of RAM dropping bits unexpectedly --- > for example, the value of a pointer stored in memory might have changed > so that it appears to point outside Postgres' valid address space, > leading to SIGSEGV next time the pointer is used. > > The fact that you're seeing unexpected reboots is what points the finger > at the hardware; evidently the kernel is suffering the same kinds of > problems. (Or you could believe that your hardware is okay and both the > kernel and Postgres have suddenly developed severe bugs; but the > hardware theory seems much more plausible.) > >> And is exit code 2 just related to the bad clog? > > Yes. This part looks like corrupted data on disk :-( ... likely also a > side effect of busted RAM. Probably the RAM corrupted a page image that > was sitting in an in-memory buffer, and then it got written out before > any other problem was noticed. > > I hope you have a recent good backup that you can restore from after you > fix your hardware. I would not trust what's presently on your disk if I > were you. > > regards, tom lane
"Thomas F.O'Connell" <tfo@monsterlabs.com> writes: > As for the bad data on disk, I've got a backup, but how severe are we > talking? By not trusting it, do you mean that it could be flagrantly > wrong (i.e., truly corrupted; bad data), or just out of sync with > whatever writes were last occurring? The high clog numbers you were quoting suggest completely-trashed transaction ID fields (I'm assuming that the files actually present in pg_clog have numbers nowhere near that). I've only seen that happen in the context of completely-trashed disk pages. You could possibly do some investigation for yourself using pg_filedump. regards, tom lane