Thread: clarifying a few error messages

clarifying a few error messages

From
Thomas O'Connell
Date:
i'm hoping someone might be able to help me understand some of what
might been going on in the environment external to postgres based on
some error messages i just got in the logs of one of my pg installations.

in a period of duress (i.e., the box itself was rebooting and postgres
was dying) on one of my servers, i saw the following a few times:

ERROR:  deadlock detected

shortly afterward, there was this:

DEBUG:  database system was interrupted at <timestamp>

what can interrupt the database? i've seen it get terminated by signal 9
when the box is failing, but is there any way to know what might have
interrupted it? it seems to have died altogether, as it then goes
through the checkpoint record, etc.

after it starts up, this:

DEBUG:  database system is ready
DEBUG:  server process (pid 882) was terminated by signal 11

does this mean postgres itself segfaulted or that it received an
external SIGSEGV from a critical system process (e.g., the kernel)?

also, is there any significance to the fact that those two statements
occurred one right after the other?

a little later, after another recovery, i see these:

DEBUG:  all server processes terminated; reinitializing shared memory
and semaphores
DEBUG:  database system was interrupted at <timestamp>

finally, it seems to stabilize for a bit. then, a little later, a whole
spew of garbage characters in the log immediately preceding another:

DEBUG:  database system was interrupted at <timestamp>

how would garbage data end up in the log? does that indicate anything
about the manner in which postgres was interrupted?

after yet another recovery, i see the following:

FATAL 1:  The database system is starting up
FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
DEBUG:  server process (pid 945) exited with exit code 2

here, what is exit code 2? does that just mean that postgres found a
significant problem with clog files?

then, there was one more of the shared memory/interruption pairs, a
final recovery, and smooth sailing again.

i guess i'd like to be able to determine if this is a system resources
issue, and if so, which system resources. is this sequence something
that can be prevented in the future via postgresql.conf? more memory?

is there any way of knowing what actually brought down postgres from
these messages?

this installation is on a Linux box running kernel 2.4.18 with 1GB RAM.

i can provide postgresql.conf settings upon request.

thanks!

-tfo

Re: clarifying a few error messages

From
Thomas O'Connell
Date:
I think I forgot to mention that this is in pg7.2.3...

-tfo

In article <tfo-C7DADB.15104309012003@news.hub.org>,
 Thomas O'Connell <tfo@monsterlabs.com> wrote:

> i'm hoping someone might be able to help me understand some of what
> might been going on in the environment external to postgres based on
> some error messages i just got in the logs of one of my pg installations.

Re: clarifying a few error messages

From
Thomas O'Connell
Date:
So I've managed to determine that the interrupt messages most likely
coincided with the server reboots.

Could the same thing have caused the signal 11? An unexpected external
event?

And is exit code 2 just related to the bad clog?

-tfo

In article <tfo-C7DADB.15104309012003@news.hub.org>,
 Thomas O'Connell <tfo@monsterlabs.com> wrote:

> i'm hoping someone might be able to help me understand some of what
> might been going on in the environment external to postgres based on
> some error messages i just got in the logs of one of my pg installations.
>
> in a period of duress (i.e., the box itself was rebooting and postgres
> was dying) on one of my servers, i saw the following a few times:
>
> ERROR:  deadlock detected
>
> shortly afterward, there was this:
>
> DEBUG:  database system was interrupted at <timestamp>
>
> what can interrupt the database? i've seen it get terminated by signal 9
> when the box is failing, but is there any way to know what might have
> interrupted it? it seems to have died altogether, as it then goes
> through the checkpoint record, etc.
>
> after it starts up, this:
>
> DEBUG:  database system is ready
> DEBUG:  server process (pid 882) was terminated by signal 11
>
> does this mean postgres itself segfaulted or that it received an
> external SIGSEGV from a critical system process (e.g., the kernel)?
>
> also, is there any significance to the fact that those two statements
> occurred one right after the other?
>
> a little later, after another recovery, i see these:
>
> DEBUG:  all server processes terminated; reinitializing shared memory
> and semaphores
> DEBUG:  database system was interrupted at <timestamp>
>
> finally, it seems to stabilize for a bit. then, a little later, a whole
> spew of garbage characters in the log immediately preceding another:
>
> DEBUG:  database system was interrupted at <timestamp>
>
> how would garbage data end up in the log? does that indicate anything
> about the manner in which postgres was interrupted?
>
> after yet another recovery, i see the following:
>
> FATAL 1:  The database system is starting up
> FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
> FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
> FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
> FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
> FATAL 2:  open of $PGDATA/pg_clog/0419 failed: No such file or directory
> DEBUG:  server process (pid 945) exited with exit code 2
>
> here, what is exit code 2? does that just mean that postgres found a
> significant problem with clog files?
>
> then, there was one more of the shared memory/interruption pairs, a
> final recovery, and smooth sailing again.
>
> i guess i'd like to be able to determine if this is a system resources
> issue, and if so, which system resources. is this sequence something
> that can be prevented in the future via postgresql.conf? more memory?
>
> is there any way of knowing what actually brought down postgres from
> these messages?
>
> this installation is on a Linux box running kernel 2.4.18 with 1GB RAM.
>
> i can provide postgresql.conf settings upon request.
>
> thanks!
>
> -tfo

Re: clarifying a few error messages

From
Tom Lane
Date:
"Thomas O'Connell" <tfo@monsterlabs.com> writes:
> So I've managed to determine that the interrupt messages most likely
> coincided with the server reboots.
> Could the same thing have caused the signal 11? An unexpected external
> event?

My guess is that you've got hardware problems, most likely bad RAM.  The
SIGSEGV is probably a side-effect of RAM dropping bits unexpectedly ---
for example, the value of a pointer stored in memory might have changed
so that it appears to point outside Postgres' valid address space,
leading to SIGSEGV next time the pointer is used.

The fact that you're seeing unexpected reboots is what points the finger
at the hardware; evidently the kernel is suffering the same kinds of
problems.  (Or you could believe that your hardware is okay and both the
kernel and Postgres have suddenly developed severe bugs; but the
hardware theory seems much more plausible.)

> And is exit code 2 just related to the bad clog?

Yes.  This part looks like corrupted data on disk :-( ... likely also a
side effect of busted RAM.  Probably the RAM corrupted a page image that
was sitting in an in-memory buffer, and then it got written out before
any other problem was noticed.

I hope you have a recent good backup that you can restore from after you
fix your hardware.  I would not trust what's presently on your disk if I
were you.

            regards, tom lane

Re: clarifying a few error messages

From
Thomas F.O'Connell
Date:
Well, here's my concern: the first postgres outage was caused by the
server rebooting itself. The rest of the server reboots, a few of which
took postgres with them, were caused pre-emptively by our sysadmin.

In looking at snapshots of the activity on the machine surrounding
recent outages (of either postgres or the whole box), it seems that
postgres is one of the culprits.

Right before it went out, memory was almost exhausted. I've seen,
before, the signal 9 error, which results from a server under duress,
right? Could it not be a vicious cycle? I.e., postgres begins consuming
tremendous resources on a machine, the kernel gets frightened and starts
killing procs, including postgres, and reboots? The reboots don't occur
during periods of light load. Only when there are high numbers of both
httpd and postgres connections running.

I'm a little suspicious of blaming the hardware. I think it's more
likely an extremely stressful server environment. I'm just trying to
figure out where to turn next for the diagnostics. Most recently, the
memory usage issue came to light.

As for the bad data on disk, I've got a backup, but how severe are we
talking? By not trusting it, do you mean that it could be flagrantly
wrong (i.e., truly corrupted; bad data), or just out of sync with
whatever writes were last occurring?

-tfo

On Monday, January 13, 2003, at 02:03 , Tom Lane wrote:

> My guess is that you've got hardware problems, most likely bad RAM.  The
> SIGSEGV is probably a side-effect of RAM dropping bits unexpectedly ---
> for example, the value of a pointer stored in memory might have changed
> so that it appears to point outside Postgres' valid address space,
> leading to SIGSEGV next time the pointer is used.
>
> The fact that you're seeing unexpected reboots is what points the finger
> at the hardware; evidently the kernel is suffering the same kinds of
> problems.  (Or you could believe that your hardware is okay and both the
> kernel and Postgres have suddenly developed severe bugs; but the
> hardware theory seems much more plausible.)
>
>> And is exit code 2 just related to the bad clog?
>
> Yes.  This part looks like corrupted data on disk :-( ... likely also a
> side effect of busted RAM.  Probably the RAM corrupted a page image that
> was sitting in an in-memory buffer, and then it got written out before
> any other problem was noticed.
>
> I hope you have a recent good backup that you can restore from after you
> fix your hardware.  I would not trust what's presently on your disk if I
> were you.
>
>             regards, tom lane


Re: clarifying a few error messages

From
Tom Lane
Date:
"Thomas F.O'Connell" <tfo@monsterlabs.com> writes:
> As for the bad data on disk, I've got a backup, but how severe are we
> talking? By not trusting it, do you mean that it could be flagrantly
> wrong (i.e., truly corrupted; bad data), or just out of sync with
> whatever writes were last occurring?

The high clog numbers you were quoting suggest completely-trashed
transaction ID fields (I'm assuming that the files actually present in
pg_clog have numbers nowhere near that).  I've only seen that happen in
the context of completely-trashed disk pages.  You could possibly do
some investigation for yourself using pg_filedump.

            regards, tom lane