Thread: URGENT: Database keeps crashing - suspect damaged RAM

URGENT: Database keeps crashing - suspect damaged RAM

From
"Markus Wollny"
Date:
Hello!

I just installed PostgreSQL 7.2.1 on SuSE 7.3, 4xPIIIXEON 550MHz, 2GB
RAM, 5x18GB SCSI RAID. The OS was freshly installed, after that I
compiled and installed PostgreSQL from source (./configure
--prefix=/opt/pgsql/ --with-perl --enable-odbc --enable-locale
--enable-syslog). I copied the settings in postgresql.conf etc. from an
identical machine running the identical platform. Then I imported a
database to the new installation. The import seems to be successfull, I
didn't get any errors during import. A subsequent vacuum analyze did
finish without anything out of the ordinary.

Just a few minutes after this vacuum analyze, the database crashed for
the first time. It keeps crashing every now and then - every one or two
minutes.

What puzzles me is the fact that this very same machine was running
Oracle 8i on Win2k more or less flawlessly just up to a few hours before
- more or less meaning that we never really noticed anything much out of
the ordinary. There might have been some minor issues after a
RAM-upgrade from 1 GB to 2 GB just a week ago, but looking back it's
hard to say if that could be due to bad RAM or just some bad code which
we've sorted out (or disposed of) by now. As the machine is already
running Linux and PostgreSQL it's quite impossible to prove my suspicion
by going back to Oracle and having a closer look.

What I'd like to know is if I need to look any further than RAM - shall
I just chuck the new modules out of the machine? Or is there some other
issue that could cause this behaviour? I am quite sure that I didn't do
anything wrong during installation, configuration and import and the
same application code is running without errors on a different machine
at this very moment. I don't like the "record with zero length" and
"Cannot allocate memory"-bits in the logfile at all, let alone the "was
terminated by signal 9"-thingy.

So: Is it bad RAM? How can I make sure? What else could it be?

Here's a small excerpt from the logfile:

2002-08-06 17:31:38 [17063]  DEBUG:  Pages 0: Changed 0, Empty 0; Tup 0:
Vac 0, Keep 0, UnUsed 0.
        Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
2002-08-06 17:36:23 [17296]  DEBUG:  _mdfd_blind_getseg: couldn't open
/var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory
2002-08-06 17:36:24 [17296]  FATAL 2:  cannot write block 13387 of
16596/16671 blind: Cannot allocate memory
2002-08-06 17:36:24 [16530]  DEBUG:  server process (pid 17296) exited
with exit code 2
2002-08-06 17:36:24 [16530]  DEBUG:  terminating any other active server
processes
2002-08-06 17:36:24 [17081]  NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
[...]
2002-08-06 17:36:24 [16530]  DEBUG:  all server processes terminated;
reinitializing shared memory and semaphores
2002-08-06 17:36:24 [17298]  DEBUG:  database system was interrupted at
2002-08-06 17:31:21 CEST
2002-08-06 17:36:24 [17298]  DEBUG:  checkpoint record is at 0/325D7C78
2002-08-06 17:36:24 [17298]  DEBUG:  redo record is at 0/325D7C78; undo
record is at 0/0; shutdown FALSE
2002-08-06 17:36:24 [17298]  DEBUG:  next transaction id: 2270; next
oid: 901292
2002-08-06 17:36:24 [17298]  DEBUG:  database system was not properly
shut down; automatic recovery in progress
2002-08-06 17:36:24 [17298]  DEBUG:  redo starts at 0/325D7CB8
2002-08-06 17:36:25 [17298]  DEBUG:  ReadRecord: record with zero length
at 0/326E16C4
2002-08-06 17:36:25 [17298]  DEBUG:  redo done at 0/326E16A0
2002-08-06 17:36:30 [17298]  DEBUG:  database system is ready
2002-08-06 17:40:53 [16530]  DEBUG:  connection startup failed (fork
failure): Cannot allocate memory
2002-08-06 17:52:50 [16530]  DEBUG:  connection startup failed (fork
failure): Cannot allocate memory
2002-08-06 17:52:54 [16530]  DEBUG:  server process (pid 18237) was
terminated by signal 9
2002-08-06 17:52:54 [16530]  DEBUG:  terminating any other active server
processes
2002-08-06 17:52:54 [18234]  NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
[...]
2002-08-06 17:52:57 [18253]  FATAL 1:  The database system is in
recovery mode
2002-08-06 17:52:57 [18255]  FATAL 1:  The database system is in
recovery mode
2002-08-06 17:52:57 [18254]  FATAL 1:  The database system is in
recovery mode
2002-08-06 17:52:57 [18235]  NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
2002-08-06 17:52:57 [18256]  FATAL 1:  The database system is in
recovery mode
2002-08-06 17:52:57 [18257]  FATAL 1:  The database system is in
recovery mode
2002-08-06 17:52:57 [18258]  FATAL 1:  The database system is in
recovery mode
2002-08-06 17:52:57 [16530]  DEBUG:  all server processes terminated;
reinitializing shared memory and semaphores
2002-08-06 17:52:57 [18260]  FATAL 1:  The database system is starting
up
2002-08-06 17:52:57 [18259]  DEBUG:  database system was interrupted at
2002-08-06 17:51:38 CEST
2002-08-06 17:52:57 [18259]  DEBUG:  checkpoint record is at 0/32991848
2002-08-06 17:52:57 [18259]  DEBUG:  redo record is at 0/3297F4D8; undo
record is at 0/0; shutdown FALSE
2002-08-06 17:52:57 [18259]  DEBUG:  next transaction id: 3704; next
oid: 909484
2002-08-06 17:52:57 [18259]  DEBUG:  database system was not properly
shut down; automatic recovery in progress
2002-08-06 17:52:57 [18259]  DEBUG:  redo starts at 0/3297F4D8
2002-08-06 17:52:57 [18261]  FATAL 1:  The database system is starting
up
2002-08-06 17:52:58 [18259]  DEBUG:  ReadRecord: record with zero length
at 0/32BF0278
2002-08-06 17:52:58 [18259]  DEBUG:  redo done at 0/32BF0254
2002-08-06 17:52:59 [18262]  FATAL 1:  The database system is starting
up
2002-08-06 17:53:00 [18259]  DEBUG:  database system is ready
2002-08-06 17:54:24 [16530]  DEBUG:  connection startup failed (fork
failure): Cannot allocate memory
2002-08-06 17:54:31 [16530]  DEBUG:  server process (pid 18283) was
terminated by signal 9
2002-08-06 17:54:31 [16530]  DEBUG:  terminating any other active server
processes
2002-08-06 17:54:31 [18275]  NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
[...]
2002-08-06 17:54:32 [16530]  DEBUG:  all server processes terminated;
reinitializing shared memory and semaphores
2002-08-06 17:54:32 [18296]  DEBUG:  database system was interrupted at
2002-08-06 17:53:00 CEST
2002-08-06 17:54:32 [18296]  DEBUG:  checkpoint record is at 0/32BF0278
2002-08-06 17:54:32 [18296]  DEBUG:  redo record is at 0/32BF0278; undo
record is at 0/0; shutdown TRUE
2002-08-06 17:54:32 [18296]  DEBUG:  next transaction id: 4456; next
oid: 909484
2002-08-06 17:54:32 [18296]  DEBUG:  database system was not properly
shut down; automatic recovery in progress
2002-08-06 17:54:32 [18296]  DEBUG:  redo starts at 0/32BF02B8
2002-08-06 17:54:32 [18296]  DEBUG:  ReadRecord: record with zero length
at 0/32F0B3C0
2002-08-06 17:54:32 [18296]  DEBUG:  redo done at 0/32F0B39C
2002-08-06 17:54:34 [18297]  FATAL 1:  The database system is starting
up
2002-08-06 17:54:34 [18298]  FATAL 1:  The database system is starting
up
2002-08-06 17:54:34 [18299]  FATAL 1:  The database system is starting
up
2002-08-06 17:54:34 [18300]  FATAL 1:  The database system is starting
up
2002-08-06 17:54:34 [18296]  DEBUG:  database system is ready
2002-08-06 17:57:35 [16530]  DEBUG:  connection startup failed (fork
failure): Cannot allocate memory
2002-08-06 17:57:54 [16530]  DEBUG:  server process (pid 18366) was
terminated by signal 9
2002-08-06 17:57:54 [16530]  DEBUG:  terminating any other active server
processes
2002-08-06 17:57:54 [18368]  NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
2002-08-06 17:57:56 [18409]  DEBUG:  ReadRecord: record with zero length
at 0/3338749C
2002-08-06 17:57:58 [18425]  FATAL 1:  The database system is starting
up
2002-08-06 17:57:58 [18409]  DEBUG:  database system is ready
2002-08-06 17:58:53 [18432]  NOTICE:  RelationBuildDesc: can't open
idx_bm_user_id: Cannot allocate memory
2002-08-06 17:59:00 [18443]  FATAL 1:  cannot open pg_attribute: Cannot
allocate memory
2002-08-06 17:59:01 [16530]  DEBUG:  connection startup failed (fork
failure): Cannot allocate memory
2002-08-06 17:59:01 [16530]  DEBUG:  server process (pid 18436) was
terminated by signal 9
2002-08-06 17:59:01 [16530]  DEBUG:  terminating any other active server
processes
2002-08-06 17:59:03 [18510]  DEBUG:  ReadRecord: record with zero length
at 0/336E9970
2002-08-06 18:00:15 [16530]  DEBUG:  connection startup failed (fork
failure): Cannot allocate memory
2002-08-06 18:00:17 [18589]  DEBUG:  ReadRecord: record with zero length
at 0/33A7C194

Thank you for your kind assistance!

Regards,

    Markus Wollny

Re: URGENT: Database keeps crashing - suspect damaged RAM

From
"Markus Wollny"
Date:
Oh - and I forgot to mention: The crashes only occur when there is load
on the machine. No load - no crashes. But then, that wouldn't be any
surprise, as it wouldn't make use of a lot of RAM without any load...

Regards,

    Markus

> -----Ursprüngliche Nachricht-----
> Von: Markus Wollny
> Gesendet: Dienstag, 6. August 2002 18:38
> An: pgsql-general@postgresql.org
> Betreff: [GENERAL] URGENT: Database keeps crashing - suspect
> damaged RAM
>
>
> Hello!
>
> I just installed PostgreSQL 7.2.1 on SuSE 7.3, 4xPIIIXEON 550MHz, 2GB
> RAM, 5x18GB SCSI RAID. The OS was freshly installed, after that I
> compiled and installed PostgreSQL from source (./configure
> --prefix=/opt/pgsql/ --with-perl --enable-odbc --enable-locale
> --enable-syslog). I copied the settings in postgresql.conf
> etc. from an
> identical machine running the identical platform. Then I imported a
> database to the new installation. The import seems to be
> successfull, I
> didn't get any errors during import. A subsequent vacuum analyze did
> finish without anything out of the ordinary.
>
> Just a few minutes after this vacuum analyze, the database crashed for
> the first time. It keeps crashing every now and then - every
> one or two
> minutes.
>
> What puzzles me is the fact that this very same machine was running
> Oracle 8i on Win2k more or less flawlessly just up to a few
> hours before
> - more or less meaning that we never really noticed anything
> much out of
> the ordinary. There might have been some minor issues after a
> RAM-upgrade from 1 GB to 2 GB just a week ago, but looking back it's
> hard to say if that could be due to bad RAM or just some bad
> code which
> we've sorted out (or disposed of) by now. As the machine is already
> running Linux and PostgreSQL it's quite impossible to prove
> my suspicion
> by going back to Oracle and having a closer look.
>
> What I'd like to know is if I need to look any further than
> RAM - shall
> I just chuck the new modules out of the machine? Or is there
> some other
> issue that could cause this behaviour? I am quite sure that I
> didn't do
> anything wrong during installation, configuration and import and the
> same application code is running without errors on a different machine
> at this very moment. I don't like the "record with zero length" and
> "Cannot allocate memory"-bits in the logfile at all, let
> alone the "was
> terminated by signal 9"-thingy.
>
> So: Is it bad RAM? How can I make sure? What else could it be?
>
> Here's a small excerpt from the logfile:
>
> 2002-08-06 17:31:38 [17063]  DEBUG:  Pages 0: Changed 0,
> Empty 0; Tup 0:
> Vac 0, Keep 0, UnUsed 0.
>         Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
> 2002-08-06 17:36:23 [17296]  DEBUG:  _mdfd_blind_getseg: couldn't open
> /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory
> 2002-08-06 17:36:24 [17296]  FATAL 2:  cannot write block 13387 of
> 16596/16671 blind: Cannot allocate memory
> 2002-08-06 17:36:24 [16530]  DEBUG:  server process (pid 17296) exited
> with exit code 2
> 2002-08-06 17:36:24 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:36:24 [17081]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
> [...]
> 2002-08-06 17:36:24 [16530]  DEBUG:  all server processes terminated;
> reinitializing shared memory and semaphores
> 2002-08-06 17:36:24 [17298]  DEBUG:  database system was
> interrupted at
> 2002-08-06 17:31:21 CEST
> 2002-08-06 17:36:24 [17298]  DEBUG:  checkpoint record is at
> 0/325D7C78
> 2002-08-06 17:36:24 [17298]  DEBUG:  redo record is at
> 0/325D7C78; undo
> record is at 0/0; shutdown FALSE
> 2002-08-06 17:36:24 [17298]  DEBUG:  next transaction id: 2270; next
> oid: 901292
> 2002-08-06 17:36:24 [17298]  DEBUG:  database system was not properly
> shut down; automatic recovery in progress
> 2002-08-06 17:36:24 [17298]  DEBUG:  redo starts at 0/325D7CB8
> 2002-08-06 17:36:25 [17298]  DEBUG:  ReadRecord: record with
> zero length
> at 0/326E16C4
> 2002-08-06 17:36:25 [17298]  DEBUG:  redo done at 0/326E16A0
> 2002-08-06 17:36:30 [17298]  DEBUG:  database system is ready
> 2002-08-06 17:40:53 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:52:50 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:52:54 [16530]  DEBUG:  server process (pid 18237) was
> terminated by signal 9
> 2002-08-06 17:52:54 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:52:54 [18234]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
> [...]
> 2002-08-06 17:52:57 [18253]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18255]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18254]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18235]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
> 2002-08-06 17:52:57 [18256]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18257]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18258]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [16530]  DEBUG:  all server processes terminated;
> reinitializing shared memory and semaphores
> 2002-08-06 17:52:57 [18260]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:52:57 [18259]  DEBUG:  database system was
> interrupted at
> 2002-08-06 17:51:38 CEST
> 2002-08-06 17:52:57 [18259]  DEBUG:  checkpoint record is at
> 0/32991848
> 2002-08-06 17:52:57 [18259]  DEBUG:  redo record is at
> 0/3297F4D8; undo
> record is at 0/0; shutdown FALSE
> 2002-08-06 17:52:57 [18259]  DEBUG:  next transaction id: 3704; next
> oid: 909484
> 2002-08-06 17:52:57 [18259]  DEBUG:  database system was not properly
> shut down; automatic recovery in progress
> 2002-08-06 17:52:57 [18259]  DEBUG:  redo starts at 0/3297F4D8
> 2002-08-06 17:52:57 [18261]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:52:58 [18259]  DEBUG:  ReadRecord: record with
> zero length
> at 0/32BF0278
> 2002-08-06 17:52:58 [18259]  DEBUG:  redo done at 0/32BF0254
> 2002-08-06 17:52:59 [18262]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:53:00 [18259]  DEBUG:  database system is ready
> 2002-08-06 17:54:24 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:54:31 [16530]  DEBUG:  server process (pid 18283) was
> terminated by signal 9
> 2002-08-06 17:54:31 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:54:31 [18275]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
> [...]
> 2002-08-06 17:54:32 [16530]  DEBUG:  all server processes terminated;
> reinitializing shared memory and semaphores
> 2002-08-06 17:54:32 [18296]  DEBUG:  database system was
> interrupted at
> 2002-08-06 17:53:00 CEST
> 2002-08-06 17:54:32 [18296]  DEBUG:  checkpoint record is at
> 0/32BF0278
> 2002-08-06 17:54:32 [18296]  DEBUG:  redo record is at
> 0/32BF0278; undo
> record is at 0/0; shutdown TRUE
> 2002-08-06 17:54:32 [18296]  DEBUG:  next transaction id: 4456; next
> oid: 909484
> 2002-08-06 17:54:32 [18296]  DEBUG:  database system was not properly
> shut down; automatic recovery in progress
> 2002-08-06 17:54:32 [18296]  DEBUG:  redo starts at 0/32BF02B8
> 2002-08-06 17:54:32 [18296]  DEBUG:  ReadRecord: record with
> zero length
> at 0/32F0B3C0
> 2002-08-06 17:54:32 [18296]  DEBUG:  redo done at 0/32F0B39C
> 2002-08-06 17:54:34 [18297]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18298]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18299]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18300]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18296]  DEBUG:  database system is ready
> 2002-08-06 17:57:35 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:57:54 [16530]  DEBUG:  server process (pid 18366) was
> terminated by signal 9
> 2002-08-06 17:57:54 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:57:54 [18368]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
> 2002-08-06 17:57:56 [18409]  DEBUG:  ReadRecord: record with
> zero length
> at 0/3338749C
> 2002-08-06 17:57:58 [18425]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:57:58 [18409]  DEBUG:  database system is ready
> 2002-08-06 17:58:53 [18432]  NOTICE:  RelationBuildDesc: can't open
> idx_bm_user_id: Cannot allocate memory
> 2002-08-06 17:59:00 [18443]  FATAL 1:  cannot open
> pg_attribute: Cannot
> allocate memory
> 2002-08-06 17:59:01 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:59:01 [16530]  DEBUG:  server process (pid 18436) was
> terminated by signal 9
> 2002-08-06 17:59:01 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:59:03 [18510]  DEBUG:  ReadRecord: record with
> zero length
> at 0/336E9970
> 2002-08-06 18:00:15 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 18:00:17 [18589]  DEBUG:  ReadRecord: record with
> zero length
> at 0/33A7C194
>
> Thank you for your kind assistance!
>
> Regards,
>
>     Markus Wollny
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

Re: URGENT: Database keeps crashing - suspect damaged RAM

From
nconway@klamath.dyndns.org (Neil Conway)
Date:
On Tue, Aug 06, 2002 at 06:38:24PM +0200, Markus Wollny wrote:
> Is it bad RAM? How can I make sure?

You can test your RAM with memtest86 -- www.memtest86.com

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: URGENT: Database keeps crashing - suspect damaged RAM

From
John Gray
Date:
On Tue, 2002-08-06 at 17:38, Markus Wollny wrote:

> What I'd like to know is if I need to look any further than RAM - shall
> I just chuck the new modules out of the machine? Or is there some other
> issue that could cause this behaviour? I am quite sure that I didn't do
> anything wrong during installation, configuration and import and the
> same application code is running without errors on a different machine
> at this very moment. I don't like the "record with zero length" and
> "Cannot allocate memory"-bits in the logfile at all, let alone the "was
> terminated by signal 9"-thingy.
>

9 is SIGKILL - that is significant because it implies that your OS is
terminating the process (sig 11 would be likely for a bad pointer
dereference, which could well indicate RAM problems).

I don't think that you should immediately suspect your hardware. This
all looks suspiciously like an OS out-of-memory situation -that also
corresponds to it being under load. Two things to check:

1) Swap enabled, set to a suitable value for the load on the machine?
(what does "free" say?)

2) There is a Linux sysctl which determines whether to "overcommit"
memory. Also check that ulimit isn't imposing any per-process memory or
CPU limits.

3) If its a stock Linux install, you may be running excessive daemons,
but I'd be surprised if things got quite this bad.

Regards

John

--
John Gray
Azuli IT
www.azuli.co.uk



Re: URGENT: Database keeps crashing - suspect damaged RAM

From
Tom Lane
Date:
"Markus Wollny" <Markus.Wollny@computec.de> writes:
> So: Is it bad RAM? How can I make sure? What else could it be?

Have you tried running memtest86?  I've never used that myself but
some folks on the list say it works well.

> Here's a small excerpt from the logfile:

> 2002-08-06 17:36:23 [17296]  DEBUG:  _mdfd_blind_getseg: couldn't open
> /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory

Is it possible that you are running with inadequate swap space, a small
data segment limit (ulimit -d), or something else that would make the
kernel refuse to give memory to a backend process?

> 2002-08-06 17:40:53 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:52:50 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory

Still looks like inadequate memory --- but now I'm thinking that it's a
system-wide condition, ie, you just plain haven't got enough RAM for the
number of processes you're trying to start.

> 2002-08-06 17:52:54 [16530]  DEBUG:  server process (pid 18237) was
> terminated by signal 9

Postgres never issues any kill -9 on itself, but I've heard that the
Linux kernel may start killing processes when it's desperately low on
memory.

Other than the signal 9, everything I see in this trace is either a
cannot-allocate-memory failure or followup effects from one.  How many
backends are you trying to start up, anyway?  Might you have a runaway
client that keeps opening new backend connections?

            regards, tom lane

Re: URGENT: Database keeps crashing - suspect damaged RAM

From
"Markus Wollny"
Date:
Hi!

    -----Ursprüngliche Nachricht----- 
    Von: Tom Lane 
    Gesendet: Di 06.08.2002 18:59 
    An: Markus Wollny 
    Cc: pgsql-general@postgresql.org 
    Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect
damaged RAM 
    
    

    "Markus Wollny" <Markus.Wollny@computec.de> writes:
    > So: Is it bad RAM? How can I make sure? What else could it be?
    
    Have you tried running memtest86?  I've never used that myself
but
    some folks on the list say it works well.

    

No, I haven't tried that yet, but I'm surely going to do so tomorrow.

    > Here's a small excerpt from the logfile:
    
    > 2002-08-06 17:36:23 [17296]  DEBUG:  _mdfd_blind_getseg:
couldn't open
    > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate
memory
    
    Is it possible that you are running with inadequate swap space,
a small
    data segment limit (ulimit -d), or something else that would
make the
    kernel refuse to give memory to a backend process?

I shouldn't think so; the machine has 2 GB RAM (that was more than
sufficient for the same DB, applications and load on a different
machine) and 4 GB swap:
Disk geometry for /dev/sda: 0.000-51834.000 megabytes
Disk label type: msdos
Minor    Start       End     Type      Filesystem  Flags
1          0.031     15.688  primary   ext3        boot
2         15.688   4118.225  primary   linux-swap
3       4118.225  24599.531  primary   ext3
4      24599.531  51826.882  primary   ext3

Taking a closer look I am a bit confused: I allocated 4GB the swap
partition, as you can see above, but free only reports 2GB? That's
strange, but cannot be the cause, I think, as the working machine has
got just 2 GB swap, too. ulimit is set to "unlimited" and there was RAM
available during load. As a matter of fact, right now free reports:

             total       used       free     shared    buffers
cached
Mem:       2061536    2053816       7720          0       4496
1825620
-/+ buffers/cache:     223700    1837836
Swap:      2097136     124800    1972336

on our fallback-machine, and that's the very same database and very same
application, it is running. When taking a look at total disk usage of
the database, I get a total of 1,8 GB. When I switched to the new
machine, there were about 30-50 open connections, max. connections is
set to 512 on both machines. The crashes occurred immediately after
making the DB accessible to our application, so most of the DB was
definitely not yet in memory. And again - our fallback-machine which has
got no RAID and slower processors can handle the very same DB under the
very same load with no such problems - I never ever encountered this
"cannot allocate memory" error before.

    > 2002-08-06 17:40:53 [16530]  DEBUG:  connection startup failed
(fork
    > failure): Cannot allocate memory
    > 2002-08-06 17:52:50 [16530]  DEBUG:  connection startup failed
(fork
    > failure): Cannot allocate memory
    
    Still looks like inadequate memory --- but now I'm thinking that
it's a
    system-wide condition, ie, you just plain haven't got enough RAM
for the
    number of processes you're trying to start.
    
    > 2002-08-06 17:52:54 [16530]  DEBUG:  server process (pid
18237) was
    > terminated by signal 9
    
    Postgres never issues any kill -9 on itself, but I've heard that
the
    Linux kernel may start killing processes when it's desperately
low on
    memory.
    
    Other than the signal 9, everything I see in this trace is
either a
    cannot-allocate-memory failure or followup effects from one.
How many
    backends are you trying to start up, anyway?  Might you have a
runaway
    client that keeps opening new backend connections?
    

Must be something else - the number of connections was not at all high
(<100), the server-load wasn't more than 3.5 (on a 4-processor machine),
there was RAM available at the time, both physical and swap, I haven't
got any surplus daemons running... I think I'll be able to harden the
bad-RAM-issue tomorrow using memtest86.

Thank you!

Regards,

     Markus


Re: URGENT: Database keeps crashing - suspect damaged RAM

From
Jeff Davis
Date:
Virtual memory problems on linux have certainly happened before; perhaps your
running a kernel that had some major ones. Maybe if you upgraded to 2.4.19?

Regards,
    Jeff Davis

On Tuesday 06 August 2002 11:02 am, Markus Wollny wrote:
> Hi!
>
>     -----Ursprüngliche Nachricht-----
>     Von: Tom Lane
>     Gesendet: Di 06.08.2002 18:59
>     An: Markus Wollny
>     Cc: pgsql-general@postgresql.org
>     Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect
> damaged RAM
>
>
>
>     "Markus Wollny" <Markus.Wollny@computec.de> writes:
>
>     > So: Is it bad RAM? How can I make sure? What else could it be?
>
>
>     Have you tried running memtest86?  I've never used that myself
> but
>     some folks on the list say it works well.
>
>
>
> No, I haven't tried that yet, but I'm surely going to do so tomorrow.
>
>
>     > Here's a small excerpt from the logfile:
>
>
>
>     > 2002-08-06 17:36:23 [17296]  DEBUG:  _mdfd_blind_getseg:
>
> couldn't open
>
>     > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate
>
> memory
>
>     Is it possible that you are running with inadequate swap space,
> a small
>     data segment limit (ulimit -d), or something else that would
> make the
>     kernel refuse to give memory to a backend process?
>
> I shouldn't think so; the machine has 2 GB RAM (that was more than
> sufficient for the same DB, applications and load on a different
> machine) and 4 GB swap:
> Disk geometry for /dev/sda: 0.000-51834.000 megabytes
> Disk label type: msdos
> Minor    Start       End     Type      Filesystem  Flags
> 1          0.031     15.688  primary   ext3        boot
> 2         15.688   4118.225  primary   linux-swap
> 3       4118.225  24599.531  primary   ext3
> 4      24599.531  51826.882  primary   ext3
>
> Taking a closer look I am a bit confused: I allocated 4GB the swap
> partition, as you can see above, but free only reports 2GB? That's
> strange, but cannot be the cause, I think, as the working machine has
> got just 2 GB swap, too. ulimit is set to "unlimited" and there was RAM
> available during load. As a matter of fact, right now free reports:
>
>              total       used       free     shared    buffers
> cached
> Mem:       2061536    2053816       7720          0       4496
> 1825620
> -/+ buffers/cache:     223700    1837836
> Swap:      2097136     124800    1972336
>
> on our fallback-machine, and that's the very same database and very same
> application, it is running. When taking a look at total disk usage of
> the database, I get a total of 1,8 GB. When I switched to the new
> machine, there were about 30-50 open connections, max. connections is
> set to 512 on both machines. The crashes occurred immediately after
> making the DB accessible to our application, so most of the DB was
> definitely not yet in memory. And again - our fallback-machine which has
> got no RAID and slower processors can handle the very same DB under the
> very same load with no such problems - I never ever encountered this
> "cannot allocate memory" error before.
>
>
>     > 2002-08-06 17:40:53 [16530]  DEBUG:  connection startup failed
>
> (fork
>
>     > failure): Cannot allocate memory
>     > 2002-08-06 17:52:50 [16530]  DEBUG:  connection startup failed
>
> (fork
>
>     > failure): Cannot allocate memory
>
>
>     Still looks like inadequate memory --- but now I'm thinking that
> it's a
>     system-wide condition, ie, you just plain haven't got enough RAM
> for the
>     number of processes you're trying to start.
>
>
>     > 2002-08-06 17:52:54 [16530]  DEBUG:  server process (pid
>
> 18237) was
>
>     > terminated by signal 9
>
>
>     Postgres never issues any kill -9 on itself, but I've heard that
> the
>     Linux kernel may start killing processes when it's desperately
> low on
>     memory.
>
>     Other than the signal 9, everything I see in this trace is
> either a
>     cannot-allocate-memory failure or followup effects from one.
> How many
>     backends are you trying to start up, anyway?  Might you have a
> runaway
>     client that keeps opening new backend connections?
>
>
> Must be something else - the number of connections was not at all high
> (<100), the server-load wasn't more than 3.5 (on a 4-processor machine),
> there was RAM available at the time, both physical and swap, I haven't
> got any surplus daemons running... I think I'll be able to harden the
> bad-RAM-issue tomorrow using memtest86.
>
> Thank you!
>
> Regards,
>
>      Markus
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


Re: URGENT: Database keeps crashing - suspect damaged RAM

From
"scott.marlowe"
Date:
A couple of points, one is that the linux kernel (if memory serves) is
limited to 2 gig swap partitions, but can have more than one swap.  In
fact, it is quite advantageous on a server that winds up swapping, to have
several partitions spread about on all the platters you can, as the kernel
will then interleave swap access across all the drives for maximum
performance.

What are your settings for sort_mem in postgresql?
Note that large values for sortmem can starve your machine for memory very
quickly, but only under load, and only when things need to be sorted.

I assume nothing like file-max or shmmax are the issue either?


Re: URGENT: Database keeps crashing - suspect damaged RAM

From
Ralph Graulich
Date:
> several partitions spread about on all the platters you can, as the kernel
> will then interleave swap access across all the drives for maximum
> performance.
[...]

Although this is rather a linux question than a postgresql's one, I want
to add that if you have several swap partitions and want to prefer using
one over another, you should set the "pri"-parameter in /etc/fstab, like

/dev/sdn1       swap              pri=1 0 0
/dev/sdf4       swap              pri=2 0 0
/dev/sdg4       swap              pri=3 0 0
/dev/sdh4       swap              pri=3 0 0
/dev/sdi4       swap              pri=3 0 0

Refer to "man stab" on your system.

This can dramatically improve performance if the server _has_ to swap, but
preferably should use a otherwise idle disk, before - in extreme
situations - should use any application disk.


Kind regards
... Ralph ...



Re: URGENT: Database keeps crashing - suspect damaged RAM

From
"Markus Wollny"
Date:
Hi!
 
Thank you - that clears up my confusion about swap available being
smaller than the swap partition :)
sort_mem is set to 65534, following the recommendation about setting it
to 2-4% of available physical RAM.
If shmmax were the issue, the postmaster would refuse to start up - so
this isn't it either; I took care of both filemax and shmmax - and the
very same configuration is working on our fallback-machine under the
same environment (application, load, database, data) without any
trouble.
 
I upgraded the kernel of the machine to 2.4.16 - there are no RPMs for
and not very much experience with SuSE 7.3 and 2.4.19 yet and I'm quite
cautious when it comes to the kernel; I do know how to configure and
compile the kernel, but on a production machine I leave this to SuSE :)
 
Taking into account that this thing does work when run on a different
machine, I think bad RAM is my best bet. But there's only one way to
know for shure - I'll go and find out tomorrow.
 
Regards,
 
   Markus

    -----Ursprüngliche Nachricht----- 
    Von: scott.marlowe 
    Gesendet: Di 06.08.2002 20:51 
    An: Markus Wollny 
    Cc: pgsql-general@postgresql.org 
    Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect
damaged RAM
    
    

    A couple of points, one is that the linux kernel (if memory
serves) is
    limited to 2 gig swap partitions, but can have more than one
swap.  In
    fact, it is quite advantageous on a server that winds up
swapping, to have
    several partitions spread about on all the platters you can, as
the kernel
    will then interleave swap access across all the drives for
maximum
    performance.
    
    What are your settings for sort_mem in postgresql? 
    Note that large values for sortmem can starve your machine for
memory very
    quickly, but only under load, and only when things need to be
sorted.
    
    I assume nothing like file-max or shmmax are the issue either?
    
    


Re: URGENT: Database keeps crashing - suspect damaged

From
"scott.marlowe"
Date:
On Tue, 6 Aug 2002, Markus Wollny wrote:

> Hi!
>
> Thank you - that clears up my confusion about swap available being
> smaller than the swap partition :)
> sort_mem is set to 65534, following the recommendation about setting it
> to 2-4% of available physical RAM.
> If shmmax were the issue, the postmaster would refuse to start up - so
> this isn't it either; I took care of both filemax and shmmax - and the
> very same configuration is working on our fallback-machine under the
> same environment (application, load, database, data) without any
> trouble.
>
> I upgraded the kernel of the machine to 2.4.16 - there are no RPMs for
> and not very much experience with SuSE 7.3 and 2.4.19 yet and I'm quite
> cautious when it comes to the kernel; I do know how to configure and
> compile the kernel, but on a production machine I leave this to SuSE :)
>
> Taking into account that this thing does work when run on a different
> machine, I think bad RAM is my best bet. But there's only one way to
> know for shure - I'll go and find out tomorrow.

Well, I'd first lower the sort mem myself.  64 Megs is pretty big, even on
a box with gigs of ram.  But more importantly, since the kernel looks like
it was killing the processes, I would NOT tend to think of this as being a
bad RAM issue, but a memory starvation issue.  Bad memory results in
database corruption, things like that.  It seems like yours is just
suddenly shutting down, and coming right back up.

Have you checked the available memory when the server is having these
problems?  I would tend to think it may be a configuration issue.  shmmax
doesn't just affect startup.  If the sort_mem is coming out of the
shared memory then the limit there could affect the ability of a child to
allocate memory when sorting, which would result in the problems you're
seeing where a backend dies while trying but failing to allocate memory.

Someone correct me if the sort mem doesn't come under the heading of
shared memory.  It would NOT be the first time that's happened. :-)


Re: URGENT: Database keeps crashing - suspect damaged

From
"scott.marlowe"
Date:
OK, I did a little more testing.  On one of our tables with 1.25 million
rows of semi-unique data (it's a keyword table, small row size, lots of
keywords, many repeaters) Some words occur once, some occur 1200 times,
most occur 3 to 10 times.


This test box has 512 Megs of RAM, and other than having X running it a
pretty close match to the servers we use. (1.1 Gigahertz CPU, running a
4x2G RAID5 drive set).

Shared buffers set to ~ 32 Megabyte (4000*8k)

I ran the following query in parallel by four psql sessions:

select distinct word from wordtable;

with sort_mem set to 64 Megs, my workstation, which sits at 0 used swap
and about 300 Megs of system buff/cache, used all the available memory,
and about 600 Megs of swap to run those four queries, and one of them
errored out with "ERROR:  MemoryContextAlloc: invalid request size
4294967293"  The run time was very long, with lots of swapping going on.
This was with only four processes connected.  Each one used about 130 Megs
of ram according to top.  subtracting 32Megs of shared, that would be
about 100 Megs of individual memory per.

With sort_mem set to 8 megs, the four queries used up all my ram, pushing
100 Megs into swap.  The queries were much faster.  About 180 seconds.
The test with 64 Meg sort_mem was about 8 minutes or so. (I stopped
checking after about 5 minutes.  I used explain analyze for all the tests
on less than 64 Megs.)

Next I tested with 2Megs sort memory.  Now I had a fair bit of ram left
over (about 100 Meg), and the queries each took about 135 seconds to run.

I'd suggest lowering sort_mem to something more reasonable, unless you
have a test case that shows a marked performance increase with 64 Megs of
sort_mem.  All mine point to 1 to 4 megs being perfect for sort_mem on
most queries.

Good luck.


Re: URGENT: Database keeps crashing - suspect damaged RAM

From
"Markus Wollny"
Date:
Hi!

I think I'll have to bow down to you gurus - again :) I upgraded to
2.4.16 (there are no RPMs for 2.4.19 and I didn't want to compile from
source - yet), and the symptoms have disappeared altogether. Which is
strange because, as I already told, the very same config isn't giving me
any trouble on a different machine... Anyway: I'll shun 2.4.10 from now
on.

Regards,

    Markus

> -----Ursprüngliche Nachricht-----
> Von: Jeff Davis [mailto:list-pgsql-general@empires.org]
> Gesendet: Dienstag, 6. August 2002 20:29
> An: Markus Wollny; Tom Lane
> Cc: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] URGENT: Database keeps crashing - 
> suspect damaged
> RAM
> 
> 
> Virtual memory problems on linux have certainly happened 
> before; perhaps your 
> running a kernel that had some major ones. Maybe if you 
> upgraded to 2.4.19?
> 
> Regards,
>     Jeff Davis
> 
> On Tuesday 06 August 2002 11:02 am, Markus Wollny wrote:
> > Hi!
> > 
> >     -----Ursprüngliche Nachricht----- 
> >     Von: Tom Lane 
> >     Gesendet: Di 06.08.2002 18:59 
> >     An: Markus Wollny 
> >     Cc: pgsql-general@postgresql.org 
> >     Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect
> > damaged RAM 
> >     
> >     
> > 
> >     "Markus Wollny" <Markus.Wollny@computec.de> writes:
> >
> >     > So: Is it bad RAM? How can I make sure? What else could it be?
> >
> >     
> >     Have you tried running memtest86?  I've never used that myself
> > but
> >     some folks on the list say it works well.
> > 
> >     
> > 
> > No, I haven't tried that yet, but I'm surely going to do so 
> tomorrow.
> > 
> >
> >     > Here's a small excerpt from the logfile:
> >
> >     
> >
> >     > 2002-08-06 17:36:23 [17296]  DEBUG:  _mdfd_blind_getseg:
> >
> > couldn't open
> >
> >     > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate
> >
> > memory
> >     
> >     Is it possible that you are running with inadequate swap space,
> > a small
> >     data segment limit (ulimit -d), or something else that would
> > make the
> >     kernel refuse to give memory to a backend process?
> > 
> > I shouldn't think so; the machine has 2 GB RAM (that was more than
> > sufficient for the same DB, applications and load on a different
> > machine) and 4 GB swap:
> > Disk geometry for /dev/sda: 0.000-51834.000 megabytes
> > Disk label type: msdos
> > Minor    Start       End     Type      Filesystem  Flags
> > 1          0.031     15.688  primary   ext3        boot
> > 2         15.688   4118.225  primary   linux-swap
> > 3       4118.225  24599.531  primary   ext3
> > 4      24599.531  51826.882  primary   ext3
> > 
> > Taking a closer look I am a bit confused: I allocated 4GB the swap
> > partition, as you can see above, but free only reports 2GB? That's
> > strange, but cannot be the cause, I think, as the working 
> machine has
> > got just 2 GB swap, too. ulimit is set to "unlimited" and 
> there was RAM
> > available during load. As a matter of fact, right now free reports:
> > 
> >              total       used       free     shared    buffers
> > cached
> > Mem:       2061536    2053816       7720          0       4496
> > 1825620
> > -/+ buffers/cache:     223700    1837836
> > Swap:      2097136     124800    1972336
> > 
> > on our fallback-machine, and that's the very same database 
> and very same
> > application, it is running. When taking a look at total 
> disk usage of
> > the database, I get a total of 1,8 GB. When I switched to the new
> > machine, there were about 30-50 open connections, max. 
> connections is
> > set to 512 on both machines. The crashes occurred immediately after
> > making the DB accessible to our application, so most of the DB was
> > definitely not yet in memory. And again - our 
> fallback-machine which has
> > got no RAID and slower processors can handle the very same 
> DB under the
> > very same load with no such problems - I never ever encountered this
> > "cannot allocate memory" error before.
> > 
> >
> >     > 2002-08-06 17:40:53 [16530]  DEBUG:  connection startup failed
> >
> > (fork
> >
> >     > failure): Cannot allocate memory
> >     > 2002-08-06 17:52:50 [16530]  DEBUG:  connection startup failed
> >
> > (fork
> >
> >     > failure): Cannot allocate memory
> >
> >     
> >     Still looks like inadequate memory --- but now I'm thinking that
> > it's a
> >     system-wide condition, ie, you just plain haven't got enough RAM
> > for the
> >     number of processes you're trying to start.
> >     
> >
> >     > 2002-08-06 17:52:54 [16530]  DEBUG:  server process (pid
> >
> > 18237) was
> >
> >     > terminated by signal 9
> >
> >     
> >     Postgres never issues any kill -9 on itself, but I've heard that
> > the
> >     Linux kernel may start killing processes when it's desperately
> > low on
> >     memory.
> >     
> >     Other than the signal 9, everything I see in this trace is
> > either a
> >     cannot-allocate-memory failure or followup effects from one.
> > How many
> >     backends are you trying to start up, anyway?  Might you have a
> > runaway
> >     client that keeps opening new backend connections?
> >     
> > 
> > Must be something else - the number of connections was not 
> at all high
> > (<100), the server-load wasn't more than 3.5 (on a 
> 4-processor machine),
> > there was RAM available at the time, both physical and 
> swap, I haven't
> > got any surplus daemons running... I think I'll be able to 
> harden the
> > bad-RAM-issue tomorrow using memtest86.
> > 
> > Thank you!
> > 
> > Regards,
> > 
> >      Markus
> > 
> >
> > ---------------------------(end of 
> broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to 
> majordomo@postgresql.org)
> 
>