Thread: URGENT: Database keeps crashing - suspect damaged RAM
Hello! I just installed PostgreSQL 7.2.1 on SuSE 7.3, 4xPIIIXEON 550MHz, 2GB RAM, 5x18GB SCSI RAID. The OS was freshly installed, after that I compiled and installed PostgreSQL from source (./configure --prefix=/opt/pgsql/ --with-perl --enable-odbc --enable-locale --enable-syslog). I copied the settings in postgresql.conf etc. from an identical machine running the identical platform. Then I imported a database to the new installation. The import seems to be successfull, I didn't get any errors during import. A subsequent vacuum analyze did finish without anything out of the ordinary. Just a few minutes after this vacuum analyze, the database crashed for the first time. It keeps crashing every now and then - every one or two minutes. What puzzles me is the fact that this very same machine was running Oracle 8i on Win2k more or less flawlessly just up to a few hours before - more or less meaning that we never really noticed anything much out of the ordinary. There might have been some minor issues after a RAM-upgrade from 1 GB to 2 GB just a week ago, but looking back it's hard to say if that could be due to bad RAM or just some bad code which we've sorted out (or disposed of) by now. As the machine is already running Linux and PostgreSQL it's quite impossible to prove my suspicion by going back to Oracle and having a closer look. What I'd like to know is if I need to look any further than RAM - shall I just chuck the new modules out of the machine? Or is there some other issue that could cause this behaviour? I am quite sure that I didn't do anything wrong during installation, configuration and import and the same application code is running without errors on a different machine at this very moment. I don't like the "record with zero length" and "Cannot allocate memory"-bits in the logfile at all, let alone the "was terminated by signal 9"-thingy. So: Is it bad RAM? How can I make sure? What else could it be? Here's a small excerpt from the logfile: 2002-08-06 17:31:38 [17063] DEBUG: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. 2002-08-06 17:36:23 [17296] DEBUG: _mdfd_blind_getseg: couldn't open /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory 2002-08-06 17:36:24 [17296] FATAL 2: cannot write block 13387 of 16596/16671 blind: Cannot allocate memory 2002-08-06 17:36:24 [16530] DEBUG: server process (pid 17296) exited with exit code 2 2002-08-06 17:36:24 [16530] DEBUG: terminating any other active server processes 2002-08-06 17:36:24 [17081] NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. [...] 2002-08-06 17:36:24 [16530] DEBUG: all server processes terminated; reinitializing shared memory and semaphores 2002-08-06 17:36:24 [17298] DEBUG: database system was interrupted at 2002-08-06 17:31:21 CEST 2002-08-06 17:36:24 [17298] DEBUG: checkpoint record is at 0/325D7C78 2002-08-06 17:36:24 [17298] DEBUG: redo record is at 0/325D7C78; undo record is at 0/0; shutdown FALSE 2002-08-06 17:36:24 [17298] DEBUG: next transaction id: 2270; next oid: 901292 2002-08-06 17:36:24 [17298] DEBUG: database system was not properly shut down; automatic recovery in progress 2002-08-06 17:36:24 [17298] DEBUG: redo starts at 0/325D7CB8 2002-08-06 17:36:25 [17298] DEBUG: ReadRecord: record with zero length at 0/326E16C4 2002-08-06 17:36:25 [17298] DEBUG: redo done at 0/326E16A0 2002-08-06 17:36:30 [17298] DEBUG: database system is ready 2002-08-06 17:40:53 [16530] DEBUG: connection startup failed (fork failure): Cannot allocate memory 2002-08-06 17:52:50 [16530] DEBUG: connection startup failed (fork failure): Cannot allocate memory 2002-08-06 17:52:54 [16530] DEBUG: server process (pid 18237) was terminated by signal 9 2002-08-06 17:52:54 [16530] DEBUG: terminating any other active server processes 2002-08-06 17:52:54 [18234] NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. [...] 2002-08-06 17:52:57 [18253] FATAL 1: The database system is in recovery mode 2002-08-06 17:52:57 [18255] FATAL 1: The database system is in recovery mode 2002-08-06 17:52:57 [18254] FATAL 1: The database system is in recovery mode 2002-08-06 17:52:57 [18235] NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. 2002-08-06 17:52:57 [18256] FATAL 1: The database system is in recovery mode 2002-08-06 17:52:57 [18257] FATAL 1: The database system is in recovery mode 2002-08-06 17:52:57 [18258] FATAL 1: The database system is in recovery mode 2002-08-06 17:52:57 [16530] DEBUG: all server processes terminated; reinitializing shared memory and semaphores 2002-08-06 17:52:57 [18260] FATAL 1: The database system is starting up 2002-08-06 17:52:57 [18259] DEBUG: database system was interrupted at 2002-08-06 17:51:38 CEST 2002-08-06 17:52:57 [18259] DEBUG: checkpoint record is at 0/32991848 2002-08-06 17:52:57 [18259] DEBUG: redo record is at 0/3297F4D8; undo record is at 0/0; shutdown FALSE 2002-08-06 17:52:57 [18259] DEBUG: next transaction id: 3704; next oid: 909484 2002-08-06 17:52:57 [18259] DEBUG: database system was not properly shut down; automatic recovery in progress 2002-08-06 17:52:57 [18259] DEBUG: redo starts at 0/3297F4D8 2002-08-06 17:52:57 [18261] FATAL 1: The database system is starting up 2002-08-06 17:52:58 [18259] DEBUG: ReadRecord: record with zero length at 0/32BF0278 2002-08-06 17:52:58 [18259] DEBUG: redo done at 0/32BF0254 2002-08-06 17:52:59 [18262] FATAL 1: The database system is starting up 2002-08-06 17:53:00 [18259] DEBUG: database system is ready 2002-08-06 17:54:24 [16530] DEBUG: connection startup failed (fork failure): Cannot allocate memory 2002-08-06 17:54:31 [16530] DEBUG: server process (pid 18283) was terminated by signal 9 2002-08-06 17:54:31 [16530] DEBUG: terminating any other active server processes 2002-08-06 17:54:31 [18275] NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. [...] 2002-08-06 17:54:32 [16530] DEBUG: all server processes terminated; reinitializing shared memory and semaphores 2002-08-06 17:54:32 [18296] DEBUG: database system was interrupted at 2002-08-06 17:53:00 CEST 2002-08-06 17:54:32 [18296] DEBUG: checkpoint record is at 0/32BF0278 2002-08-06 17:54:32 [18296] DEBUG: redo record is at 0/32BF0278; undo record is at 0/0; shutdown TRUE 2002-08-06 17:54:32 [18296] DEBUG: next transaction id: 4456; next oid: 909484 2002-08-06 17:54:32 [18296] DEBUG: database system was not properly shut down; automatic recovery in progress 2002-08-06 17:54:32 [18296] DEBUG: redo starts at 0/32BF02B8 2002-08-06 17:54:32 [18296] DEBUG: ReadRecord: record with zero length at 0/32F0B3C0 2002-08-06 17:54:32 [18296] DEBUG: redo done at 0/32F0B39C 2002-08-06 17:54:34 [18297] FATAL 1: The database system is starting up 2002-08-06 17:54:34 [18298] FATAL 1: The database system is starting up 2002-08-06 17:54:34 [18299] FATAL 1: The database system is starting up 2002-08-06 17:54:34 [18300] FATAL 1: The database system is starting up 2002-08-06 17:54:34 [18296] DEBUG: database system is ready 2002-08-06 17:57:35 [16530] DEBUG: connection startup failed (fork failure): Cannot allocate memory 2002-08-06 17:57:54 [16530] DEBUG: server process (pid 18366) was terminated by signal 9 2002-08-06 17:57:54 [16530] DEBUG: terminating any other active server processes 2002-08-06 17:57:54 [18368] NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. 2002-08-06 17:57:56 [18409] DEBUG: ReadRecord: record with zero length at 0/3338749C 2002-08-06 17:57:58 [18425] FATAL 1: The database system is starting up 2002-08-06 17:57:58 [18409] DEBUG: database system is ready 2002-08-06 17:58:53 [18432] NOTICE: RelationBuildDesc: can't open idx_bm_user_id: Cannot allocate memory 2002-08-06 17:59:00 [18443] FATAL 1: cannot open pg_attribute: Cannot allocate memory 2002-08-06 17:59:01 [16530] DEBUG: connection startup failed (fork failure): Cannot allocate memory 2002-08-06 17:59:01 [16530] DEBUG: server process (pid 18436) was terminated by signal 9 2002-08-06 17:59:01 [16530] DEBUG: terminating any other active server processes 2002-08-06 17:59:03 [18510] DEBUG: ReadRecord: record with zero length at 0/336E9970 2002-08-06 18:00:15 [16530] DEBUG: connection startup failed (fork failure): Cannot allocate memory 2002-08-06 18:00:17 [18589] DEBUG: ReadRecord: record with zero length at 0/33A7C194 Thank you for your kind assistance! Regards, Markus Wollny
Oh - and I forgot to mention: The crashes only occur when there is load on the machine. No load - no crashes. But then, that wouldn't be any surprise, as it wouldn't make use of a lot of RAM without any load... Regards, Markus > -----Ursprüngliche Nachricht----- > Von: Markus Wollny > Gesendet: Dienstag, 6. August 2002 18:38 > An: pgsql-general@postgresql.org > Betreff: [GENERAL] URGENT: Database keeps crashing - suspect > damaged RAM > > > Hello! > > I just installed PostgreSQL 7.2.1 on SuSE 7.3, 4xPIIIXEON 550MHz, 2GB > RAM, 5x18GB SCSI RAID. The OS was freshly installed, after that I > compiled and installed PostgreSQL from source (./configure > --prefix=/opt/pgsql/ --with-perl --enable-odbc --enable-locale > --enable-syslog). I copied the settings in postgresql.conf > etc. from an > identical machine running the identical platform. Then I imported a > database to the new installation. The import seems to be > successfull, I > didn't get any errors during import. A subsequent vacuum analyze did > finish without anything out of the ordinary. > > Just a few minutes after this vacuum analyze, the database crashed for > the first time. It keeps crashing every now and then - every > one or two > minutes. > > What puzzles me is the fact that this very same machine was running > Oracle 8i on Win2k more or less flawlessly just up to a few > hours before > - more or less meaning that we never really noticed anything > much out of > the ordinary. There might have been some minor issues after a > RAM-upgrade from 1 GB to 2 GB just a week ago, but looking back it's > hard to say if that could be due to bad RAM or just some bad > code which > we've sorted out (or disposed of) by now. As the machine is already > running Linux and PostgreSQL it's quite impossible to prove > my suspicion > by going back to Oracle and having a closer look. > > What I'd like to know is if I need to look any further than > RAM - shall > I just chuck the new modules out of the machine? Or is there > some other > issue that could cause this behaviour? I am quite sure that I > didn't do > anything wrong during installation, configuration and import and the > same application code is running without errors on a different machine > at this very moment. I don't like the "record with zero length" and > "Cannot allocate memory"-bits in the logfile at all, let > alone the "was > terminated by signal 9"-thingy. > > So: Is it bad RAM? How can I make sure? What else could it be? > > Here's a small excerpt from the logfile: > > 2002-08-06 17:31:38 [17063] DEBUG: Pages 0: Changed 0, > Empty 0; Tup 0: > Vac 0, Keep 0, UnUsed 0. > Total CPU 0.00s/0.00u sec elapsed 0.00 sec. > 2002-08-06 17:36:23 [17296] DEBUG: _mdfd_blind_getseg: couldn't open > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory > 2002-08-06 17:36:24 [17296] FATAL 2: cannot write block 13387 of > 16596/16671 blind: Cannot allocate memory > 2002-08-06 17:36:24 [16530] DEBUG: server process (pid 17296) exited > with exit code 2 > 2002-08-06 17:36:24 [16530] DEBUG: terminating any other > active server > processes > 2002-08-06 17:36:24 [17081] NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > [...] > 2002-08-06 17:36:24 [16530] DEBUG: all server processes terminated; > reinitializing shared memory and semaphores > 2002-08-06 17:36:24 [17298] DEBUG: database system was > interrupted at > 2002-08-06 17:31:21 CEST > 2002-08-06 17:36:24 [17298] DEBUG: checkpoint record is at > 0/325D7C78 > 2002-08-06 17:36:24 [17298] DEBUG: redo record is at > 0/325D7C78; undo > record is at 0/0; shutdown FALSE > 2002-08-06 17:36:24 [17298] DEBUG: next transaction id: 2270; next > oid: 901292 > 2002-08-06 17:36:24 [17298] DEBUG: database system was not properly > shut down; automatic recovery in progress > 2002-08-06 17:36:24 [17298] DEBUG: redo starts at 0/325D7CB8 > 2002-08-06 17:36:25 [17298] DEBUG: ReadRecord: record with > zero length > at 0/326E16C4 > 2002-08-06 17:36:25 [17298] DEBUG: redo done at 0/326E16A0 > 2002-08-06 17:36:30 [17298] DEBUG: database system is ready > 2002-08-06 17:40:53 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 17:52:50 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 17:52:54 [16530] DEBUG: server process (pid 18237) was > terminated by signal 9 > 2002-08-06 17:52:54 [16530] DEBUG: terminating any other > active server > processes > 2002-08-06 17:52:54 [18234] NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > [...] > 2002-08-06 17:52:57 [18253] FATAL 1: The database system is in > recovery mode > 2002-08-06 17:52:57 [18255] FATAL 1: The database system is in > recovery mode > 2002-08-06 17:52:57 [18254] FATAL 1: The database system is in > recovery mode > 2002-08-06 17:52:57 [18235] NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. > 2002-08-06 17:52:57 [18256] FATAL 1: The database system is in > recovery mode > 2002-08-06 17:52:57 [18257] FATAL 1: The database system is in > recovery mode > 2002-08-06 17:52:57 [18258] FATAL 1: The database system is in > recovery mode > 2002-08-06 17:52:57 [16530] DEBUG: all server processes terminated; > reinitializing shared memory and semaphores > 2002-08-06 17:52:57 [18260] FATAL 1: The database system is starting > up > 2002-08-06 17:52:57 [18259] DEBUG: database system was > interrupted at > 2002-08-06 17:51:38 CEST > 2002-08-06 17:52:57 [18259] DEBUG: checkpoint record is at > 0/32991848 > 2002-08-06 17:52:57 [18259] DEBUG: redo record is at > 0/3297F4D8; undo > record is at 0/0; shutdown FALSE > 2002-08-06 17:52:57 [18259] DEBUG: next transaction id: 3704; next > oid: 909484 > 2002-08-06 17:52:57 [18259] DEBUG: database system was not properly > shut down; automatic recovery in progress > 2002-08-06 17:52:57 [18259] DEBUG: redo starts at 0/3297F4D8 > 2002-08-06 17:52:57 [18261] FATAL 1: The database system is starting > up > 2002-08-06 17:52:58 [18259] DEBUG: ReadRecord: record with > zero length > at 0/32BF0278 > 2002-08-06 17:52:58 [18259] DEBUG: redo done at 0/32BF0254 > 2002-08-06 17:52:59 [18262] FATAL 1: The database system is starting > up > 2002-08-06 17:53:00 [18259] DEBUG: database system is ready > 2002-08-06 17:54:24 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 17:54:31 [16530] DEBUG: server process (pid 18283) was > terminated by signal 9 > 2002-08-06 17:54:31 [16530] DEBUG: terminating any other > active server > processes > 2002-08-06 17:54:31 [18275] NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. > [...] > 2002-08-06 17:54:32 [16530] DEBUG: all server processes terminated; > reinitializing shared memory and semaphores > 2002-08-06 17:54:32 [18296] DEBUG: database system was > interrupted at > 2002-08-06 17:53:00 CEST > 2002-08-06 17:54:32 [18296] DEBUG: checkpoint record is at > 0/32BF0278 > 2002-08-06 17:54:32 [18296] DEBUG: redo record is at > 0/32BF0278; undo > record is at 0/0; shutdown TRUE > 2002-08-06 17:54:32 [18296] DEBUG: next transaction id: 4456; next > oid: 909484 > 2002-08-06 17:54:32 [18296] DEBUG: database system was not properly > shut down; automatic recovery in progress > 2002-08-06 17:54:32 [18296] DEBUG: redo starts at 0/32BF02B8 > 2002-08-06 17:54:32 [18296] DEBUG: ReadRecord: record with > zero length > at 0/32F0B3C0 > 2002-08-06 17:54:32 [18296] DEBUG: redo done at 0/32F0B39C > 2002-08-06 17:54:34 [18297] FATAL 1: The database system is starting > up > 2002-08-06 17:54:34 [18298] FATAL 1: The database system is starting > up > 2002-08-06 17:54:34 [18299] FATAL 1: The database system is starting > up > 2002-08-06 17:54:34 [18300] FATAL 1: The database system is starting > up > 2002-08-06 17:54:34 [18296] DEBUG: database system is ready > 2002-08-06 17:57:35 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 17:57:54 [16530] DEBUG: server process (pid 18366) was > terminated by signal 9 > 2002-08-06 17:57:54 [16530] DEBUG: terminating any other > active server > processes > 2002-08-06 17:57:54 [18368] NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. > 2002-08-06 17:57:56 [18409] DEBUG: ReadRecord: record with > zero length > at 0/3338749C > 2002-08-06 17:57:58 [18425] FATAL 1: The database system is starting > up > 2002-08-06 17:57:58 [18409] DEBUG: database system is ready > 2002-08-06 17:58:53 [18432] NOTICE: RelationBuildDesc: can't open > idx_bm_user_id: Cannot allocate memory > 2002-08-06 17:59:00 [18443] FATAL 1: cannot open > pg_attribute: Cannot > allocate memory > 2002-08-06 17:59:01 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 17:59:01 [16530] DEBUG: server process (pid 18436) was > terminated by signal 9 > 2002-08-06 17:59:01 [16530] DEBUG: terminating any other > active server > processes > 2002-08-06 17:59:03 [18510] DEBUG: ReadRecord: record with > zero length > at 0/336E9970 > 2002-08-06 18:00:15 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 18:00:17 [18589] DEBUG: ReadRecord: record with > zero length > at 0/33A7C194 > > Thank you for your kind assistance! > > Regards, > > Markus Wollny > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html >
Re: URGENT: Database keeps crashing - suspect damaged RAM
From
nconway@klamath.dyndns.org (Neil Conway)
Date:
On Tue, Aug 06, 2002 at 06:38:24PM +0200, Markus Wollny wrote: > Is it bad RAM? How can I make sure? You can test your RAM with memtest86 -- www.memtest86.com Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
On Tue, 2002-08-06 at 17:38, Markus Wollny wrote: > What I'd like to know is if I need to look any further than RAM - shall > I just chuck the new modules out of the machine? Or is there some other > issue that could cause this behaviour? I am quite sure that I didn't do > anything wrong during installation, configuration and import and the > same application code is running without errors on a different machine > at this very moment. I don't like the "record with zero length" and > "Cannot allocate memory"-bits in the logfile at all, let alone the "was > terminated by signal 9"-thingy. > 9 is SIGKILL - that is significant because it implies that your OS is terminating the process (sig 11 would be likely for a bad pointer dereference, which could well indicate RAM problems). I don't think that you should immediately suspect your hardware. This all looks suspiciously like an OS out-of-memory situation -that also corresponds to it being under load. Two things to check: 1) Swap enabled, set to a suitable value for the load on the machine? (what does "free" say?) 2) There is a Linux sysctl which determines whether to "overcommit" memory. Also check that ulimit isn't imposing any per-process memory or CPU limits. 3) If its a stock Linux install, you may be running excessive daemons, but I'd be surprised if things got quite this bad. Regards John -- John Gray Azuli IT www.azuli.co.uk
"Markus Wollny" <Markus.Wollny@computec.de> writes: > So: Is it bad RAM? How can I make sure? What else could it be? Have you tried running memtest86? I've never used that myself but some folks on the list say it works well. > Here's a small excerpt from the logfile: > 2002-08-06 17:36:23 [17296] DEBUG: _mdfd_blind_getseg: couldn't open > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory Is it possible that you are running with inadequate swap space, a small data segment limit (ulimit -d), or something else that would make the kernel refuse to give memory to a backend process? > 2002-08-06 17:40:53 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 17:52:50 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory Still looks like inadequate memory --- but now I'm thinking that it's a system-wide condition, ie, you just plain haven't got enough RAM for the number of processes you're trying to start. > 2002-08-06 17:52:54 [16530] DEBUG: server process (pid 18237) was > terminated by signal 9 Postgres never issues any kill -9 on itself, but I've heard that the Linux kernel may start killing processes when it's desperately low on memory. Other than the signal 9, everything I see in this trace is either a cannot-allocate-memory failure or followup effects from one. How many backends are you trying to start up, anyway? Might you have a runaway client that keeps opening new backend connections? regards, tom lane
Hi! -----Ursprüngliche Nachricht----- Von: Tom Lane Gesendet: Di 06.08.2002 18:59 An: Markus Wollny Cc: pgsql-general@postgresql.org Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect damaged RAM "Markus Wollny" <Markus.Wollny@computec.de> writes: > So: Is it bad RAM? How can I make sure? What else could it be? Have you tried running memtest86? I've never used that myself but some folks on the list say it works well. No, I haven't tried that yet, but I'm surely going to do so tomorrow. > Here's a small excerpt from the logfile: > 2002-08-06 17:36:23 [17296] DEBUG: _mdfd_blind_getseg: couldn't open > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory Is it possible that you are running with inadequate swap space, a small data segment limit (ulimit -d), or something else that would make the kernel refuse to give memory to a backend process? I shouldn't think so; the machine has 2 GB RAM (that was more than sufficient for the same DB, applications and load on a different machine) and 4 GB swap: Disk geometry for /dev/sda: 0.000-51834.000 megabytes Disk label type: msdos Minor Start End Type Filesystem Flags 1 0.031 15.688 primary ext3 boot 2 15.688 4118.225 primary linux-swap 3 4118.225 24599.531 primary ext3 4 24599.531 51826.882 primary ext3 Taking a closer look I am a bit confused: I allocated 4GB the swap partition, as you can see above, but free only reports 2GB? That's strange, but cannot be the cause, I think, as the working machine has got just 2 GB swap, too. ulimit is set to "unlimited" and there was RAM available during load. As a matter of fact, right now free reports: total used free shared buffers cached Mem: 2061536 2053816 7720 0 4496 1825620 -/+ buffers/cache: 223700 1837836 Swap: 2097136 124800 1972336 on our fallback-machine, and that's the very same database and very same application, it is running. When taking a look at total disk usage of the database, I get a total of 1,8 GB. When I switched to the new machine, there were about 30-50 open connections, max. connections is set to 512 on both machines. The crashes occurred immediately after making the DB accessible to our application, so most of the DB was definitely not yet in memory. And again - our fallback-machine which has got no RAID and slower processors can handle the very same DB under the very same load with no such problems - I never ever encountered this "cannot allocate memory" error before. > 2002-08-06 17:40:53 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory > 2002-08-06 17:52:50 [16530] DEBUG: connection startup failed (fork > failure): Cannot allocate memory Still looks like inadequate memory --- but now I'm thinking that it's a system-wide condition, ie, you just plain haven't got enough RAM for the number of processes you're trying to start. > 2002-08-06 17:52:54 [16530] DEBUG: server process (pid 18237) was > terminated by signal 9 Postgres never issues any kill -9 on itself, but I've heard that the Linux kernel may start killing processes when it's desperately low on memory. Other than the signal 9, everything I see in this trace is either a cannot-allocate-memory failure or followup effects from one. How many backends are you trying to start up, anyway? Might you have a runaway client that keeps opening new backend connections? Must be something else - the number of connections was not at all high (<100), the server-load wasn't more than 3.5 (on a 4-processor machine), there was RAM available at the time, both physical and swap, I haven't got any surplus daemons running... I think I'll be able to harden the bad-RAM-issue tomorrow using memtest86. Thank you! Regards, Markus
Virtual memory problems on linux have certainly happened before; perhaps your running a kernel that had some major ones. Maybe if you upgraded to 2.4.19? Regards, Jeff Davis On Tuesday 06 August 2002 11:02 am, Markus Wollny wrote: > Hi! > > -----Ursprüngliche Nachricht----- > Von: Tom Lane > Gesendet: Di 06.08.2002 18:59 > An: Markus Wollny > Cc: pgsql-general@postgresql.org > Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect > damaged RAM > > > > "Markus Wollny" <Markus.Wollny@computec.de> writes: > > > So: Is it bad RAM? How can I make sure? What else could it be? > > > Have you tried running memtest86? I've never used that myself > but > some folks on the list say it works well. > > > > No, I haven't tried that yet, but I'm surely going to do so tomorrow. > > > > Here's a small excerpt from the logfile: > > > > > 2002-08-06 17:36:23 [17296] DEBUG: _mdfd_blind_getseg: > > couldn't open > > > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate > > memory > > Is it possible that you are running with inadequate swap space, > a small > data segment limit (ulimit -d), or something else that would > make the > kernel refuse to give memory to a backend process? > > I shouldn't think so; the machine has 2 GB RAM (that was more than > sufficient for the same DB, applications and load on a different > machine) and 4 GB swap: > Disk geometry for /dev/sda: 0.000-51834.000 megabytes > Disk label type: msdos > Minor Start End Type Filesystem Flags > 1 0.031 15.688 primary ext3 boot > 2 15.688 4118.225 primary linux-swap > 3 4118.225 24599.531 primary ext3 > 4 24599.531 51826.882 primary ext3 > > Taking a closer look I am a bit confused: I allocated 4GB the swap > partition, as you can see above, but free only reports 2GB? That's > strange, but cannot be the cause, I think, as the working machine has > got just 2 GB swap, too. ulimit is set to "unlimited" and there was RAM > available during load. As a matter of fact, right now free reports: > > total used free shared buffers > cached > Mem: 2061536 2053816 7720 0 4496 > 1825620 > -/+ buffers/cache: 223700 1837836 > Swap: 2097136 124800 1972336 > > on our fallback-machine, and that's the very same database and very same > application, it is running. When taking a look at total disk usage of > the database, I get a total of 1,8 GB. When I switched to the new > machine, there were about 30-50 open connections, max. connections is > set to 512 on both machines. The crashes occurred immediately after > making the DB accessible to our application, so most of the DB was > definitely not yet in memory. And again - our fallback-machine which has > got no RAID and slower processors can handle the very same DB under the > very same load with no such problems - I never ever encountered this > "cannot allocate memory" error before. > > > > 2002-08-06 17:40:53 [16530] DEBUG: connection startup failed > > (fork > > > failure): Cannot allocate memory > > 2002-08-06 17:52:50 [16530] DEBUG: connection startup failed > > (fork > > > failure): Cannot allocate memory > > > Still looks like inadequate memory --- but now I'm thinking that > it's a > system-wide condition, ie, you just plain haven't got enough RAM > for the > number of processes you're trying to start. > > > > 2002-08-06 17:52:54 [16530] DEBUG: server process (pid > > 18237) was > > > terminated by signal 9 > > > Postgres never issues any kill -9 on itself, but I've heard that > the > Linux kernel may start killing processes when it's desperately > low on > memory. > > Other than the signal 9, everything I see in this trace is > either a > cannot-allocate-memory failure or followup effects from one. > How many > backends are you trying to start up, anyway? Might you have a > runaway > client that keeps opening new backend connections? > > > Must be something else - the number of connections was not at all high > (<100), the server-load wasn't more than 3.5 (on a 4-processor machine), > there was RAM available at the time, both physical and swap, I haven't > got any surplus daemons running... I think I'll be able to harden the > bad-RAM-issue tomorrow using memtest86. > > Thank you! > > Regards, > > Markus > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
A couple of points, one is that the linux kernel (if memory serves) is limited to 2 gig swap partitions, but can have more than one swap. In fact, it is quite advantageous on a server that winds up swapping, to have several partitions spread about on all the platters you can, as the kernel will then interleave swap access across all the drives for maximum performance. What are your settings for sort_mem in postgresql? Note that large values for sortmem can starve your machine for memory very quickly, but only under load, and only when things need to be sorted. I assume nothing like file-max or shmmax are the issue either?
> several partitions spread about on all the platters you can, as the kernel > will then interleave swap access across all the drives for maximum > performance. [...] Although this is rather a linux question than a postgresql's one, I want to add that if you have several swap partitions and want to prefer using one over another, you should set the "pri"-parameter in /etc/fstab, like /dev/sdn1 swap pri=1 0 0 /dev/sdf4 swap pri=2 0 0 /dev/sdg4 swap pri=3 0 0 /dev/sdh4 swap pri=3 0 0 /dev/sdi4 swap pri=3 0 0 Refer to "man stab" on your system. This can dramatically improve performance if the server _has_ to swap, but preferably should use a otherwise idle disk, before - in extreme situations - should use any application disk. Kind regards ... Ralph ...
Hi! Thank you - that clears up my confusion about swap available being smaller than the swap partition :) sort_mem is set to 65534, following the recommendation about setting it to 2-4% of available physical RAM. If shmmax were the issue, the postmaster would refuse to start up - so this isn't it either; I took care of both filemax and shmmax - and the very same configuration is working on our fallback-machine under the same environment (application, load, database, data) without any trouble. I upgraded the kernel of the machine to 2.4.16 - there are no RPMs for and not very much experience with SuSE 7.3 and 2.4.19 yet and I'm quite cautious when it comes to the kernel; I do know how to configure and compile the kernel, but on a production machine I leave this to SuSE :) Taking into account that this thing does work when run on a different machine, I think bad RAM is my best bet. But there's only one way to know for shure - I'll go and find out tomorrow. Regards, Markus -----Ursprüngliche Nachricht----- Von: scott.marlowe Gesendet: Di 06.08.2002 20:51 An: Markus Wollny Cc: pgsql-general@postgresql.org Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect damaged RAM A couple of points, one is that the linux kernel (if memory serves) is limited to 2 gig swap partitions, but can have more than one swap. In fact, it is quite advantageous on a server that winds up swapping, to have several partitions spread about on all the platters you can, as the kernel will then interleave swap access across all the drives for maximum performance. What are your settings for sort_mem in postgresql? Note that large values for sortmem can starve your machine for memory very quickly, but only under load, and only when things need to be sorted. I assume nothing like file-max or shmmax are the issue either?
On Tue, 6 Aug 2002, Markus Wollny wrote: > Hi! > > Thank you - that clears up my confusion about swap available being > smaller than the swap partition :) > sort_mem is set to 65534, following the recommendation about setting it > to 2-4% of available physical RAM. > If shmmax were the issue, the postmaster would refuse to start up - so > this isn't it either; I took care of both filemax and shmmax - and the > very same configuration is working on our fallback-machine under the > same environment (application, load, database, data) without any > trouble. > > I upgraded the kernel of the machine to 2.4.16 - there are no RPMs for > and not very much experience with SuSE 7.3 and 2.4.19 yet and I'm quite > cautious when it comes to the kernel; I do know how to configure and > compile the kernel, but on a production machine I leave this to SuSE :) > > Taking into account that this thing does work when run on a different > machine, I think bad RAM is my best bet. But there's only one way to > know for shure - I'll go and find out tomorrow. Well, I'd first lower the sort mem myself. 64 Megs is pretty big, even on a box with gigs of ram. But more importantly, since the kernel looks like it was killing the processes, I would NOT tend to think of this as being a bad RAM issue, but a memory starvation issue. Bad memory results in database corruption, things like that. It seems like yours is just suddenly shutting down, and coming right back up. Have you checked the available memory when the server is having these problems? I would tend to think it may be a configuration issue. shmmax doesn't just affect startup. If the sort_mem is coming out of the shared memory then the limit there could affect the ability of a child to allocate memory when sorting, which would result in the problems you're seeing where a backend dies while trying but failing to allocate memory. Someone correct me if the sort mem doesn't come under the heading of shared memory. It would NOT be the first time that's happened. :-)
OK, I did a little more testing. On one of our tables with 1.25 million rows of semi-unique data (it's a keyword table, small row size, lots of keywords, many repeaters) Some words occur once, some occur 1200 times, most occur 3 to 10 times. This test box has 512 Megs of RAM, and other than having X running it a pretty close match to the servers we use. (1.1 Gigahertz CPU, running a 4x2G RAID5 drive set). Shared buffers set to ~ 32 Megabyte (4000*8k) I ran the following query in parallel by four psql sessions: select distinct word from wordtable; with sort_mem set to 64 Megs, my workstation, which sits at 0 used swap and about 300 Megs of system buff/cache, used all the available memory, and about 600 Megs of swap to run those four queries, and one of them errored out with "ERROR: MemoryContextAlloc: invalid request size 4294967293" The run time was very long, with lots of swapping going on. This was with only four processes connected. Each one used about 130 Megs of ram according to top. subtracting 32Megs of shared, that would be about 100 Megs of individual memory per. With sort_mem set to 8 megs, the four queries used up all my ram, pushing 100 Megs into swap. The queries were much faster. About 180 seconds. The test with 64 Meg sort_mem was about 8 minutes or so. (I stopped checking after about 5 minutes. I used explain analyze for all the tests on less than 64 Megs.) Next I tested with 2Megs sort memory. Now I had a fair bit of ram left over (about 100 Meg), and the queries each took about 135 seconds to run. I'd suggest lowering sort_mem to something more reasonable, unless you have a test case that shows a marked performance increase with 64 Megs of sort_mem. All mine point to 1 to 4 megs being perfect for sort_mem on most queries. Good luck.
Hi! I think I'll have to bow down to you gurus - again :) I upgraded to 2.4.16 (there are no RPMs for 2.4.19 and I didn't want to compile from source - yet), and the symptoms have disappeared altogether. Which is strange because, as I already told, the very same config isn't giving me any trouble on a different machine... Anyway: I'll shun 2.4.10 from now on. Regards, Markus > -----Ursprüngliche Nachricht----- > Von: Jeff Davis [mailto:list-pgsql-general@empires.org] > Gesendet: Dienstag, 6. August 2002 20:29 > An: Markus Wollny; Tom Lane > Cc: pgsql-general@postgresql.org > Betreff: Re: [GENERAL] URGENT: Database keeps crashing - > suspect damaged > RAM > > > Virtual memory problems on linux have certainly happened > before; perhaps your > running a kernel that had some major ones. Maybe if you > upgraded to 2.4.19? > > Regards, > Jeff Davis > > On Tuesday 06 August 2002 11:02 am, Markus Wollny wrote: > > Hi! > > > > -----Ursprüngliche Nachricht----- > > Von: Tom Lane > > Gesendet: Di 06.08.2002 18:59 > > An: Markus Wollny > > Cc: pgsql-general@postgresql.org > > Betreff: Re: [GENERAL] URGENT: Database keeps crashing - suspect > > damaged RAM > > > > > > > > "Markus Wollny" <Markus.Wollny@computec.de> writes: > > > > > So: Is it bad RAM? How can I make sure? What else could it be? > > > > > > Have you tried running memtest86? I've never used that myself > > but > > some folks on the list say it works well. > > > > > > > > No, I haven't tried that yet, but I'm surely going to do so > tomorrow. > > > > > > > Here's a small excerpt from the logfile: > > > > > > > > > 2002-08-06 17:36:23 [17296] DEBUG: _mdfd_blind_getseg: > > > > couldn't open > > > > > /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate > > > > memory > > > > Is it possible that you are running with inadequate swap space, > > a small > > data segment limit (ulimit -d), or something else that would > > make the > > kernel refuse to give memory to a backend process? > > > > I shouldn't think so; the machine has 2 GB RAM (that was more than > > sufficient for the same DB, applications and load on a different > > machine) and 4 GB swap: > > Disk geometry for /dev/sda: 0.000-51834.000 megabytes > > Disk label type: msdos > > Minor Start End Type Filesystem Flags > > 1 0.031 15.688 primary ext3 boot > > 2 15.688 4118.225 primary linux-swap > > 3 4118.225 24599.531 primary ext3 > > 4 24599.531 51826.882 primary ext3 > > > > Taking a closer look I am a bit confused: I allocated 4GB the swap > > partition, as you can see above, but free only reports 2GB? That's > > strange, but cannot be the cause, I think, as the working > machine has > > got just 2 GB swap, too. ulimit is set to "unlimited" and > there was RAM > > available during load. As a matter of fact, right now free reports: > > > > total used free shared buffers > > cached > > Mem: 2061536 2053816 7720 0 4496 > > 1825620 > > -/+ buffers/cache: 223700 1837836 > > Swap: 2097136 124800 1972336 > > > > on our fallback-machine, and that's the very same database > and very same > > application, it is running. When taking a look at total > disk usage of > > the database, I get a total of 1,8 GB. When I switched to the new > > machine, there were about 30-50 open connections, max. > connections is > > set to 512 on both machines. The crashes occurred immediately after > > making the DB accessible to our application, so most of the DB was > > definitely not yet in memory. And again - our > fallback-machine which has > > got no RAID and slower processors can handle the very same > DB under the > > very same load with no such problems - I never ever encountered this > > "cannot allocate memory" error before. > > > > > > > 2002-08-06 17:40:53 [16530] DEBUG: connection startup failed > > > > (fork > > > > > failure): Cannot allocate memory > > > 2002-08-06 17:52:50 [16530] DEBUG: connection startup failed > > > > (fork > > > > > failure): Cannot allocate memory > > > > > > Still looks like inadequate memory --- but now I'm thinking that > > it's a > > system-wide condition, ie, you just plain haven't got enough RAM > > for the > > number of processes you're trying to start. > > > > > > > 2002-08-06 17:52:54 [16530] DEBUG: server process (pid > > > > 18237) was > > > > > terminated by signal 9 > > > > > > Postgres never issues any kill -9 on itself, but I've heard that > > the > > Linux kernel may start killing processes when it's desperately > > low on > > memory. > > > > Other than the signal 9, everything I see in this trace is > > either a > > cannot-allocate-memory failure or followup effects from one. > > How many > > backends are you trying to start up, anyway? Might you have a > > runaway > > client that keeps opening new backend connections? > > > > > > Must be something else - the number of connections was not > at all high > > (<100), the server-load wasn't more than 3.5 (on a > 4-processor machine), > > there was RAM available at the time, both physical and > swap, I haven't > > got any surplus daemons running... I think I'll be able to > harden the > > bad-RAM-issue tomorrow using memtest86. > > > > Thank you! > > > > Regards, > > > > Markus > > > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 2: you can get off all lists at once with the unregister command > > (send "unregister YourEmailAddressHere" to > majordomo@postgresql.org) > >