Thread: Kernel kills postgres process - help need
Hi, I have a big trouble with a PostgreSQL server ... regulary since I have added 8 Gb of memory, on a server having already 8Gb of memory, I have troubles. Nothing else have changed ... I'm on a Dell server, and all the memory diagnostics from Dell seems to be good ... When I have a lot of connexions (persistante connexions from 6 web apache/php serveurs using PDO, about 110 process on each web servers) on the server, or long request, it's difficult for me to know when it's appening, the kernel seems to kill my postgresql process then the server become completly instable, and most of the time need a reboot ... I'm on Linux kernel 2.6.15 with a version 8.1.10 of PostgreSQL. My database is a size of 56G RAM = 16 Gb kernel shmmax : 941604096 Postgresql config : max_connections = 2048 shared_buffers = 40000 #temp_buffers = 1000 # min 100, 8KB each work_mem = 2048 # min 64, size in KB maintenance_work_mem = 512000 # min 1024, size in KB max_stack_depth = 4096 # min 100, size in KB max_fsm_pages = 25000000 max_fsm_relations = 2000 # min 100, ~70 bytes each max_files_per_process = 255 # min 25 fsync = on wal_buffers = 128 # min 4, 8KB each commit_delay = 500 # range 0-100000, in microseconds commit_siblings = 5 # range 1-1000 checkpoint_segments = 160 effective_cache_size = 600000 # typically 8KB each random_page_cost = 2 Syslog when crashing : Jan 9 20:30:47 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0 Jan 9 20:30:48 db2 kernel: Mem-info: Jan 9 20:30:48 db2 kernel: DMA per-cpu: Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 1 hot: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 1 cold: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 2 hot: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 2 cold: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 3 hot: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 3 cold: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: DMA32 per-cpu: empty Jan 9 20:30:48 db2 kernel: Normal per-cpu: Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 186, batch 31 used:5 Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 62, batch 15 used:59 Jan 9 20:30:48 db2 kernel: cpu 1 hot: low 0, high 186, batch 31 used:22 Jan 9 20:30:48 db2 kernel: cpu 1 cold: low 0, high 62, batch 15 used:49 Jan 9 20:30:48 db2 kernel: cpu 2 hot: low 0, high 186, batch 31 used:33 Jan 9 20:30:48 db2 kernel: cpu 2 cold: low 0, high 62, batch 15 used:60 Jan 9 20:30:48 db2 kernel: cpu 3 hot: low 0, high 186, batch 31 used:3 Jan 9 20:30:48 db2 kernel: cpu 3 cold: low 0, high 62, batch 15 used:55 Jan 9 20:30:48 db2 kernel: HighMem per-cpu: Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 186, batch 31 used:5 Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 62, batch 15 used:5 Jan 9 20:30:48 db2 kernel: cpu 1 hot: low 0, high 186, batch 31 used:11 Jan 9 20:30:48 db2 kernel: cpu 1 cold: low 0, high 62, batch 15 used:4 Jan 9 20:30:48 db2 kernel: cpu 2 hot: low 0, high 186, batch 31 used:17 Jan 9 20:30:48 db2 kernel: cpu 2 cold: low 0, high 62, batch 15 used:14 Jan 9 20:30:48 db2 kernel: cpu 3 hot: low 0, high 186, batch 31 used:14 Jan 9 20:30:48 db2 kernel: cpu 3 cold: low 0, high 62, batch 15 used:9 Jan 9 20:30:48 db2 kernel: Free pages: 497624kB (490232kB HighMem) Jan 9 20:30:48 db2 kernel: Active:3604892 inactive:234379 dirty:20273 writeback:210 unstable:0 free:124406 slab:49119 mapped:547571 pagetables:139724 Jan 9 20:30:48 db2 kernel: DMA free:3588kB min:68kB low:84kB high:100kB active:0kB inactive:0kB present:16384kB pages_scanned:1 all_unreclaimable? yes Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 880 17392 Jan 9 20:30:48 db2 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 880 17392 Jan 9 20:30:48 db2 kernel: Normal free:3804kB min:3756kB low:4692kB high:5632kB active:508kB inactive:464kB present:901120kB pages_scanned:975 all_unreclaimable? yes Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 0 132096 Jan 9 20:30:48 db2 kernel: HighMem free:490108kB min:512kB low:18148kB high:35784kB active:14419044kB inactive:937112kB present:16908288kB pages_scanned:0 all_unreclaimable? no Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 0 0 Jan 9 20:30:48 db2 kernel: DMA: 1*4kB 0*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB Jan 9 20:30:48 db2 kernel: DMA32: empty Jan 9 20:30:48 db2 kernel: Normal: 35*4kB 0*8kB 7*16kB 5*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3804kB Jan 9 20:30:48 db2 kernel: HighMem: 29171*4kB 43358*8kB 1620*16kB 8*32kB 0*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 490108kB Jan 9 20:30:48 db2 kernel: Swap cache: add 161, delete 160, find 98/138, race 0+0 Jan 9 20:30:48 db2 kernel: Free swap = 15623168kB Jan 9 20:30:48 db2 kernel: Total swap = 15623172kB Jan 9 20:30:48 db2 kernel: Free swap: 15623168kB Jan 9 20:30:48 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0 Jan 9 20:30:48 db2 kernel: Mem-info: Jan 9 20:30:48 db2 kernel: DMA per-cpu: Jan 9 20:30:48 db2 postgres[7634]: [2-1] LOG: background writer process (PID 7639) was terminated by signal 9 Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 0, batch 1 used:0 Jan 9 20:30:48 db2 postgres[7634]: [3-1] LOG: terminating any other active server processes Jan 9 20:30:48 db2 postgres[4058]: [2-1] WARNING: terminating connection because of crash of another server process Jan 9 20:30:48 db2 postgres[4058]: [2-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server Jan 9 20:30:48 db2 postgres[4058]: [2-3] process exited abnormally and possibly corrupted shared memory. Jan 9 20:30:48 db2 postgres[4044]: [2-1] WARNING: terminating connection because of crash of another server process Jan 9 20:30:48 db2 postgres[4058]: [2-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. Jan 9 20:30:48 db2 postgres[4023]: [2-1] WARNING: terminating connection because of crash of another server process Jan 9 20:30:48 db2 postgres[4023]: [2-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server Jan 9 20:30:48 db2 postgres[4023]: [2-3] process exited abnormally and possibly corrupted shared memory. Jan 9 20:30:48 db2 postgres[4023]: [2-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. etc. At this moment I had 877 connexions ... nothing very big for our activity. If somebody have any idea ... a bad configuration parameter ... or another idea to solve my problem ... help will be really appreciated. Regards, -- Hervé
On Wed, 2008-01-09 at 22:57 +0100, Hervé Piedvache wrote: > Hi, > > I have a big trouble with a PostgreSQL server ... regulary since I have added > 8 Gb of memory, on a server having already 8Gb of memory, I have troubles. > Nothing else have changed ... I'm on a Dell server, and all the memory > diagnostics from Dell seems to be good ... > When I have a lot of connexions (persistante connexions from 6 web apache/php > serveurs using PDO, about 110 process on each web servers) on the server, or > long request, it's difficult for me to know when it's appening, the kernel > seems to kill my postgresql process then the server become completly > instable, and most of the time need a reboot ... > > I'm on Linux kernel 2.6.15 with a version 8.1.10 of PostgreSQL. > My database is a size of 56G > RAM = 16 Gb > [snip] > Jan 9 20:30:47 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0 It looks like the Out Of Memory Killer was invoked, and you need to find out why it was invoked. I posted to LKML here: http://kerneltrap.org/mailarchive/linux-kernel/2007/2/12/54202 because linux has a behavior -- which in my opinion is a bug -- that causes the OOM killer to almost always kill PostgreSQL first, regardless of whether it was truly the offending process or not. So, find out which process truly caused the memory pressure that lead to the OOM being invoked, and fix that problem. You may also consider some other linux configuration options that make invocation of OOM killer less likely. Regards, Jeff Davis
=?utf-8?q?Herv=C3=A9_Piedvache?= <bill.footcow@gmail.com> writes: > When I have a lot of connexions (persistante connexions from 6 web apache/php > serveurs using PDO, about 110 process on each web servers) on the server, or > long request, it's difficult for me to know when it's appening, the kernel > seems to kill my postgresql process then the server become completly > instable, and most of the time need a reboot ... Turn off memory overcommit. > max_connections = 2048 Have you considered using a connection pooler in front of a smaller number of backends? If you really need that many backends, it'd likely be a good idea to reduce max_files_per_process to perhaps 100 or so. If you manage to run the kernel out of filetable slots, all sorts of userland stuff is going to get very unhappy. regards, tom lane
Tom, Le mercredi 09 janvier 2008, Tom Lane a écrit : > =?utf-8?q?Herv=C3=A9_Piedvache?= <bill.footcow@gmail.com> writes: > > When I have a lot of connexions (persistante connexions from 6 web > > apache/php serveurs using PDO, about 110 process on each web servers) on > > the server, or long request, it's difficult for me to know when it's > > appening, the kernel seems to kill my postgresql process then the server > > become completly instable, and most of the time need a reboot ... > > Turn off memory overcommit. My sysctl.conf file looks like this : kernel.shmmax= 941604096 kernel.sem = 250 32000 100 400 fs.file-max=655360 vm.overcommit_memory=2 vm.overcommit_ratio=30 > > max_connections = 2048 > > Have you considered using a connection pooler in front of a smaller > number of backends? Which system do you recommand for this ? > If you really need that many backends, it'd likely be a good idea to > reduce max_files_per_process to perhaps 100 or so. If you manage > to run the kernel out of filetable slots, all sorts of userland stuff > is going to get very unhappy. I'll try this ... regards, -- Hervé
Le mercredi 09 janvier 2008, Jeff Davis a écrit : > On Wed, 2008-01-09 at 22:57 +0100, Hervé Piedvache wrote: > > Hi, > > > > I have a big trouble with a PostgreSQL server ... regulary since I have > > added 8 Gb of memory, on a server having already 8Gb of memory, I have > > troubles. Nothing else have changed ... I'm on a Dell server, and all the > > memory diagnostics from Dell seems to be good ... > > When I have a lot of connexions (persistante connexions from 6 web > > apache/php serveurs using PDO, about 110 process on each web servers) on > > the server, or long request, it's difficult for me to know when it's > > appening, the kernel seems to kill my postgresql process then the server > > become completly instable, and most of the time need a reboot ... > > > > I'm on Linux kernel 2.6.15 with a version 8.1.10 of PostgreSQL. > > My database is a size of 56G > > RAM = 16 Gb > > [snip] > > > Jan 9 20:30:47 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0 > > It looks like the Out Of Memory Killer was invoked, and you need to find > out why it was invoked. > > I posted to LKML here: > > http://kerneltrap.org/mailarchive/linux-kernel/2007/2/12/54202 > > because linux has a behavior -- which in my opinion is a bug -- that > causes the OOM killer to almost always kill PostgreSQL first, regardless > of whether it was truly the offending process or not. > > So, find out which process truly caused the memory pressure that lead to > the OOM being invoked, and fix that problem. How can I process to find this ? It's a production server for a web service, and I have no idea how to find wich process was the cause of this ... !? > You may also consider some other linux configuration options that make > invocation of OOM killer less likely. On this server there is only Postgresql, slony, and sshd running the rest is only Linux basic process (cron, atd, getty etc.) regards, -- Hervé Piedvache
On Jan 9, 2008 3:57 PM, Hervé Piedvache <bill.footcow@gmail.com> wrote: SNIP > 0+0 > Jan 9 20:30:48 db2 kernel: Free swap = 15623168kB > Jan 9 20:30:48 db2 kernel: Total swap = 15623172kB > Jan 9 20:30:48 db2 kernel: Free swap: 15623168kB > Jan 9 20:30:48 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0 > Jan 9 20:30:48 db2 kernel: Mem-info: > Jan 9 20:30:48 db2 kernel: DMA per-cpu: > Jan 9 20:30:48 db2 postgres[7634]: [2-1] LOG: background writer process (PID > 7639) was terminated by signal 9 This makes no sense to me. The OS is showing that there's 16G free swap. Why is it killing things? I'm betting there's some bug with too large of a swap resulting in some kind of wrap around or something.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 09 Jan 2008 14:17:14 -0800 Jeff Davis <pgsql@j-davis.com> wrote: > I posted to LKML here: > > http://kerneltrap.org/mailarchive/linux-kernel/2007/2/12/54202 > > because linux has a behavior -- which in my opinion is a bug -- that > causes the OOM killer to almost always kill PostgreSQL first, > regardless of whether it was truly the offending process or not. If that isn't an argument for FreeBSD I don't know what is... /linuxpoet > > So, find out which process truly caused the memory pressure that lead > to the OOM being invoked, and fix that problem. > > You may also consider some other linux configuration options that make > invocation of OOM killer less likely. > > Regards, > Jeff Davis > > > ---------------------------(end of > broadcast)--------------------------- TIP 2: Don't 'kill -9' the > postmaster > - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHhVcVATb/zqfZUUQRAp0YAJ4ooisf5xRDvXegEl2f/r3TTTB4jACfSuFn O1MUlow1sg++4zdoh6TGu6Y= =JMWG -----END PGP SIGNATURE-----
On Wednesday 09 January 2008 18:21, Joshua D. Drake wrote: > On Wed, 09 Jan 2008 14:17:14 -0800 > > Jeff Davis <pgsql@j-davis.com> wrote: > > I posted to LKML here: > > > > http://kerneltrap.org/mailarchive/linux-kernel/2007/2/12/54202 > > > > because linux has a behavior -- which in my opinion is a bug -- that > > causes the OOM killer to almost always kill PostgreSQL first, > > regardless of whether it was truly the offending process or not. > > If that isn't an argument for FreeBSD I don't know what is... > Funny, it looked like an argument for Solaris to me. ;-) -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Wed, Jan 09, 2008 at 04:59:45PM -0600, Scott Marlowe wrote: > On Jan 9, 2008 3:57 PM, Hervé Piedvache <bill.footcow@gmail.com> wrote: > > Jan 9 20:30:48 db2 kernel: Free swap = 15623168kB > > Jan 9 20:30:48 db2 kernel: Total swap = 15623172kB > > Jan 9 20:30:48 db2 kernel: Free swap: 15623168kB > > Jan 9 20:30:48 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0 > > Jan 9 20:30:48 db2 kernel: Mem-info: > > Jan 9 20:30:48 db2 kernel: DMA per-cpu: > > Jan 9 20:30:48 db2 postgres[7634]: [2-1] LOG: background writer process (PID > > 7639) was terminated by signal 9 > > This makes no sense to me. The OS is showing that there's > 16G free swap. Why is it killing things? I'm betting there's some > bug with too large of a swap resulting in some kind of wrap around or > something. At a guess it's this: Jan 9 20:30:48 db2 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable?no Which is why the bgwriter got whacked, it couldn't allocate any memory for the disk transfer (though why the OOM killer gets invoked here I don't know). Disabling overcommit won't help you either. Perhaps a 64-bit architecture? Or a RAID controller that can access high memory (is this possible?). Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
Tom, Le mercredi 9 janvier 2008, Tom Lane a écrit : > =?utf-8?q?Herv=C3=A9_Piedvache?= <bill.footcow@gmail.com> writes: > > When I have a lot of connexions (persistante connexions from 6 web > > apache/php serveurs using PDO, about 110 process on each web servers) on > > the server, or long request, it's difficult for me to know when it's > > appening, the kernel seems to kill my postgresql process then the server > > become completly instable, and most of the time need a reboot ... > > Turn off memory overcommit. > > > max_connections = 2048 > > Have you considered using a connection pooler in front of a smaller > number of backends? You never answered me to this point ... we use persistants connections so I don't understand the interest of using a pooler ... Otherwise, what pooler do you recommand, and what will be the improvement for us ? Thanks, -- Hervé Piedvache Elma Ingénierie Informatique Groupe Maximiles S.A. 3 rue d'Uzès F-75002 - Paris - France Pho. 33-144949901 Fax. 33-144882747
Hervé Piedvache escribió: > Tom, > > Le mercredi 9 janvier 2008, Tom Lane a écrit : > > =?utf-8?q?Herv=C3=A9_Piedvache?= <bill.footcow@gmail.com> writes: > > > When I have a lot of connexions (persistante connexions from 6 web > > > apache/php serveurs using PDO, about 110 process on each web servers) on > > > the server, or long request, it's difficult for me to know when it's > > > appening, the kernel seems to kill my postgresql process then the server > > > become completly instable, and most of the time need a reboot ... > > Have you considered using a connection pooler in front of a smaller > > number of backends? > > You never answered me to this point ... we use persistants connections so I > don't understand the interest of using a pooler ... The problem with persistent connections is that they are, well, persistent -- so they keep resources allocated, which the server cannot then use for other things. The PHP model of persistent connections is silly and useless, because each PHP process keeps an open connection (or more than one, if it connects to different databases), which is then idle most of the time. A pooler also keeps the connections open, but they are given in turns to different PHP processes as they need them. The total number of open connections to the database server is lower, which leads to resource wastage being lower. > Otherwise, what pooler do you recommand, and what will be the improvement for > us ? The two most recommended ones I've seen around here are pgbouncer and pgpool. I think pgbouncer is supposed to perform better, at the cost of not having certain bells and whistles (which you may not need anyway). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > The PHP model of persistent connections is silly and useless, because each > PHP process keeps an open connection (or more than one, if it connects to > different databases), which is then idle most of the time. Some might say that keeping PHP processes around which are idle most of the time would be silly in itself. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Thanks Alvaro for your answer really clear. Another, may be stupid question, but when you have several web nodes like me ... with several physical database (I'm not talking about replication, it's just that the web node can contact 3 or 4 differents database for differents applications), what is the best way to process with a pooler ... install one on each node or one on each database ? Regards, Le jeudi 7 février 2008, Alvaro Herrera a écrit : > Hervé Piedvache escribió: > > Tom, > > > > Le mercredi 9 janvier 2008, Tom Lane a écrit : > > > =?utf-8?q?Herv=C3=A9_Piedvache?= <bill.footcow@gmail.com> writes: > > > > When I have a lot of connexions (persistante connexions from 6 web > > > > apache/php serveurs using PDO, about 110 process on each web servers) > > > > on the server, or long request, it's difficult for me to know when > > > > it's appening, the kernel seems to kill my postgresql process then > > > > the server become completly instable, and most of the time need a > > > > reboot ... > > > > > > Have you considered using a connection pooler in front of a smaller > > > number of backends? > > > > You never answered me to this point ... we use persistants connections so > > I don't understand the interest of using a pooler ... > > The problem with persistent connections is that they are, well, > persistent -- so they keep resources allocated, which the server cannot > then use for other things. The PHP model of persistent connections is > silly and useless, because each PHP process keeps an open connection (or > more than one, if it connects to different databases), which is then > idle most of the time. > > A pooler also keeps the connections open, but they are given in turns to > different PHP processes as they need them. The total number of open > connections to the database server is lower, which leads to resource > wastage being lower. > > > Otherwise, what pooler do you recommand, and what will be the improvement > > for us ? > > The two most recommended ones I've seen around here are pgbouncer and > pgpool. I think pgbouncer is supposed to perform better, at the cost of > not having certain bells and whistles (which you may not need anyway). -- Hervé Piedvache Elma Ingénierie Informatique Groupe Maximiles S.A. 3 rue d'Uzès F-75002 - Paris - France Pho. 33-144949901 Fax. 33-144882747
Hervé Piedvache escribió: > Another, may be stupid question, but when you have several web nodes like > me ... with several physical database (I'm not talking about replication, > it's just that the web node can contact 3 or 4 differents database for > differents applications), what is the best way to process with a pooler ... > install one on each node or one on each database ? I don't really know the answer to this, but if you have one per database server, then all the web nodes are going to share the connections to that database server. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support