Thread: Error: "could not fork new process for connection: Cannot allocate memory"

Error: "could not fork new process for connection: Cannot allocate memory"

From
frank picabia
Date:

This error occurred a couple of days ago and it is unlikely to repeat as
it was a series of exams being done in moodle with over 500 students.
However I'd like to tune Postgres to see if it can be prevented.

First of all, we did not run out of memory within the OS.  This system is monitored by cacti and
Real Used memory never exceeded 30 GB on a server having 64 GB ram plus some swap.

I can also see in the messages log, the Linux kernel OOM killer was never triggered.

The postgres user running the DB has no ulimits on memory which might apply.

Here is the entry in the Postgres log:

<2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
TopMemoryContext: 136704 total in 13 blocks; 6688 free (4 chunks); 130016 used
smgr relation table: 24576 total in 2 blocks; 13904 free (4 chunks); 10672 used
TopTransactionContext: 8192 total in 1 blocks; 6152 free (5 chunks); 2040 used
<2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
TransactionAbortContext: 32768 total in 1 blocks; 32728 free (0 chunks); 40 used
Portal hash: 8192 total in 1 blocks; 1672 free (0 chunks); 6520 used
PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used


My conclusion is this memory error must be related to limits within Postgres tunables, or perhaps something like SHM memory, which we have not set up in the sysctl.conf , so this is just a default amount.

The values we have in postgres.conf are mostly suggested by pgtune script.  The one exception being max_connections=500, which was increased to try to avoid bottlenecks being reached with Apache client numbers, and gradually over time as we saw we still had lots of memory available with a higher connection limit.

Which tunables do you need to see?

Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
frank picabia
Date:

Here are settings as suggested in the wiki to run a query to
display those changed from defaults:

             name             |  current_setting  |        source        
------------------------------+-------------------+----------------------
 application_name             | psql              | client
 autovacuum                   | on                | configuration file
 autovacuum_analyze_threshold | 500               | configuration file
 autovacuum_naptime           | 1d                | configuration file
 autovacuum_vacuum_threshold  | 1000              | configuration file
 checkpoint_segments          | 12                | configuration file
 client_encoding              | UTF8              | client
 client_min_messages          | notice            | configuration file
 effective_cache_size         | 4TB               | configuration file
 escape_string_warning        | off               | configuration file
 lc_messages                  | en_US.UTF-8       | configuration file
 lc_monetary                  | en_US.UTF-8       | configuration file
 lc_numeric                   | en_US.UTF-8       | configuration file
 lc_time                      | en_US.UTF-8       | configuration file
 listen_addresses             | *                 | configuration file
 log_destination              | stderr            | configuration file
 log_directory                | pg_log            | configuration file
 log_duration                 | on                | configuration file
 log_filename                 | postgresql-%a.log | configuration file
 log_line_prefix              | <%t>              | configuration file
 log_min_error_statement      | debug1            | configuration file
 log_min_messages             | info              | configuration file
 log_rotation_age             | 1d                | configuration file
 log_rotation_size            | 0                 | configuration file
 log_truncate_on_rotation     | on                | configuration file
 logging_collector            | on                | configuration file
 maintenance_work_mem         | 160MB             | configuration file

 max_connections              | 500               | configuration file
 max_stack_depth              | 2MB               | environment variable
 shared_buffers               | 1GB               | configuration file
 standard_conforming_strings  | off               | configuration file
 statement_timeout            | 1h                | configuration file
 work_mem                     | 10MB              | configuration file



On Mon, Dec 21, 2020 at 9:17 AM frank picabia <fpicabia@gmail.com> wrote:

This error occurred a couple of days ago and it is unlikely to repeat as
it was a series of exams being done in moodle with over 500 students.
However I'd like to tune Postgres to see if it can be prevented.

First of all, we did not run out of memory within the OS.  This system is monitored by cacti and
Real Used memory never exceeded 30 GB on a server having 64 GB ram plus some swap.

I can also see in the messages log, the Linux kernel OOM killer was never triggered.

The postgres user running the DB has no ulimits on memory which might apply.

Here is the entry in the Postgres log:

<2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
TopMemoryContext: 136704 total in 13 blocks; 6688 free (4 chunks); 130016 used
smgr relation table: 24576 total in 2 blocks; 13904 free (4 chunks); 10672 used
TopTransactionContext: 8192 total in 1 blocks; 6152 free (5 chunks); 2040 used
<2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
TransactionAbortContext: 32768 total in 1 blocks; 32728 free (0 chunks); 40 used
Portal hash: 8192 total in 1 blocks; 1672 free (0 chunks); 6520 used
PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used


My conclusion is this memory error must be related to limits within Postgres tunables, or perhaps something like SHM memory, which we have not set up in the sysctl.conf , so this is just a default amount.

The values we have in postgres.conf are mostly suggested by pgtune script.  The one exception being max_connections=500, which was increased to try to avoid bottlenecks being reached with Apache client numbers, and gradually over time as we saw we still had lots of memory available with a higher connection limit.

Which tunables do you need to see?

Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
Laurenz Albe
Date:
On Mon, 2020-12-21 at 09:17 -0400, frank picabia wrote:
> This error occurred a couple of days ago and it is unlikely to repeat as
> it was a series of exams being done in moodle with over 500 students.
> However I'd like to tune Postgres to see if it can be prevented.
> 
> First of all, we did not run out of memory within the OS.  This system is monitored by cacti and
> Real Used memory never exceeded 30 GB on a server having 64 GB ram plus some swap.
> 
> I can also see in the messages log, the Linux kernel OOM killer was never triggered.
> 
> The postgres user running the DB has no ulimits on memory which might apply.
> 
> Here is the entry in the Postgres log:
> 
> <2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
> TopMemoryContext: 136704 total in 13 blocks; 6688 free (4 chunks); 130016 used
> smgr relation table: 24576 total in 2 blocks; 13904 free (4 chunks); 10672 used
> TopTransactionContext: 8192 total in 1 blocks; 6152 free (5 chunks); 2040 used
> <2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
> TransactionAbortContext: 32768 total in 1 blocks; 32728 free (0 chunks); 40 used
> Portal hash: 8192 total in 1 blocks; 1672 free (0 chunks); 6520 used
> PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used
> 
> My conclusion is this memory error must be related to limits within Postgres tunables, or perhaps
>  something like SHM memory, which we have not set up in the sysctl.conf , so this is just a default amount.
> 
> The values we have in postgres.conf are mostly suggested by pgtune script.  The one exception being
>  max_connections=500, which was increased to try to avoid bottlenecks being reached with Apache client
>  numbers, and gradually over time as we saw we still had lots of memory available with a higher connection limit.
> 
> Which tunables do you need to see?

You probably have to increase the operating system limit for open files
for the "postgres" user (ulimit -n).

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
frank picabia
Date:
Is that a guess?

Isn't there a different error for number of files open?  I think maybe you answered the question
for the error "out of file descriptors: Too many open files in system".


On Mon, Dec 21, 2020 at 9:35 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Mon, 2020-12-21 at 09:17 -0400, frank picabia wrote:
> This error occurred a couple of days ago and it is unlikely to repeat as
> it was a series of exams being done in moodle with over 500 students.
> However I'd like to tune Postgres to see if it can be prevented.
>
> First of all, we did not run out of memory within the OS.  This system is monitored by cacti and
> Real Used memory never exceeded 30 GB on a server having 64 GB ram plus some swap.
>
> I can also see in the messages log, the Linux kernel OOM killer was never triggered.
>
> The postgres user running the DB has no ulimits on memory which might apply.
>
> Here is the entry in the Postgres log:
>
> <2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
> TopMemoryContext: 136704 total in 13 blocks; 6688 free (4 chunks); 130016 used
> smgr relation table: 24576 total in 2 blocks; 13904 free (4 chunks); 10672 used
> TopTransactionContext: 8192 total in 1 blocks; 6152 free (5 chunks); 2040 used
> <2020-12-19 13:19:29 GMT>LOG: could not fork new process for connection: Cannot allocate memory
> TransactionAbortContext: 32768 total in 1 blocks; 32728 free (0 chunks); 40 used
> Portal hash: 8192 total in 1 blocks; 1672 free (0 chunks); 6520 used
> PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used
>
> My conclusion is this memory error must be related to limits within Postgres tunables, or perhaps
>  something like SHM memory, which we have not set up in the sysctl.conf , so this is just a default amount.
>
> The values we have in postgres.conf are mostly suggested by pgtune script.  The one exception being
>  max_connections=500, which was increased to try to avoid bottlenecks being reached with Apache client
>  numbers, and gradually over time as we saw we still had lots of memory available with a higher connection limit.
>
> Which tunables do you need to see?

You probably have to increase the operating system limit for open files
for the "postgres" user (ulimit -n).

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Mon, 2020-12-21 at 09:17 -0400, frank picabia wrote:
>> First of all, we did not run out of memory within the OS.  This system is monitored by cacti and
>> Real Used memory never exceeded 30 GB on a server having 64 GB ram plus some swap.

> You probably have to increase the operating system limit for open files
> for the "postgres" user (ulimit -n).

Isn't ulimit -n a per-process, not per-user, limit?  Perhaps ulimit -u
(max user processes) would be worth checking, though.

However, I don't think I believe the assertion that the system wasn't
under overall memory pressure.  What we see in the quoted log fragment
is two separate postmaster fork-failure reports interspersed with a
memory context map, which has to have been coming out of some other
process because postmaster.c does not dump its contexts when reporting
a fork failure.  But there's no reason for a PG process to dump a
context map unless it suffered an ENOMEM allocation failure.
(It'd be interesting to look for the "out of memory" error that presumably
follows the context map, to see if it offers any more info.)

So what we have is fork() being unhappy concurrently with ENOMEM
problems in at least one other process.  That smells like overall
memory pressure to me, cacti or no cacti.

If you've got cgroups enabled, or if the whole thing is running
inside a VM, there might be kernel-enforced memory limits somewhere.

            regards, tom lane



Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
Fernando Hevia
Date:


El lun, 21 de dic. de 2020 a la(s) 10:25, frank picabia (fpicabia@gmail.com) escribió:

Here are settings as suggested in the wiki to run a query to
display those changed from defaults:

             name             |  current_setting  |        source        
------------------------------+-------------------+----------------------
 effective_cache_size         | 4TB               | configuration file


Probably unrelated to your error perse, but this setting doesn't seem sane on a 64 GB box... unless you have lots of swap on some remarkably fast storage.

Regards,
Fernando.


Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
frank picabia
Date:


On Mon, Dec 21, 2020 at 11:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, I don't think I believe the assertion that the system wasn't
under overall memory pressure.  What we see in the quoted log fragment
is two separate postmaster fork-failure reports interspersed with a
memory context map, which has to have been coming out of some other
process because postmaster.c does not dump its contexts when reporting
a fork failure.  But there's no reason for a PG process to dump a
context map unless it suffered an ENOMEM allocation failure.
(It'd be interesting to look for the "out of memory" error that presumably
follows the context map, to see if it offers any more info.)

I have included the only detail in the logs.  Everything else is
query timings and undetailed out of memory log entries.
If you think it could be useful, I can grab all of the log
lines between 9:15 and 9:20 AM.
 

So what we have is fork() being unhappy concurrently with ENOMEM
problems in at least one other process.  That smells like overall
memory pressure to me, cacti or no cacti.

If this did pop the memory limit of the system, then it somehow more than doubled the memory
usage from less than 30GB to greater than 64 GB (80GB including another 16 GB of swap),
spat out the error, then resumed running at exactly the memory consumption it had
previous (there is no perceivable bump in the graph), all in under the 5 minute sampling rate from cacti.

cacti-memory.png

The cacti/snmp numbers are accurate when I compare the running system with results
from free command on shell.

Cacti can produce compressed averaged graphs when viewed much later, but I took
this snapshot within the hour of the event, as we were called in to handle the outage.

There are errors from Postgres at 09:16 and 09:19 of unable to fork due to memory.
In fact, the last out of memory error has a time stamp of 09:19:54 (converting from GMT).
Cacti samples the system info over snmp at 09:15 and 09:20.  That would be
a remarkable feat for it to land back at almost the exact same memory footprint
6 seconds later.  This kind of thing happens at magic shows, not in IT.


If you've got cgroups enabled, or if the whole thing is running
inside a VM, there might be kernel-enforced memory limits somewhere.

We didn't intentionally run cgroups, but out of the box, Redhat Linux 6 does configure "slabs"
of memory for cgroup support.  If that was disabled, we would save 1.6 GB of memory, so
I will look into that kernel option to disable it, but it doesn't explain what we saw.

We are running it within VMware.  I will ask the admin of that if there could be
anything limiting memory access.
 
Thanks for the analysis thus far.

Attachment

Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
frank picabia
Date:

My VMware admin has come back with a graph showing memory use over
the period in question.  He has looked over other indicators
and there are no alarms triggered on the system.
It jives with what Cacti reported.  Memory was never exhausted
and used only 50% of allocated RAM at the most.

If it's not a configuration issue in Postgres, and both internal and external tools
show memory was not consumed to the point of firing off the "cannot fork"
error, would that mean that there is a bug in either the kernel or Postgres?


On Mon, Dec 21, 2020 at 4:17 PM frank picabia <fpicabia@gmail.com> wrote:


On Mon, Dec 21, 2020 at 11:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, I don't think I believe the assertion that the system wasn't
under overall memory pressure.  What we see in the quoted log fragment
is two separate postmaster fork-failure reports interspersed with a
memory context map, which has to have been coming out of some other
process because postmaster.c does not dump its contexts when reporting
a fork failure.  But there's no reason for a PG process to dump a
context map unless it suffered an ENOMEM allocation failure.
(It'd be interesting to look for the "out of memory" error that presumably
follows the context map, to see if it offers any more info.)

I have included the only detail in the logs.  Everything else is
query timings and undetailed out of memory log entries.
If you think it could be useful, I can grab all of the log
lines between 9:15 and 9:20 AM.
 

So what we have is fork() being unhappy concurrently with ENOMEM
problems in at least one other process.  That smells like overall
memory pressure to me, cacti or no cacti.

If this did pop the memory limit of the system, then it somehow more than doubled the memory
usage from less than 30GB to greater than 64 GB (80GB including another 16 GB of swap),
spat out the error, then resumed running at exactly the memory consumption it had
previous (there is no perceivable bump in the graph), all in under the 5 minute sampling rate from cacti.

cacti-memory.png

The cacti/snmp numbers are accurate when I compare the running system with results
from free command on shell.

Cacti can produce compressed averaged graphs when viewed much later, but I took
this snapshot within the hour of the event, as we were called in to handle the outage.

There are errors from Postgres at 09:16 and 09:19 of unable to fork due to memory.
In fact, the last out of memory error has a time stamp of 09:19:54 (converting from GMT).
Cacti samples the system info over snmp at 09:15 and 09:20.  That would be
a remarkable feat for it to land back at almost the exact same memory footprint
6 seconds later.  This kind of thing happens at magic shows, not in IT.


If you've got cgroups enabled, or if the whole thing is running
inside a VM, there might be kernel-enforced memory limits somewhere.

We didn't intentionally run cgroups, but out of the box, Redhat Linux 6 does configure "slabs"
of memory for cgroup support.  If that was disabled, we would save 1.6 GB of memory, so
I will look into that kernel option to disable it, but it doesn't explain what we saw.

We are running it within VMware.  I will ask the admin of that if there could be
anything limiting memory access.
 
Thanks for the analysis thus far.

Attachment
frank picabia <fpicabia@gmail.com> writes:
> My VMware admin has come back with a graph showing memory use over
> the period in question.  He has looked over other indicators
> and there are no alarms triggered on the system.
> It jives with what Cacti reported.  Memory was never exhausted
> and used only 50% of allocated RAM at the most.

> If it's not a configuration issue in Postgres, and both internal and
> external tools
> show memory was not consumed to the point of firing off the "cannot fork"
> error, would that mean that there is a bug in either the kernel or Postgres?

[ shrug... ]  Postgres is just reporting to you that the kernel wouldn't
perform a fork().  Since you've gone to great lengths to show that
Postgres isn't consuming excessive resources, either this is a kernel bug
or you're running into some kernel-level (not Postgres) allocation limit.
I continue to suspect the latter.  Desultory googling shows that VMware
can be configured to enforce resource allocation limits, so maybe you
should be taking a hard look at your VMware settings.

            regards, tom lane



Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
frank picabia
Date:

Thanks for the responses. 

We'll take this up with VMware support and then if it isn't a configuration issue, move it along to Redhat Linux support.

It was also useful to learn cgroups default support within the kernel can use up so much memory on a system with larger RAM.  On the next reboot this will free up about 1.6 GB RAM.  It might help with a little wiggle room until we know more about the other issue which seems to limit us to 1/2 our RAM.


On Mon, Dec 21, 2020 at 9:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
frank picabia <fpicabia@gmail.com> writes:
> My VMware admin has come back with a graph showing memory use over
> the period in question.  He has looked over other indicators
> and there are no alarms triggered on the system.
> It jives with what Cacti reported.  Memory was never exhausted
> and used only 50% of allocated RAM at the most.

> If it's not a configuration issue in Postgres, and both internal and
> external tools
> show memory was not consumed to the point of firing off the "cannot fork"
> error, would that mean that there is a bug in either the kernel or Postgres?

[ shrug... ]  Postgres is just reporting to you that the kernel wouldn't
perform a fork().  Since you've gone to great lengths to show that
Postgres isn't consuming excessive resources, either this is a kernel bug
or you're running into some kernel-level (not Postgres) allocation limit.
I continue to suspect the latter.  Desultory googling shows that VMware
can be configured to enforce resource allocation limits, so maybe you
should be taking a hard look at your VMware settings.

                        regards, tom lane

Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
MichaelDBA
Date:
perhaps kernel parameters under control of vmware: shmmax/shmall.

Regards,
Michael Vitale

Thanks for the responses. 

We'll take this up with VMware support and then if it isn't a configuration issue, move it along to Redhat Linux support.

It was also useful to learn cgroups default support within the kernel can use up so much memory on a system with larger RAM.  On the next reboot this will free up about 1.6 GB RAM.  It might help with a little wiggle room until we know more about the other issue which seems to limit us to 1/2 our RAM.


On Mon, Dec 21, 2020 at 9:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
frank picabia <fpicabia@gmail.com> writes:
> My VMware admin has come back with a graph showing memory use over
> the period in question.  He has looked over other indicators
> and there are no alarms triggered on the system.
> It jives with what Cacti reported.  Memory was never exhausted
> and used only 50% of allocated RAM at the most.

> If it's not a configuration issue in Postgres, and both internal and
> external tools
> show memory was not consumed to the point of firing off the "cannot fork"
> error, would that mean that there is a bug in either the kernel or Postgres?

[ shrug... ]  Postgres is just reporting to you that the kernel wouldn't
perform a fork().  Since you've gone to great lengths to show that
Postgres isn't consuming excessive resources, either this is a kernel bug
or you're running into some kernel-level (not Postgres) allocation limit.
I continue to suspect the latter.  Desultory googling shows that VMware
can be configured to enforce resource allocation limits, so maybe you
should be taking a hard look at your VMware settings.

                        regards, tom lane

Re: Error: "could not fork new process for connection: Cannot allocate memory"

From
Kasahara Tatsuhito
Date:
Hi,

On Wed, Dec 23, 2020 at 12:59 AM frank picabia <fpicabia@gmail.com> wrote:
>
>
> Thanks for the responses.
>
> We'll take this up with VMware support and then if it isn't a configuration issue, move it along to Redhat Linux
support.
>
> It was also useful to learn cgroups default support within the kernel can use up so much memory on a system with
largerRAM.  On the next reboot this will free up about 1.6 GB RAM.  It might help with a little wiggle room until we
knowmore about the other issue which seems to limit us to 1/2 our RAM. 

I'm not sure, but the kernel parameter vm.zone_reclaim_mode and/or
NUMA memory interleaving may have an effect.

regards,

>
>
> On Mon, Dec 21, 2020 at 9:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> frank picabia <fpicabia@gmail.com> writes:
>> > My VMware admin has come back with a graph showing memory use over
>> > the period in question.  He has looked over other indicators
>> > and there are no alarms triggered on the system.
>> > It jives with what Cacti reported.  Memory was never exhausted
>> > and used only 50% of allocated RAM at the most.
>>
>> > If it's not a configuration issue in Postgres, and both internal and
>> > external tools
>> > show memory was not consumed to the point of firing off the "cannot fork"
>> > error, would that mean that there is a bug in either the kernel or Postgres?
>>
>> [ shrug... ]  Postgres is just reporting to you that the kernel wouldn't
>> perform a fork().  Since you've gone to great lengths to show that
>> Postgres isn't consuming excessive resources, either this is a kernel bug
>> or you're running into some kernel-level (not Postgres) allocation limit.
>> I continue to suspect the latter.  Desultory googling shows that VMware
>> can be configured to enforce resource allocation limits, so maybe you
>> should be taking a hard look at your VMware settings.
>>
>>                         regards, tom lane



--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com