Thread: PANIC: could not flush dirty data: Cannot allocate memory

PANIC: could not flush dirty data: Cannot allocate memory

From
klaus.mailinglists@pernau.at
Date:
Hi all!

We have a setup with a master and plenty of logical replication slaves. 
Master and slaves are 12.12-1.pgdg22.04+1 runnning on Ubuntu 22.04.
SELECT pg_size_pretty( pg_database_size('regdns') ); is from 25GB (fresh 
installed slave) to 42GB (probably bloat)

Replication slaves VMs have between 22G and 48G RAM, most have 48G RAM.

We are using:
maintenance_work_mem = 128MB
work_mem = 64MB

and VMs with 48G RAM:
effective_cache_size = 8192MB
shared_buffers = 6144MB

and VMs with 22G RAM:
effective_cache_size = 4096MB
shared_buffers = 2048MB

On several servers we see the error message: PANIC:  could not flush 
dirty data: Cannot allocate memory

Unfortunately I do not find any reference to this kind of error. Can you 
please describe what happens here in detail? Is it related to server 
memory? Or our memory settings? I am not so surprised that it happens 
with the 22G RAM VM. It is not happening on our 32G RAM VMs. But it also 
happens on some of the 48G RAM VMs which should have plenty of RAM 
available:
# free -h
                total        used        free      shared  buff/cache   
available
Mem:            47Gi         9Gi       1.2Gi       6.1Gi        35Gi     
    30Gi
Swap:          7.8Gi       3.0Gi       4.9Gi

Of course I could upgrade all our VMs and then wait and see if it solved 
the problem. But I would like to understand what is happening here 
before spending $$$.

Thanks
Klaus




Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Tom Lane
Date:
klaus.mailinglists@pernau.at writes:
> On several servers we see the error message: PANIC:  could not flush 
> dirty data: Cannot allocate memory

What that's telling you is that fsync (or some equivalent OS call)
returned ENOMEM, which would seem to be a kernel-level deficiency.
Perhaps you could dodge it by using a different wal_sync_method setting,
but complaining to your kernel vendor seems like the main thing
to be doing.

The reason we treat it as a PANIC condition is

 * Failure to fsync any data file is cause for immediate panic, unless
 * data_sync_retry is enabled.  Data may have been written to the operating
 * system and removed from our buffer pool already, and if we are running on
 * an operating system that forgets dirty data on write-back failure, there
 * may be only one copy of the data remaining: in the WAL.  A later attempt to
 * fsync again might falsely report success.  Therefore we must not allow any
 * further checkpoints to be attempted.  data_sync_retry can in theory be
 * enabled on systems known not to drop dirty buffered data on write-back
 * failure (with the likely outcome that checkpoints will continue to fail
 * until the underlying problem is fixed).

As noted here, turning on data_sync_retry would reduce the PANIC to
a WARNING.  But I wouldn't recommend that without some assurances
from your kernel vendor about what happens in the kernel after such a
failure.  The panic restart should (in theory) ensure data consistency
is preserved; without it we can't offer any guarantees.

            regards, tom lane



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Christoph Moench-Tegeder
Date:
## klaus.mailinglists@pernau.at (klaus.mailinglists@pernau.at):

> On several servers we see the error message: PANIC:  could not flush 
> dirty data: Cannot allocate memory

As far as I can see, that "could not flush dirty data" happens total
three times in the code - there are other places where postgresql could
PANIC on fsync()-and-stuff-related issues, but they have different
messages.
Of these three places, there's an sync_file_range(), an posix_fadvise()
and an msync(), all in src/backend/storage/file/fd.c. "Cannot allocate
memory" would be ENOMEM, which posix_fadvise() does not return (as per
it's docs). So this would be sync_file_range(), which could run out
of memory (as per the manual) or msync() where ENOMEM actually means
"The indicated memory (or part of it) was not mapped". Both cases are
somewhat WTF for this setup.
What filesystem are you running?

Regards,
Christoph

-- 
Spare Space



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Thomas Munro
Date:
On Tue, Nov 15, 2022 at 10:54 AM Christoph Moench-Tegeder
<cmt@burggraben.net> wrote:
> ## klaus.mailinglists@pernau.at (klaus.mailinglists@pernau.at):
> > On several servers we see the error message: PANIC:  could not flush
> > dirty data: Cannot allocate memory

> Of these three places, there's an sync_file_range(), an posix_fadvise()
> and an msync(), all in src/backend/storage/file/fd.c. "Cannot allocate
> memory" would be ENOMEM, which posix_fadvise() does not return (as per
> it's docs). So this would be sync_file_range(), which could run out
> of memory (as per the manual) or msync() where ENOMEM actually means
> "The indicated memory (or part of it) was not mapped". Both cases are
> somewhat WTF for this setup.

It must be sync_file_range().  The others are fallbacks that wouldn't
apply on a modern Linux.

It has been argued before that we might have been over-zealous
applying the PANIC promotion logic to sync_file_range().  It's used to
start asynchronous writeback to make the later fsync() call fast, so
it's "only a hint", but I have no idea if it could report a writeback
error from the kernel that would then be consumed and not reported to
the later fsync(), so I defaulted to assuming that it could.



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> It has been argued before that we might have been over-zealous
> applying the PANIC promotion logic to sync_file_range().  It's used to
> start asynchronous writeback to make the later fsync() call fast, so
> it's "only a hint", but I have no idea if it could report a writeback
> error from the kernel that would then be consumed and not reported to
> the later fsync(), so I defaulted to assuming that it could.

Certainly, if it reports EIO, we should panic.  But maybe not for
ENOMEM?  One would assume that that means that the request didn't
get queued for lack of in-kernel memory space ... in which case
"nothing happened".

            regards, tom lane



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
klaus.mailinglists@pernau.at
Date:
Thanks all for digging into this problem.

AFAIU the problem is not related to the memory settings in 
postgresql.conf. It is the kernel that
for whatever reasons report ENOMEM. Correct?

Am 2022-11-14 22:54, schrieb Christoph Moench-Tegeder:
> ## klaus.mailinglists@pernau.at (klaus.mailinglists@pernau.at):
> 
>> On several servers we see the error message: PANIC:  could not flush
>> dirty data: Cannot allocate memory
> 
> As far as I can see, that "could not flush dirty data" happens total
> three times in the code - there are other places where postgresql could
> PANIC on fsync()-and-stuff-related issues, but they have different
> messages.
> Of these three places, there's an sync_file_range(), an posix_fadvise()
> and an msync(), all in src/backend/storage/file/fd.c. "Cannot allocate
> memory" would be ENOMEM, which posix_fadvise() does not return (as per
> it's docs). So this would be sync_file_range(), which could run out
> of memory (as per the manual) or msync() where ENOMEM actually means
> "The indicated memory (or part of it) was not mapped". Both cases are
> somewhat WTF for this setup.
> What filesystem are you running?

Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. 
Kernel is 5.15.0-52-generic.

We have not seen this with Ubutnu 18.04 and 20.04 (although we might not 
have noticed it).

I guess upgrading to postgresql 13/14/15 does not help as the problem 
happens in the kernel.

Do you have any advice how to go further? Shall I lookout for certain 
kernel changes? In the kernel itself or in ext4 changelog?

Thanks
Klaus





Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Thomas Munro
Date:
On Wed, Nov 16, 2022 at 1:24 AM <klaus.mailinglists@pernau.at> wrote:
> Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV.
> Kernel is 5.15.0-52-generic.
>
> We have not seen this with Ubutnu 18.04 and 20.04 (although we might not
> have noticed it).
>
> I guess upgrading to postgresql 13/14/15 does not help as the problem
> happens in the kernel.
>
> Do you have any advice how to go further? Shall I lookout for certain
> kernel changes? In the kernel itself or in ext4 changelog?

It'd be good to figure out what is up with Linux or tuning.  I'll go
write a patch to reduce that error level for non-EIO errors, to
discuss for the next point release.  In the meantime, you could
experiment with setting checkpoint_flush_after to 0, so the
checkpointer/bgwriter/other backends don't call sync_file_range() all
day long.  That would have performance consequences for checkpoints
which might be unacceptable though.  The checkpointer will fsync
relations one after another, with less I/O concurrency.   Linux is
generally quite lazy at writing back dirty data, and doesn't know
about our checkpointer's plans to fsync files on a certain schedule,
which is why we ask it to get started on multiple files concurrently
using sync_file_range().

https://www.postgresql.org/docs/15/runtime-config-wal.html#RUNTIME-CONFIG-WAL-CHECKPOINTS



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Christoph Moench-Tegeder
Date:
## klaus.mailinglists@pernau.at (klaus.mailinglists@pernau.at):

> AFAIU the problem is not related to the memory settings in 
> postgresql.conf. It is the kernel that
> for whatever reasons report ENOMEM. Correct?

Correct, there's a ENOMEM from the kernel when writing out data.

> Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. 
> Kernel is 5.15.0-52-generic.

I do not suspect the filesystem per se, ext4 is quite common and we
would have heard something about that (but then, someone gotta be
the first reporter?). I would believe that the kernel would raise
a bunch of printks if it hit ENOMEM in the commonly used paths, so
you would see something in dmesg or wherever you collect your kernel
log if it happened where it was expected.
And coming from the other side: does this happen on all the hosts,
or is it limited to one host or one technology? Any uncommon options
on the filesystem or the mount point? Anything which could mess
with your block devices? (I'm expecially thinking "antivirus" because
it's always "0 days since the AV ate a database" and they tend to
raise errors in the weirdest places, which would fit the bill here;
but anythig which is not "commonly in use everywhere" could be a
candidate).

Regards,
Christoph

-- 
Spare Space



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Andres Freund
Date:
Hi,

On 2022-11-15 13:23:56 +0100, klaus.mailinglists@pernau.at wrote:
> Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. Kernel
> is 5.15.0-52-generic.
> 
> We have not seen this with Ubutnu 18.04 and 20.04 (although we might not
> have noticed it).

Did this start after upgrading to 22.04? Or after a certain kernel upgrade?

Do you use cgroups or such to limit memory usage of postgres?

I'd be helpful to see /proc/meminfo from one of the affected instances.

Greetings,

Andres Freund



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
Andres Freund
Date:
Hi,

On 2022-11-16 09:16:56 -0800, Andres Freund wrote:
> On 2022-11-15 13:23:56 +0100, klaus.mailinglists@pernau.at wrote:
> > Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. Kernel
> > is 5.15.0-52-generic.
> > 
> > We have not seen this with Ubutnu 18.04 and 20.04 (although we might not
> > have noticed it).
> 
> Did this start after upgrading to 22.04? Or after a certain kernel upgrade?
> 
> Do you use cgroups or such to limit memory usage of postgres?
> 
> I'd be helpful to see /proc/meminfo from one of the affected instances.

Another interesting thing would be to know the mount and file system options
for the FS that triggers the failures. E.g.
  tune2fs -l path/to/blockdev
and
  grep path/to/blockdev /proc/mounts

Greetings,

Andres Freund



Re: PANIC: could not flush dirty data: Cannot allocate memory

From
klaus.mailinglists@pernau.at
Date:
Hello all!


Thanks for the many hints to look for. We did some tuning and further 
debugging and here are the outcomes, answering all questions in a single 
email.


> In the meantime, you could experiment with setting 
> checkpoint_flush_after to 0
We did this:
# SHOW checkpoint_flush_after;
  checkpoint_flush_after
------------------------
  0
(1 row)

But we STILL have PANICs. I tried to understand the code but failed. I 
guess that there are some code paths which call pg_flush_data() without 
checking this settings, or the check does not work.



> Did this start after upgrading to 22.04? Or after a certain kernel 
> upgrade?

For sure it only started with Ubuntu 22.04. We did not had and still not 
have any issues on servers with Ubuntu 20.04 and 18.04.


> I would believe that the kernel would raise
> a bunch of printks if it hit ENOMEM in the commonly used paths, so
> you would see something in dmesg or wherever you collect your kernel
> log if it happened where it was expected.

There is nothing in the kernel logs (dmesg)


> Do you use cgroups or such to limit memory usage of postgres?

No


> Any uncommon options on the filesystem or the mount point?
No. Also no Antivirus:
/dev/xvda2 / ext4 noatime,nodiratime,errors=remount-ro 0 1
or
LABEL=cloudimg-rootfs   /        ext4   discard,errors=remount-ro       
0 1


> does this happen on all the hosts, or is it limited to one host or one 
> technology?

It happens on XEN VMs, KVM VMs and VMware VMs. On Intel and AMD 
plattforms.


> Another interesting thing would be to know the mount and file system 
> options
> for the FS that triggers the failures. E.g.

# tune2fs -l /dev/sda1
tune2fs 1.46.5 (30-Dec-2021)
Filesystem volume name:   cloudimg-rootfs
Last mounted on:          /
Filesystem UUID:          0522e6b3-8d40-4754-a87e-5678a6921e37
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype needs_recovery extent 64bit flex_bg encrypt sparse_super 
large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              12902400
Block count:              26185979
Reserved block count:     0
Overhead clusters:        35096
Free blocks:              18451033
Free inodes:              12789946
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      243
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16128
Inode blocks per group:   1008
Flex block group size:    16
Filesystem created:       Wed Apr 20 18:31:24 2022
Last mount time:          Thu Nov 10 09:49:34 2022
Last write time:          Thu Nov 10 09:49:34 2022
Mount count:              7
Maximum mount count:      -1
Last checked:             Wed Apr 20 18:31:24 2022
Check interval:           0 (<none>)
Lifetime writes:          252 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
First orphan inode:       42571
Default directory hash:   half_md4
Directory Hash Seed:      c5ef129b-fbee-4f35-8f28-ad7cc93c1c43
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xb74ebbc3


Thanks
Klaus




Re: PANIC: could not flush dirty data: Cannot allocate memory

From
klaus.mailinglists@pernau.at
Date:
Some more updates ....

>> Did this start after upgrading to 22.04? Or after a certain kernel 
>> upgrade?
> 
> For sure it only started with Ubuntu 22.04. We did not had and still 
> not have any issues on servers with Ubuntu 20.04 and 18.04.

It also happens with Ubuntu 22.10 (Kernel 5.19.0-23-generic). We now try 
6.0 mainline and 5.15. mainline kernel on some servers.

I also forgot to mention that /var/lib/postgresql/12 directory is 
encrypted with fscrypt (ext4 encryption). So we also deactivated the 
directory encryption on one server to see if it is related to 
encryption.

thanks
Klaus