Re: pg13.2: invalid memory alloc request size NNNN - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: pg13.2: invalid memory alloc request size NNNN
Date
Msg-id 20210212181052.GH1793@telsasoft.com
Whole thread Raw
In response to Re: pg13.2: invalid memory alloc request size NNNN  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Fri, Feb 12, 2021 at 06:44:54PM +0100, Tomas Vondra wrote:
> > (gdb) p len
> > $1 = -4
> > 
> > This VM had some issue early today and I killed the VM, causing PG to execute
> > recovery.  I'm tentatively blaming that on zfs, so this could conceivably be a
> > data error (although recovery supposedly would have resolved it).  I just
> > checked and data_checksums=off.
> 
> This seems very much like a corrupted varlena header - length (-4) is
> clearly bogus, and it's what triggers the problem, because that's what wraps
> around to 18446744073709551613 (which is 0xFFFFFFFFFFFFFFFD).
> 
> This has to be a value stored in a table, not some intermediate value
> created during execution. So I don't think the exact query matters. Can you
> try doing something like pg_dump, which has to detoast everything?

Right, COPY fails and VACUUM FULL crashes.

message                | invalid memory alloc request size 18446744073709551613
query                  | COPY child.tt TO '/dev/null';

> The question is whether this is due to the VM getting killed in some strange
> way (what VM system is this, how is the storage mounted?) or whether the
> recovery is borked and failed to do the right thing.

This is qemu/kvm, with block storage:
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/data/postgres'/>

And then more block devices for ZFS vdevs:
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/data/zfs2'/>
      ...

Those are LVM volumes (I know that ZFS/LVM is discouraged).

$ zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zfs          762G   577G   185G        -         -    71%    75%  1.00x    ONLINE  -
  vdj        127G  92.7G  34.3G        -         -    64%  73.0%      -  ONLINE  
  vdd        127G  95.6G  31.4G        -         -    74%  75.2%      -  ONLINE  
  vdf        127G  96.0G  31.0G        -         -    75%  75.6%      -  ONLINE  
  vdg        127G  95.8G  31.2G        -         -    74%  75.5%      -  ONLINE  
  vdh        127G  95.5G  31.5G        -         -    74%  75.2%      -  ONLINE  
  vdi        128G   102G  25.7G        -         -    71%  79.9%      -  ONLINE  

This is recently upgraded to ZFS 2.0.0, and then to 2.0.1:

Jan 21 09:33:26 Installed: zfs-dkms-2.0.1-1.el7.noarch
Dec 23 08:41:21 Installed: zfs-dkms-2.0.0-1.el7.noarch

The VM has gotten "wedged" and I've had to kill it a few times in the last 24h
(needless to say this is not normal).  That part seems like a kernel issue and
not postgres problem.  It's unclear if that's due to me trying to tickle the
postgres ERROR.  It's the latest centos7 kernel: 3.10.0-1160.15.2.el7.x86_64

-- 
Justin



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Trigger execution role
Next
From: Álvaro Hernández
Date:
Subject: PostgreSQL <-> Babelfish integration