Thread: valgrind error

valgrind error

From
Andrew Dunstan
Date:
I was just trying to revive lousyjack, my valgrind buildfarm animal
which has been offline for 12 days, after having upgraded the machine
(fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this:


2020-04-17 19:26:03.483 EDT [63741:3] pg_regress LOG:  statement: CREATE
DATABASE "regression" TEMPLATE=template0
==63717== VALGRINDERROR-BEGIN
==63717== Use of uninitialised value of size 8
==63717==    at 0xAC5BB5: pg_comp_crc32c_sb8 (pg_crc32c_sb8.c:82)
==63717==    by 0x55A98B: XLogRecordAssemble (xloginsert.c:785)
==63717==    by 0x55A268: XLogInsert (xloginsert.c:461)
==63717==    by 0x8BC9E0: LogCurrentRunningXacts (standby.c:1005)
==63717==    by 0x8BC8F9: LogStandbySnapshot (standby.c:961)
==63717==    by 0x550CB3: CreateCheckPoint (xlog.c:8937)
==63717==    by 0x82A3B2: CheckpointerMain (checkpointer.c:441)
==63717==    by 0x56347D: AuxiliaryProcessMain (bootstrap.c:453)
==63717==    by 0x83CA18: StartChildProcess (postmaster.c:5474)
==63717==    by 0x83A120: reaper (postmaster.c:3045)
==63717==    by 0x4874B1F: ??? (in /usr/lib64/libpthread-2.30.so)
==63717==    by 0x5056F29: select (in /usr/lib64/libc-2.30.so)
==63717==    by 0x8380A0: ServerLoop (postmaster.c:1691)
==63717==    by 0x837A1F: PostmasterMain (postmaster.c:1400)
==63717==    by 0x74A71D: main (main.c:210)
==63717==  Uninitialised value was created by a stack allocation
==63717==    at 0x8BC942: LogCurrentRunningXacts (standby.c:984)
==63717==
==63717== VALGRINDERROR-END
{
   <insert_a_suppression_name_here>
   Memcheck:Value8
   fun:pg_comp_crc32c_sb8
   fun:XLogRecordAssemble
   fun:XLogInsert
   fun:LogCurrentRunningXacts
   fun:LogStandbySnapshot
   fun:CreateCheckPoint
   fun:CheckpointerMain
   fun:AuxiliaryProcessMain
   fun:StartChildProcess
   fun:reaper
   obj:/usr/lib64/libpthread-2.30.so
   fun:select
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}


I can't see what the problem is immediately.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: valgrind error

From
Andrew Dunstan
Date:
On 4/18/20 9:15 AM, Andrew Dunstan wrote:
> I was just trying to revive lousyjack, my valgrind buildfarm animal
> which has been offline for 12 days, after having upgraded the machine
> (fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this:
>
>
> 2020-04-17 19:26:03.483 EDT [63741:3] pg_regress LOG:  statement: CREATE
> DATABASE "regression" TEMPLATE=template0
> ==63717== VALGRINDERROR-BEGIN
> ==63717== Use of uninitialised value of size 8
> ==63717==    at 0xAC5BB5: pg_comp_crc32c_sb8 (pg_crc32c_sb8.c:82)
> ==63717==    by 0x55A98B: XLogRecordAssemble (xloginsert.c:785)
> ==63717==    by 0x55A268: XLogInsert (xloginsert.c:461)
> ==63717==    by 0x8BC9E0: LogCurrentRunningXacts (standby.c:1005)
> ==63717==    by 0x8BC8F9: LogStandbySnapshot (standby.c:961)
> ==63717==    by 0x550CB3: CreateCheckPoint (xlog.c:8937)
> ==63717==    by 0x82A3B2: CheckpointerMain (checkpointer.c:441)
> ==63717==    by 0x56347D: AuxiliaryProcessMain (bootstrap.c:453)
> ==63717==    by 0x83CA18: StartChildProcess (postmaster.c:5474)
> ==63717==    by 0x83A120: reaper (postmaster.c:3045)
> ==63717==    by 0x4874B1F: ??? (in /usr/lib64/libpthread-2.30.so)
> ==63717==    by 0x5056F29: select (in /usr/lib64/libc-2.30.so)
> ==63717==    by 0x8380A0: ServerLoop (postmaster.c:1691)
> ==63717==    by 0x837A1F: PostmasterMain (postmaster.c:1400)
> ==63717==    by 0x74A71D: main (main.c:210)
> ==63717==  Uninitialised value was created by a stack allocation
> ==63717==    at 0x8BC942: LogCurrentRunningXacts (standby.c:984)
> ==63717==
> ==63717== VALGRINDERROR-END
> {
>    <insert_a_suppression_name_here>
>    Memcheck:Value8
>    fun:pg_comp_crc32c_sb8
>    fun:XLogRecordAssemble
>    fun:XLogInsert
>    fun:LogCurrentRunningXacts
>    fun:LogStandbySnapshot
>    fun:CreateCheckPoint
>    fun:CheckpointerMain
>    fun:AuxiliaryProcessMain
>    fun:StartChildProcess
>    fun:reaper
>    obj:/usr/lib64/libpthread-2.30.so
>    fun:select
>    fun:ServerLoop
>    fun:PostmasterMain
>    fun:main
> }
>
>


After many hours of testing I have a culprit for this. The error appears
with valgrind 3.15.0  with everything else held constant. 3.14.0  does
not produce the problem.  So lousyjack will be back on the air before long.


Here are the build flags it's using:


CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncatio
n -g -fno-omit-frame-pointer -O0 -fPIC
CPPFLAGS=-DUSE_VALGRIND  -DRELCACHE_FORCE_RELEASE -D_GNU_SOURCE
-I/usr/include/libxml2


and valgrind is invoked like this:


valgrind --quiet --trace-children=yes --track-origins=yes
--read-var-info=yes --num-callers=20 --leak-check=no
--gen-suppressions=all --error-limit=no
--suppressions=../pgsql/src/tools/valgrind.supp
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END bin/postgres -D data-C


Does anyone see anything here that needs tweaking?


Note that this is quite an old machine:


andrew@freddo:bf (master)*$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          16
Model:               6
Model name:          AMD Athlon(tm) II X2 215 Processor
Stepping:            2
CPU MHz:             2700.000
CPU max MHz:         2700.0000
CPU min MHz:         800.0000
BogoMIPS:            5425.13
Virtualization:      AMD-V
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy
svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save


I did not manage to reproduce this anywhere else, tried on various
physical, Virtualbox and Docker instances.


cheers


andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: valgrind error

From
Noah Misch
Date:
On Sun, May 10, 2020 at 09:29:05AM -0400, Andrew Dunstan wrote:
> On 4/18/20 9:15 AM, Andrew Dunstan wrote:
> > I was just trying to revive lousyjack, my valgrind buildfarm animal
> > which has been offline for 12 days, after having upgraded the machine
> > (fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this:

> > {
> >    <insert_a_suppression_name_here>
> >    Memcheck:Value8
> >    fun:pg_comp_crc32c_sb8
> >    fun:XLogRecordAssemble
> >    fun:XLogInsert
> >    fun:LogCurrentRunningXacts
> >    fun:LogStandbySnapshot
> >    fun:CreateCheckPoint
> >    fun:CheckpointerMain
> >    fun:AuxiliaryProcessMain
> >    fun:StartChildProcess
> >    fun:reaper
> >    obj:/usr/lib64/libpthread-2.30.so
> >    fun:select
> >    fun:ServerLoop
> >    fun:PostmasterMain
> >    fun:main
> > }

> After many hours of testing I have a culprit for this. The error appears
> with valgrind 3.15.0  with everything else held constant. 3.14.0  does
> not produce the problem.

I suspect 3.15.0 is just better at tracking the uninitialized data.  A
more-remote possibility is valgrind-3.14.0 emulating sse42.  That would make
pg_crc32c_sse42_available() return true, avoiding the pg_comp_crc32c_sb8().

> andrew@freddo:bf (master)*$ lscpu
...
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
> fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
> nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy
> svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
> skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
> 
> 
> I did not manage to reproduce this anywhere else, tried on various
> physical, Virtualbox and Docker instances.

I can reproduce this on a 2017-vintage CPU with ./configure
... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel"
under valgrind-3.15.0 (as packaged by RHEL 7.8).  valgrind.supp has a
suppression for CRC calculations, but it didn't get the memo when commit
4f700bc renamed the function.  The attached patch fixes the suppression.

Attachment

Re: valgrind error

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> I can reproduce this on a 2017-vintage CPU with ./configure
> ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel"
> under valgrind-3.15.0 (as packaged by RHEL 7.8).  valgrind.supp has a
> suppression for CRC calculations, but it didn't get the memo when commit
> 4f700bc renamed the function.  The attached patch fixes the suppression.

I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0,
using the same configuration to force use of that CRC function.  I concur
with your diagnosis that this is just a missed update of the pre-existing
suppression rule.  However, rather than

-       fun:pg_comp_crc32c
+       fun:pg_comp_crc32c*

as you have it, I'd prefer to use

-       fun:pg_comp_crc32c
+       fun:pg_comp_crc32c_sb8

which precisely matches what 4f700bc did.  The other way seems like
it's giving a free pass to problems that could lurk in unrelated CRC
implementations.

            regards, tom lane



Re: valgrind error

From
Noah Misch
Date:
On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > I can reproduce this on a 2017-vintage CPU with ./configure
> > ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel"
> > under valgrind-3.15.0 (as packaged by RHEL 7.8).  valgrind.supp has a
> > suppression for CRC calculations, but it didn't get the memo when commit
> > 4f700bc renamed the function.  The attached patch fixes the suppression.
> 
> I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0,
> using the same configuration to force use of that CRC function.  I concur
> with your diagnosis that this is just a missed update of the pre-existing
> suppression rule.  However, rather than
> 
> -       fun:pg_comp_crc32c
> +       fun:pg_comp_crc32c*
> 
> as you have it, I'd prefer to use
> 
> -       fun:pg_comp_crc32c
> +       fun:pg_comp_crc32c_sb8
> 
> which precisely matches what 4f700bc did.  The other way seems like
> it's giving a free pass to problems that could lurk in unrelated CRC
> implementations.

The undefined data is in the CRC input, namely the padding bytes in xl_*
structs.  Apparently, valgrind-3.15.0 doesn't complain about undefined input
to _mm_crc32_u* functions.  We should not be surprised if Valgrind gains the
features necessary to complain about the other implementations.

Most COMP_CRC32C callers don't have a suppression, so Valgrind still studies
each CRC implementation via those other callers.



Re: valgrind error

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote:
>> as you have it, I'd prefer to use
>> -       fun:pg_comp_crc32c
>> +       fun:pg_comp_crc32c_sb8
>> which precisely matches what 4f700bc did.  The other way seems like
>> it's giving a free pass to problems that could lurk in unrelated CRC
>> implementations.

> The undefined data is in the CRC input, namely the padding bytes in xl_*
> structs.

Oh, I see.  Objection withdrawn.

> Apparently, valgrind-3.15.0 doesn't complain about undefined input
> to _mm_crc32_u* functions.  We should not be surprised if Valgrind gains the
> features necessary to complain about the other implementations.

Perhaps it already has ... I wonder if anyone's tried this on ARMv8
lately.

            regards, tom lane



Re: valgrind error

From
Tom Lane
Date:
I wrote:
> Noah Misch <noah@leadboat.com> writes:
>> Apparently, valgrind-3.15.0 doesn't complain about undefined input
>> to _mm_crc32_u* functions.  We should not be surprised if Valgrind gains the
>> features necessary to complain about the other implementations.

> Perhaps it already has ... I wonder if anyone's tried this on ARMv8
> lately.

I installed Fedora 32/aarch64 on a Raspberry Pi 3B+, and can report that
valgrind 3.16.0 is just as blind to this problem in pg_comp_crc32c_armv8
as it is in pg_comp_crc32c_sse42.  Seems odd, but there you have it.

(There are some other issues, but they seem fit for separate threads.)

            regards, tom lane