Thread: Non-reproducible valgrind failure on HEAD

Non-reproducible valgrind failure on HEAD

From
Tom Lane
Date:
I happened to be trying to run the core regression tests under
valgrind, and I got the complaints attached below, from the
process that had been running the insert_conflict test script.

I could not reproduce the failure in a second run, which is not
hugely surprising because it appears to be in cross-process
sinval processing; so timing sensitivity is to be expected.
That doesn't make it any less disturbing.

One point worth mentioning is that I'd forgotten to build with
"#define USE_VALGRIND" in the first try.  AFAIK that should make
valgrind strictly less sensitive, so I think it's not material,
but still.

Thoughts?

            regards, tom lane

==00:00:01:51.603 386293== Memcheck, a memory error detector
==00:00:01:51.603 386293== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==00:00:01:51.603 386293== Using Valgrind-3.16.0 and LibVEX; rerun with -h for copyright info
==00:00:01:51.603 386293== Command: postmaster -i -F
==00:00:01:51.603 386293== Parent PID: 384272
==00:00:01:51.603 386293==
==00:00:02:01.931 386293== Conditional jump or move depends on uninitialised value(s)
==00:00:02:01.931 386293==    at 0x945117: calc_bucket (dynahash.c:920)
==00:00:02:01.931 386293==    by 0x945117: hash_search_with_hash_value (dynahash.c:1012)
==00:00:02:01.931 386293==    by 0x8278C4: smgrclosenode (smgr.c:318)
==00:00:02:01.931 386293==    by 0x80EDEC: ReceiveSharedInvalidMessages (sinval.c:120)
==00:00:02:01.931 386293==    by 0x924112: AcceptInvalidationMessages (inval.c:689)
==00:00:02:01.931 386293==    by 0x8143B8: LockRelationOid (lmgr.c:137)
==00:00:02:01.931 386293==    by 0x51710C: relation_open (relation.c:56)
==00:00:02:01.931 386293==    by 0x59F6B5: table_open (table.c:43)
==00:00:02:01.931 386293==    by 0x5DBCA8: StorePartitionBound (heap.c:3789)
==00:00:02:01.931 386293==    by 0x686EB2: ATExecAttachPartition.isra.58 (tablecmds.c:17263)
==00:00:02:01.931 386293==    by 0x688F9E: ATExecCmd (tablecmds.c:5111)
==00:00:02:01.931 386293==    by 0x6896BA: ATRewriteCatalogs (tablecmds.c:4781)
==00:00:02:01.931 386293==    by 0x6896BA: ATController (tablecmds.c:4376)
==00:00:02:01.931 386293==    by 0x8320D2: ProcessUtilitySlow.isra.7 (utility.c:1284)
==00:00:02:01.931 386293==
==00:00:02:01.931 386293== Use of uninitialised value of size 8
==00:00:02:01.931 386293==    at 0x945026: hash_search_with_hash_value (dynahash.c:1017)
==00:00:02:01.931 386293==    by 0x8278C4: smgrclosenode (smgr.c:318)
==00:00:02:01.931 386293==    by 0x80EDEC: ReceiveSharedInvalidMessages (sinval.c:120)
==00:00:02:01.931 386293==    by 0x924112: AcceptInvalidationMessages (inval.c:689)
==00:00:02:01.931 386293==    by 0x8143B8: LockRelationOid (lmgr.c:137)
==00:00:02:01.931 386293==    by 0x51710C: relation_open (relation.c:56)
==00:00:02:01.931 386293==    by 0x59F6B5: table_open (table.c:43)
==00:00:02:01.931 386293==    by 0x5DBCA8: StorePartitionBound (heap.c:3789)
==00:00:02:01.932 386293==    by 0x686EB2: ATExecAttachPartition.isra.58 (tablecmds.c:17263)
==00:00:02:01.932 386293==    by 0x688F9E: ATExecCmd (tablecmds.c:5111)
==00:00:02:01.932 386293==    by 0x6896BA: ATRewriteCatalogs (tablecmds.c:4781)
==00:00:02:01.932 386293==    by 0x6896BA: ATController (tablecmds.c:4376)
==00:00:02:01.932 386293==    by 0x8320D2: ProcessUtilitySlow.isra.7 (utility.c:1284)
==00:00:02:01.932 386293==
==00:00:02:01.932 386293== Use of uninitialised value of size 8
==00:00:02:01.932 386293==    at 0x94503F: hash_search_with_hash_value (dynahash.c:1023)
==00:00:02:01.932 386293==    by 0x8278C4: smgrclosenode (smgr.c:318)
==00:00:02:01.932 386293==    by 0x80EDEC: ReceiveSharedInvalidMessages (sinval.c:120)
==00:00:02:01.932 386293==    by 0x924112: AcceptInvalidationMessages (inval.c:689)
==00:00:02:01.932 386293==    by 0x8143B8: LockRelationOid (lmgr.c:137)
==00:00:02:01.932 386293==    by 0x51710C: relation_open (relation.c:56)
==00:00:02:01.932 386293==    by 0x59F6B5: table_open (table.c:43)
==00:00:02:01.932 386293==    by 0x5DBCA8: StorePartitionBound (heap.c:3789)
==00:00:02:01.932 386293==    by 0x686EB2: ATExecAttachPartition.isra.58 (tablecmds.c:17263)
==00:00:02:01.932 386293==    by 0x688F9E: ATExecCmd (tablecmds.c:5111)
==00:00:02:01.932 386293==    by 0x6896BA: ATRewriteCatalogs (tablecmds.c:4781)
==00:00:02:01.932 386293==    by 0x6896BA: ATController (tablecmds.c:4376)
==00:00:02:01.932 386293==    by 0x8320D2: ProcessUtilitySlow.isra.7 (utility.c:1284)
==00:00:02:01.932 386293==
==00:00:02:01.932 386293== Conditional jump or move depends on uninitialised value(s)
==00:00:02:01.932 386293==    at 0x945071: hash_search_with_hash_value (dynahash.c:1033)
==00:00:02:01.932 386293==    by 0x8278C4: smgrclosenode (smgr.c:318)
==00:00:02:01.932 386293==    by 0x80EDEC: ReceiveSharedInvalidMessages (sinval.c:120)
==00:00:02:01.932 386293==    by 0x924112: AcceptInvalidationMessages (inval.c:689)
==00:00:02:01.932 386293==    by 0x8143B8: LockRelationOid (lmgr.c:137)
==00:00:02:01.932 386293==    by 0x51710C: relation_open (relation.c:56)
==00:00:02:01.932 386293==    by 0x59F6B5: table_open (table.c:43)
==00:00:02:01.932 386293==    by 0x5DBCA8: StorePartitionBound (heap.c:3789)
==00:00:02:01.932 386293==    by 0x686EB2: ATExecAttachPartition.isra.58 (tablecmds.c:17263)
==00:00:02:01.932 386293==    by 0x688F9E: ATExecCmd (tablecmds.c:5111)
==00:00:02:01.932 386293==    by 0x6896BA: ATRewriteCatalogs (tablecmds.c:4781)
==00:00:02:01.932 386293==    by 0x6896BA: ATController (tablecmds.c:4376)
==00:00:02:01.932 386293==    by 0x8320D2: ProcessUtilitySlow.isra.7 (utility.c:1284)
==00:00:02:01.932 386293==
==00:00:02:02.765 386293==
==00:00:02:02.765 386293== HEAP SUMMARY:
==00:00:02:02.765 386293==     in use at exit: 2,385,524 bytes in 538 blocks
==00:00:02:02.765 386293==   total heap usage: 8,340 allocs, 7,802 frees, 162,448,006 bytes allocated
==00:00:02:02.765 386293==
==00:00:02:03.192 386293== LEAK SUMMARY:
==00:00:02:03.192 386293==    definitely lost: 368 bytes in 1 blocks
==00:00:02:03.192 386293==    indirectly lost: 1,746 bytes in 45 blocks
==00:00:02:03.192 386293==      possibly lost: 704 bytes in 4 blocks
==00:00:02:03.192 386293==    still reachable: 2,341,786 bytes in 484 blocks
==00:00:02:03.192 386293==         suppressed: 40,920 bytes in 4 blocks
==00:00:02:03.192 386293== Rerun with --leak-check=full to see details of leaked memory
==00:00:02:03.192 386293==
==00:00:02:03.192 386293== Use --track-origins=yes to see where uninitialised values come from
==00:00:02:03.192 386293== For lists of detected and suppressed errors, rerun with: -s
==00:00:02:03.192 386293== ERROR SUMMARY: 123 errors from 4 contexts (suppressed: 272 from 8)

Re: Non-reproducible valgrind failure on HEAD

From
Andres Freund
Date:
Hi,

On 2021-05-09 15:29:06 -0400, Tom Lane wrote:
> I happened to be trying to run the core regression tests under
> valgrind, and I got the complaints attached below, from the
> process that had been running the insert_conflict test script.
>
> I could not reproduce the failure in a second run, which is not
> hugely surprising because it appears to be in cross-process
> sinval processing; so timing sensitivity is to be expected.
> That doesn't make it any less disturbing.
>
> One point worth mentioning is that I'd forgotten to build with
> "#define USE_VALGRIND" in the first try.  AFAIK that should make
> valgrind strictly less sensitive, so I think it's not material,
> but still.

I think it may be material - see the comments in
AddCatcacheInvalidationMessage(). Valgrind doesn't cope correctly with
the sinval ringbuffer being accessed in multiple processes. If process A
adds an invalidation to position 0 and then later processes another
invalidation at the same position 0 that was added by B, valgrind will
use the "is defined" state from the invalidation it queued itself, not
the one that B queued.

Greetings,

Andres Freund



Re: Non-reproducible valgrind failure on HEAD

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2021-05-09 15:29:06 -0400, Tom Lane wrote:
>> One point worth mentioning is that I'd forgotten to build with
>> "#define USE_VALGRIND" in the first try.  AFAIK that should make
>> valgrind strictly less sensitive, so I think it's not material,
>> but still.

> I think it may be material - see the comments in
> AddCatcacheInvalidationMessage().

Ah-hah.  Please excuse the noise.

(I wonder if we shouldn't adjust the comments in pg_config_manual.h,
though, as they certainly leave the impression that USE_VALGRIND
isn't essential.)

            regards, tom lane



Re: Non-reproducible valgrind failure on HEAD

From
Andres Freund
Date:
Hi,

On 2021-05-09 17:12:14 -0400, Tom Lane wrote:
> (I wonder if we shouldn't adjust the comments in pg_config_manual.h,
> though, as they certainly leave the impression that USE_VALGRIND
> isn't essential.)

That'd make sense to me. If we found a better way to deal with the
sinval thing it'd be good too - but I am not seeing anything convincing,
and I looked a couple times over the years...



Re: Non-reproducible valgrind failure on HEAD

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2021-05-09 17:12:14 -0400, Tom Lane wrote:
>> (I wonder if we shouldn't adjust the comments in pg_config_manual.h,
>> though, as they certainly leave the impression that USE_VALGRIND
>> isn't essential.)

> That'd make sense to me. If we found a better way to deal with the
> sinval thing it'd be good too - but I am not seeing anything convincing,
> and I looked a couple times over the years...

Yeah, it's actually somewhat amazing that we get useful results at all
around shared-memory accesses.

Proposed comment patch attached.

            regards, tom lane

diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e28c990382..42ee43f0aa 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -279,12 +279,18 @@
  * enables detection of buffer accesses that take place without holding a
  * buffer pin (or without holding a buffer lock in the case of index access
  * methods that superimpose their own custom client requests on top of the
- * generic bufmgr.c requests).  See also src/tools/valgrind.supp.
+ * generic bufmgr.c requests).
  *
  * "make installcheck" is significantly slower under Valgrind.  The client
  * requests fall in hot code paths, so USE_VALGRIND slows execution by a few
  * percentage points even when not run under Valgrind.
  *
+ * Do not try to test the server under Valgrind without having built the
+ * server with USE_VALGRIND; else you will get false positives from sinval
+ * messaging (see comments in AddCatcacheInvalidationMessage).  It's also
+ * important to use the suppression file src/tools/valgrind.supp to
+ * exclude other known false positives.
+ *
  * You should normally use MEMORY_CONTEXT_CHECKING with USE_VALGRIND;
  * instrumentation of repalloc() is inferior without it.
  */

Re: Non-reproducible valgrind failure on HEAD

From
Andres Freund
Date:
On 2021-05-09 18:17:36 -0400, Tom Lane wrote:
> Proposed comment patch attached.

lgtm.



Re: Non-reproducible valgrind failure on HEAD

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2021-05-09 18:17:36 -0400, Tom Lane wrote:
>> Proposed comment patch attached.

> lgtm.

Pushed, thanks.

            regards, tom lane