Thread: Re: Valgrind failures caused by multivariate stats patch

Re: Valgrind failures caused by multivariate stats patch

From
Tomas Vondra
Date:
On 03/25/2017 10:10 PM, Andres Freund wrote:
...
> ==2486== Uninitialised byte(s) found during client check request
> ==2486==    at 0x1C4857: printtup (printtup.c:347)
> ==2486==    by 0x401FD5: ExecutePlan (execMain.c:1681)
> ==2486==    by 0x3FFDED: standard_ExecutorRun (execMain.c:355)
> ==2486==    by 0x3FFC07: ExecutorRun (execMain.c:298)
> ==2486==    by 0x5C40BD: PortalRunSelect (pquery.c:928)
> ==2486==    by 0x5C3D50: PortalRun (pquery.c:769)
> ==2486==    by 0x5BDC1A: exec_simple_query (postgres.c:1101)
> ==2486==    by 0x5C203F: PostgresMain (postgres.c:4071)
> ==2486==    by 0x525138: BackendRun (postmaster.c:4317)
> ==2486==    by 0x524848: BackendStartup (postmaster.c:3989)
> ==2486==    by 0x520C4A: ServerLoop (postmaster.c:1729)
> ==2486==    by 0x5201D9: PostmasterMain (postmaster.c:1337)
> ==2486==    by 0x45DFA6: main (main.c:228)
> ==2486==  Address 0x3e8c50f5 is in a rw- anonymous segment
> ==2486==  Uninitialised value was created by a heap allocation
> ==2486==    at 0x76212F: palloc (mcxt.c:872)
> ==2486==    by 0x578814: statext_ndistinct_build (mvdistinct.c:80)
> ==2486==    by 0x577CCE: BuildRelationExtStatistics (extended_stats.c:94)
> ==2486==    by 0x3546B6: do_analyze_rel (analyze.c:577)
> ==2486==    by 0x353B76: analyze_rel (analyze.c:271)
> ==2486==    by 0x3E7B89: vacuum (vacuum.c:321)
> ==2486==    by 0x3E7755: ExecVacuum (vacuum.c:122)
> ==2486==    by 0x5C5F1C: standard_ProcessUtility (utility.c:670)
> ==2486==    by 0x5C5724: ProcessUtility (utility.c:353)
> ==2486==    by 0x5C46CA: PortalRunUtility (pquery.c:1174)
> ==2486==    by 0x5C48D0: PortalRunMulti (pquery.c:1317)
> ==2486==    by 0x5C3E1A: PortalRun (pquery.c:795)
> ==2486==    by 0x5BDC1A: exec_simple_query (postgres.c:1101)
> ==2486==    by 0x5C203F: PostgresMain (postgres.c:4071)
> ==2486==    by 0x525138: BackendRun (postmaster.c:4317)
> ==2486==    by 0x524848: BackendStartup (postmaster.c:3989)
> ==2486==    by 0x520C4A: ServerLoop (postmaster.c:1729)
> ==2486==    by 0x5201D9: PostmasterMain (postmaster.c:1337)
> ==2486==    by 0x45DFA6: main (main.c:228)

Hmmm, so I have a theory about what's going on, but no matter what I do 
I can't trigger these valgrind failures. What switches are you using to 
start valgrind? I'm using this:

valgrind --leak-check=no --undef-value-errors=yes \  --expensive-definedness-checks=yes --track-origins=yes \
--read-var-info=yes--gen-suppressions=all \  --suppressions=src/tools/valgrind.supp --time-stamp=yes \
--log-file=$HOME/pg-valgrind/%p.log--trace-children=yes \  postgres --log_line_prefix="%m %p " --log_statement=all \
--shared_buffers=64MB-D /home/user/tmp/data-valgrind 2>&1
 

FWIW I think the issue is that statext_ndistinct_build does palloc() and 
then uses offsetof() to copy data into the chunk, which might result in 
a few bytes skipped due to alignment/padding. But as I can't reproduce 
the valgrind failure, it's hard to say.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Valgrind failures caused by multivariate stats patch

From
Andres Freund
Date:
On 2017-03-26 20:38:52 +0200, Tomas Vondra wrote:
> On 03/25/2017 10:10 PM, Andres Freund wrote:
> ...
> > ==2486== Uninitialised byte(s) found during client check request
> > ==2486==    at 0x1C4857: printtup (printtup.c:347)
> > ==2486==    by 0x401FD5: ExecutePlan (execMain.c:1681)
> > ==2486==    by 0x3FFDED: standard_ExecutorRun (execMain.c:355)
> > ==2486==    by 0x3FFC07: ExecutorRun (execMain.c:298)
> > ==2486==    by 0x5C40BD: PortalRunSelect (pquery.c:928)
> > ==2486==    by 0x5C3D50: PortalRun (pquery.c:769)
> > ==2486==    by 0x5BDC1A: exec_simple_query (postgres.c:1101)
> > ==2486==    by 0x5C203F: PostgresMain (postgres.c:4071)
> > ==2486==    by 0x525138: BackendRun (postmaster.c:4317)
> > ==2486==    by 0x524848: BackendStartup (postmaster.c:3989)
> > ==2486==    by 0x520C4A: ServerLoop (postmaster.c:1729)
> > ==2486==    by 0x5201D9: PostmasterMain (postmaster.c:1337)
> > ==2486==    by 0x45DFA6: main (main.c:228)
> > ==2486==  Address 0x3e8c50f5 is in a rw- anonymous segment
> > ==2486==  Uninitialised value was created by a heap allocation
> > ==2486==    at 0x76212F: palloc (mcxt.c:872)
> > ==2486==    by 0x578814: statext_ndistinct_build (mvdistinct.c:80)
> > ==2486==    by 0x577CCE: BuildRelationExtStatistics (extended_stats.c:94)
> > ==2486==    by 0x3546B6: do_analyze_rel (analyze.c:577)
> > ==2486==    by 0x353B76: analyze_rel (analyze.c:271)
> > ==2486==    by 0x3E7B89: vacuum (vacuum.c:321)
> > ==2486==    by 0x3E7755: ExecVacuum (vacuum.c:122)
> > ==2486==    by 0x5C5F1C: standard_ProcessUtility (utility.c:670)
> > ==2486==    by 0x5C5724: ProcessUtility (utility.c:353)
> > ==2486==    by 0x5C46CA: PortalRunUtility (pquery.c:1174)
> > ==2486==    by 0x5C48D0: PortalRunMulti (pquery.c:1317)
> > ==2486==    by 0x5C3E1A: PortalRun (pquery.c:795)
> > ==2486==    by 0x5BDC1A: exec_simple_query (postgres.c:1101)
> > ==2486==    by 0x5C203F: PostgresMain (postgres.c:4071)
> > ==2486==    by 0x525138: BackendRun (postmaster.c:4317)
> > ==2486==    by 0x524848: BackendStartup (postmaster.c:3989)
> > ==2486==    by 0x520C4A: ServerLoop (postmaster.c:1729)
> > ==2486==    by 0x5201D9: PostmasterMain (postmaster.c:1337)
> > ==2486==    by 0x45DFA6: main (main.c:228)
> 
> Hmmm, so I have a theory about what's going on, but no matter what I do I
> can't trigger these valgrind failures. What switches are you using to start
> valgrind? I'm using this:
> 
> valgrind --leak-check=no --undef-value-errors=yes \
>   --expensive-definedness-checks=yes --track-origins=yes \
>   --read-var-info=yes --gen-suppressions=all \
>   --suppressions=src/tools/valgrind.supp --time-stamp=yes \
>   --log-file=$HOME/pg-valgrind/%p.log --trace-children=yes \
>   postgres --log_line_prefix="%m %p " --log_statement=all \
>   --shared_buffers=64MB -D /home/user/tmp/data-valgrind 2>&1
--quiet \        --error-exitcode=55 \        --suppressions=${PG_SRC_DIR}/src/tools/valgrind.supp \
--trace-children=yes--track-origins=yes --read-var-info=yes \ --num-callers=20 \        --leak-check=no \
--gen-suppressions=all\
 

the error-exitcode makes it a whole lot easier to see an error, because
it triggers a crash-restart cycle at backend exit ;)

Note that skink also finds the issue:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2017-03-26%2013%3A00%3A01&stg=check

Did you compile postgres with valgrind support (-DUSE_VALGRIND / pg_config_manual.h)?

- Andres



Re: Valgrind failures caused by multivariate stats patch

From
Tomas Vondra
Date:
On 03/26/2017 08:47 PM, Andres Freund wrote:
> On 2017-03-26 20:38:52 +0200, Tomas Vondra wrote:
>> ...
>> Hmmm, so I have a theory about what's going on, but no matter what I do I
>> can't trigger these valgrind failures. What switches are you using to start
>> valgrind? I'm using this:
>>
>> valgrind --leak-check=no --undef-value-errors=yes \
>>   --expensive-definedness-checks=yes --track-origins=yes \
>>   --read-var-info=yes --gen-suppressions=all \
>>   --suppressions=src/tools/valgrind.supp --time-stamp=yes \
>>   --log-file=$HOME/pg-valgrind/%p.log --trace-children=yes \
>>   postgres --log_line_prefix="%m %p " --log_statement=all \
>>   --shared_buffers=64MB -D /home/user/tmp/data-valgrind 2>&1
>
>     --quiet \
>          --error-exitcode=55 \
>          --suppressions=${PG_SRC_DIR}/src/tools/valgrind.supp \
>          --trace-children=yes --track-origins=yes --read-var-info=yes \
>      --num-callers=20 \
>          --leak-check=no \
>      --gen-suppressions=all \
>
> the error-exitcode makes it a whole lot easier to see an error, because
> it triggers a crash-restart cycle at backend exit ;)
>

OK, this did the trick. And just as I suspected, it seems to be due to 
doing memcpy+offsetof when serializing the statistic into a bytea. The 
attached patch fixes that for me. Can you test it on your build?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Valgrind failures caused by multivariate stats patch

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

> OK, this did the trick. And just as I suspected, it seems to be due to doing
> memcpy+offsetof when serializing the statistic into a bytea. The attached
> patch fixes that for me. Can you test it on your build?

Buildfarm member skink confirms that this is fixed.  Thanks!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services