Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit
Date
Msg-id 20190618191852.aqmw5dt3milodkqd@alap3.anarazel.de
Whole thread Raw
In response to Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2018-12-17 15:35:01 -0800, Andres Freund wrote:
> On 2018-12-16 13:48:00 -0800, Andres Freund wrote:
> > On 2018-12-17 08:25:38 +1100, Thomas Munro wrote:
> > > On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote:
> > > > The interesting bit is that if I replace the _exit(2) in
> > > > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
> > > > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
> > > > doesn't find errors.
> > > 
> > > Weird.  Well I can see that there were bugs last year where OpenSSL
> > > failed to clean up its thread locals[1], and after they fixed that,
> > > cases where it bogusly cleaned up someone else's thread locals[2].
> > > Maybe there is some interference between pthread keys or something
> > > like that.
> > > 
> > > [1] https://github.com/openssl/openssl/issues/3033
> > > [2] https://github.com/openssl/openssl/issues/3584
> > 
> > What confuses the heck out of me is that it happens on _exit(). Those
> > issues ought to be only visible when doing exit(), no?
> > 
> > I guess there's also a good argument to make that valgrind running it's
> > intercept in the _exit() case is a bit dubious (given that's going to be
> > used in cases where e.g. a signal handler might have interrupted a
> > malloc), but given the stacktraces here I don't think that can be the
> > cause.
> 
> I've for now put --run-libc-freeres=no into skink's config. Locally that
> "fixes" the issue for me, but of course is not a proper solution. But I
> want to see whether that allows running all tests under valgrind.

Turns out to be caused by a glibc bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=24476

The reason it only fails if ssl is enabled, and only after the openssl
randomness was integrated, is that openssl randomness initialization
creates a TLS variable, which glibc then frees accidentally (as it tries
to free something not initialized).

Thus this can be "worked around" by doing something like
shared_preload_libraries=pg_stat_statements, as dlopening a library
initializes the relevant state.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: idea: log_statement_sample_rate - bottom limit for sampling
Next
From: "Li, Zheng"
Date:
Subject: Re: NOT IN subquery optimization