Re: Regression tests fail with musl libc because libpq.so can't be loaded - Mailing list pgsql-bugs

From Wolfgang Walther
Subject Re: Regression tests fail with musl libc because libpq.so can't be loaded
Date
Msg-id f98cd8de-1c66-491a-8409-e62c09932080@technowledgy.de
Whole thread Raw
In response to Re: Regression tests fail with musl libc because libpq.so can't be loaded  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Regression tests fail with musl libc because libpq.so can't be loaded  (walther@technowledgy.de)
Re: Regression tests fail with musl libc because libpq.so can't be loaded  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
Thomas Munro:
> Of course we have to distinguish between the basic argv[] clobbering
> trick which is barely even a trick, and the more advanced environ
> stealing trick which confuses musl.

Right. The latter not only confuses musl, but also makes 
/proc/<pid>/environ return garbage. This is also mentioned at the bottom 
of main.c, which has a workaround for the specific case of UBSan 
depending on that. This is kind of funny: Because we are relying on 
undefined behavior regarding the modification of environ, we need a 
workaround for the "UndefinedBehaviorSanitizer" - I guess by failing 
without this workaround, it wanted to tell us something..

This happens on glibc, too.

So summarizing:

1. The simple approach is to use PS_USE_CLOBBER_ARGV on Linux only for 
glibc and other known-to-be-good-and-identifiable libc variants, 
otherwise default to PS_USE_NONE. This will not only keep the problem 
for /proc/../environ for glibc users, but also disable ps status for 
musl entirely. Considering that probably the biggest use-case for musl 
is to run postgres in containers, it's quite likely to actually run more 
than just one cluster on a single machine. In this case... ps status 
would be especially handy to identify which cluster a process belongs to.

2. The next proposal was to stop clobbering environ once LD_LIBRARY_PATH 
/ LD_PRELOAD is found to keep those intact. This will keep ps status 
support on musl, which is good. But the /proc/.../environ problem will 
still be there, unchanged.

Both of those approaches rely on the undefined behavior of clobbering 
environ.

3. The logical consequence of this is, to stop clobbering environ and 
use only the available argv space. However, this will quickly leave us 
with a very small ps status buffer to work with, making the feature less 
useful. Note, that this could happen theoretically by starting postgres 
with the fewest arguments and environment possible, too. Not sure what 
the minimal buffer size is that could be achieved that way. The point 
is: The buffer size is not guaranteed at all.

4. The upstream (musl) suggestion of which I sent a PoC was to "exec 
yourself with a bigger argv". This works. I chose to pad argv0 with 
trailing slashes. Those can safely be stripped away again, because any 
argv0 which would come with a trailing slash to start with, would not be 
the current executable, but a directory - so would fail exec immediately 
anyway. This keeps /proc/.../environ intact and does not rely on 
undefined behavior. Additionally, we get a guaranteed ps buffer size of 
256, which is what we use on BSDs and Windows, too.

I wonder why we actually fall back to PS_USE_NONE by default.. and how 
much of that is related to the environment clobbering to start with? 
Could we even use the exec-approach as the fallback in all other cases 
except BSDs and Windows and get rid of PS_USE_NONE? Clobbering only argv 
sure seems way safer to do than what we do right now.

Best,

Wolfgang



pgsql-bugs by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Index plan returns different results to sequential scan
Next
From: walther@technowledgy.de
Date:
Subject: Re: Regression tests fail with musl libc because libpq.so can't be loaded