Thread: Unnecessary connection overhead due copy-on-write (mainly openssl)

Unnecessary connection overhead due copy-on-write (mainly openssl)

From
Andres Freund
Date:
Hi,

Looking at [1] I, again, noticed that a decent portion of our connection
overhead is due to openssl's atexit handler.

On my older workstation (with a few noisy things running):

c=16;pgbench -n -M prepared -c$c -j$c -P1 -T10 -f <(echo 'select') -C
-> 3057 TPS

If I change the exit() in proc_exit() to a _exit():
-> 3633 TPS

The reason for this difference is that by default openssl registers an atexit
handler that frees a lot of memory that was initialized in postmaster. That in
turn triggers page-faults due to the relevant pages now differing in child
processes. Which a) isn't cheap b) causes contention with postmaster, since
those datastructures are shared.


It's possible to tell openssl to not register an atexit handler, see [2]:

> OPENSSL_INIT_NO_ATEXIT
>   By default OpenSSL will attempt to clean itself up when the process exits via
>   an "atexit" handler. Using this option suppresses that behaviour. This means
>   that the application will have to clean up OpenSSL explicitly using
>   OPENSSL_cleanup().

One slight difficulty is that we initialize openssl somewhat indirectly, via
PostmasterMain()->InitProcessGlobals()->pg_prng_strong_seed() which then, if
built with openssl support, triggers initialization within RAND_status().


The quick hack of putting

#ifdef USE_OPENSSL
    OPENSSL_init_crypto(OPENSSL_INIT_NO_ATEXIT, NULL);
#endif

at the start of PostmasterMain() gets the connection speed up a fair bit:
-> 3449 TPS


The reason this isn't as good as using _exit is that there are other libraries
with (effectively) atexit handlers. In particular ICU pulls in libstdc++,
which in turn seems to have a lot of destructors for global objects that
aren't cheap.

If I build without ICU support, the connection rate with exit() (and the
openssl "fix") is
-> 3863 TPS
and if I use _exit() it is
-> 3900 TPS

I.e. at that point the remaining atexit handlers only play a small role.

I don't know if there's a decent solution for the nontrivial overhead due to
ICU -> libstdc++'s atexit handlers.



There are a few related issues where we ourselves to blame. The most prominent
one is that we go around and delete PostmasterContext in child processes. That
however doesn't really save memory, as the memory is still needed in
postmaster, we just end up causing page faults that trigger copy-on-write.

If I just comment out the MemoryContextDelete in PostgresMain() I see
connection rates improve from
-> 3891 TPS
to
-> 4004 TPS


If I build a much more minimal postgres, disabling all optional dependencies
other than openssl I see a significant improvement, just due fewer mmaps for
the libraries:
-> 4865 TPS

Further disabling openssl and zlib interestingly does not help, interestingly.


Greetings,

Andres Freund

[1] https://postgr.es/m/CAFbpF8OA44_UG%2BRYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA%40mail.gmail.com
[2] https://docs.openssl.org/3.1/man3/OPENSSL_init_crypto/#name



Re: Unnecessary connection overhead due copy-on-write (mainly openssl)

From
Peter Eisentraut
Date:
On 05.06.25 21:58, Andres Freund wrote:
> The reason for this difference is that by default openssl registers an atexit
> handler that frees a lot of memory that was initialized in postmaster. That in
> turn triggers page-faults due to the relevant pages now differing in child
> processes. Which a) isn't cheap b) causes contention with postmaster, since
> those datastructures are shared.
> 
> 
> It's possible to tell openssl to not register an atexit handler, see [2]:
> 
>> OPENSSL_INIT_NO_ATEXIT
>>    By default OpenSSL will attempt to clean itself up when the process exits via
>>    an "atexit" handler. Using this option suppresses that behaviour. This means
>>    that the application will have to clean up OpenSSL explicitly using
>>    OPENSSL_cleanup().

It seems weird to me that openssl spends so much effort tidying up its 
memory allocations just before exiting.  We could just skip that. 
Looking through the code of OPENSSL_cleanup(), there might be one or two 
cases of log or trace files that get flushed during cleanup, so it's not 
an absolute no-brainer to skip all the cleanup.




On Thu, Jun 5, 2025 at 3:58 PM Andres Freund <andres@anarazel.de> wrote:
> There are a few related issues where we ourselves to blame. The most prominent
> one is that we go around and delete PostmasterContext in child processes. That
> however doesn't really save memory, as the memory is still needed in
> postmaster, we just end up causing page faults that trigger copy-on-write.

If we're not going to bother deleting PostmasterContext, we could also
skip creating it in the first place. After all, if the storage isn't
actually freed, then we won't know whether things are leaking into
that context that actually do get used in child processes, so there's
really no point.

The current structure amounts to a design decision that at some point
in time the postmaster might allocate an amount of memory that we need
to free in child processes, whether or not that's actually true
currently. Not deleting it any more -- or not having it any more -- is
deciding that it shouldn't ever allocate a significant amount of
memory.

I don't know whether that's a good bet, but I wouldn't be surprised. I
think we've talked about wanting to move some things that the
postmaster currently does to a separate process, whether for
multi-threading or other reasons. But, if we do take the position that
the postmaster shouldn't allocate a significant amount of stuff, we
might want to add some checks someplace to prove that it doesn't.
Otherwise, it might get broken by some future patch without anybody
noticing.

(For clarity, I'm not attempting to insist on anything here, just
sharing a few thoughts that come to mind.)

--
Robert Haas
EDB: http://www.enterprisedb.com



On Fri, Jun 6, 2025 at 4:56 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> It seems weird to me that openssl spends so much effort tidying up its
> memory allocations just before exiting.  We could just skip that.
> Looking through the code of OPENSSL_cleanup(), there might be one or two
> cases of log or trace files that get flushed during cleanup, so it's not
> an absolute no-brainer to skip all the cleanup.

I guess I'd be concerned that a hardware crypto provider might need
good-faith cleanup to work well. I understand they can't rely on
atexit in general, but there would be a big difference between "you
might have to clean up after a crash" and "every single connection
litters the hardware with unused stuff".

But that's pure FUD and guesswork; I have no examples to point to, so
there might not be any providers that need that.

--Jacob



Re: Jacob Champion
> I guess I'd be concerned that a hardware crypto provider might need
> good-faith cleanup to work well.

Hopefully not in every single backend.

Christoph



On Fri, Jun 06, 2025 at 08:41:20AM -0700, Jacob Champion wrote:
> I guess I'd be concerned that a hardware crypto provider might need
> good-faith cleanup to work well. I understand they can't rely on
> atexit in general, but there would be a big difference between "you
> might have to clean up after a crash" and "every single connection
> litters the hardware with unused stuff".

I'd expect all subsystems to recover cleanly from unclean shutdowns.  I
know, that's a lot to expect, but nowadays pretty much all filesystems
used in production do, for example.

> But that's pure FUD and guesswork; I have no examples to point to, so
> there might not be any providers that need that.

I doubt that PG w/ OpenSSL in any configuration maintains stateful
interactions with HW cryptographic providers.



Hi,

On 2025-06-06 08:41:20 -0700, Jacob Champion wrote:
> On Fri, Jun 6, 2025 at 4:56 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> > It seems weird to me that openssl spends so much effort tidying up its
> > memory allocations just before exiting.  We could just skip that.
> > Looking through the code of OPENSSL_cleanup(), there might be one or two
> > cases of log or trace files that get flushed during cleanup, so it's not
> > an absolute no-brainer to skip all the cleanup.
> 
> I guess I'd be concerned that a hardware crypto provider might need
> good-faith cleanup to work well. I understand they can't rely on
> atexit in general, but there would be a big difference between "you
> might have to clean up after a crash" and "every single connection
> litters the hardware with unused stuff".

It's not just crashes, e.g. the startup packet timeout is also handled by
_exit() - and it can be triggered remotely. ISTM that if crypto providers
can't handle _exit(), we have a bigger problem.

Alternatively we could try deferring more of openssl's initialization to
outside of postmaster - but that doesn't seem particularly realistic.

Greetings,

Andres Freund



On Fri, Jun 6, 2025 at 9:25 AM Nico Williams <nico@cryptonector.com> wrote:
> I'd expect all subsystems to recover cleanly from unclean shutdowns.  I
> know, that's a lot to expect, but nowadays pretty much all filesystems
> used in production do, for example.

I guess, but if we stop cleaning up entirely, we will suddenly be
stressing those code paths... But maybe that's a community service? :)

I realize I'm making an argument from fear and ignorance. Maybe that
ecosystem is very healthy. I'm just imagining the following
conversation:

DBA: we upgraded our server and our HSM is freaking out after a few
thousand connections; what gives?
us: oh, we stopped cleaning up after ourselves for performance! tell
your vendor to fix their drivers!
DBA: hahahaha

[1] is a description of the kind of problem I'm worried about. (It's
not 1:1 applicable to this situation, I just think we might start
seeing those sorts of bug reports.)

> I doubt that PG w/ OpenSSL in any configuration maintains stateful
> interactions with HW cryptographic providers.

(Why? From looking over the Cryptoki/PKCS#11 stuff, for example, isn't
a lot of that API stateful?)

--Jacob

[1] https://github.com/OpenSC/libp11/issues/228#issuecomment-402941378



On Fri, Jun 6, 2025 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
> It's not just crashes, e.g. the startup packet timeout is also handled by
> _exit() - and it can be triggered remotely.

Fair point...

> ISTM that if crypto providers
> can't handle _exit(), we have a bigger problem.

...so I guess I need to figure out whether we have a bigger problem. I
hope we don't.

Note that OpenSSL seems to be interested in removing the atexit()
handling altogether, and requiring applications to manually call the
cleanup function, in 4.0. [1]

--Jacob

[1] https://github.com/openssl/openssl/issues/22508



On Fri, Jun 06, 2025 at 11:58:38AM -0700, Jacob Champion wrote:
> > I'd expect all subsystems to recover cleanly from unclean shutdowns.  I
> > know, that's a lot to expect, but nowadays pretty much all filesystems
> > used in production do, for example.
> 
> I guess, but if we stop cleaning up entirely, we will suddenly be
> stressing those code paths... But maybe that's a community service? :)

The latter.

> I realize I'm making an argument from fear and ignorance. Maybe that
> ecosystem is very healthy. I'm just imagining the following
> conversation:
> 
> DBA: we upgraded our server and our HSM is freaking out after a few
> thousand connections; what gives?
> us: oh, we stopped cleaning up after ourselves for performance! tell
> your vendor to fix their drivers!
> DBA: hahahaha

TPMs for example have a concept of session.  You can have up to 64 open
sessions, and if you use the TPM resource manager and you're accessing
it through a file descriptor then the RM will just clean up when you
exit.  Though if you're accessing the raw TPM directly then fail to
flush sessions then yes, you'll eventually be unable to create new ones.

However no one will be using a discrete or firmware TPM for TLS server
certificate private key usage: discrete TPMs are way way too slow for
that, and firmware TPMs are... also way too slow.  You wouldn't bother
with a software TPM for this unless it's for privilege separation.

Anyways, if you were using a TPM then the user's startup scripts, or
postgres itself could just flush all sessions and be done.

Other types of hardware cryptographic providers also tend to have a
notion of "session", and they all tend to have relatively paltry limits,
which means that the software side that calls them will generally need
to be prepared to a) close its own sessions eagerly (at the cost of
extra overhead on the next operation), and b) recover from running out
of sessions (by flushing others at the cost of causing those that were
live to need retries).

But anyways, IIUC the OpenSSL engine interface is itself stateless and I
would expect providers to auto-recover.  And anyways I expect no one
uses PG with HW cryptographic providers to perform TLS server
signatures.  Instead the best current practice would be to use
short-lived server certificates with software keys and longer-lived
credentials in hardware with which to fetch new short-lived credentials
with software keys.  The kinds of HSMs that can do high rates of
signatures are neither cheap nor commonly used, and those do tend to
have higher session limits, and again you can recover from running out
of sessions by flushing extant sessions.

> > I doubt that PG w/ OpenSSL in any configuration maintains stateful
> > interactions with HW cryptographic providers.
> 
> (Why? From looking over the Cryptoki/PKCS#11 stuff, for example, isn't
> a lot of that API stateful?)

PKCS#11 is stateful, yes (it has session handles), but there are
generally low limits on how many sessions you can keep open, therefore
high pressure to close them soon, therefore the inference is that that
must be what actually happens at the rather high cost of having to set
up new sessions often.  That inference could be wrong, but then as you
note you'd be doing the community a service by testing it and making it
true in the future.

Nico
--