Thread: Unnecessary connection overhead due copy-on-write (mainly openssl)
Hi, Looking at [1] I, again, noticed that a decent portion of our connection overhead is due to openssl's atexit handler. On my older workstation (with a few noisy things running): c=16;pgbench -n -M prepared -c$c -j$c -P1 -T10 -f <(echo 'select') -C -> 3057 TPS If I change the exit() in proc_exit() to a _exit(): -> 3633 TPS The reason for this difference is that by default openssl registers an atexit handler that frees a lot of memory that was initialized in postmaster. That in turn triggers page-faults due to the relevant pages now differing in child processes. Which a) isn't cheap b) causes contention with postmaster, since those datastructures are shared. It's possible to tell openssl to not register an atexit handler, see [2]: > OPENSSL_INIT_NO_ATEXIT > By default OpenSSL will attempt to clean itself up when the process exits via > an "atexit" handler. Using this option suppresses that behaviour. This means > that the application will have to clean up OpenSSL explicitly using > OPENSSL_cleanup(). One slight difficulty is that we initialize openssl somewhat indirectly, via PostmasterMain()->InitProcessGlobals()->pg_prng_strong_seed() which then, if built with openssl support, triggers initialization within RAND_status(). The quick hack of putting #ifdef USE_OPENSSL OPENSSL_init_crypto(OPENSSL_INIT_NO_ATEXIT, NULL); #endif at the start of PostmasterMain() gets the connection speed up a fair bit: -> 3449 TPS The reason this isn't as good as using _exit is that there are other libraries with (effectively) atexit handlers. In particular ICU pulls in libstdc++, which in turn seems to have a lot of destructors for global objects that aren't cheap. If I build without ICU support, the connection rate with exit() (and the openssl "fix") is -> 3863 TPS and if I use _exit() it is -> 3900 TPS I.e. at that point the remaining atexit handlers only play a small role. I don't know if there's a decent solution for the nontrivial overhead due to ICU -> libstdc++'s atexit handlers. There are a few related issues where we ourselves to blame. The most prominent one is that we go around and delete PostmasterContext in child processes. That however doesn't really save memory, as the memory is still needed in postmaster, we just end up causing page faults that trigger copy-on-write. If I just comment out the MemoryContextDelete in PostgresMain() I see connection rates improve from -> 3891 TPS to -> 4004 TPS If I build a much more minimal postgres, disabling all optional dependencies other than openssl I see a significant improvement, just due fewer mmaps for the libraries: -> 4865 TPS Further disabling openssl and zlib interestingly does not help, interestingly. Greetings, Andres Freund [1] https://postgr.es/m/CAFbpF8OA44_UG%2BRYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA%40mail.gmail.com [2] https://docs.openssl.org/3.1/man3/OPENSSL_init_crypto/#name
On 05.06.25 21:58, Andres Freund wrote: > The reason for this difference is that by default openssl registers an atexit > handler that frees a lot of memory that was initialized in postmaster. That in > turn triggers page-faults due to the relevant pages now differing in child > processes. Which a) isn't cheap b) causes contention with postmaster, since > those datastructures are shared. > > > It's possible to tell openssl to not register an atexit handler, see [2]: > >> OPENSSL_INIT_NO_ATEXIT >> By default OpenSSL will attempt to clean itself up when the process exits via >> an "atexit" handler. Using this option suppresses that behaviour. This means >> that the application will have to clean up OpenSSL explicitly using >> OPENSSL_cleanup(). It seems weird to me that openssl spends so much effort tidying up its memory allocations just before exiting. We could just skip that. Looking through the code of OPENSSL_cleanup(), there might be one or two cases of log or trace files that get flushed during cleanup, so it's not an absolute no-brainer to skip all the cleanup.
On Thu, Jun 5, 2025 at 3:58 PM Andres Freund <andres@anarazel.de> wrote: > There are a few related issues where we ourselves to blame. The most prominent > one is that we go around and delete PostmasterContext in child processes. That > however doesn't really save memory, as the memory is still needed in > postmaster, we just end up causing page faults that trigger copy-on-write. If we're not going to bother deleting PostmasterContext, we could also skip creating it in the first place. After all, if the storage isn't actually freed, then we won't know whether things are leaking into that context that actually do get used in child processes, so there's really no point. The current structure amounts to a design decision that at some point in time the postmaster might allocate an amount of memory that we need to free in child processes, whether or not that's actually true currently. Not deleting it any more -- or not having it any more -- is deciding that it shouldn't ever allocate a significant amount of memory. I don't know whether that's a good bet, but I wouldn't be surprised. I think we've talked about wanting to move some things that the postmaster currently does to a separate process, whether for multi-threading or other reasons. But, if we do take the position that the postmaster shouldn't allocate a significant amount of stuff, we might want to add some checks someplace to prove that it doesn't. Otherwise, it might get broken by some future patch without anybody noticing. (For clarity, I'm not attempting to insist on anything here, just sharing a few thoughts that come to mind.) -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jun 6, 2025 at 4:56 AM Peter Eisentraut <peter@eisentraut.org> wrote: > It seems weird to me that openssl spends so much effort tidying up its > memory allocations just before exiting. We could just skip that. > Looking through the code of OPENSSL_cleanup(), there might be one or two > cases of log or trace files that get flushed during cleanup, so it's not > an absolute no-brainer to skip all the cleanup. I guess I'd be concerned that a hardware crypto provider might need good-faith cleanup to work well. I understand they can't rely on atexit in general, but there would be a big difference between "you might have to clean up after a crash" and "every single connection litters the hardware with unused stuff". But that's pure FUD and guesswork; I have no examples to point to, so there might not be any providers that need that. --Jacob
Re: Jacob Champion > I guess I'd be concerned that a hardware crypto provider might need > good-faith cleanup to work well. Hopefully not in every single backend. Christoph
On Fri, Jun 06, 2025 at 08:41:20AM -0700, Jacob Champion wrote: > I guess I'd be concerned that a hardware crypto provider might need > good-faith cleanup to work well. I understand they can't rely on > atexit in general, but there would be a big difference between "you > might have to clean up after a crash" and "every single connection > litters the hardware with unused stuff". I'd expect all subsystems to recover cleanly from unclean shutdowns. I know, that's a lot to expect, but nowadays pretty much all filesystems used in production do, for example. > But that's pure FUD and guesswork; I have no examples to point to, so > there might not be any providers that need that. I doubt that PG w/ OpenSSL in any configuration maintains stateful interactions with HW cryptographic providers.
Hi, On 2025-06-06 08:41:20 -0700, Jacob Champion wrote: > On Fri, Jun 6, 2025 at 4:56 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > It seems weird to me that openssl spends so much effort tidying up its > > memory allocations just before exiting. We could just skip that. > > Looking through the code of OPENSSL_cleanup(), there might be one or two > > cases of log or trace files that get flushed during cleanup, so it's not > > an absolute no-brainer to skip all the cleanup. > > I guess I'd be concerned that a hardware crypto provider might need > good-faith cleanup to work well. I understand they can't rely on > atexit in general, but there would be a big difference between "you > might have to clean up after a crash" and "every single connection > litters the hardware with unused stuff". It's not just crashes, e.g. the startup packet timeout is also handled by _exit() - and it can be triggered remotely. ISTM that if crypto providers can't handle _exit(), we have a bigger problem. Alternatively we could try deferring more of openssl's initialization to outside of postmaster - but that doesn't seem particularly realistic. Greetings, Andres Freund
On Fri, Jun 6, 2025 at 9:25 AM Nico Williams <nico@cryptonector.com> wrote: > I'd expect all subsystems to recover cleanly from unclean shutdowns. I > know, that's a lot to expect, but nowadays pretty much all filesystems > used in production do, for example. I guess, but if we stop cleaning up entirely, we will suddenly be stressing those code paths... But maybe that's a community service? :) I realize I'm making an argument from fear and ignorance. Maybe that ecosystem is very healthy. I'm just imagining the following conversation: DBA: we upgraded our server and our HSM is freaking out after a few thousand connections; what gives? us: oh, we stopped cleaning up after ourselves for performance! tell your vendor to fix their drivers! DBA: hahahaha [1] is a description of the kind of problem I'm worried about. (It's not 1:1 applicable to this situation, I just think we might start seeing those sorts of bug reports.) > I doubt that PG w/ OpenSSL in any configuration maintains stateful > interactions with HW cryptographic providers. (Why? From looking over the Cryptoki/PKCS#11 stuff, for example, isn't a lot of that API stateful?) --Jacob [1] https://github.com/OpenSC/libp11/issues/228#issuecomment-402941378
On Fri, Jun 6, 2025 at 9:37 AM Andres Freund <andres@anarazel.de> wrote: > It's not just crashes, e.g. the startup packet timeout is also handled by > _exit() - and it can be triggered remotely. Fair point... > ISTM that if crypto providers > can't handle _exit(), we have a bigger problem. ...so I guess I need to figure out whether we have a bigger problem. I hope we don't. Note that OpenSSL seems to be interested in removing the atexit() handling altogether, and requiring applications to manually call the cleanup function, in 4.0. [1] --Jacob [1] https://github.com/openssl/openssl/issues/22508
On Fri, Jun 06, 2025 at 11:58:38AM -0700, Jacob Champion wrote: > > I'd expect all subsystems to recover cleanly from unclean shutdowns. I > > know, that's a lot to expect, but nowadays pretty much all filesystems > > used in production do, for example. > > I guess, but if we stop cleaning up entirely, we will suddenly be > stressing those code paths... But maybe that's a community service? :) The latter. > I realize I'm making an argument from fear and ignorance. Maybe that > ecosystem is very healthy. I'm just imagining the following > conversation: > > DBA: we upgraded our server and our HSM is freaking out after a few > thousand connections; what gives? > us: oh, we stopped cleaning up after ourselves for performance! tell > your vendor to fix their drivers! > DBA: hahahaha TPMs for example have a concept of session. You can have up to 64 open sessions, and if you use the TPM resource manager and you're accessing it through a file descriptor then the RM will just clean up when you exit. Though if you're accessing the raw TPM directly then fail to flush sessions then yes, you'll eventually be unable to create new ones. However no one will be using a discrete or firmware TPM for TLS server certificate private key usage: discrete TPMs are way way too slow for that, and firmware TPMs are... also way too slow. You wouldn't bother with a software TPM for this unless it's for privilege separation. Anyways, if you were using a TPM then the user's startup scripts, or postgres itself could just flush all sessions and be done. Other types of hardware cryptographic providers also tend to have a notion of "session", and they all tend to have relatively paltry limits, which means that the software side that calls them will generally need to be prepared to a) close its own sessions eagerly (at the cost of extra overhead on the next operation), and b) recover from running out of sessions (by flushing others at the cost of causing those that were live to need retries). But anyways, IIUC the OpenSSL engine interface is itself stateless and I would expect providers to auto-recover. And anyways I expect no one uses PG with HW cryptographic providers to perform TLS server signatures. Instead the best current practice would be to use short-lived server certificates with software keys and longer-lived credentials in hardware with which to fetch new short-lived credentials with software keys. The kinds of HSMs that can do high rates of signatures are neither cheap nor commonly used, and those do tend to have higher session limits, and again you can recover from running out of sessions by flushing extant sessions. > > I doubt that PG w/ OpenSSL in any configuration maintains stateful > > interactions with HW cryptographic providers. > > (Why? From looking over the Cryptoki/PKCS#11 stuff, for example, isn't > a lot of that API stateful?) PKCS#11 is stateful, yes (it has session handles), but there are generally low limits on how many sessions you can keep open, therefore high pressure to close them soon, therefore the inference is that that must be what actually happens at the rather high cost of having to set up new sessions often. That inference could be wrong, but then as you note you'd be doing the community a service by testing it and making it true in the future. Nico --