[PATCH] OAuth: fix performance bug with stuck multiplexer events - Mailing list pgsql-hackers
From | Jacob Champion |
---|---|
Subject | [PATCH] OAuth: fix performance bug with stuck multiplexer events |
Date | |
Msg-id | CAOYmi+nDZxJHaWj9_jRSyf8uMToCADAmOfJEggsKW-kY7aUwHA@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Hi all, The current implementation of the OAuth flow is overly eager to signal readiness when there's nothing to do, so sometimes we burn CPU for no reason. This was discussed briefly back in [1], but I'll summarize here so no one has to dig through that megathread. = Background = A major interaction between libpq and libcurl happens through a callback -- register_socket() in oauth-curl.c -- which links the sockets that Curl cares about into a single descriptor that our clients can wait on. This descriptor, which is either an epoll set or a kqueue, is called the multiplexer. Curl has wide latitude to make this interaction as simple or as complicated as it wants, and the complexity additionally depends on the transport in use. For a simple HTTP request/response with older Curls, the callback order might look like - wait for reads on fd 5 - done with fd 5 But for a modern dual-stack system speaking HTTPS, you might get something like - wait for writes on fd 6, time out after 200ms - wait for writes on fd 7 - done with fd 6, cancel the timeout - wait for reads only on fd 7, time out after five minutes - wait for both reads and writes on fd 7 ... - done with fd 7, cancel the timeout = Getting Stuck = The Linux/epoll implementation handles the latter case well (with one caveat, discussed later), but with the BSD/kqueue implementation, the multiplexer can get "stuck open" on stale events that Curl no longer cares about, and clients of libpq can loop on PQconnectPoll() pointlessly. The flow still works -- there's no bug in functionality, just in performance -- but that also makes it hard to detect in tests. And for simpler network cases (the ones in our tests), the bug is nearly invisible, because Curl will unregister descriptors and clear the stuck events before they have any real effect. 0001, attached, fixes this by counting the number of outstanding event registrations and draining up to that many events from the kqueue after libcurl gives control back to us. This ensures that any stale events will be knocked back down. (This design was originally suggested by Thomas Munro in a related conversation about kqueue corner cases; thanks Thomas!) If you follow the flow from start to finish, you may notice the following oddity: I took great pains not to track which events had been set, instead telling libcurl "something happened, figure it out" -- and now in this patchset I'm pulling all of those events off in a big wave only to discard them. As punishment for trying to keep the epoll side of things simple, I'm now having to add unnecessary complexity to the kqueue side. Seems like I'll eventually need to balance the two out again. = Timeouts = There is a similar "stale event" bug for the timer, with both epoll and kqueue: if Curl does not disable a timeout explicitly after it fires, that timer's descriptor will simply remain signaled until something else happens to set it. Unfortunately, we can't solve this bug in the same place as 0001. If we clear stale timers after calling into Curl, that would introduce a race: Curl may have set a short timeout that has already passed, and we need that event to remain signaled so that the client will call us back. So expired timers have to be cleared _before_ calling back into Curl, which I have done in 0002. As part of that, I've strengthened the timer_expired() API so that it always returns false when the timer is disabled. It was previously a don't-care, since we only called it when we knew the timer must be running, so the kqueue and epoll implementations used to give different answers. = Testing = It was hard to notice any of this with the current tests, because they're mostly focused on ensuring that the flow works, not on internal efficiency. I tried to find a way to internally detect when a file descriptor had gone stale by accident, but I think it's hard to do that without false positives if we don't know what Curl is doing. In the end, I decided to test a heuristic: we expect roughly 5-15 calls for these simple localhost flows, so if we see something more than an order of magnitude above that, something is wrong. (When "bad behavior" happens in the CI, I see hundreds or thousands of calls.) This is implemented in 0003 by simply counting the number of calls and printing them at the end of the flow in debug mode. I strengthened a bunch of postconditions in the first two patches, so to try to prevent decay (and keep other maintainers from having to remember all this), I've added a unit test executable that pins many internal expectations. It's essentially a version of the libpq-oauth plugin which has been given a main() function and TAP macros. This is done in 0004. I feel confident that these new tests are useful, because: - The new unit tests caught that my initial register_socket attempt, written on macOS, did not work on the other BSDs. (EV_RECEIPT works differently there.) - If I revert the changes to register_socket, the new unit tests fail on all BSDs in the CI, and the call-count test fails on NetBSD and OpenBSD. - If I comment out the new call to drain_timer_events(), the call-count test fails on macOS. - If I revert the changes to timer_expired, the new unit tests fail on Linux. (There are no user-visible effects of that change, because it doesn't lead to more calls.) Unfortunately, if I just comment out the new call to drain_socket_events(), nothing fails in the CI. :( But my personal test suite (which makes use of TLS plus dual-stack sockets) does notice the regression in the call count immediately, with some tests using tens of thousands of calls for a single client flow without the fix. I hope that's good enough for now, because I'm not sure if I can safely push the additional TLS infrastructure into the oauth_validator tests for 18. My plan, if this code seems reasonable, is to backport 0001-0003, but keep the larger 0004 on HEAD only until it has proven to be stable. It's a big new suite and I want to make sure it's not flapping on some buildfarm animal. Eventually I'll backport that too. WDYT? Thanks, --Jacob [1] https://postgr.es/m/CAOYmi%2Bn4EDOOUL27_OqYT2-F2rS6S%2B3mK-ppWb2Ec92UEoUbYA%40mail.gmail.com
Attachment
pgsql-hackers by date: