[PATCH] OAuth: fix performance bug with stuck multiplexer events - Mailing list pgsql-hackers

From Jacob Champion
Subject [PATCH] OAuth: fix performance bug with stuck multiplexer events
Date
Msg-id CAOYmi+nDZxJHaWj9_jRSyf8uMToCADAmOfJEggsKW-kY7aUwHA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi all,

The current implementation of the OAuth flow is overly eager to signal
readiness when there's nothing to do, so sometimes we burn CPU for no
reason. This was discussed briefly back in [1], but I'll summarize
here so no one has to dig through that megathread.

= Background =

A major interaction between libpq and libcurl happens through a
callback -- register_socket() in oauth-curl.c -- which links the
sockets that Curl cares about into a single descriptor that our
clients can wait on. This descriptor, which is either an epoll set or
a kqueue, is called the multiplexer.

Curl has wide latitude to make this interaction as simple or as
complicated as it wants, and the complexity additionally depends on
the transport in use. For a simple HTTP request/response with older
Curls, the callback order might look like

- wait for reads on fd 5
- done with fd 5

But for a modern dual-stack system speaking HTTPS, you might get something like

- wait for writes on fd 6, time out after 200ms
- wait for writes on fd 7
- done with fd 6, cancel the timeout
- wait for reads only on fd 7, time out after five minutes
- wait for both reads and writes on fd 7
...
- done with fd 7, cancel the timeout

= Getting Stuck =

The Linux/epoll implementation handles the latter case well (with one
caveat, discussed later), but with the BSD/kqueue implementation, the
multiplexer can get "stuck open" on stale events that Curl no longer
cares about, and clients of libpq can loop on PQconnectPoll()
pointlessly.

The flow still works -- there's no bug in functionality, just in
performance -- but that also makes it hard to detect in tests. And for
simpler network cases (the ones in our tests), the bug is nearly
invisible, because Curl will unregister descriptors and clear the
stuck events before they have any real effect.

0001, attached, fixes this by counting the number of outstanding event
registrations and draining up to that many events from the kqueue
after libcurl gives control back to us. This ensures that any stale
events will be knocked back down. (This design was originally
suggested by Thomas Munro in a related conversation about kqueue
corner cases; thanks Thomas!)

If you follow the flow from start to finish, you may notice the
following oddity: I took great pains not to track which events had
been set, instead telling libcurl "something happened, figure it out"
-- and now in this patchset I'm pulling all of those events off in a
big wave only to discard them. As punishment for trying to keep the
epoll side of things simple, I'm now having to add unnecessary
complexity to the kqueue side. Seems like I'll eventually need to
balance the two out again.

= Timeouts =

There is a similar "stale event" bug for the timer, with both epoll
and kqueue: if Curl does not disable a timeout explicitly after it
fires, that timer's descriptor will simply remain signaled until
something else happens to set it.

Unfortunately, we can't solve this bug in the same place as 0001. If
we clear stale timers after calling into Curl, that would introduce a
race: Curl may have set a short timeout that has already passed, and
we need that event to remain signaled so that the client will call us
back.

So expired timers have to be cleared _before_ calling back into Curl,
which I have done in 0002. As part of that, I've strengthened the
timer_expired() API so that it always returns false when the timer is
disabled. It was previously a don't-care, since we only called it when
we knew the timer must be running, so the kqueue and epoll
implementations used to give different answers.

= Testing =

It was hard to notice any of this with the current tests, because
they're mostly focused on ensuring that the flow works, not on
internal efficiency. I tried to find a way to internally detect when a
file descriptor had gone stale by accident, but I think it's hard to
do that without false positives if we don't know what Curl is doing.

In the end, I decided to test a heuristic: we expect roughly 5-15
calls for these simple localhost flows, so if we see something more
than an order of magnitude above that, something is wrong. (When "bad
behavior" happens in the CI, I see hundreds or thousands of calls.)
This is implemented in 0003 by simply counting the number of calls and
printing them at the end of the flow in debug mode.

I strengthened a bunch of postconditions in the first two patches, so
to try to prevent decay (and keep other maintainers from having to
remember all this), I've added a unit test executable that pins many
internal expectations. It's essentially a version of the libpq-oauth
plugin which has been given a main() function and TAP macros. This is
done in 0004.

I feel confident that these new tests are useful, because:

- The new unit tests caught that my initial register_socket attempt,
written on macOS, did not work on the other BSDs. (EV_RECEIPT works
differently there.)
- If I revert the changes to register_socket, the new unit tests fail
on all BSDs in the CI, and the call-count test fails on NetBSD and
OpenBSD.
- If I comment out the new call to drain_timer_events(), the
call-count test fails on macOS.
- If I revert the changes to timer_expired, the new unit tests fail on
Linux. (There are no user-visible effects of that change, because it
doesn't lead to more calls.)

Unfortunately, if I just comment out the new call to
drain_socket_events(), nothing fails in the CI. :( But my personal
test suite (which makes use of TLS plus dual-stack sockets) does
notice the regression in the call count immediately, with some tests
using tens of thousands of calls for a single client flow without the
fix. I hope that's good enough for now, because I'm not sure if I can
safely push the additional TLS infrastructure into the oauth_validator
tests for 18.

My plan, if this code seems reasonable, is to backport 0001-0003, but
keep the larger 0004 on HEAD only until it has proven to be stable.
It's a big new suite and I want to make sure it's not flapping on some
buildfarm animal. Eventually I'll backport that too.

WDYT?

Thanks,
--Jacob

[1] https://postgr.es/m/CAOYmi%2Bn4EDOOUL27_OqYT2-F2rS6S%2B3mK-ppWb2Ec92UEoUbYA%40mail.gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Next
From: Michael Paquier
Date:
Subject: Re: Removing unneeded self joins