Thread: WL_SOCKET_ACCEPT fairness on Windows

WL_SOCKET_ACCEPT fairness on Windows

From

Thomas Munro

Date:

01 April 2023, 03:42:21

Hi,

Commit 7389aad6 started using WaitEventSetWait() to wait for incoming
connections.  Before that, we used select(), for which we have our own
implementation for Windows.

While hacking on patches to rip a bunch of unused Win32 socket wrapper
code out, I twigged that the old pgwin32_select() code was careful to
report multiple sockets at once by brute force polling of all of them,
while WaitEventSetWait() doesn't do that: it reports just one event,
because that's what the Windows WaitForMultipleEvents() syscall does.
I guess this means you can probably fill up the listen queue of server
socket 1 to prevent server socket 2 from ever being serviced, whereas
on Unix we'll accept one connection at a time from each in round-robin
fashion.

I think we could get the same effect as pgwin32_select() more cheaply
by doing the initial WaitForMultipleEvents() call with the caller's
timeout exactly as we do today, and then, while we have space,
repeatedly calling
WaitForMultipleEvents(handles=&events[last_signaled_event_index + 1],
timeout=0) until it reports timeout.  Windows always reports the
signaled event with the lowest index in the array, so the idea is to
poll the remaining part of the array without waiting, to check for any
more.  In the most common case of FEBE socket waits etc there would be
no extra system call (because it uses nevents=1, so no space for
more), and in the postmaster's main loop there would commonly be only
one extra system call to determine that no other events have been
signaled.

The attached patch shows the idea.  It's using an ugly goto, but I
guess it should be made decent with a real loop; I just didn't want to
change the indentation level for this POC.

I mention this now because I'm not sure whether to consider this an
'open item' for 16, or merely an enhancement for 17.  I guess the
former, because someone might call that a new denial of service
vector.  On the other hand, if you fill up the listen queue for socket
1 with enough vigour, you're also denying service to socket 1, so I
don't know if it's worth worrying about.  Opinions on that?

I don't have Windows to test any of this.  Although this patch passes
on CI, that means nothing, as I don't expect the new code to be
reached, so I'm hoping to find a someone who would be able to set up
such a test on Windows, ie putting elog in to the new code path and
trying to reach it by connecting to two different ports fast enough to
exercise the multiple event case.

Or maybe latch.c really needs its own test suite.

Attachment

0001-Teach-WaitEventSetWait-to-report-multiple-events-on-.patch

Re: WL_SOCKET_ACCEPT fairness on Windows

From

Andres Freund

Date:

01 April 2023, 04:35:29

Hi,

On 2023-04-01 13:42:21 +1300, Thomas Munro wrote:
> Commit 7389aad6 started using WaitEventSetWait() to wait for incoming
> connections.  Before that, we used select(), for which we have our own
> implementation for Windows.
> 
> While hacking on patches to rip a bunch of unused Win32 socket wrapper
> code out, I twigged that the old pgwin32_select() code was careful to
> report multiple sockets at once by brute force polling of all of them,
> while WaitEventSetWait() doesn't do that: it reports just one event,
> because that's what the Windows WaitForMultipleEvents() syscall does.
> I guess this means you can probably fill up the listen queue of server
> socket 1 to prevent server socket 2 from ever being serviced, whereas
> on Unix we'll accept one connection at a time from each in round-robin
> fashion.

It does indeed sound like it'd behave that way:

> If bWaitAll is FALSE, the return value minus WAIT_OBJECT_0 indicates the
> lpHandles array index of the object that satisfied the wait. If more than
> one object became signaled during the call, this is the array index of the
> signaled object with the smallest index value of all the signaled objects.

I wonder if we ought to bite the bullet and replace the use of
WaitForMultipleObjects() with RegisterWaitForSingleObject() and then use
GetQueuedCompletionStatus() to wait. The fairness issue here is a motivation,
but the bigger one is that that'd get us out from under the
MAXIMUM_WAIT_OBJECTS (IIRC 64) limit. Afaict that'd also allow us to read
multiple notifications at once, using GetQueuedCompletionStatusEx().

Medium term that'd also be a small step towards using readiness based APIs in
a few places...


> I think we could get the same effect as pgwin32_select() more cheaply
> by doing the initial WaitForMultipleEvents() call with the caller's
> timeout exactly as we do today, and then, while we have space,
> repeatedly calling
> WaitForMultipleEvents(handles=&events[last_signaled_event_index + 1],
> timeout=0) until it reports timeout.

That would make sense, and should indeed be reasonable cost-wise.


> I mention this now because I'm not sure whether to consider this an
> 'open item' for 16, or merely an enhancement for 17.  I guess the
> former, because someone might call that a new denial of service
> vector.  On the other hand, if you fill up the listen queue for socket
> 1 with enough vigour, you're also denying service to socket 1, so I
> don't know if it's worth worrying about.  Opinions on that?

I'm not sure either. It doesn't strike me as a particularly relevant
bottleneck. And the old approach of doing more work for every single
connection also made many connections worse, I think?

> diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
> index f4123e7de7..cc7b572008 100644
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
>       */
>      cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
>  
> +loop:
> +

As far as I can tell, we'll never see WL_LATCH_SET or WL_POSTMASTER_DEATH. I
think it'd look cleaner to move the body of if (cur_event->events & WL_SOCKET_MASK)
into a separate function that we then also can call further down.


>      occurred_events->pos = cur_event->pos;
>      occurred_events->user_data = cur_event->user_data;
>      occurred_events->events = 0;
> @@ -2044,6 +2046,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
>              occurred_events->events = WL_LATCH_SET;
>              occurred_events++;
>              returned_events++;
> +            nevents--;
>          }
>      }
>      else if (cur_event->events == WL_POSTMASTER_DEATH)
> @@ -2063,6 +2066,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
>              occurred_events->events = WL_POSTMASTER_DEATH;
>              occurred_events++;
>              returned_events++;
> +            nevents--;
>          }
>      }
>      else if (cur_event->events & WL_SOCKET_MASK)
> @@ -2124,6 +2128,36 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
>          {
>              occurred_events++;
>              returned_events++;
> +            nevents--;
> +        }
> +    }

Seems like we could use returned_events to get nevents in the way you want it,
without adding even more ++/-- to each of the different events?

Greetings,

Andres Freund

Re: WL_SOCKET_ACCEPT fairness on Windows

From

Thomas Munro

Date:

01 April 2023, 06:00:04

On Sat, Apr 1, 2023 at 2:35 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if we ought to bite the bullet and replace the use of
> WaitForMultipleObjects() with RegisterWaitForSingleObject() and then use
> GetQueuedCompletionStatus() to wait. The fairness issue here is a motivation,
> but the bigger one is that that'd get us out from under the
> MAXIMUM_WAIT_OBJECTS (IIRC 64) limit. Afaict that'd also allow us to read
> multiple notifications at once, using GetQueuedCompletionStatusEx().

Interesting.  So a callback would run in a system-managed thread, and
that would post a custom message in an IOCP for us to read, kinda like
the fake waitpid() thing?  Seems a bit gross and context-switchy but I
agree that the 64 event limit is also terrible.

> Medium term that'd also be a small step towards using readiness based APIs in
> a few places...

Yeah, that would be cool.

> > I think we could get the same effect as pgwin32_select() more cheaply
> > by doing the initial WaitForMultipleEvents() call with the caller's
> > timeout exactly as we do today, and then, while we have space,
> > repeatedly calling
> > WaitForMultipleEvents(handles=&events[last_signaled_event_index + 1],
> > timeout=0) until it reports timeout.
>
> That would make sense, and should indeed be reasonable cost-wise.

Cool.

> > I mention this now because I'm not sure whether to consider this an
> > 'open item' for 16, or merely an enhancement for 17.  I guess the
> > former, because someone might call that a new denial of service
> > vector.  On the other hand, if you fill up the listen queue for socket
> > 1 with enough vigour, you're also denying service to socket 1, so I
> > don't know if it's worth worrying about.  Opinions on that?
>
> I'm not sure either. It doesn't strike me as a particularly relevant
> bottleneck. And the old approach of doing more work for every single
> connection also made many connections worse, I think?

Alright, let's see if anyone else thinks this is worth fixing for 16.

> > diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
> > index f4123e7de7..cc7b572008 100644
> > --- a/src/backend/storage/ipc/latch.c
> > +++ b/src/backend/storage/ipc/latch.c
> > @@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
> >        */
> >       cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
> >
> > +loop:
> > +
>
> As far as I can tell, we'll never see WL_LATCH_SET or WL_POSTMASTER_DEATH. I
> think it'd look cleaner to move the body of if (cur_event->events & WL_SOCKET_MASK)
> into a separate function that we then also can call further down.

We could see them, AFAICS, and I don't see much advantage in making
that assumption?  Shouldn't we just shove it in a loop, just like the
other platforms' implementations?  Done in this version, which is best
viewed with git show --ignore-all-space.

> Seems like we could use returned_events to get nevents in the way you want it,
> without adding even more ++/-- to each of the different events?

Yeah.  This time I use reported_events.  I also fixed a maths failure:
I'd forgotten to use rc - WAIT_OBJECT_0, suggesting that CI never
reached the code.

Attachment

v2-0001-Teach-WaitEventSetWait-to-report-multiple-events-.patch

Re: WL_SOCKET_ACCEPT fairness on Windows

From

"Jonathan S. Katz"

Date:

16 May 2023, 17:57:09

On 3/31/23 11:00 PM, Thomas Munro wrote:

>>> I mention this now because I'm not sure whether to consider this an
>>> 'open item' for 16, or merely an enhancement for 17.  I guess the
>>> former, because someone might call that a new denial of service
>>> vector.  On the other hand, if you fill up the listen queue for socket
>>> 1 with enough vigour, you're also denying service to socket 1, so I
>>> don't know if it's worth worrying about.  Opinions on that?
>>
>> I'm not sure either. It doesn't strike me as a particularly relevant
>> bottleneck. And the old approach of doing more work for every single
>> connection also made many connections worse, I think?
> 
> Alright, let's see if anyone else thinks this is worth fixing for 16.

[RMT hat]

Given this has sat for a bit, I wanted to see if any of your thinking 
has changed on whether this should be fixed for v16 or v17. I have 
personally not formed an opinion yet, but per the current discussion, it 
seems like this could wait?

Thanks,

Jonathan

Attachment

OpenPGP_signature

Re: WL_SOCKET_ACCEPT fairness on Windows

From

Thomas Munro

Date:

16 May 2023, 23:41:24

On Wed, May 17, 2023 at 2:57 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> On 3/31/23 11:00 PM, Thomas Munro wrote:
> >>> I mention this now because I'm not sure whether to consider this an
> >>> 'open item' for 16, or merely an enhancement for 17.  I guess the
> >>> former, because someone might call that a new denial of service
> >>> vector.  On the other hand, if you fill up the listen queue for socket
> >>> 1 with enough vigour, you're also denying service to socket 1, so I
> >>> don't know if it's worth worrying about.  Opinions on that?
> >>
> >> I'm not sure either. It doesn't strike me as a particularly relevant
> >> bottleneck. And the old approach of doing more work for every single
> >> connection also made many connections worse, I think?
> >
> > Alright, let's see if anyone else thinks this is worth fixing for 16.
>
> [RMT hat]
>
> Given this has sat for a bit, I wanted to see if any of your thinking
> has changed on whether this should be fixed for v16 or v17. I have
> personally not formed an opinion yet, but per the current discussion, it
> seems like this could wait?

Yeah.  No one seems to think this is worth worrying about (please
speak up if you do).  I'll go ahead and remove this from the open item
lists now, but I'll leave the patch in the CF for 17, to see if a
Windows hacker/tester thinks it's a worthwhile improvement.

Re: WL_SOCKET_ACCEPT fairness on Windows

From

"Jonathan S. Katz"

Date:

17 May 2023, 00:11:29

On 5/16/23 4:41 PM, Thomas Munro wrote:
> On Wed, May 17, 2023 at 2:57 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:

>> Given this has sat for a bit, I wanted to see if any of your thinking
>> has changed on whether this should be fixed for v16 or v17. I have
>> personally not formed an opinion yet, but per the current discussion, it
>> seems like this could wait?
> 
> Yeah.  No one seems to think this is worth worrying about (please
> speak up if you do).  I'll go ahead and remove this from the open item
> lists now, but I'll leave the patch in the CF for 17, to see if a
> Windows hacker/tester thinks it's a worthwhile improvement.

[RMT hat, personal opinion]

That seems reasonable to me.

Thanks,

Jonathan

Attachment

OpenPGP_signature

Re: WL_SOCKET_ACCEPT fairness on Windows

From

Andres Freund

Date:

17 May 2023, 02:26:27

Hi,

On 2023-05-17 08:41:24 +1200, Thomas Munro wrote:
> Yeah.  No one seems to think this is worth worrying about (please
> speak up if you do).

+1 - we have much bigger fish to fry IMO.

Greetings,

Andres Freund

RE: WL_SOCKET_ACCEPT fairness on Windows

From

"Wei Wang (Fujitsu)"

Date:

18 May 2023, 11:53:03

On Sat, April 1, 2023 at 11:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
Hi,
Thanks for your patch.

I tried to test this patch on Windows. And I did cover the new code path below:
```
+        /* We have another event to decode. */
+        cur_event = &set->events[next_pos + (rc - WAIT_OBJECT_0)];
```

But I have one thing want to confirm:
In my tests, I think the scenario with two different events (e.g., one ending
socket connection and one incoming socket connection) has been optimized.
However, it seems that when there are multiple incoming socket connections, the
function WaitEventSetWaitBlock is called multiple times instead of being called
once. Is this our expected result?

Here are my test details:
I use the debugger to ensure that multiple events occur when the function
WaitEventSetWaitBlock is called. First, I added a breakpoint in the below code:
```
        /*
        * Sleep.
        *
        * Need to wait for ->nevents + 1, because signal handle is in [0].
        */
b        rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
                                    cur_timeout);
```
And then make sure that the postmaster process enters the function
WaitForMultipleObjects. (I think the postmaster process will only return from
the function WaitForMultipleObjects when any object is signaled or timeout
occurs). Before the timeout occurs, I set up multiple socket connections using
psql (the first connection makes the process returns from the function
WaitForMultipleObjects). Then, as I continued debugging, multiple socket
connections were handled by different calls of the function
WaitEventSetWaitBlock.

Please let me know if I am missing something.

Regards,
Wang wei

Re: WL_SOCKET_ACCEPT fairness on Windows

From

Thomas Munro

Date:

10 June 2023, 06:06:06

On Thu, May 18, 2023 at 8:53 PM Wei Wang (Fujitsu)
<wangw.fnst@fujitsu.com> wrote:
> On Sat, April 1, 2023 at 11:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I tried to test this patch on Windows. And I did cover the new code path below:
> ```
> +               /* We have another event to decode. */
> +               cur_event = &set->events[next_pos + (rc - WAIT_OBJECT_0)];
> ```
>
> But I have one thing want to confirm:
> In my tests, I think the scenario with two different events (e.g., one ending
> socket connection and one incoming socket connection) has been optimized.
> However, it seems that when there are multiple incoming socket connections, the
> function WaitEventSetWaitBlock is called multiple times instead of being called
> once. Is this our expected result?

Thanks for testing!  Maybe I am misunderstanding something: what I
expect to happen is that we call *WaitForMultipleObjects()* one extra
time (to see if there is another event available immediately, and
usually there is not), but not WaitEventSetWaitBlock().

> Here are my test details:
> I use the debugger to ensure that multiple events occur when the function
> WaitEventSetWaitBlock is called. First, I added a breakpoint in the below code:
> ```
>                 /*
>                 * Sleep.
>                 *
>                 * Need to wait for ->nevents + 1, because signal handle is in [0].
>                 */
> b               rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
>                                                                         cur_timeout);
> ```
> And then make sure that the postmaster process enters the function
> WaitForMultipleObjects. (I think the postmaster process will only return from
> the function WaitForMultipleObjects when any object is signaled or timeout
> occurs). Before the timeout occurs, I set up multiple socket connections using
> psql (the first connection makes the process returns from the function
> WaitForMultipleObjects). Then, as I continued debugging, multiple socket
> connections were handled by different calls of the function
> WaitEventSetWaitBlock.

This is a good test.  Also, to test the exact scenario I was worrying
about, you could initiate 4 psql sessions while the server is blocked
in a debugger, 2 on a Unix domain socket, and 2 on a TCP socket, and
then when you "continue" the server with a break on (say) accept() so
that you can "continue" each time, you should see that it alternates
between the two sockets accepting from both "fairly", instead of
draining one socket entirely first.

> Please let me know if I am missing something.

I think your report already shows that it basically works correctly.
Do you think it's worth committing for 17 to make it work more like
Unix, or was I just being paranoid?

Re: WL_SOCKET_ACCEPT fairness on Windows

From

Thomas Munro

Date:

08 September 2023, 09:59:33

I committed this for 17.  It would be good to come up with something
fundamentally better than this, to get rid of that 64 event limit
nonsense, but I don't see it happening in the 17 cycle, and prefer the
semantics with this commit in the meantime.